Граф коммитов

1370 Коммитов

Автор SHA1 Сообщение Дата
Paul Turner c006fac556 sched: Warn on long periods of pending need_resched
CPU scheduler marks need_resched flag to signal a schedule() on a
particular CPU. But, schedule() may not happen immediately in cases
where the current task is executing in the kernel mode (no
preemption state) for extended periods of time.

This patch adds a warn_on if need_resched is pending for more than the
time specified in sysctl resched_latency_warn_ms. If it goes off, it is
likely that there is a missing cond_resched() somewhere. Monitoring is
done via the tick and the accuracy is hence limited to jiffy scale. This
also means that we won't trigger the warning if the tick is disabled.

This feature (LATENCY_WARN) is default disabled.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210416212936.390566-1-joshdon@google.com
2021-04-21 13:55:41 +02:00
Peter Zijlstra 1011dcce99 sched,preempt: Move preempt_dynamic to debug.c
Move the #ifdef SCHED_DEBUG bits to kernel/sched/debug.c in order to
collect all the debugfs bits.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.353833279@infradead.org
2021-04-16 17:06:34 +02:00
Peter Zijlstra 8a99b6833c sched: Move SCHED_DEBUG sysctl to debugfs
Stop polluting sysctl with undocumented knobs that really are debug
only, move them all to /debug/sched/ along with the existing
/debug/sched_* files that already exist.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
2021-04-16 17:06:34 +02:00
Peter Zijlstra b5c4477366 sched: Use cpu_dying() to fix balance_push vs hotplug-rollback
Use the new cpu_dying() state to simplify and fix the balance_push()
vs CPU hotplug rollback state.

Specifically, we currently rely on notifiers sched_cpu_dying() /
sched_cpu_activate() to terminate balance_push, however if the
cpu_down() fails when we're past sched_cpu_deactivate(), it should
terminate balance_push at that point and not wait until we hit
sched_cpu_activate().

Similarly, when cpu_up() fails and we're going back down, balance_push
should be active, where it currently is not.

So instead, make sure balance_push is enabled below SCHED_AP_ACTIVE
(when !cpu_active()), and gate it's utility with cpu_dying().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/YHgAYef83VQhKdC2@hirez.programming.kicks-ass.net
2021-04-16 17:06:32 +02:00
Rafael J. Wysocki 0210b8eb72 Merge branch 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull ARM cpufreq updates for v5.13 from Viresh Kumar:

"- Fix typos in s5pv210 cpufreq driver (Bhaskar Chowdhury).

 - Armada 37xx: Fix cpufreq changing base CPU speed to 800 MHz from
   1000 MHz (Pali Rohár and Marek Behún).

 - cpufreq-dt: Return -EPROBE_DEFER on failure to add table (Quanyang
   Wang).

 - Minor cleanup in cppc driver (Tom Saeger).

 - Add frequency invariance support for CPPC driver and generalize
   freq invariance support arch-topology driver (Viresh Kumar)."

* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  cpufreq: armada-37xx: Fix module unloading
  cpufreq: armada-37xx: Remove cur_frequency variable
  cpufreq: armada-37xx: Fix determining base CPU frequency
  cpufreq: armada-37xx: Fix driver cleanup when registration failed
  clk: mvebu: armada-37xx-periph: Fix workaround for switching from L1 to L0
  clk: mvebu: armada-37xx-periph: Fix switching CPU freq from 250 Mhz to 1 GHz
  cpufreq: armada-37xx: Fix the AVS value for load L1
  clk: mvebu: armada-37xx-periph: remove .set_parent method for CPU PM clock
  cpufreq: armada-37xx: Fix setting TBG parent for load levels
  cpufreq: dt: dev_pm_opp_of_cpumask_add_table() may return -EPROBE_DEFER
  cpufreq: cppc: simplify default delay_us setting
  cpufreq: Rudimentary typos fix in the file s5pv210-cpufreq.c
  cpufreq: CPPC: Add support for frequency invariance
  arch_topology: Export arch_freq_scale and helpers
  arch_topology: Allow multiple entities to provide sched_freq_tick() callback
  arch_topology: Rename freq_scale as arch_freq_scale
2021-04-12 14:46:33 +02:00
Peter Zijlstra 9432bbd969 static_call: Relax static_call_update() function argument type
static_call_update() had stronger type requirements than regular C,
relax them to match. Instead of requiring the @func argument has the
exact matching type, allow any type which C is willing to promote to the
right (function) pointer type. Specifically this allows (void *)
arguments.

This cleans up a bunch of static_call_update() callers for
PREEMPT_DYNAMIC and should get around silly GCC11 warnings for free.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YFoN7nCl8OfGtpeh@hirez.programming.kicks-ass.net
2021-04-09 13:22:12 +02:00
Rasmus Villemoes c4681f3f1c sched/core: Use -EINVAL in sched_dynamic_mode()
-1 is -EPERM which is a somewhat odd error to return from
sched_dynamic_write(). No other callers care about which negative
value is used.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20210325004515.531631-2-linux@rasmusvillemoes.dk
2021-03-25 11:39:13 +01:00
Rasmus Villemoes 7e1b2eb749 sched/core: Stop using magic values in sched_dynamic_mode()
Use the enum names which are also what is used in the switch() in
sched_dynamic_update().

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20210325004515.531631-1-linux@rasmusvillemoes.dk
2021-03-25 11:39:12 +01:00
Viresh Kumar 4c38f2df71 cpufreq: CPPC: Add support for frequency invariance
The Frequency Invariance Engine (FIE) is providing a frequency scaling
correction factor that helps achieve more accurate load-tracking.

Normally, this scaling factor can be obtained directly with the help of
the cpufreq drivers as they know the exact frequency the hardware is
running at. But that isn't the case for CPPC cpufreq driver.

Another way of obtaining that is using the arch specific counter
support, which is already present in kernel, but that hardware is
optional for platforms.

This patch updates the CPPC driver to register itself with the topology
core to provide its own implementation (cppc_scale_freq_tick()) of
topology_scale_freq_tick() which gets called by the scheduler on every
tick. Note that the arch specific counters have higher priority than
CPPC counters, if available, though the CPPC driver doesn't need to have
any special handling for that.

On an invocation of cppc_scale_freq_tick(), we schedule an irq work
(since we reach here from hard-irq context), which then schedules a
normal work item and cppc_scale_freq_workfn() updates the per_cpu
arch_freq_scale variable based on the counter updates since the last
tick.

To allow platforms to disable this CPPC counter-based frequency
invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE,
which is enabled by default.

This also exports sched_setattr_nocheck() as the CPPC driver can be
built as a module.

Cc: linux-acpi@vger.kernel.org
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-03-22 08:55:28 +05:30
Ingo Molnar 3b03706fa6 sched: Fix various typos
Fix ~42 single-word typos in scheduler code comments.

We have accumulated a few fun ones over the years. :-)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org
2021-03-22 00:11:52 +01:00
Edmundo Carmona Antoranz 13c2235b2b sched: Remove unnecessary variable from schedule_tail()
Since 565790d28b (sched: Fix balance_callback(), 2020-05-11), there
is no longer a need to reuse the result value of the call to finish_task_switch()
inside schedule_tail(), therefore the variable used to hold that value
(rq) is no longer needed.

Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210306210739.1370486-1-eantoranz@gmail.com
2021-03-10 09:51:49 +01:00
Chengming Zhou 7fae6c8171 psi: Use ONCPU state tracking machinery to detect reclaim
Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.

In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.

Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.com
2021-03-06 12:40:22 +01:00
Vincent Guittot c6f886546c sched/fair: Trigger the update of blocked load on newly idle cpu
Instead of waking up a random and already idle CPU, we can take advantage
of this_cpu being about to enter idle to run the ILB and update the
blocked load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224133007.28644-7-vincent.guittot@linaro.org
2021-03-06 12:40:22 +01:00
Peter Zijlstra 50caf9c14b sched: Simplify set_affinity_pending refcounts
Now that we have set_affinity_pending::stop_pending to indicate if a
stopper is in progress, and we have the guarantee that if that stopper
exists, it will (eventually) complete our @pending we can simplify the
refcount scheme by no longer counting the stopper thread.

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.724130207@infradead.org
2021-03-06 12:40:21 +01:00
Peter Zijlstra 9e81889c76 sched: Fix affine_move_task() self-concurrency
Consider:

   sched_setaffinity(p, X);		sched_setaffinity(p, Y);

Then the first will install p->migration_pending = &my_pending; and
issue stop_one_cpu_nowait(pending); and the second one will read
p->migration_pending and _also_ issue: stop_one_cpu_nowait(pending),
the _SAME_ @pending.

This causes stopper list corruption.

Add set_affinity_pending::stop_pending, to indicate if a stopper is in
progress.

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.649146419@infradead.org
2021-03-06 12:40:21 +01:00
Peter Zijlstra 3f1bc119cd sched: Optimize migration_cpu_stop()
When the purpose of migration_cpu_stop() is to migrate the task to
'any' valid CPU, don't migrate the task when it's already running on a
valid CPU.

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.569238629@infradead.org
2021-03-06 12:40:21 +01:00
Peter Zijlstra 58b1a45086 sched: Collate affine_move_task() stoppers
The SCA_MIGRATE_ENABLE and task_running() cases are almost identical,
collapse them to avoid further duplication.

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.500108964@infradead.org
2021-03-06 12:40:21 +01:00
Valentin Schneider e140749c9f sched: Simplify migration_cpu_stop()
Since, when ->stop_pending, only the stopper can uninstall
p->migration_pending. This could simplify a few ifs, because:

  (pending != NULL) => (pending == p->migration_pending)

Also, the fatty comment above affine_move_task() probably needs a bit
of gardening.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-06 12:40:21 +01:00
Peter Zijlstra c20cf065d4 sched: Simplify migration_cpu_stop()
When affine_move_task() issues a migration_cpu_stop(), the purpose of
that function is to complete that @pending, not any random other
p->migration_pending that might have gotten installed since.

This realization much simplifies migration_cpu_stop() and allows
further necessary steps to fix all this as it provides the guarantee
that @pending's stopper will complete @pending (and not some random
other @pending).

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.430014682@infradead.org
2021-03-06 12:40:20 +01:00
Peter Zijlstra 8a6edb5257 sched: Fix migration_cpu_stop() requeueing
When affine_move_task(p) is called on a running task @p, which is not
otherwise already changing affinity, we'll first set
p->migration_pending and then do:

	 stop_one_cpu(cpu_of_rq(rq), migration_cpu_stop, &arg);

This then gets us to migration_cpu_stop() running on the CPU that was
previously running our victim task @p.

If we find that our task is no longer on that runqueue (this can
happen because of a concurrent migration due to load-balance etc.),
then we'll end up at the:

	} else if (dest_cpu < 1 || pending) {

branch. Which we'll take because we set pending earlier. Here we first
check if the task @p has already satisfied the affinity constraints,
if so we bail early [A]. Otherwise we'll reissue migration_cpu_stop()
onto the CPU that is now hosting our task @p:

	stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
			    &pending->arg, &pending->stop_work);

Except, we've never initialized pending->arg, which will be all 0s.

This then results in running migration_cpu_stop() on the next CPU with
arg->p == NULL, which gives the by now obvious result of fireworks.

The cure is to change affine_move_task() to always use pending->arg,
furthermore we can use the exact same pattern as the
SCA_MIGRATE_ENABLE case, since we'll block on the pending->done
completion anyway, no point in adding yet another completion in
stop_one_cpu().

This then gives a clear distinction between the two
migration_cpu_stop() use cases:

  - sched_exec() / migrate_task_to() : arg->pending == NULL
  - affine_move_task() : arg->pending != NULL;

And we can have it ignore p->migration_pending when !arg->pending. Any
stop work from sched_exec() / migrate_task_to() is in addition to stop
works from affine_move_task(), which will be sufficient to issue the
completion.

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.357743989@infradead.org
2021-03-06 12:40:20 +01:00
Linus Torvalds 3e10585335 x86:
- Support for userspace to emulate Xen hypercalls
 - Raise the maximum number of user memslots
 - Scalability improvements for the new MMU.  Instead of the complex
   "fast page fault" logic that is used in mmu.c, tdp_mmu.c uses an
   rwlock so that page faults are concurrent, but the code that can run
   against page faults is limited.  Right now only page faults take the
   lock for reading; in the future this will be extended to some
   cases of page table destruction.  I hope to switch the default MMU
   around 5.12-rc3 (some testing was delayed due to Chinese New Year).
 - Cleanups for MAXPHYADDR checks
 - Use static calls for vendor-specific callbacks
 - On AMD, use VMLOAD/VMSAVE to save and restore host state
 - Stop using deprecated jump label APIs
 - Workaround for AMD erratum that made nested virtualization unreliable
 - Support for LBR emulation in the guest
 - Support for communicating bus lock vmexits to userspace
 - Add support for SEV attestation command
 - Miscellaneous cleanups
 
 PPC:
 - Support for second data watchpoint on POWER10
 - Remove some complex workarounds for buggy early versions of POWER9
 - Guest entry/exit fixes
 
 ARM64
 - Make the nVHE EL2 object relocatable
 - Cleanups for concurrent translation faults hitting the same page
 - Support for the standard TRNG hypervisor call
 - A bunch of small PMU/Debug fixes
 - Simplification of the early init hypercall handling
 
 Non-KVM changes (with acks):
 - Detection of contended rwlocks (implemented only for qrwlocks,
   because KVM only needs it for x86)
 - Allow __DISABLE_EXPORTS from assembly code
 - Provide a saner follow_pfn replacements for modules
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmApSRgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOc7wf9FnlinKoTFaSk7oeuuhF/CoCVwSFs
 Z9+A2sNI99tWHQxFR6dyDkEFeQoXnqSxfLHtUVIdH/JnTg0FkEvFz3NK+0PzY1PF
 PnGNbSoyhP58mSBG4gbBAxdF3ZJZMB8GBgYPeR62PvMX2dYbcHqVBNhlf6W4MQK4
 5mAUuAnbf19O5N267sND+sIg3wwJYwOZpRZB7PlwvfKAGKf18gdBz5dQ/6Ej+apf
 P7GODZITjqM5Iho7SDm/sYJlZprFZT81KqffwJQHWFMEcxFgwzrnYPx7J3gFwRTR
 eeh9E61eCBDyCTPpHROLuNTVBqrAioCqXLdKOtO5gKvZI3zmomvAsZ8uXQ==
 =uFZU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "x86:

   - Support for userspace to emulate Xen hypercalls

   - Raise the maximum number of user memslots

   - Scalability improvements for the new MMU.

     Instead of the complex "fast page fault" logic that is used in
     mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent,
     but the code that can run against page faults is limited. Right now
     only page faults take the lock for reading; in the future this will
     be extended to some cases of page table destruction. I hope to
     switch the default MMU around 5.12-rc3 (some testing was delayed
     due to Chinese New Year).

   - Cleanups for MAXPHYADDR checks

   - Use static calls for vendor-specific callbacks

   - On AMD, use VMLOAD/VMSAVE to save and restore host state

   - Stop using deprecated jump label APIs

   - Workaround for AMD erratum that made nested virtualization
     unreliable

   - Support for LBR emulation in the guest

   - Support for communicating bus lock vmexits to userspace

   - Add support for SEV attestation command

   - Miscellaneous cleanups

  PPC:

   - Support for second data watchpoint on POWER10

   - Remove some complex workarounds for buggy early versions of POWER9

   - Guest entry/exit fixes

  ARM64:

   - Make the nVHE EL2 object relocatable

   - Cleanups for concurrent translation faults hitting the same page

   - Support for the standard TRNG hypervisor call

   - A bunch of small PMU/Debug fixes

   - Simplification of the early init hypercall handling

  Non-KVM changes (with acks):

   - Detection of contended rwlocks (implemented only for qrwlocks,
     because KVM only needs it for x86)

   - Allow __DISABLE_EXPORTS from assembly code

   - Provide a saner follow_pfn replacements for modules"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits)
  KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes
  KVM: selftests: Don't bother mapping GVA for Xen shinfo test
  KVM: selftests: Fix hex vs. decimal snafu in Xen test
  KVM: selftests: Fix size of memslots created by Xen tests
  KVM: selftests: Ignore recently added Xen tests' build output
  KVM: selftests: Add missing header file needed by xAPIC IPI tests
  KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c
  KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static
  locking/arch: Move qrwlock.h include after qspinlock.h
  KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests
  KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries
  KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2
  KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
  KVM: PPC: remove unneeded semicolon
  KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB
  KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest
  KVM: PPC: Book3S HV: Fix radix guest SLB side channel
  KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support
  KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR
  KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR
  ...
2021-02-21 13:31:43 -08:00
Linus Torvalds 657bd90c93 Scheduler updates for v5.12:
[ NOTE: unfortunately this tree had to be freshly rebased today,
         it's a same-content tree of 82891be90f3c (-next published)
         merged with v5.11.
 
         The main reason for the rebase was an authorship misattribution
         problem with a new commit, which we noticed in the last minute,
         and which we didn't want to be merged upstream. The offending
         commit was deep in the tree, and dependent commits had to be
         rebased as well. ]
 
 - Core scheduler updates:
 
   - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
     preempt=none/voluntary/full boot options (default: full),
     to allow distros to build a PREEMPT kernel but fall back to
     close to PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling
     behavior via a boot time selection.
 
     There's also the /debug/sched_debug switch to do this runtime.
 
     This feature is implemented via runtime patching (a new variant of static calls).
 
     The scope of the runtime patching can be best reviewed by looking
     at the sched_dynamic_update() function in kernel/sched/core.c.
 
     ( Note that the dynamic none/voluntary mode isn't 100% identical,
       for example preempt-RCU is available in all cases, plus the
       preempt count is maintained in all models, which has runtime
       overhead even with the code patching. )
 
     The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast majority
     of distributions, are supposed to be unaffected.
 
   - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
     was found via rcutorture triggering a hang. The bug is that
     rcu_idle_enter() may wake up a NOCB kthread, but this happens after
     the last generic need_resched() check. Some cpuidle drivers fix it
     by chance but many others don't.
 
     In true 2020 fashion the original bug fix has grown into a 5-patch
     scheduler/RCU fix series plus another 16 RCU patches to address
     the underlying issue of missed preemption events. These are the
     initial fixes that should fix current incarnations of the bug.
 
   - Clean up rbtree usage in the scheduler, by providing & using the following
     consistent set of rbtree APIs:
 
      partial-order; less() based:
        - rb_add(): add a new entry to the rbtree
        - rb_add_cached(): like rb_add(), but for a rb_root_cached
 
      total-order; cmp() based:
        - rb_find(): find an entry in an rbtree
        - rb_find_add(): find an entry, and add if not found
 
        - rb_find_first(): find the first (leftmost) matching entry
        - rb_next_match(): continue from rb_find_first()
        - rb_for_each(): iterate a sub-tree using the previous two
 
   - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a single pass.
     This is a 4-commit series where each commit improves one aspect of the idle
     sibling scan logic.
 
   - Improve the cpufreq cooling driver by getting the effective CPU utilization
     metrics from the scheduler
 
   - Improve the fair scheduler's active load-balancing logic by reducing the number
     of active LB attempts & lengthen the load-balancing interval. This improves
     stress-ng mmapfork performance.
 
   - Fix CFS's estimated utilization (util_est) calculation bug that can result in
     too high utilization values
 
 - Misc updates & fixes:
 
    - Fix the HRTICK reprogramming & optimization feature
    - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code
    - Reduce dl_add_task_root_domain() overhead
    - Fix uprobes refcount bug
    - Process pending softirqs in flush_smp_call_function_from_idle()
    - Clean up task priority related defines, remove *USER_*PRIO and
      USER_PRIO()
    - Simplify the sched_init_numa() deduplication sort
    - Documentation updates
    - Fix EAS bug in update_misfit_status(), which degraded the quality
      of energy-balancing
    - Smaller cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmAtHBsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1itgg/+NGed12pgPjYBzesdou60Lvx7LZLGjfOt
 M1F1EnmQGn/hEH2fCY6ZoqIZQTVltm7GIcBNabzYTzlaHZsdtyuDUJBZyj19vTlk
 zekcj7WVt+qvfjChaNwEJhQ9nnOM/eohMgEOHMAAJd9zlnQvve7NOLQ56UDM+kn/
 9taFJ5ZPvb4avP6C5p3KivvKex6Bjof/Tl0m3utpNyPpI/qK3FyGxwdgCxU0yepT
 ABWQX5ZQCufFvo1bgnBPfqyzab4MqhoM3bNKBsLQfuAlssG1xRv4KQOev4dRwrt9
 pXJikV5C9yez5d2lGe5p0ltH5IZS/l9x2yI/ZQj3OUDTFyV1ic6WfFAqJgDzVF8E
 i/vvA4NPQiI241Bkps+ErcCw4aVOgiY6TWli74cHjLUIX0+As6aHrFWXGSxUmiHB
 WR+B8KmdfzRTTlhOxMA+cvlpZcKCfxWkJJmXzr/lDZzIuKPqM3QCE2wD9sixkfVo
 JNICT0IvZghWOdbMEfZba8Psh/e2LVI9RzdpEiuYJz1ZrVlt1hO0M6jBxY0hMz9n
 k54z81xODw0a8P2FHMtpmB1vhAeqCmvwA6DO8z0Oxs0DFi+KM2bLf2efHsCKafI+
 Bm5v9YFaOk/55R76hJVh+aYLlyFgFkKd+P/niJTPDnxOk3SqJuXvTrql1HeGHkNr
 kYgQa23dsZk=
 =pyaG
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler updates:

   - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
     preempt=none/voluntary/full boot options (default: full), to allow
     distros to build a PREEMPT kernel but fall back to close to
     PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via
     a boot time selection.

     There's also the /debug/sched_debug switch to do this runtime.

     This feature is implemented via runtime patching (a new variant of
     static calls).

     The scope of the runtime patching can be best reviewed by looking
     at the sched_dynamic_update() function in kernel/sched/core.c.

     ( Note that the dynamic none/voluntary mode isn't 100% identical,
       for example preempt-RCU is available in all cases, plus the
       preempt count is maintained in all models, which has runtime
       overhead even with the code patching. )

     The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast
     majority of distributions, are supposed to be unaffected.

   - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
     was found via rcutorture triggering a hang. The bug is that
     rcu_idle_enter() may wake up a NOCB kthread, but this happens after
     the last generic need_resched() check. Some cpuidle drivers fix it
     by chance but many others don't.

     In true 2020 fashion the original bug fix has grown into a 5-patch
     scheduler/RCU fix series plus another 16 RCU patches to address the
     underlying issue of missed preemption events. These are the initial
     fixes that should fix current incarnations of the bug.

   - Clean up rbtree usage in the scheduler, by providing & using the
     following consistent set of rbtree APIs:

       partial-order; less() based:
         - rb_add(): add a new entry to the rbtree
         - rb_add_cached(): like rb_add(), but for a rb_root_cached

       total-order; cmp() based:
         - rb_find(): find an entry in an rbtree
         - rb_find_add(): find an entry, and add if not found

         - rb_find_first(): find the first (leftmost) matching entry
         - rb_next_match(): continue from rb_find_first()
         - rb_for_each(): iterate a sub-tree using the previous two

   - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a
     single pass. This is a 4-commit series where each commit improves
     one aspect of the idle sibling scan logic.

   - Improve the cpufreq cooling driver by getting the effective CPU
     utilization metrics from the scheduler

   - Improve the fair scheduler's active load-balancing logic by
     reducing the number of active LB attempts & lengthen the
     load-balancing interval. This improves stress-ng mmapfork
     performance.

   - Fix CFS's estimated utilization (util_est) calculation bug that can
     result in too high utilization values

  Misc updates & fixes:

   - Fix the HRTICK reprogramming & optimization feature

   - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code

   - Reduce dl_add_task_root_domain() overhead

   - Fix uprobes refcount bug

   - Process pending softirqs in flush_smp_call_function_from_idle()

   - Clean up task priority related defines, remove *USER_*PRIO and
     USER_PRIO()

   - Simplify the sched_init_numa() deduplication sort

   - Documentation updates

   - Fix EAS bug in update_misfit_status(), which degraded the quality
     of energy-balancing

   - Smaller cleanups"

* tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  sched,x86: Allow !PREEMPT_DYNAMIC
  entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
  entry: Explicitly flush pending rcuog wakeup before last rescheduling point
  rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
  rcu/nocb: Perform deferred wake up before last idle's need_resched() check
  rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
  sched/features: Distinguish between NORMAL and DEADLINE hrtick
  sched/features: Fix hrtick reprogramming
  sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
  uprobes: (Re)add missing get_uprobe() in __find_uprobe()
  smp: Process pending softirqs in flush_smp_call_function_from_idle()
  sched: Harden PREEMPT_DYNAMIC
  static_call: Allow module use without exposing static_call_key
  sched: Add /debug/sched_preempt
  preempt/dynamic: Support dynamic preempt with preempt= boot option
  preempt/dynamic: Provide irqentry_exit_cond_resched() static call
  preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
  preempt/dynamic: Provide cond_resched() and might_resched() static calls
  preempt: Introduce CONFIG_PREEMPT_DYNAMIC
  static_call: Provide DEFINE_STATIC_CALL_RET0()
  ...
2021-02-21 12:35:04 -08:00
Juri Lelli e0ee463c93 sched/features: Distinguish between NORMAL and DEADLINE hrtick
The HRTICK feature has traditionally been servicing configurations that
need precise preemptions point for NORMAL tasks. More recently, the
feature has been extended to also service DEADLINE tasks with stringent
runtime enforcement needs (e.g., runtime < 1ms with HZ=1000).

Enabling HRTICK sched feature currently enables the additional timer and
task tick for both classes, which might introduced undesired overhead
for no additional benefit if one needed it only for one of the cases.

Separate HRTICK sched feature in two (and leave the traditional case
name unmodified) so that it can be selectively enabled when needed.

With:

  $ echo HRTICK > /sys/kernel/debug/sched_features

the NORMAL/fair hrtick gets enabled.

With:

  $ echo HRTICK_DL > /sys/kernel/debug/sched_features

the DEADLINE hrtick gets enabled.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210208073554.14629-3-juri.lelli@redhat.com
2021-02-17 14:12:42 +01:00
Juri Lelli 156ec6f42b sched/features: Fix hrtick reprogramming
Hung tasks and RCU stall cases were reported on systems which were not
100% busy. Investigation of such unexpected cases (no sign of potential
starvation caused by tasks hogging the system) pointed out that the
periodic sched tick timer wasn't serviced anymore after a certain point
and that caused all machinery that depends on it (timers, RCU, etc.) to
stop working as well. This issues was however only reproducible if
HRTICK was enabled.

Looking at core dumps it was found that the rbtree of the hrtimer base
used also for the hrtick was corrupted (i.e. next as seen from the base
root and actual leftmost obtained by traversing the tree are different).
Same base is also used for periodic tick hrtimer, which might get "lost"
if the rbtree gets corrupted.

Much alike what described in commit 1f71addd34 ("tick/sched: Do not
mess with an enqueued hrtimer") there is a race window between
hrtimer_set_expires() in hrtick_start and hrtimer_start_expires() in
__hrtick_restart() in which the former might be operating on an already
queued hrtick hrtimer, which might lead to corruption of the base.

Use hrtick_start() (which removes the timer before enqueuing it back) to
ensure hrtick hrtimer reprogramming is entirely guarded by the base
lock, so that no race conditions can occur.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210208073554.14629-2-juri.lelli@redhat.com
2021-02-17 14:12:42 +01:00
Peter Zijlstra ef72661e28 sched: Harden PREEMPT_DYNAMIC
Use the new EXPORT_STATIC_CALL_TRAMP() / static_call_mod() to unexport
the static_call_key for the PREEMPT_DYNAMIC calls such that modules
can no longer update these calls.

Having modules change/hi-jack the preemption calls would be horrible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-17 14:12:42 +01:00
Peter Zijlstra e59e10f8ef sched: Add /debug/sched_preempt
Add a debugfs file to muck about with the preempt mode at runtime.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/YAsGiUYf6NyaTplX@hirez.programming.kicks-ass.net
2021-02-17 14:12:42 +01:00
Peter Zijlstra (Intel) 826bfeb37b preempt/dynamic: Support dynamic preempt with preempt= boot option
Support the preempt= boot option and patch the static call sites
accordingly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-9-frederic@kernel.org
2021-02-17 14:12:42 +01:00
Peter Zijlstra (Intel) 2c9a98d3bc preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
Provide static calls to control preempt_schedule[_notrace]()
(called in CONFIG_PREEMPT) so that we can override their behaviour when
preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
initialized to the arch provided wrapper, if any.

[fweisbec: only define static calls when PREEMPT_DYNAMIC, make it less
           dependent on x86 with __preempt_schedule_func]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-7-frederic@kernel.org
2021-02-17 14:12:42 +01:00
Peter Zijlstra (Intel) b965f1ddb4 preempt/dynamic: Provide cond_resched() and might_resched() static calls
Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT)
and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we
can override their behaviour when preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
ignored when preempt= isn't passed.

  [fweisbec: branch might_resched() directly to __cond_resched(), only
             define static calls when PREEMPT_DYNAMIC]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-6-frederic@kernel.org
2021-02-17 14:12:42 +01:00
Dietmar Eggemann c541bb7835 sched/core: Update task_prio() function header
The description of the RT offset and the values for 'normal' tasks needs
update. Moreover there are DL tasks now.
task_prio() has to stay like it is to guarantee compatibility with the
/proc/<pid>/stat priority field:

  # cat /proc/<pid>/stat | awk '{ print $18; }'

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210128131040.296856-4-dietmar.eggemann@arm.com
2021-02-17 14:08:30 +01:00
Dietmar Eggemann ae18ad281e sched: Remove MAX_USER_RT_PRIO
Commit d46523ea32 ("[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO")
was introduced due to a a small time period in which the realtime patch
set was using different values for MAX_USER_RT_PRIO and MAX_RT_PRIO.

This is no longer true, i.e. now MAX_RT_PRIO == MAX_USER_RT_PRIO.

Get rid of MAX_USER_RT_PRIO and make everything use MAX_RT_PRIO
instead.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210128131040.296856-2-dietmar.eggemann@arm.com
2021-02-17 14:08:11 +01:00
Ingo Molnar ed3cd45f8c Linux 5.11
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmAppPgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGeXYH/imZPBd4A1jIMehN
 5HV2A53Z+MXmmaMuGj9X1KV6vsf55/xB+IhOoFdtRAIsO8c2yYSCO8i4+4R0XfYA
 +/YFJeq672rojQnmh6XbpR8dugaAV7CUHy6n7KDsyvtT6EOCpwFSwkOb4X3tBRX6
 TlYgm2d/xgV/wRHSgLVugK0MdFCLMAnyb7mkPfar9QrMgG1BiDKLq07xmwnS23On
 TkqpJ9yZ/rJpUrrUqQYPShSO/FmA+fSfWs0CDv7EIrJ40LUScD6PZxSHWTIHtjLk
 E4jFda6wuqLRVWsBwaBzUIdD0zk7X5quHRzEpbC5ga16SK6yrWvE5YJJXCguIEuZ
 f3FMRYs=
 =CAjn
 -----END PGP SIGNATURE-----

Merge tag 'v5.11' into sched/core, to pick up fixes & refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-17 14:04:39 +01:00
Ingo Molnar 85e853c5ec Merge branch 'for-mingo-rcu' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

- Documentation updates.

- Miscellaneous fixes.

- kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return
  addresses to more easily locate bugs.  This has a couple of RCU-related commits,
  but is mostly MM.  Was pulled in with akpm's agreement.

- Per-callback-batch tracking of numbers of callbacks,
  which enables better debugging information and smarter
  reactions to large numbers of callbacks.

- The first round of changes to allow CPUs to be runtime switched from and to
  callback-offloaded state.

- CONFIG_PREEMPT_RT-related changes.

- RCU CPU stall warning updates.
- Addition of polling grace-period APIs for SRCU.

- Torture-test and torture-test scripting updates, including a "torture everything"
  script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale.
  Plus does an allmodconfig build.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-12 12:56:55 +01:00
Ben Gardon f3d4b4b1dc sched: Add cond_resched_rwlock
Safely rescheduling while holding a spin lock is essential for keeping
long running kernel operations running smoothly. Add the facility to
cond_resched rwlocks.

CC: Ingo Molnar <mingo@redhat.com>
CC: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210202185734.1680553-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:43 -05:00
Paul E. McKenney 0d2460ba61 Merge branches 'doc.2021.01.06a', 'fixes.2021.01.04b', 'kfree_rcu.2021.01.04a', 'mmdumpobj.2021.01.22a', 'nocb.2021.01.06a', 'rt.2021.01.04a', 'stall.2021.01.06a', 'torture.2021.01.12a' and 'tortureall.2021.01.06a' into HEAD
doc.2021.01.06a: Documentation updates.
fixes.2021.01.04b: Miscellaneous fixes.
kfree_rcu.2021.01.04a: kfree_rcu() updates.
mmdumpobj.2021.01.22a: Dump allocation point for memory blocks.
nocb.2021.01.06a: RCU callback offload updates and cblist segment lengths.
rt.2021.01.04a: Real-time updates.
stall.2021.01.06a: RCU CPU stall warning updates.
torture.2021.01.12a: Torture-test updates and polling SRCU grace-period API.
tortureall.2021.01.06a: Torture-test script updates.
2021-01-22 15:26:44 -08:00
Peter Zijlstra 741ba80f6f sched: Relax the set_cpus_allowed_ptr() semantics
Now that we have KTHREAD_IS_PER_CPU to denote the critical per-cpu
tasks to retain during CPU offline, we can relax the warning in
set_cpus_allowed_ptr(). Any spurious kthread that wants to get on at
the last minute will get pushed off before it can run.

While during CPU online there is no harm, and actual benefit, to
allowing kthreads back on early, it simplifies hotplug code and fixes
a number of outstanding races.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lai jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103507.240724591@infradead.org
2021-01-22 15:09:44 +01:00
Peter Zijlstra 5ba2ffba13 sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
Prior to commit 1cf12e08bc ("sched/hotplug: Consolidate task
migration on CPU unplug") we'd leave any task on the dying CPU and
break affinity and force them off at the very end.

This scheme had to change in order to enable migrate_disable(). One
cannot wait for migrate_disable() to complete while stuck in
stop_machine(). Furthermore, since we need at the very least: idle,
hotplug and stop threads at any point before stop_machine, we can't
break affinity and/or push those away.

Under the assumption that all per-cpu kthreads are sanely handled by
CPU hotplug, the new code no long breaks affinity or migrates any of
them (which then includes the critical ones above).

However, there's an important difference between per-cpu kthreads and
kthreads that happen to have a single CPU affinity which is lost. The
latter class very much relies on the forced affinity breaking and
migration semantics previously provided.

Use the new kthread_is_per_cpu() infrastructure to tighten
is_per_cpu_kthread() and fix the hot-unplug problems stemming from the
change.

Fixes: 1cf12e08bc ("sched/hotplug: Consolidate task migration on CPU unplug")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103507.102416009@infradead.org
2021-01-22 15:09:44 +01:00
Peter Zijlstra 975707f227 sched: Prepare to use balance_push in ttwu()
In preparation of using the balance_push state in ttwu() we need it to
provide a reliable and consistent state.

The immediate problem is that rq->balance_callback gets cleared every
schedule() and then re-set in the balance_push_callback() itself. This
is not a reliable signal, so add a variable that stays set during the
entire time.

Also move setting it before the synchronize_rcu() in
sched_cpu_deactivate(), such that we get guaranteed visibility to
ttwu(), which is a preempt-disable region.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.966069627@infradead.org
2021-01-22 15:09:43 +01:00
Peter Zijlstra 22f667c97a sched: Don't run cpu-online with balance_push() enabled
We don't need to push away tasks when we come online, mark the push
complete right before the CPU dies.

XXX hotplug state machine has trouble with rollback here.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.415606087@infradead.org
2021-01-22 15:09:42 +01:00
Valentin Schneider 36c6e17bf1 sched/core: Print out straggler tasks in sched_cpu_dying()
Since commit

  1cf12e08bc ("sched/hotplug: Consolidate task migration on CPU unplug")

tasks are expected to move themselves out of a out-going CPU. For most
tasks this will be done automagically via BALANCE_PUSH, but percpu kthreads
will have to cooperate and move themselves away one way or another.

Currently, some percpu kthreads (workqueues being a notable exemple) do not
cooperate nicely and can end up on an out-going CPU at the time
sched_cpu_dying() is invoked.

Print the dying rq's tasks to shed some light on the stragglers.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210113183141.11974-1-valentin.schneider@arm.com
2021-01-22 15:09:41 +01:00
Anna-Maria Behnsen e0b257c3b7 sched: Prevent raising SCHED_SOFTIRQ when CPU is !active
SCHED_SOFTIRQ is raised to trigger periodic load balancing. When CPU is not
active, CPU should not participate in load balancing.

The scheduler uses nohz.idle_cpus_mask to keep track of the CPUs which can
do idle load balancing. When bringing a CPU up the CPU is added to the mask
when it reaches the active state, but on teardown the CPU stays in the mask
until it goes offline and invokes sched_cpu_dying().

When SCHED_SOFTIRQ is raised on a !active CPU, there might be a pending
softirq when stopping the tick which triggers a warning in NOHZ code. The
SCHED_SOFTIRQ can also be raised by the scheduler tick which has the same
issue.

Therefore remove the CPU from nohz.idle_cpus_mask when it is marked
inactive and also prevent the scheduler_tick() from raising SCHED_SOFTIRQ
after this point.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201215104400.9435-1-anna-maria@linutronix.de
2021-01-14 11:20:09 +01:00
Viresh Kumar a5418be9df sched/core: Rename schedutil_cpu_util() and allow rest of the kernel to use it
There is nothing schedutil specific in schedutil_cpu_util(), rename it
to effective_cpu_util(). Also create and expose another wrapper
sched_cpu_util() which can be used by other parts of the kernel, like
thermal core (that will be done in a later commit).

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/db011961fb3bb8bef1c0eda5cd64564637d3ef31.1607400596.git.viresh.kumar@linaro.org
2021-01-14 11:20:09 +01:00
Viresh Kumar 7d6a905f3d sched/core: Move schedutil_cpu_util() to core.c
There is nothing schedutil specific in schedutil_cpu_util(), move it to
core.c and define it only for CONFIG_SMP.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/c921a362c78e1324f8ebc5aaa12f53e309c5a8a2.1607400596.git.viresh.kumar@linaro.org
2021-01-14 11:20:08 +01:00
Peter Zijlstra 1b7af29554 sched/core: Allow try_invoke_on_locked_down_task() with irqs disabled
The try_invoke_on_locked_down_task() function currently requires
that interrupts be enabled, but it is called with interrupts
disabled from rcu_print_task_stall(), resulting in an "IRQs not
enabled as expected" diagnostic.  This commit therefore updates
try_invoke_on_locked_down_task() to use raw_spin_lock_irqsave() instead
of raw_spin_lock_irq(), thus allowing use from either context.

Link: https://lore.kernel.org/lkml/000000000000903d5805ab908fc4@google.com/
Link: https://lore.kernel.org/lkml/20200928075729.GC2611@hirez.programming.kicks-ass.net/
Reported-by: syzbot+cb3b69ae80afd6535b0e@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-04 15:49:52 -08:00
Linus Torvalds 3b80dee70e Fix a context switch performance regression.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl/oTrQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gFcBAAtVljuMTvy9RhyX5s+Q8XEa81+iSTckht
 gdd26WbfGBmKMqEXdKtwlG+ZwPHBzHKIipy4thSb7B0SbuYpiyBOlir6aGwpQZD5
 puMRCVuGzyoW02oGExGnuEOteNUQ+hyj6z351G6R0152Tp/5WPZSM8Wvr745Pjkb
 mmAx3VELRRoq0q4ecz/MUHiZ+XVGpN/rbMj1O9hm5RFdUQHROFqwxAIJ7Hnan3v9
 fSOiFRVTNtIflvIHhR8w052pPx/5Sg+UNi/T8n6gSP5WeKamTEPIs/q6nROgX9Qm
 4SEK8PM0epkhVhoLzKNgaP7GpXYKTpifZ/04Y6QZ5sRveo7tHvlNVQvE+uN82ARm
 SFmJvhbrHi00CRdYmOOERivOJahkNrEgsJTj5Nd/kmno92lkBv5S/+hHl2JEtLDb
 P2d3GWh+8aUEFUh+VA73Z4SoCaVA/VlzErdCm4EBY/efu3fFhKafCcs/nh3gQ9cU
 KK5gBWFt/pG3EDPH6d89d/O7akZcOjnB6jelaUbVxtbG/xCO8uh2RZ16gV1Bvvnn
 gqjNTXolY9jeFCt9FB+Tg3cxRbITEiqivr7nG7KluiWdsdujEV05OkpOegQCkq74
 HE/UzH2GZzoVHYKm6rBOlOuMDV77ClE8vrmOKz4sb4oquXHkr/78uBaScHcIRG4c
 nap1c0DJ4nc=
 =soZP
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-12-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix a context switch performance regression"

* tag 'sched-urgent-2020-12-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Optimize finish_lock_switch()
2020-12-27 09:00:47 -08:00
Peter Zijlstra ae79270232 sched: Optimize finish_lock_switch()
The kernel test robot measured a -1.6% performance regression on
will-it-scale/sched_yield due to commit:

  2558aacff8 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")

Even though we were careful to replace a single load with another
single load from the same cacheline.

Restore finish_lock_switch() to the exact state before the offending
patch and solve the problem differently.

Fixes: 2558aacff8 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201210161408.GX3021@hirez.programming.kicks-ass.net
2020-12-15 11:27:53 +01:00
Linus Torvalds edd7ab7684 The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic implementation
     which builds the base for the kmap_local() API and make the
     kmap_atomic() interface wrappers which handle the disabling/enabling of
     preemption and pagefaults.
 
   - Switch the storage from per-CPU to per task and provide scheduler
     support for clearing mapping when scheduling out and restoring them
     when scheduling back in.
 
   - Merge the migrate_disable/enable() code, which is also part of the
     scheduler pull request. This was required to make the kmap_local()
     interface available which does not disable preemption when a mapping
     is established. It has to disable migration instead to guarantee that
     the virtual address of the mapped slot is the same accross preemption.
 
   - Provide better debug facilities: guard pages and enforced utilization
     of the mapping mechanics on 64bit systems when the architecture allows
     it.
 
   - Provide the new kmap_local() API which can now be used to cleanup the
     kmap_atomic() usage sites all over the place. Most of the usage sites
     do not require the implicit disabling of preemption and pagefaults so
     the penalty on 64bit and 32bit non-highmem systems is removed and quite
     some of the code can be simplified. A wholesale conversion is not
     possible because some usage depends on the implicit side effects and
     some need to be cleaned up because they work around these side effects.
 
     The migrate disable side effect is only effective on highmem systems
     and when enforced debugging is enabled. On 64bit and 32bit non-highmem
     systems the overhead is completely avoided.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
 0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
 4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
 p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
 RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
 +x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
 ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
 Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
 XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
 FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
 HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
 +jlfoJhMbtx5Gg==
 =n71I
 -----END PGP SIGNATURE-----

Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull kmap updates from Thomas Gleixner:
 "The new preemtible kmap_local() implementation:

   - Consolidate all kmap_atomic() internals into a generic
     implementation which builds the base for the kmap_local() API and
     make the kmap_atomic() interface wrappers which handle the
     disabling/enabling of preemption and pagefaults.

   - Switch the storage from per-CPU to per task and provide scheduler
     support for clearing mapping when scheduling out and restoring them
     when scheduling back in.

   - Merge the migrate_disable/enable() code, which is also part of the
     scheduler pull request. This was required to make the kmap_local()
     interface available which does not disable preemption when a
     mapping is established. It has to disable migration instead to
     guarantee that the virtual address of the mapped slot is the same
     across preemption.

   - Provide better debug facilities: guard pages and enforced
     utilization of the mapping mechanics on 64bit systems when the
     architecture allows it.

   - Provide the new kmap_local() API which can now be used to cleanup
     the kmap_atomic() usage sites all over the place. Most of the usage
     sites do not require the implicit disabling of preemption and
     pagefaults so the penalty on 64bit and 32bit non-highmem systems is
     removed and quite some of the code can be simplified. A wholesale
     conversion is not possible because some usage depends on the
     implicit side effects and some need to be cleaned up because they
     work around these side effects.

     The migrate disable side effect is only effective on highmem
     systems and when enforced debugging is enabled. On 64bit and 32bit
     non-highmem systems the overhead is completely avoided"

* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  ARM: highmem: Fix cache_is_vivt() reference
  x86/crashdump/32: Simplify copy_oldmem_page()
  io-mapping: Provide iomap_local variant
  mm/highmem: Provide kmap_local*
  sched: highmem: Store local kmaps in task struct
  x86: Support kmap_local() forced debugging
  mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
  mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
  microblaze/mm/highmem: Add dropped #ifdef back
  xtensa/mm/highmem: Make generic kmap_atomic() work correctly
  mm/highmem: Take kmap_high_get() properly into account
  highmem: High implementation details and document API
  Documentation/io-mapping: Remove outdated blurb
  io-mapping: Cleanup atomic iomap
  mm/highmem: Remove the old kmap_atomic cruft
  highmem: Get rid of kmap_types.h
  xtensa/mm/highmem: Switch to generic kmap atomic
  sparc/mm/highmem: Switch to generic kmap atomic
  powerpc/mm/highmem: Switch to generic kmap atomic
  nds32/mm/highmem: Switch to generic kmap atomic
  ...
2020-12-14 18:35:53 -08:00
Linus Torvalds adb35e8dc9 Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and
    is now a prerequisite for the new preemptible kmap_local() API which aims
    to replace kmap_atomic().
 
  - A fair amount of topology and NUMA related improvements
 
  - Improvements for the frequency invariant calculations
 
  - Enhanced robustness for the global CPU priority tracking and decision
    making
 
  - The usual small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
 /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
 sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
 dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
 u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
 Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
 CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
 MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
 e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
 Gsq0rJW5iiu/OQ==
 =Oz1V
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - migrate_disable/enable() support which originates from the RT tree
   and is now a prerequisite for the new preemptible kmap_local() API
   which aims to replace kmap_atomic().

 - A fair amount of topology and NUMA related improvements

 - Improvements for the frequency invariant calculations

 - Enhanced robustness for the global CPU priority tracking and decision
   making

 - The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  sched/fair: Trivial correction of the newidle_balance() comment
  sched/fair: Clear SMT siblings after determining the core is not idle
  sched: Fix kernel-doc markup
  x86: Print ratio freq_max/freq_base used in frequency invariance calculations
  x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
  x86, sched: Calculate frequency invariance for AMD systems
  irq_work: Optimize irq_work_single()
  smp: Cleanup smp_call_function*()
  irq_work: Cleanup
  sched: Limit the amount of NUMA imbalance that can exist at fork time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Rename nr_running and break out the magic number
  sched: Make migrate_disable/enable() independent of RT
  sched/topology: Condition EAS enablement on FIE support
  arm64: Rebuild sched domains on invariance status changes
  sched/topology,schedutil: Wrap sched domains rebuild
  sched/uclamp: Allow to reset a task uclamp constraint value
  sched/core: Fix typos in comments
  Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
  ...
2020-12-14 18:29:11 -08:00
Linus Torvalds 1ac0884d54 A set of updates for entry/exit handling:
- More generalization of entry/exit functionality
 
  - The consolidation work to reclaim TIF flags on x86 and also for non-x86
    specific TIF flags which are solely relevant for syscall related work
    and have been moved into their own storage space. The x86 specific part
    had to be merged in to avoid a major conflict.
 
  - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
    delivery mode of task work and results in an impressive performance
    improvement for io_uring. The non-x86 consolidation of this is going to
    come seperate via Jens.
 
  - The selective syscall redirection facility which provides a clean and
    efficient way to support the non-Linux syscalls of WINE by catching them
    at syscall entry and redirecting them to the user space emulation. This
    can be utilized for other purposes as well and has been designed
    carefully to avoid overhead for the regular fastpath. This includes the
    core changes and the x86 support code.
 
  - Simplification of the context tracking entry/exit handling for the users
    of the generic entry code which guarantee the proper ordering and
    protection.
 
  - Preparatory changes to make the generic entry code accomodate S390
    specific requirements which are mostly related to their syscall restart
    mechanism.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XoPoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoe0tD/4jSKHIogVM9kVpiYfwjDGS1NluaBXn
 71ZoASbX9GZebyGandMyF2QP1iJ24ZO0RztBwHEVH6fyomKB2iFNedssCpO9yfWV
 3eFRpOvMpbszY2W2bd0QG3GrqaTttjVfB4ahkGLzqeSbchdob6hZpNDYtBZnujA6
 GSnrrurfJkCGoQny+yJQYdQJXQU+BIX90B2a2Q+jW123Luy/iHXC1f/krZSA1m14
 fC9xYLSUjPphTzh2ZOW+C3DgdjOL5PfAm/6F+DArt4GtLgrEGD7R74aLSFhvetky
 dn5QtG+yAsz1i0cc5Wu/JBcT9tOkY92rPYSyLI9bYQUSQ/bMyuprz6oYKj3dubsu
 ZSsKPdkNFPIniL4fLdCMWZcIXX5xgnrxKjdgXZXW3gtrcxSns8w8uED3Sh7dgE08
 pgIeq67E5g/OB8kJXH1VxdewmeQb9cOmnzzHwNO7TrrGbBKjDTYHNdYOKf1dUTTK
 ZX1UjLfGwxTkMYAbQD1k0JGZ2OLRshzSaH5BW/ZKa3bvJW6yYOq+/YT8B8hbJ8U3
 vThlO75/55IJxS5r5Y3vZd/IHdsYbPuETD+TA8tNYtPqNZasW8nnk4TYctWqzDuO
 /Ka1wvWYid3c6ySznQn4zSyRjr968AfHeZ9YTUMhWufy5waXVmdBMG41u3IKfsVt
 osyzNc4EK19/Mg==
 =hsjV
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core entry/exit updates from Thomas Gleixner:
 "A set of updates for entry/exit handling:

   - More generalization of entry/exit functionality

   - The consolidation work to reclaim TIF flags on x86 and also for
     non-x86 specific TIF flags which are solely relevant for syscall
     related work and have been moved into their own storage space. The
     x86 specific part had to be merged in to avoid a major conflict.

   - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
     delivery mode of task work and results in an impressive performance
     improvement for io_uring. The non-x86 consolidation of this is
     going to come seperate via Jens.

   - The selective syscall redirection facility which provides a clean
     and efficient way to support the non-Linux syscalls of WINE by
     catching them at syscall entry and redirecting them to the user
     space emulation. This can be utilized for other purposes as well
     and has been designed carefully to avoid overhead for the regular
     fastpath. This includes the core changes and the x86 support code.

   - Simplification of the context tracking entry/exit handling for the
     users of the generic entry code which guarantee the proper ordering
     and protection.

   - Preparatory changes to make the generic entry code accomodate S390
     specific requirements which are mostly related to their syscall
     restart mechanism"

* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  entry: Add syscall_exit_to_user_mode_work()
  entry: Add exit_to_user_mode() wrapper
  entry_Add_enter_from_user_mode_wrapper
  entry: Rename exit_to_user_mode()
  entry: Rename enter_from_user_mode()
  docs: Document Syscall User Dispatch
  selftests: Add benchmark for syscall user dispatch
  selftests: Add kselftest for syscall user dispatch
  entry: Support Syscall User Dispatch on common syscall entry
  kernel: Implement selective syscall userspace redirection
  signal: Expose SYS_USER_DISPATCH si_code type
  x86: vdso: Expose sigreturn address on vdso to the kernel
  MAINTAINERS: Add entry for common entry code
  entry: Fix boot for !CONFIG_GENERIC_ENTRY
  x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
  sched: Detect call to schedule from critical entry code
  context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
  x86: Reclaim unused x86 TI flags
  ...
2020-12-14 17:13:53 -08:00
Mauro Carvalho Chehab 59a74b1544 sched: Fix kernel-doc markup
Kernel-doc requires that a kernel-doc markup to be immediately
below the function prototype, as otherwise it will rename it.
So, move sys_sched_yield() markup to the right place.

Also fix the cpu_util() markup: Kernel-doc markups
should use this format:
        identifier - description

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/50cd6f460aeb872ebe518a8e9cfffda2df8bdb0a.1606823973.git.mchehab+huawei@kernel.org
2020-12-11 10:30:31 +01:00
Ingo Molnar a787bdaff8 Merge branch 'linus' into sched/core, to resolve semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-27 11:10:50 +01:00
Peter Zijlstra 545b8c8df4 smp: Cleanup smp_call_function*()
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2020-11-24 16:47:49 +01:00
Thomas Gleixner 5fbda3ecd1 sched: highmem: Store local kmaps in task struct
Instead of storing the map per CPU provide and use per task storage. That
prepares for local kmaps which are preemptible.

The context switch code is preparatory and not yet in use because
kmap_atomic() runs with preemption disabled. Will be made usable in the
next step.

The context switch logic is safe even when an interrupt happens after
clearing or before restoring the kmaps. The kmap index in task struct is
not modified so any nesting kmap in an interrupt will use unused indices
and on return the counter is the same as before.

Also add an assert into the return to user space code. Going back to user
space with an active kmap local is a nono.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.372935758@linutronix.de
2020-11-24 14:42:09 +01:00
Thomas Gleixner 74d862b682 sched: Make migrate_disable/enable() independent of RT
Now that the scheduler can deal with migrate disable properly, there is no
real compelling reason to make it only available for RT.

There are quite some code pathes which needlessly disable preemption in
order to prevent migration and some constructs like kmap_atomic() enforce
it implicitly.

Making it available independent of RT allows to provide a preemptible
variant of kmap_atomic() and makes the code more consistent in general.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Grudgingly-Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.269943012@linutronix.de
2020-11-24 11:25:44 +01:00
Dietmar Eggemann 480a6ca2dc sched/uclamp: Allow to reset a task uclamp constraint value
In case the user wants to stop controlling a uclamp constraint value
for a task, use the magic value -1 in sched_util_{min,max} with the
appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
the reset.

The advantage over the 'additional flag' approach (i.e. introducing
SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
exported via uapi. This avoids the need to document how this new flag
has be used in conjunction with the existing uclamp related flags.

The following subtle issue is fixed as well. When a uclamp constraint
value is set on a !user_defined uclamp_se it is currently first reset
and then set.
Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
stands for the 'sched class change' case.
The related condition 'if (uc_se->user_defined)' moved from
__setscheduler_uclamp() into uclamp_reset().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yun Hsiang <hsiang023167@gmail.com>
Link: https://lkml.kernel.org/r/20201113113454.25868-1-dietmar.eggemann@arm.com
2020-11-19 11:25:47 +01:00
Tal Zussman b19a888c1e sched/core: Fix typos in comments
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201113005156.GA8408@charmander
2020-11-19 11:25:46 +01:00
Peter Zijlstra 1293771e43 sched: Fix migration_cpu_stop() WARN
Oleksandr reported hitting the WARN in the 'task_rq(p) != rq' branch
of migration_cpu_stop(). Valentin noted that using cpu_of(rq) in that
case is just plain wrong to begin with, since per the earlier branch
that isn't the actual CPU of the task.

Replace both instances of is_cpu_allowed() by a direct p->cpus_mask
test using task_cpu().

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Debugged-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-11-19 11:25:45 +01:00
Valentin Schneider d707faa64d sched/core: Add missing completion for affine_move_task() waiters
Qian reported that some fuzzer issuing sched_setaffinity() ends up stuck on
a wait_for_completion(). The problematic pattern seems to be:

  affine_move_task()
      // task_running() case
      stop_one_cpu();
      wait_for_completion(&pending->done);

Combined with, on the stopper side:

  migration_cpu_stop()
    // Task moved between unlocks and scheduling the stopper
    task_rq(p) != rq &&
    // task_running() case
    dest_cpu >= 0

    => no complete_all()

This can happen with both PREEMPT and !PREEMPT, although !PREEMPT should
be more likely to see this given the targeted task has a much bigger window
to block and be woken up elsewhere before the stopper runs.

Make migration_cpu_stop() always look at pending affinity requests; signal
their completion if the stopper hits a rq mismatch but the task is
still within its allowed mask. When Migrate-Disable isn't involved, this
matches the previous set_cpus_allowed_ptr() vs migration_cpu_stop()
behaviour.

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/8b62fd1ad1b18def27f18e2ee2df3ff5b36d0762.camel@redhat.com
2020-11-19 11:25:45 +01:00
Frederic Weisbecker 6775de4984 context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
schedule_user() was traditionally used by the entry code's tail to
preempt userspace after the call to user_enter(). Indeed the call to
user_enter() used to be performed upon syscall exit slow path which was
right before the last opportunity to schedule() while resuming to
userspace. The context tracking state had to be saved on the task stack
and set back to CONTEXT_KERNEL temporarily in order to safely switch to
another task.

Only a few archs use it now (namely sparc64 and powerpc64) and those
implementing HAVE_CONTEXT_TRACKING_OFFSTACK definetly can't rely on it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117151637.259084-5-frederic@kernel.org
2020-11-19 11:25:42 +01:00
Frederic Weisbecker 9f68b5b74c sched: Detect call to schedule from critical entry code
Detect calls to schedule() between user_enter() and user_exit(). Those
are symptoms of early entry code that either forgot to protect a call
to schedule() inside exception_enter()/exception_exit() or, in the case
of HAVE_CONTEXT_TRACKING_OFFSTACK, enabled interrupts or preemption in
a wrong spot.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117151637.259084-4-frederic@kernel.org
2020-11-19 11:25:42 +01:00
Juri Lelli 2279f540ea sched/deadline: Fix priority inheritance with multiple scheduling classes
Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).

 ------------[ cut here ]------------
 kernel BUG at kernel/sched/deadline.c:1462!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
 Hardware name: ...
 RIP: 0010:enqueue_task_dl+0x335/0x910
 Code: ...
 RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
 FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  ? intel_pstate_update_util_hwp+0x13/0x170
  rt_mutex_setprio+0x1cc/0x4b0
  task_blocks_on_rt_mutex+0x225/0x260
  rt_spin_lock_slowlock_locked+0xab/0x2d0
  rt_spin_lock_slowlock+0x50/0x80
  hrtimer_grab_expiry_lock+0x20/0x30
  hrtimer_cancel+0x13/0x30
  do_nanosleep+0xa0/0x150
  hrtimer_nanosleep+0xe1/0x230
  ? __hrtimer_init_sleeper+0x60/0x60
  __x64_sys_nanosleep+0x8d/0xa0
  do_syscall_64+0x4a/0x100
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fa58b52330d
 ...
 ---[ end trace 0000000000000002 ]—

He also provided a simple reproducer creating the situation below:

 So the execution order of locking steps are the following
 (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
 are mutexes that are enabled * with priority inheritance.)

 Time moves forward as this timeline goes down:

 N1              N2               D1
 |               |                |
 |               |                |
 Lock(M1)        |                |
 |               |                |
 |             Lock(M2)           |
 |               |                |
 |               |              Lock(M2)
 |               |                |
 |             Lock(M1)           |
 |             (!!bug triggered!) |

Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).

Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().

Fix this by keeping track of the original donor and using its parameters
when a task is boosted.

Reported-by: Glenn Elliott <glenn@aurora.tech>
Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
2020-11-17 13:15:28 +01:00
Peter Zijlstra ec618b84f6 sched: Fix rq->nr_iowait ordering
schedule()				ttwu()
    deactivate_task();			  if (p->on_rq && ...) // false
					    atomic_dec(&task_rq(p)->nr_iowait);
    if (prev->in_iowait)
      atomic_inc(&rq->nr_iowait);

Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.

Note that because we can now do ttwu_queue_wakelist() before
p->on_cpu==0, we lose the natural ordering and have to further delay
the decrement.

Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net
2020-11-17 13:15:28 +01:00
Valentin Schneider 3aef1551e9 sched: Remove select_task_rq()'s sd_flag parameter
Only select_task_rq_fair() uses that parameter to do an actual domain
search, other classes only care about what kind of wakeup is happening
(fork, exec, or "regular") and thus just translate the flag into a wakeup
type.

WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to
encode the wakeup types we care about. For select_task_rq_fair(), we can
simply use the shiny new WF_flag : SD_flag mapping.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201102184514.2733-3-valentin.schneider@arm.com
2020-11-10 18:39:06 +01:00
Peter Zijlstra 12fa97c64d Merge branch 'sched/migrate-disable' 2020-11-10 18:39:04 +01:00
Valentin Schneider c777d84710 sched: Comment affine_move_task()
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201013140116.26651-2-valentin.schneider@arm.com
2020-11-10 18:39:02 +01:00
Valentin Schneider 885b3ba47a sched: Deny self-issued __set_cpus_allowed_ptr() when migrate_disable()
migrate_disable();
  set_cpus_allowed_ptr(current, {something excluding task_cpu(current)});
  affine_move_task(); <-- never returns

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201013140116.26651-1-valentin.schneider@arm.com
2020-11-10 18:39:02 +01:00
Peter Zijlstra a7c81556ec sched: Fix migrate_disable() vs rt/dl balancing
In order to minimize the interference of migrate_disable() on lower
priority tasks, which can be deprived of runtime due to being stuck
below a higher priority task. Teach the RT/DL balancers to push away
these higher priority tasks when a lower priority task gets selected
to run on a freshly demoted CPU (pull).

This adds migration interference to the higher priority task, but
restores bandwidth to system that would otherwise be irrevocably lost.
Without this it would be possible to have all tasks on the system
stuck on a single CPU, each task preempted in a migrate_disable()
section with a single high priority task running.

This way we can still approximate running the M highest priority tasks
on the system.

Migrating the top task away is (ofcourse) still subject to
migrate_disable() too, which means the lower task is subject to an
interference equivalent to the worst case migrate_disable() section.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.499155098@infradead.org
2020-11-10 18:39:01 +01:00
Peter Zijlstra ded467dc83 sched, lockdep: Annotate ->pi_lock recursion
There's a valid ->pi_lock recursion issue where the actual PI code
tries to wake up the stop task. Make lockdep aware so it doesn't
complain about this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.406912197@infradead.org
2020-11-10 18:39:01 +01:00
Thomas Gleixner 3015ef4b98 sched/core: Make migrate disable and CPU hotplug cooperative
On CPU unplug tasks which are in a migrate disabled region cannot be pushed
to a different CPU until they returned to migrateable state.

Account the number of tasks on a runqueue which are in a migrate disabled
section and make the hotplug wait mechanism respect that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.067278757@infradead.org
2020-11-10 18:39:00 +01:00
Peter Zijlstra 6d337eab04 sched: Fix migrate_disable() vs set_cpus_allowed_ptr()
Concurrent migrate_disable() and set_cpus_allowed_ptr() has
interesting features. We rely on set_cpus_allowed_ptr() to not return
until the task runs inside the provided mask. This expectation is
exported to userspace.

This means that any set_cpus_allowed_ptr() caller must wait until
migrate_enable() allows migrations.

At the same time, we don't want migrate_enable() to schedule, due to
patterns like:

	preempt_disable();
	migrate_disable();
	...
	migrate_enable();
	preempt_enable();

And:

	raw_spin_lock(&B);
	spin_unlock(&A);

this means that when migrate_enable() must restore the affinity
mask, it cannot wait for completion thereof. Luck will have it that
that is exactly the case where there is a pending
set_cpus_allowed_ptr(), so let that provide storage for the async stop
machine.

Much thanks to Valentin who used TLA+ most effective and found lots of
'interesting' cases.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.921768277@infradead.org
2020-11-10 18:39:00 +01:00
Peter Zijlstra af449901b8 sched: Add migrate_disable()
Add the base migrate_disable() support (under protest).

While migrate_disable() is (currently) required for PREEMPT_RT, it is
also one of the biggest flaws in the system.

Notably this is just the base implementation, it is broken vs
sched_setaffinity() and hotplug, both solved in additional patches for
ease of review.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.818170844@infradead.org
2020-11-10 18:38:59 +01:00
Peter Zijlstra 9cfc3e18ad sched: Massage set_cpus_allowed()
Thread a u32 flags word through the *set_cpus_allowed*() callchain.
This will allow adding behavioural tweaks for future users.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.729082820@infradead.org
2020-11-10 18:38:59 +01:00
Peter Zijlstra 120455c514 sched: Fix hotplug vs CPU bandwidth control
Since we now migrate tasks away before DYING, we should also move
bandwidth unthrottle, otherwise we can gain tasks from unthrottle
after we expect all tasks to be gone already.

Also; it looks like the RT balancers don't respect cpu_active() and
instead rely on rq->online in part, complete this. This too requires
we do set_rq_offline() earlier to match the cpu_active() semantics.
(The bigger patch is to convert RT to cpu_active() entirely)

Since set_rq_online() is called from sched_cpu_activate(), place
set_rq_offline() in sched_cpu_deactivate().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.639538965@infradead.org
2020-11-10 18:38:59 +01:00
Thomas Gleixner 1cf12e08bc sched/hotplug: Consolidate task migration on CPU unplug
With the new mechanism which kicks tasks off the outgoing CPU at the end of
schedule() the situation on an outgoing CPU right before the stopper thread
brings it down completely is:

 - All user tasks and all unbound kernel threads have either been migrated
   away or are not running and the next wakeup will move them to a online CPU.

 - All per CPU kernel threads, except cpu hotplug thread and the stopper
   thread have either been unbound or parked by the responsible CPU hotplug
   callback.

That means that at the last step before the stopper thread is invoked the
cpu hotplug thread is the last legitimate running task on the outgoing
CPU.

Add a final wait step right before the stopper thread is kicked which
ensures that any still running tasks on the way to park or on the way to
kick themself of the CPU are either sleeping or gone.

This allows to remove the migrate_tasks() crutch in sched_cpu_dying(). If
sched_cpu_dying() detects that there is still another running task aside of
the stopper thread then it will explode with the appropriate fireworks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.547163969@infradead.org
2020-11-10 18:38:58 +01:00
Thomas Gleixner f2469a1fb4 sched/core: Wait for tasks being pushed away on hotplug
RT kernels need to ensure that all tasks which are not per CPU kthreads
have left the outgoing CPU to guarantee that no tasks are force migrated
within a migrate disabled section.

There is also some desire to (ab)use fine grained CPU hotplug control to
clear a CPU from active state to force migrate tasks which are not per CPU
kthreads away for power control purposes.

Add a mechanism which waits until all tasks which should leave the CPU
after the CPU active flag is cleared have moved to a different online CPU.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.377836842@infradead.org
2020-11-10 18:38:58 +01:00
Peter Zijlstra 2558aacff8 sched/hotplug: Ensure only per-cpu kthreads run during hotplug
In preparation for migrate_disable(), make sure only per-cpu kthreads
are allowed to run on !active CPUs.

This is ran (as one of the very first steps) from the cpu-hotplug
task which is a per-cpu kthread and completion of the hotplug
operation only requires such tasks.

This constraint enables the migrate_disable() implementation to wait
for completion of all migrate_disable regions on this CPU at hotplug
time without fear of any new ones starting.

This replaces the unlikely(rq->balance_callbacks) test at the tail of
context_switch with an unlikely(rq->balance_work), the fast path is
not affected.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.292709163@infradead.org
2020-11-10 18:38:57 +01:00
Peter Zijlstra 565790d28b sched: Fix balance_callback()
The intent of balance_callback() has always been to delay executing
balancing operations until the end of the current rq->lock section.
This is because balance operations must often drop rq->lock, and that
isn't safe in general.

However, as noted by Scott, there were a few holes in that scheme;
balance_callback() was called after rq->lock was dropped, which means
another CPU can interleave and touch the callback list.

Rework code to call the balance callbacks before dropping rq->lock
where possible, and otherwise splice the balance list onto a local
stack.

This guarantees that the balance list must be empty when we take
rq->lock. IOW, we'll only ever run our own balance callbacks.

Reported-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.203901269@infradead.org
2020-11-10 18:38:57 +01:00
Peter Zijlstra a8b62fd085 stop_machine: Add function and caller debug info
Crashes in stop-machine are hard to connect to the calling code, add a
little something to help with that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.116513635@infradead.org
2020-11-10 18:38:57 +01:00
Thomas Gleixner 345a957fcc sched: Reenable interrupts in do_sched_yield()
do_sched_yield() invokes schedule() with interrupts disabled which is
not allowed. This goes back to the pre git era to commit a6efb709806c
("[PATCH] irqlock patch 2.5.27-H6") in the history tree.

Reenable interrupts and remove the misleading comment which "explains" it.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
2020-10-29 11:00:31 +01:00
Juri Lelli a73f863af4 sched/features: Fix !CONFIG_JUMP_LABEL case
Commit:

  765cc3a4b2 ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")

made sched features static for !CONFIG_SCHED_DEBUG configurations, but
overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases.

For the latter echoing changes to /sys/kernel/debug/sched_features has
the nasty effect of effectively changing what sched_features reports,
but without actually changing the scheduler behaviour (since different
translation units get different sysctl_sched_features).

Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly
restructuring ifdefs.

Fixes: 765cc3a4b2 ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
Co-developed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Patrick Bellasi <patrick.bellasi@matbug.net>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com
2020-10-14 19:55:46 +02:00
Linus Torvalds edaa5ddf38 Scheduler changes for v5.10:
- Reorganize & clean up the SD* flags definitions and add a bunch
    of sanity checks. These new checks caught quite a few bugs or at
    least inconsistencies, resulting in another set of patches.
 
  - Rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
 
  - Add a new tracepoint to improve CPU capacity tracking
 
  - Improve overloaded SMP system load-balancing behavior
 
  - Tweak SMT balancing
 
  - Energy-aware scheduling updates
 
  - NUMA balancing improvements
 
  - Deadline scheduler fixes and improvements
 
  - CPU isolation fixes
 
  - Misc cleanups, simplifications and smaller optimizations.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+EWRERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hV8A/7BB0nt/zYVZ8Z3Di8V0b9hMtr0d1xtRM5
 ZAvg4hcZl/fVgobFndxBw6KdlK8lSce9Mcq+bTTWeD46CS13cK5Vrpiaf7x7Q00P
 m8YHeYEH13ME0pbBrhDoRCR4XzfXukzjkUl7LiyrTekAvRUtFikJ/uKl8MeJtYGZ
 gANEkadqforxUW0v45iUEGepmCWAl8hSlSMb2mDKsVhw4DFMD+px0EBmmA0VDqjE
 e0rkh6dEoUVNqlic2KoaXULld1rLg1xiaOcLUbTAXnucfhmuv5p/H11AC4ABuf+s
 7d0zLrLEfZrcLJkthYxfMHs7DYMtARiQM9Db/a5hAq9Af4Z2bvvVAaHt3gCGvkV1
 llB6BB2yWCki9Qv7oiGOAhANnyJHG/cU4r6WwMuHdlYi4dFT/iN5qkOMUL1IrDgi
 a6ZzvECChXBeisQXHSlMd8Y5O+j0gRvDR7E18z2q0/PlmO8PGJq4w34mEWveWIg3
 LaVF16bmvaARuNFJTQH/zaHhjqVQANSMx5OIv9swp0OkwvQkw21ICYHG0YxfzWCr
 oa/FESEpOL9XdYp8UwMPI0bmVIsEfx79pmDMF3zInYTpJpwMUhV2yjHE8uYVMqEf
 7U8rZv7gdbZ2us38Gjf2l73hY+recp/GrgZKnk0R98OUeMk1l/iVP6dwco6ITUV5
 czGmKlIB1ec=
 =bXy6
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - reorganize & clean up the SD* flags definitions and add a bunch of
   sanity checks. These new checks caught quite a few bugs or at least
   inconsistencies, resulting in another set of patches.

 - rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ

 - add a new tracepoint to improve CPU capacity tracking

 - improve overloaded SMP system load-balancing behavior

 - tweak SMT balancing

 - energy-aware scheduling updates

 - NUMA balancing improvements

 - deadline scheduler fixes and improvements

 - CPU isolation fixes

 - misc cleanups, simplifications and smaller optimizations

* tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  sched/deadline: Unthrottle PI boosted threads while enqueuing
  sched/debug: Add new tracepoint to track cpu_capacity
  sched/fair: Tweak pick_next_entity()
  rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
  rseq/selftests,x86_64: Add rseq_offset_deref_addv()
  rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
  sched/fair: Use dst group while checking imbalance for NUMA balancer
  sched/fair: Reduce busy load balance interval
  sched/fair: Minimize concurrent LBs between domain level
  sched/fair: Reduce minimal imbalance threshold
  sched/fair: Relax constraint on task's load during load balance
  sched/fair: Remove the force parameter of update_tg_load_avg()
  sched/fair: Fix wrong cpu selecting from isolated domain
  sched: Remove unused inline function uclamp_bucket_base_value()
  sched/rt: Disable RT_RUNTIME_SHARE by default
  sched/deadline: Fix stale throttling on de-/boosted tasks
  sched/numa: Use runnable_avg to classify node
  sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
  MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
  sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
  ...
2020-10-12 12:56:01 -07:00
Vincent Donnefort 51cf18c90c sched/debug: Add new tracepoint to track cpu_capacity
rq->cpu_capacity is a key element in several scheduler parts, such as EAS
task placement and load balancing. Tracking this value enables testing
and/or debugging by a toolkit.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com
2020-10-03 16:30:52 +02:00
YueHaibing 51bd5121c4 sched: Remove unused inline function uclamp_bucket_base_value()
There is no caller in tree, so can remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com
2020-09-25 14:23:25 +02:00
Sebastian Andrzej Siewior c1cecf884a sched: Cache task_struct::flags in sched_submit_work()
sched_submit_work() is considered to be a hot path. The preempt_disable()
instruction is a compiler barrier and forces the compiler to load
task_struct::flags for the second comparison.
By using a local variable, the compiler can load the value once and keep it in
a register for the second comparison.

Verified on x86-64 with gcc-10.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200819200025.lqvmyefqnbok5i4f@linutronix.de
2020-08-26 12:41:58 +02:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Linus Torvalds 1195d58f00 Two fixes: fix a new tracepoint's output value, and fix the formatting of show-state syslog printouts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl83xXMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hwRQ/+LC7yzLFMy+OpvuRp/ZY02VtL7oZdCVAS
 QFYrvmelsPrfbOzfuevGEg5jCHfJ6sL6Q4O06O/ktMUSsQ1HNc+esbTpbea9L/8X
 ynpujYXDm2AwiYQS2Bh/jDQVIUqJRfyNVpYWgIWTUq4QULh248vx4LGGYk/LQJtD
 FmuHT/Hc2xIPc01gAY24npSrPOlTJEm9HsfSpFqinXkNFlyocvRc2VwBnI1q/Dxt
 NVT18/8gb5dpaB3kRJyjuyNz88wJj7Rh65I/NebW9vvWincQzt7OJOutjnx/BzGG
 k5hMo/oPwCBRlPZ5X1fbsEjv/vXsXYtByNtNMljP3yFaR42F+pZ+5ySYNTtzyya8
 BuicHMlrj+kueEXzfYIxcFaI0u0zZV9OCxNQI7T86j5YJyKj2c5xIvkj20r+4U3N
 4biuCawvGNyfbw5X8se9yy1EEsw36UaeKNpoMQKcdpGDVskj2POMcyC06qMqahXX
 /LcIwKyXDwCKbJOz+NOQNY4ZvJSS3kcCYfTmEcaBs7UR6gFRAlwfrh54SDGLp8au
 t6MEj5GI51RWjo8S0KFBhqg+1sNqdRw2mvcabeRX1vHb/ter3AcHi2of4bSoAF4E
 GRKK2gfAkmvGc7cLjHEWvSjUPBS/gQgzNMhnyyFL8fEiL/juY5fCLnamuajWEmnF
 k6LA71AwkNY=
 =ffEv
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Two fixes: fix a new tracepoint's output value, and fix the formatting
  of show-state syslog printouts"

* tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Fix the alignment of the show-state debug output
  sched: Fix use of count for nr_running tracepoint
2020-08-15 10:36:40 -07:00
Libing Zhou cc172ff301 sched/debug: Fix the alignment of the show-state debug output
Current sysrq(t) output task fields name are not aligned with
actual task fields value, e.g.:

	kernel: sysrq: Show State
	kernel:  task                        PC stack   pid father
	kernel: systemd         S12456     1      0 0x00000000
	kernel: Call Trace:
	kernel: ? __schedule+0x240/0x740

To make it more readable, print fields name together with task fields
value in the same line, with fixed width:

	kernel: sysrq: Show State
	kernel: task:systemd         state:S stack:12920 pid:    1 ppid:     0 flags:0x00000000
	kernel: Call Trace:
	kernel: __schedule+0x282/0x620

Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200814030236.37835-1-libing.zhou@nokia-sbell.com
2020-08-14 12:36:18 +02:00
Linus Torvalds 6d2b84a4e5 This tree adds the sched_set_fifo*() encapsulation APIs to remove
static priority level knowledge from non-scheduler code.
 
 The three APIs for non-scheduler code to set SCHED_FIFO are:
 
  - sched_set_fifo()
  - sched_set_fifo_low()
  - sched_set_normal()
 
 These are two FIFO priority levels: default (high), and a 'low' priority level,
 plus sched_set_normal() to set the policy back to non-SCHED_FIFO.
 
 Since the changes affect a lot of non-scheduler code, we kept this in a separate
 tree.
 
 When merging to the latest upstream tree there's a conflict in drivers/spi/spi.c,
 which can be resolved via:
 
 	sched_set_fifo(ctlr->kworker_task);
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8pPQIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j0Jw/+LlSyX6gD2ATy3cizGL7DFPZogD5MVKTb
 IXbhXH/ACpuPQlBe1+haRLbJj6XfXqbOlAleVKt7eh+jZ1jYjC972RCSTO4566mJ
 0v8Iy9kkEeb2TDbYx1H3bnk78lf85t0CB+sCzyKUYFuTrXU04eRj7MtN3vAQyRQU
 xJg83x/sT5DGdDTP50sL7lpbwk3INWkD0aDCJEaO/a9yHElMsTZiZBKoXxN/s30o
 FsfzW56jqtng771H2bo8ERN7+abwJg10crQU5mIaLhacNMETuz0NZ/f8fY/fydCL
 Ju8HAdNKNXyphWkAOmixQuyYtWKe2/GfbHg8hld0jmpwxkOSTgZjY+pFcv7/w306
 g2l1TPOt8e1n5jbfnY3eig+9Kr8y0qHkXPfLfgRqKwMMaOqTTYixEzj+NdxEIRX9
 Kr7oFAv6VEFfXGSpb5L1qyjIGVgQ5/JE/p3OC3GHEsw5VKiy5yjhNLoSmSGzdS61
 1YurVvypSEUAn3DqTXgeGX76f0HH365fIKqmbFrUWxliF+YyflMhtrj2JFtejGzH
 Md3RgAzxusE9S6k3gw1ev4byh167bPBbY8jz0w3Gd7IBRKy9vo92h6ZRYIl6xeoC
 BU2To1IhCAydIr6hNsIiCSDTgiLbsYQzPuVVovUxNh+l1ZvKV2X+csEHhs8oW4pr
 4BRU7dKL2NE=
 =/7JH
 -----END PGP SIGNATURE-----

Merge tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull sched/fifo updates from Ingo Molnar:
 "This adds the sched_set_fifo*() encapsulation APIs to remove static
  priority level knowledge from non-scheduler code.

  The three APIs for non-scheduler code to set SCHED_FIFO are:

   - sched_set_fifo()
   - sched_set_fifo_low()
   - sched_set_normal()

  These are two FIFO priority levels: default (high), and a 'low'
  priority level, plus sched_set_normal() to set the policy back to
  non-SCHED_FIFO.

  Since the changes affect a lot of non-scheduler code, we kept this in
  a separate tree"

* tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  sched,tracing: Convert to sched_set_fifo()
  sched: Remove sched_set_*() return value
  sched: Remove sched_setscheduler*() EXPORTs
  sched,psi: Convert to sched_set_fifo_low()
  sched,rcutorture: Convert to sched_set_fifo_low()
  sched,rcuperf: Convert to sched_set_fifo_low()
  sched,locktorture: Convert to sched_set_fifo()
  sched,irq: Convert to sched_set_fifo()
  sched,watchdog: Convert to sched_set_fifo()
  sched,serial: Convert to sched_set_fifo()
  sched,powerclamp: Convert to sched_set_fifo()
  sched,ion: Convert to sched_set_normal()
  sched,powercap: Convert to sched_set_fifo*()
  sched,spi: Convert to sched_set_fifo*()
  sched,mmc: Convert to sched_set_fifo*()
  sched,ivtv: Convert to sched_set_fifo*()
  sched,drm/scheduler: Convert to sched_set_fifo*()
  sched,msm: Convert to sched_set_fifo*()
  sched,psci: Convert to sched_set_fifo*()
  sched,drbd: Convert to sched_set_fifo*()
  ...
2020-08-06 11:55:43 -07:00
Linus Torvalds e4cbce4d13 The main changes in this cycle were:
- Improve uclamp performance by using a static key for the fast path
 
  - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
    better power efficiency of RT tasks on battery powered devices.
    (The default is to maximize performance & reduce RT latencies.)
 
  - Improve utime and stime tracking accuracy, which had a fixed boundary
    of error, which created larger and larger relative errors as the values
    become larger. This is now replaced with more precise arithmetics,
    using the new mul_u64_u64_div_u64() helper in math64.h.
 
  - Improve the deadline scheduler, such as making it capacity aware
 
  - Improve frequency-invariant scheduling
 
  - Misc cleanups in energy/power aware scheduling
 
  - Add sched_update_nr_running tracepoint to track changes to nr_running
 
  - Documentation additions and updates
 
  - Misc cleanups and smaller fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oJDURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ixLg//bqWzFlfWirvngTgDxDnplwUTyKXmMCcq
 R1IYhlyK2O5FxvhbRmdmW11W3yzyTPvgCs6Q/70negGaPNe2w1OxfxiK9NMKz5eu
 M1LoXas7pL5g7Pr/ZxxHk/8VqJLV4t9MkodiiInmV6lTaznT3sU6a/kpYQjJyFnG
 Tuu9jd6JhdRKmePDJnNmUBoGQ7JiOQDcX4HtkcQ3OA+An3624tmJzbW1yts+uj7J
 ZWo2EY60RfbA9MxQXGPOaR/nAjngWs4Q6tddAh10mftsPq1gR2iFUKju1d31MQt/
 RHLdiqJf+AyUC4popKG7a+7ilCKMBwPociSreTJNPyEUQ1X4AM3vUVk4yjUoiDph
 k2WdsCF8/JRdhXg0NnrpPUqOaAbQj53EeXnitEb92E7WyTZgLOvAtpV//xZo6utp
 2QHerfrQ9SoGQjz/ho78za5vQtV1x25yDhd+X4XV4QEhIy85G9/2JCpC/Kc/TXLf
 OO7A4X69XztKTEJhP60g8ldCPUe4N2vbh1vKY6oAD8AFQVVNZ6n7375/Qa//b0/k
 ++hcYkPc2EK97/aBFdvzDgqb7aUo7Mtn2ibke16sQU4szulaoRuAHQG4jdGKMwbD
 dk2VBoxyxeYFXWHsNneSe87+ha3sd0dSN0ul1EB/SlFrVELMvy634YXnMYGW8ima
 PzyPB0ezpuA=
 =PbO7
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Improve uclamp performance by using a static key for the fast path

 - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
   better power efficiency of RT tasks on battery powered devices.
   (The default is to maximize performance & reduce RT latencies.)

 - Improve utime and stime tracking accuracy, which had a fixed boundary
   of error, which created larger and larger relative errors as the
   values become larger. This is now replaced with more precise
   arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

 - Improve the deadline scheduler, such as making it capacity aware

 - Improve frequency-invariant scheduling

 - Misc cleanups in energy/power aware scheduling

 - Add sched_update_nr_running tracepoint to track changes to nr_running

 - Documentation additions and updates

 - Misc cleanups and smaller fixes

* tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
  sched/doc: Document capacity aware scheduling
  sched: Document arch_scale_*_capacity()
  arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
  Documentation/sysctl: Document uclamp sysctl knobs
  sched/uclamp: Add a new sysctl to control RT default boost value
  sched/uclamp: Fix a deadlock when enabling uclamp static key
  sched: Remove duplicated tick_nohz_full_enabled() check
  sched: Fix a typo in a comment
  sched/uclamp: Remove unnecessary mutex_init()
  arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
  sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
  arch_topology, sched/core: Cleanup thermal pressure definition
  trace/events/sched.h: fix duplicated word
  linux/sched/mm.h: drop duplicated words in comments
  smp: Fix a potential usage of stale nr_cpus
  sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
  sched: nohz: stop passing around unused "ticks" parameter.
  sched: Better document ttwu()
  sched: Add a tracepoint to track rq->nr_running
  ...
2020-08-03 14:58:38 -07:00
Qais Yousef 13685c4a08 sched/uclamp: Add a new sysctl to control RT default boost value
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d99997 ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
2020-07-29 13:51:47 +02:00
Qais Yousef e65855a52b sched/uclamp: Fix a deadlock when enabling uclamp static key
The following splat was caught when setting uclamp value of a task:

  BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

   cpus_read_lock+0x68/0x130
   static_key_enable+0x1c/0x38
   __sched_setscheduler+0x900/0xad8

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce227 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
2020-07-29 13:51:47 +02:00
Qinglang Miao 13efa61612 sched/uclamp: Remove unnecessary mutex_init()
The uclamp_mutex lock is initialized statically via DEFINE_MUTEX(),
it is unnecessary to initialize it runtime via mutex_init().

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20200725085629.98292-1-miaoqinglang@huawei.com
2020-07-25 12:10:36 +02:00
Chris Wilson 062d3f95b6 sched: Warn if garbage is passed to default_wake_function()
Since the default_wake_function() passes its flags onto
try_to_wake_up(), warn if those flags collide with internal values.

Given that the supplied flags are garbage, no repair can be done but at
least alert the user to the damage they are causing.

In the belief that these errors should be picked up during testing, the
warning is only compiled in under CONFIG_SCHED_DEBUG.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200723201042.18861-1-chris@chris-wilson.co.uk
2020-07-24 14:40:25 +02:00
Valentin Schneider 25980c7a79 arch_topology, sched/core: Cleanup thermal pressure definition
The following commit:

  14533a16c4 ("thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code")

moved the definition of arch_set_thermal_pressure() to sched/core.c, but
kept its declaration in linux/arch_topology.h. When building e.g. an x86
kernel with CONFIG_SCHED_THERMAL_PRESSURE=y, cpufreq_cooling.c ends up
getting the declaration of arch_set_thermal_pressure() from
include/linux/arch_topology.h, which is somewhat awkward.

On top of this, sched/core.c unconditionally defines
o The thermal_pressure percpu variable
o arch_set_thermal_pressure()

while arch_scale_thermal_pressure() does nothing unless redefined by the
architecture.

arch_*() functions are meant to be defined by architectures, so revert the
aforementioned commit and re-implement it in a way that keeps
arch_set_thermal_pressure() architecture-definable, and doesn't define the
thermal pressure percpu variable for kernels that don't need
it (CONFIG_SCHED_THERMAL_PRESSURE=n).

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200712165917.9168-2-valentin.schneider@arm.com
2020-07-22 10:22:05 +02:00
Peter Zijlstra 58877d347b sched: Better document ttwu()
Dave hit the problem fixed by commit:

  b6e13e8582 ("sched/core: Fix ttwu() race")

and failed to understand much of the code involved. Per his request a
few comments to (hopefully) clarify things.

Requested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200702125211.GQ4800@hirez.programming.kicks-ass.net
2020-07-22 10:22:03 +02:00
Peter Zijlstra 015dc08918 Merge branch 'sched/urgent' 2020-07-22 10:22:02 +02:00
Peter Zijlstra d136122f58 sched: Fix race against ptrace_freeze_trace()
There is apparently one site that violates the rule that only current
and ttwu() will modify task->state, namely ptrace_{,un}freeze_traced()
will change task->state for a remote task.

Oleg explains:

  "TASK_TRACED/TASK_STOPPED was always protected by siglock. In
particular, ttwu(__TASK_TRACED) must be always called with siglock
held. That is why ptrace_freeze_traced() assumes it can safely do
s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock)."

This breaks the ordering scheme introduced by commit:

  dbfb089d36 ("sched: Fix loadavg accounting race")

Specifically, the reload not matching no longer implies we don't have
to block.

Simply things by noting that what we need is a LOAD->STORE ordering
and this can be provided by a control dependency.

So replace:

	prev_state = prev->state;
	raw_spin_lock(&rq->lock);
	smp_mb__after_spinlock(); /* SMP-MB */
	if (... && prev_state && prev_state == prev->state)
		deactivate_task();

with:

	prev_state = prev->state;
	if (... && prev_state) /* CTRL-DEP */
		deactivate_task();

Since that already implies the 'prev->state' load must be complete
before allowing the 'prev->on_rq = 0' store to become visible.

Fixes: dbfb089d36 ("sched: Fix loadavg accounting race")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-22 10:22:00 +02:00
Phil Auld 9d246053a6 sched: Add a tracepoint to track rq->nr_running
Add a bare tracepoint trace_sched_update_nr_running_tp which tracks
->nr_running CPU's rq. This is used to accurately trace this data and
provide a visualization of scheduler imbalances in, for example, the
form of a heat map.  The tracepoint is accessed by loading an external
kernel module. An example module (forked from Qais' module and including
the pelt related tracepoints) can be found at:

  https://github.com/auldp/tracepoints-helpers.git

A script to turn the trace-cmd report output into a heatmap plot can be
found at:

  https://github.com/jirvoz/plot-nr-running

The tracepoints are added to add_nr_running() and sub_nr_running() which
are in kernel/sched/sched.h. In order to avoid CREATE_TRACE_POINTS in
the header a wrapper call is used and the trace/events/sched.h include
is moved before sched.h in kernel/sched/core.

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200629192303.GC120228@lorien.usersys.redhat.com
2020-07-08 11:39:02 +02:00
Qais Yousef 46609ce227 sched/uclamp: Protect uclamp fast path code with static key
There is a report that when uclamp is enabled, a netperf UDP test
regresses compared to a kernel compiled without uclamp.

https://lore.kernel.org/lkml/20200529100806.GA3070@suse.de/

While investigating the root cause, there were no sign that the uclamp
code is doing anything particularly expensive but could suffer from bad
cache behavior under certain circumstances that are yet to be
understood.

https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

To reduce the pressure on the fast path anyway, add a static key that is
by default will skip executing uclamp logic in the
enqueue/dequeue_task() fast path until it's needed.

As soon as the user start using util clamp by:

	1. Changing uclamp value of a task with sched_setattr()
	2. Modifying the default sysctl_sched_util_clamp_{min, max}
	3. Modifying the default cpu.uclamp.{min, max} value in cgroup

We flip the static key now that the user has opted to use util clamp.
Effectively re-introducing uclamp logic in the enqueue/dequeue_task()
fast path. It stays on from that point forward until the next reboot.

This should help minimize the effect of util clamp on workloads that
don't need it but still allow distros to ship their kernels with uclamp
compiled in by default.

SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end
up with unbalanced call to uclamp_rq_dec_id() if we flip the key while
a task is running in the rq. Since we know it is harmless we just
quietly return if we attempt a uclamp_rq_dec_id() when
rq->uclamp[].bucket[].tasks is 0.

In schedutil, we introduce a new uclamp_is_enabled() helper which takes
the static key into account to ensure RT boosting behavior is retained.

The following results demonstrates how this helps on 2 Sockets Xeon E5
2x10-Cores system.

                                   nouclamp                 uclamp      uclamp-static-key
Hmean     send-64         162.43 (   0.00%)      157.84 *  -2.82%*      163.39 *   0.59%*
Hmean     send-128        324.71 (   0.00%)      314.78 *  -3.06%*      326.18 *   0.45%*
Hmean     send-256        641.55 (   0.00%)      628.67 *  -2.01%*      648.12 *   1.02%*
Hmean     send-1024      2525.28 (   0.00%)     2448.26 *  -3.05%*     2543.73 *   0.73%*
Hmean     send-2048      4836.14 (   0.00%)     4712.08 *  -2.57%*     4867.69 *   0.65%*
Hmean     send-3312      7540.83 (   0.00%)     7425.45 *  -1.53%*     7621.06 *   1.06%*
Hmean     send-4096      9124.53 (   0.00%)     8948.82 *  -1.93%*     9276.25 *   1.66%*
Hmean     send-8192     15589.67 (   0.00%)    15486.35 *  -0.66%*    15819.98 *   1.48%*
Hmean     send-16384    26386.47 (   0.00%)    25752.25 *  -2.40%*    26773.74 *   1.47%*

The perf diff between nouclamp and uclamp-static-key when uclamp is
disabled in the fast path:

     8.73%     -1.55%  [kernel.kallsyms]        [k] try_to_wake_up
     0.07%     +0.04%  [kernel.kallsyms]        [k] deactivate_task
     0.13%     -0.02%  [kernel.kallsyms]        [k] activate_task

The diff between nouclamp and uclamp-static-key when uclamp is enabled
in the fast path:

     8.73%     -0.72%  [kernel.kallsyms]        [k] try_to_wake_up
     0.13%     +0.39%  [kernel.kallsyms]        [k] activate_task
     0.07%     +0.38%  [kernel.kallsyms]        [k] deactivate_task

Fixes: 69842cba9a ("sched/uclamp: Add CPU's clamp buckets refcounting")
Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20200630112123.12076-3-qais.yousef@arm.com
2020-07-08 11:39:01 +02:00
Qais Yousef d81ae8aac8 sched/uclamp: Fix initialization of struct uclamp_rq
struct uclamp_rq was zeroed out entirely in assumption that in the first
call to uclamp_rq_inc() they'd be initialized correctly in accordance to
default settings.

But when next patch introduces a static key to skip
uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
will fail to perform any frequency changes because the
rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
means all rqs are capped to 0 by default.

Fix it by making sure we do proper initialization at init without
relying on uclamp_rq_inc() doing it later.

Fixes: 69842cba9a ("sched/uclamp: Add CPU's clamp buckets refcounting")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20200630112123.12076-2-qais.yousef@arm.com
2020-07-08 11:39:01 +02:00
Peter Zijlstra faa2fd7cba Merge branch 'sched/urgent' 2020-07-08 11:38:59 +02:00
Mathieu Desnoyers ce3614daab sched: Fix unreliable rseq cpu_id for new tasks
While integrating rseq into glibc and replacing glibc's sched_getcpu
implementation with rseq, glibc's tests discovered an issue with
incorrect __rseq_abi.cpu_id field value right after the first time
a newly created process issues sched_setaffinity.

For the records, it triggers after building glibc and running tests, and
then issuing:

  for x in {1..2000} ; do posix/tst-affinity-static  & done

and shows up as:

error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0

This is caused by the scheduler invoking __set_task_cpu() directly from
sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which
is done by set_task_cpu().

Add the missing rseq_migrate() to both functions. The only other direct
use of __set_task_cpu() is done by init_idle(), which does not involve a
user-space task.

Based on my testing with the glibc test-case, just adding rseq_migrate()
to wake_up_new_task() is sufficient to fix the observed issue. Also add
it to sched_fork() to keep things consistent.

The reason why this never triggered so far with the rseq/basic_test
selftest is unclear.

The current use of sched_getcpu(3) does not typically require it to be
always accurate. However, use of the __rseq_abi.cpu_id field within rseq
critical sections requires it to be accurate. If it is not accurate, it
can cause corruption in the per-cpu data targeted by rseq critical
sections in user-space.

Reported-By: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-By: Florian Weimer <fweimer@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoyers@efficios.com
2020-07-08 11:38:50 +02:00
Peter Zijlstra dbfb089d36 sched: Fix loadavg accounting race
The recent commit:

  c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")

moved these lines in ttwu():

	p->sched_contributes_to_load = !!task_contributes_to_load(p);
	p->state = TASK_WAKING;

up before:

	smp_cond_load_acquire(&p->on_cpu, !VAL);

into the 'p->on_rq == 0' block, with the thinking that once we hit
schedule() the current task cannot change it's ->state anymore. And
while this is true, it is both incorrect and flawed.

It is incorrect in that we need at least an ACQUIRE on 'p->on_rq == 0'
to avoid weak hardware from re-ordering things for us. This can fairly
easily be achieved by relying on the control-dependency already in
place.

The second problem, which makes the flaw in the original argument, is
that while schedule() will not change prev->state, it will read it a
number of times (arguably too many times since it's marked volatile).
The previous condition 'p->on_cpu == 0' was sufficient because that
indicates schedule() has completed, and will no longer read
prev->state. So now the trick is to make this same true for the (much)
earlier 'prev->on_rq == 0' case.

Furthermore, in order to make the ordering stick, the 'prev->on_rq = 0'
assignment needs to he a RELEASE, but adding additional ordering to
schedule() is an unwelcome proposition at the best of times, doubly so
for mere accounting.

Luckily we can push the prev->state load up before rq->lock, with the
only caveat that we then have to re-read the state after. However, we
know that if it changed, we no longer have to worry about the blocking
path. This gives us the required ordering, if we block, we did the
prev->state load before an (effective) smp_mb() and the p->on_rq store
needs not change.

With this we end up with the effective ordering:

	LOAD p->state           LOAD-ACQUIRE p->on_rq == 0
	MB
	STORE p->on_rq, 0       STORE p->state, TASK_WAKING

which ensures the TASK_WAKING store happens after the prev->state
load, and all is well again.

Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Dave Jones <davej@codemonkey.org.uk>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: https://lkml.kernel.org/r/20200707102957.GN117543@hirez.programming.kicks-ass.net
2020-07-08 11:38:49 +02:00
Peter Zijlstra 8c4890d1c3 smp, irq_work: Continue smp_call_function*() and irq_work*() integration
Instead of relying on BUG_ON() to ensure the various data structures
line up, use a bunch of horrible unions to make it all automatic.

Much of the union magic is to ensure irq_work and smp_call_function do
not (yet) see the members of their respective data structures change
name.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20200622100825.844455025@infradead.org
2020-06-28 17:01:20 +02:00
Peter Zijlstra 739f70b476 sched/core: s/WF_ON_RQ/WQ_ON_CPU/
Use a better name for this poorly named flag, to avoid confusion...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200622100825.785115830@infradead.org
2020-06-28 17:01:20 +02:00
Peter Zijlstra b6e13e8582 sched/core: Fix ttwu() race
Paul reported rcutorture occasionally hitting a NULL deref:

  sched_ttwu_pending()
    ttwu_do_wakeup()
      check_preempt_curr() := check_preempt_wakeup()
        find_matching_se()
          is_same_group()
            if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*

Debugging showed that this only appears to happen when we take the new
code-path from commit:

  2ebb177175 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

and only when @cpu == smp_processor_id(). Something which should not
be possible, because p->on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:

  c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")

this would've unconditionally hit:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.

The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p->on_cpu
load accurately returns 1, it really is still running, just not here.

Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.

The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):

					X->cpu = 1
					rq(1)->curr = X

	CPU0				CPU1				CPU2

					// switch away from X
					LOCK rq(1)->lock
					smp_mb__after_spinlock
					dequeue_task(X)
					  X->on_rq = 9
					switch_to(Z)
					  X->on_cpu = 0
					UNLOCK rq(1)->lock

									// migrate X to cpu 0
									LOCK rq(1)->lock
									dequeue_task(X)
									set_task_cpu(X, 0)
									  X->cpu = 0
									UNLOCK rq(1)->lock

									LOCK rq(0)->lock
									enqueue_task(X)
									  X->on_rq = 1
									UNLOCK rq(0)->lock

	// switch to X
	LOCK rq(0)->lock
	smp_mb__after_spinlock
	switch_to(X)
	  X->on_cpu = 1
	UNLOCK rq(0)->lock

	// X goes sleep
	X->state = TASK_UNINTERRUPTIBLE
	smp_mb();			// wake X
					ttwu()
					  LOCK X->pi_lock
					  smp_mb__after_spinlock

					  if (p->state)

					  cpu = X->cpu; // =? 1

					  smp_rmb()

	// X calls schedule()
	LOCK rq(0)->lock
	smp_mb__after_spinlock
	dequeue_task(X)
	  X->on_rq = 0

					  if (p->on_rq)

					  smp_rmb();

					  if (p->on_cpu && ttwu_queue_wakelist(..)) [*]

					  smp_cond_load_acquire(&p->on_cpu, !VAL)

					  cpu = select_task_rq(X, X->wake_cpu, ...)
					  if (X->cpu != cpu)
	switch_to(Y)
	  X->on_cpu = 0
	UNLOCK rq(0)->lock

However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu
observes ->state != RUNNING, it must also observe ->cpu != 1.

(Most of the previous ttwu() races were found on very large PowerPC)

Nevertheless, this fully explains the observed failure case.

Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
which is easy since nothing actually uses @cpu before this.

Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200622125649.GC576871@hirez.programming.kicks-ass.net
2020-06-28 17:01:20 +02:00
Juri Lelli 740797ce3a sched/core: Fix PI boosting between RT and DEADLINE tasks
syzbot reported the following warning:

 WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
 enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504

At deadline.c:628 we have:

 623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 624 {
 625 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 626 	struct rq *rq = rq_of_dl_rq(dl_rq);
 627
 628 	WARN_ON(dl_se->dl_boosted);
 629 	WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
        [...]
     }

Which means that setup_new_dl_entity() has been called on a task
currently boosted. This shouldn't happen though, as setup_new_dl_entity()
is only called when the 'dynamic' deadline of the new entity
is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
condition.

Digging through the PI code I noticed that what above might in fact happen
if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
first branch of boosting conditions we check only if a pi_task 'dynamic'
deadline is earlier than mutex holder's and in this case we set mutex
holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
initialized if such tasks get boosted at some point (or if they become
DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
to 0 and this verifies the aforementioned condition.

Fix it by checking that the potential donor task is actually (even if
temporary because in turn boosted) running at DEADLINE priority before
using its 'dynamic' deadline value.

Fixes: 2d3d891d33 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain
2020-06-28 17:01:20 +02:00
Scott Wood fd844ba9ae sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption
This function is concerned with the long-term CPU mask, not the
transitory mask the task might have while migrate disabled.  Before
this patch, if a task was migrate-disabled at the time
__set_cpus_allowed_ptr() was called, and the new mask happened to be
equal to the CPU that the task was running on, then the mask update
would be lost.

Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200617121742.cpxppyi7twxmpin7@linutronix.de
2020-06-28 17:01:20 +02:00
Kirill Tkhai aa93cd53bc sched: Micro optimization in pick_next_task() and in check_preempt_curr()
This introduces an optimization based on xxx_sched_class addresses
in two hot scheduler functions: pick_next_task() and check_preempt_curr().

It is possible to compare pointers to sched classes to check, which
of them has a higher priority, instead of current iterations using
for_each_class().

One more result of the patch is that size of object file becomes a little
less (excluding added BUG_ON(), which goes in __init section):

$size kernel/sched/core.o
         text     data      bss	    dec	    hex	filename
before:  66446    18957	    676	  86079	  1503f	kernel/sched/core.o
after:   66398    18957	    676	  86031	  1500f	kernel/sched/core.o

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/711a9c4b-ff32-1136-b848-17c622d548f3@yandex.ru
2020-06-25 13:45:44 +02:00
Steven Rostedt (VMware) c3a340f7e7 sched: Have sched_class_highest define by vmlinux.lds.h
Now that the sched_class descriptors are defined by the linker script, and
this needs to be aware of the existance of stop_sched_class when SMP is
enabled or not, as it is used as the "highest" priority when defined. Move
the declaration of sched_class_highest to the same location in the linker
script that inserts stop_sched_class, and this will also make it easier to
see what should be defined as the highest class, as this linker script
location defines the priorities as well.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191219214558.682913590@goodmis.org
2020-06-25 13:45:44 +02:00
Peter Zijlstra 8b700983de sched: Remove sched_set_*() return value
Ingo suggested that since the new sched_set_*() functions are
implemented using the 'nocheck' variants, they really shouldn't ever
fail, so remove the return value.

Cc: axboe@kernel.dk
Cc: daniel.lezcano@linaro.org
Cc: sudeep.holla@arm.com
Cc: airlied@redhat.com
Cc: broonie@kernel.org
Cc: paulmck@kernel.org
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2020-06-15 14:10:26 +02:00
Peter Zijlstra 616d91b68c sched: Remove sched_setscheduler*() EXPORTs
Now that nothing (modular) still uses sched_setscheduler(), remove the
exports.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2020-06-15 14:10:25 +02:00
Peter Zijlstra 7318d4cc14 sched: Provide sched_set_fifo()
SCHED_FIFO (or any static priority scheduler) is a broken scheduler
model; it is fundamentally incapable of resource management, the one
thing an OS is actually supposed to do.

It is impossible to compose static priority workloads. One cannot take
two well designed and functional static priority workloads and mash
them together and still expect them to work.

Therefore it doesn't make sense to expose the priority field; the
kernel is fundamentally incapable of setting a sensible value, it
needs systems knowledge that it doesn't have.

Take away sched_setschedule() / sched_setattr() from modules and
replace them with:

  - sched_set_fifo(p); create a FIFO task (at prio 50)
  - sched_set_fifo_low(p); create a task higher than NORMAL,
	which ends up being a FIFO task at prio 1.
  - sched_set_normal(p, nice); (re)set the task to normal

This stops the proliferation of randomly chosen, and irrelevant, FIFO
priorities that dont't really mean anything anyway.

The system administrator/integrator, whoever has insight into the
actual system design and requirements (userspace) can set-up
appropriate priorities if and when needed.

Cc: airlied@redhat.com
Cc: alexander.deucher@amd.com
Cc: awalls@md.metrocast.net
Cc: axboe@kernel.dk
Cc: broonie@kernel.org
Cc: daniel.lezcano@linaro.org
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: herbert@gondor.apana.org.au
Cc: hverkuil@xs4all.nl
Cc: john.stultz@linaro.org
Cc: nico@fluxnic.net
Cc: paulmck@kernel.org
Cc: rafael.j.wysocki@intel.com
Cc: rmk+kernel@arm.linux.org.uk
Cc: sudeep.holla@arm.com
Cc: tglx@linutronix.de
Cc: ulf.hansson@linaro.org
Cc: wim@linux-watchdog.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-15 14:10:20 +02:00
Vincent Donnefort 4581bea8b4 sched/debug: Add new tracepoints to track util_est
The util_est signals are key elements for EAS task placement and
frequency selection. Having tracepoints to track these signals enables
load-tracking and schedutil testing and/or debugging by a toolkit.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/1590597554-370150-1-git-send-email-vincent.donnefort@arm.com
2020-06-15 14:10:02 +02:00
Dietmar Eggemann 0900acf2d8 sched/core: Remove redundant 'preempt' param from sched_class->yield_to_task()
Commit 6d1cafd8b5 ("sched: Resched proper CPU on yield_to()") moved
the code to resched the CPU from yield_to_task_fair() to yield_to()
making the preempt parameter in sched_class->yield_to_task()
unnecessary. Remove it. No other sched_class implements yield_to_task().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200603080304.16548-3-dietmar.eggemann@arm.com
2020-06-15 14:10:01 +02:00
Dmitry Safonov 9cb8f069de kernel: rename show_stack_loglvl() => show_stack()
Now the last users of show_stack() got converted to use an explicit log
level, show_stack_loglvl() can drop it's redundant suffix and become once
again well known show_stack().

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-51-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Dmitry Safonov 8ba09b1dc1 sched: print stack trace with KERN_INFO
Aligning with other messages printed in sched_show_task() - use KERN_INFO
to print the backtrace.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-49-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:12 -07:00
Dmitry Safonov 2062a4e8ae kallsyms/printk: add loglvl to print_ip_sym()
Patch series "Add log level to show_stack()", v3.

Add log level argument to show_stack().

Done in three stages:
1. Introducing show_stack_loglvl() for every architecture
2. Migrating old users with an explicit log level
3. Renaming show_stack_loglvl() into show_stack()

Justification:

- It's a design mistake to move a business-logic decision into platform
  realization detail.

- I have currently two patches sets that would benefit from this work:
  Removing console_loglevel jumps in sysrq driver [1] Hung task warning
  before panic [2] - suggested by Tetsuo (but he probably didn't realise
  what it would involve).

- While doing (1), (2) the backtraces were adjusted to headers and other
  messages for each situation - so there won't be a situation when the
  backtrace is printed, but the headers are missing because they have
  lesser log level (or the reverse).

- As the result in (2) plays with console_loglevel for kdb are removed.

The least important for upstream, but maybe still worth to note that every
company I've worked in so far had an off-list patch to print backtrace
with the needed log level (but only for the architecture they cared
about).  If you have other ideas how you will benefit from show_stack()
with a log level - please, reply to this cover letter.

See also discussion on v1:
https://lore.kernel.org/linux-riscv/20191106083538.z5nlpuf64cigxigh@pathway.suse.cz/

This patch (of 50):

print_ip_sym() needs to have a log level parameter to comply with other
parts being printed.  Otherwise, half of the expected backtrace would be
printed and other may be missing with some logging level.

The following callee(s) are using now the adjusted log level:
- microblaze/unwind: the same level as headers & userspace unwind.
  Note that pr_debug()'s there are for debugging the unwinder itself.
- nds32/traps: symbol addresses are printed with the same log level
  as backtrace headers.
- lockdep: ip for locking issues is printed with the same log level
  as other part of the warning.
- sched: ip where preemption was disabled is printed as error like
  the rest part of the message.
- ftrace: bug reports are now consistent in the log level being used.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Will Deacon <will@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Link: http://lkml.kernel.org/r/20200418201944.482088-2-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:10 -07:00
Linus Torvalds cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Linus Torvalds d479c5a191 The changes in this cycle are:
- Optimize the task wakeup CPU selection logic, to improve scalability and
    reduce wakeup latency spikes
 
  - PELT enhancements
 
  - CFS bandwidth handling fixes
 
  - Optimize the wakeup path by remove rq->wake_list and replacing it with ->ttwu_pending
 
  - Optimize IPI cross-calls by making flush_smp_call_function_queue()
    process sync callbacks first.
 
  - Misc fixes and enhancements.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7WPL0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i0ThAAs0fbvMzNJ5SWFdwOQ4KZIlA+Im4dEBMK
 sx/XAZqa/hGxvkm1jS0RDVQl1V1JdOlru5UF4C42ctnAFGtBBHDriO5rn9oCpkSw
 DAoLc4eZqzldIXN6sDZ0xMtC14Eu15UAP40OyM4qxBc4GqGlOnnale6Vhn+n+pLQ
 jAuZlMJIkmmzeA6cuvtultevrVh+QUqJ/5oNUANlTER4OM48umjr5rNTOb8cIW53
 9K3vbS3nmqSvJuIyqfRFoMy5GFM6+Jj2+nYuq8aTuYLEtF4qqWzttS3wBzC9699g
 XYRKILkCK8ZP4RB5Ps/DIKj6maZGZoICBxTJEkIgXujJlxlKKTD3mddk+0LBXChW
 Ijznanxn67akoAFpqi/Dnkhieg7cUrE9v1OPRS2J0xy550synSPFcSgOK3viizga
 iqbjptY4scUWkCwHQNjABerxc7MWzrwbIrRt+uNvCaqJLweUh0GnEcV5va8R+4I8
 K20XwOdrzuPLo5KdDWA/BKOEv49guHZDvoykzlwMlR3gFfwHS/UsjzmSQIWK3gZG
 9OMn8ibO2f1OzhRcEpDLFzp7IIj6NJmPFVSW+7xHyL9/vTveUx3ZXPLteb2qxJVP
 BYPsduVx8YeGRBlLya0PJriB23ajQr0lnHWo15g0uR9o/0Ds1ephcymiF3QJmCaA
 To3CyIuQN8M=
 =C2OP
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "The changes in this cycle are:

   - Optimize the task wakeup CPU selection logic, to improve
     scalability and reduce wakeup latency spikes

   - PELT enhancements

   - CFS bandwidth handling fixes

   - Optimize the wakeup path by remove rq->wake_list and replacing it
     with ->ttwu_pending

   - Optimize IPI cross-calls by making flush_smp_call_function_queue()
     process sync callbacks first.

   - Misc fixes and enhancements"

* tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too
  sched/headers: Split out open-coded prototypes into kernel/sched/smp.h
  sched: Replace rq::wake_list
  sched: Add rq::ttwu_pending
  irq_work, smp: Allow irq_work on call_single_queue
  smp: Optimize send_call_function_single_ipi()
  smp: Move irq_work_run() out of flush_smp_call_function_queue()
  smp: Optimize flush_smp_call_function_queue()
  sched: Fix smp_call_function_single_async() usage for ILB
  sched/core: Offload wakee task activation if it the wakee is descheduling
  sched/core: Optimize ttwu() spinning on p->on_cpu
  sched: Defend cfs and rt bandwidth quota against overflow
  sched/cpuacct: Fix charge cpuacct.usage_sys
  sched/fair: Replace zero-length array with flexible-array
  sched/pelt: Sync util/runnable_sum with PELT window when propagating
  sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()
  sched/fair: Optimize enqueue_task_fair()
  sched: Make scheduler_ipi inline
  sched: Clean up scheduler_ipi()
  sched/core: Simplify sched_init()
  ...
2020-06-03 13:06:42 -07:00
Linus Torvalds 533b220f7b arm64 updates for 5.8
- Branch Target Identification (BTI)
 	* Support for ARMv8.5-BTI in both user- and kernel-space. This
 	  allows branch targets to limit the types of branch from which
 	  they can be called and additionally prevents branching to
 	  arbitrary code, although kernel support requires a very recent
 	  toolchain.
 
 	* Function annotation via SYM_FUNC_START() so that assembly
 	  functions are wrapped with the relevant "landing pad"
 	  instructions.
 
 	* BPF and vDSO updates to use the new instructions.
 
 	* Addition of a new HWCAP and exposure of BTI capability to
 	  userspace via ID register emulation, along with ELF loader
 	  support for the BTI feature in .note.gnu.property.
 
 	* Non-critical fixes to CFI unwind annotations in the sigreturn
 	  trampoline.
 
 - Shadow Call Stack (SCS)
 	* Support for Clang's Shadow Call Stack feature, which reserves
 	  platform register x18 to point at a separate stack for each
 	  task that holds only return addresses. This protects function
 	  return control flow from buffer overruns on the main stack.
 
 	* Save/restore of x18 across problematic boundaries (user-mode,
 	  hypervisor, EFI, suspend, etc).
 
 	* Core support for SCS, should other architectures want to use it
 	  too.
 
 	* SCS overflow checking on context-switch as part of the existing
 	  stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
 
 - CPU feature detection
 	* Removed numerous "SANITY CHECK" errors when running on a system
 	  with mismatched AArch32 support at EL1. This is primarily a
 	  concern for KVM, which disabled support for 32-bit guests on
 	  such a system.
 
 	* Addition of new ID registers and fields as the architecture has
 	  been extended.
 
 - Perf and PMU drivers
 	* Minor fixes and cleanups to system PMU drivers.
 
 - Hardware errata
 	* Unify KVM workarounds for VHE and nVHE configurations.
 
 	* Sort vendor errata entries in Kconfig.
 
 - Secure Monitor Call Calling Convention (SMCCC)
 	* Update to the latest specification from Arm (v1.2).
 
 	* Allow PSCI code to query the SMCCC version.
 
 - Software Delegated Exception Interface (SDEI)
 	* Unexport a bunch of unused symbols.
 
 	* Minor fixes to handling of firmware data.
 
 - Pointer authentication
 	* Add support for dumping the kernel PAC mask in vmcoreinfo so
 	  that the stack can be unwound by tools such as kdump.
 
 	* Simplification of key initialisation during CPU bringup.
 
 - BPF backend
 	* Improve immediate generation for logical and add/sub
 	  instructions.
 
 - vDSO
 	- Minor fixes to the linker flags for consistency with other
 	  architectures and support for LLVM's unwinder.
 
 	- Clean up logic to initialise and map the vDSO into userspace.
 
 - ACPI
 	- Work around for an ambiguity in the IORT specification relating
 	  to the "num_ids" field.
 
 	- Support _DMA method for all named components rather than only
 	  PCIe root complexes.
 
 	- Minor other IORT-related fixes.
 
 - Miscellaneous
 	* Initialise debug traps early for KGDB and fix KDB cacheflushing
 	  deadlock.
 
 	* Minor tweaks to early boot state (documentation update, set
 	  TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
 
 	* Refactoring and cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC
 yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9
 jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU
 JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx
 HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv
 znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q
 =w3qi
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "A sizeable pile of arm64 updates for 5.8.

  Summary below, but the big two features are support for Branch Target
  Identification and Clang's Shadow Call stack. The latter is currently
  arm64-only, but the high-level parts are all in core code so it could
  easily be adopted by other architectures pending toolchain support

  Branch Target Identification (BTI):

   - Support for ARMv8.5-BTI in both user- and kernel-space. This allows
     branch targets to limit the types of branch from which they can be
     called and additionally prevents branching to arbitrary code,
     although kernel support requires a very recent toolchain.

   - Function annotation via SYM_FUNC_START() so that assembly functions
     are wrapped with the relevant "landing pad" instructions.

   - BPF and vDSO updates to use the new instructions.

   - Addition of a new HWCAP and exposure of BTI capability to userspace
     via ID register emulation, along with ELF loader support for the
     BTI feature in .note.gnu.property.

   - Non-critical fixes to CFI unwind annotations in the sigreturn
     trampoline.

  Shadow Call Stack (SCS):

   - Support for Clang's Shadow Call Stack feature, which reserves
     platform register x18 to point at a separate stack for each task
     that holds only return addresses. This protects function return
     control flow from buffer overruns on the main stack.

   - Save/restore of x18 across problematic boundaries (user-mode,
     hypervisor, EFI, suspend, etc).

   - Core support for SCS, should other architectures want to use it
     too.

   - SCS overflow checking on context-switch as part of the existing
     stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

  CPU feature detection:

   - Removed numerous "SANITY CHECK" errors when running on a system
     with mismatched AArch32 support at EL1. This is primarily a concern
     for KVM, which disabled support for 32-bit guests on such a system.

   - Addition of new ID registers and fields as the architecture has
     been extended.

  Perf and PMU drivers:

   - Minor fixes and cleanups to system PMU drivers.

  Hardware errata:

   - Unify KVM workarounds for VHE and nVHE configurations.

   - Sort vendor errata entries in Kconfig.

  Secure Monitor Call Calling Convention (SMCCC):

   - Update to the latest specification from Arm (v1.2).

   - Allow PSCI code to query the SMCCC version.

  Software Delegated Exception Interface (SDEI):

   - Unexport a bunch of unused symbols.

   - Minor fixes to handling of firmware data.

  Pointer authentication:

   - Add support for dumping the kernel PAC mask in vmcoreinfo so that
     the stack can be unwound by tools such as kdump.

   - Simplification of key initialisation during CPU bringup.

  BPF backend:

   - Improve immediate generation for logical and add/sub instructions.

  vDSO:

   - Minor fixes to the linker flags for consistency with other
     architectures and support for LLVM's unwinder.

   - Clean up logic to initialise and map the vDSO into userspace.

  ACPI:

   - Work around for an ambiguity in the IORT specification relating to
     the "num_ids" field.

   - Support _DMA method for all named components rather than only PCIe
     root complexes.

   - Minor other IORT-related fixes.

  Miscellaneous:

   - Initialise debug traps early for KGDB and fix KDB cacheflushing
     deadlock.

   - Minor tweaks to early boot state (documentation update, set
     TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

   - Refactoring and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
  KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
  KVM: arm64: Check advertised Stage-2 page size capability
  arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
  ACPI/IORT: Remove the unused __get_pci_rid()
  arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
  arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
  arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
  arm64/cpufeature: Introduce ID_MMFR5 CPU register
  arm64/cpufeature: Introduce ID_DFR1 CPU register
  arm64/cpufeature: Introduce ID_PFR2 CPU register
  arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
  arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
  arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
  arm64: mm: Add asid_gen_match() helper
  firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
  arm64: vdso: Fix CFI directives in sigreturn trampoline
  arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
  ...
2020-06-01 15:18:27 -07:00
Ingo Molnar 1f8db41505 sched/headers: Split out open-coded prototypes into kernel/sched/smp.h
Move the prototypes for sched_ttwu_pending() and send_call_function_single_ipi()
into the newly created kernel/sched/smp.h header, to make sure they are all
the same, and to architectures happy that use -Wmissing-prototypes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 11:03:20 +02:00
Peter Zijlstra a148866489 sched: Replace rq::wake_list
The recent commit: 90b5363acd ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in ttwu_queue_remote() got this wrong.

While on first reading ttwu_queue_remote() has an atomic test-and-set
that appears to serialize the use, the matching 'release' is not in
the right place to actually guarantee this serialization.

The actual race is vs the sched_ttwu_pending() call in the idle loop;
that can run the wakeup-list without consuming the CSD.

Instead of trying to chain the lists, merge them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.129371594@infradead.org
2020-05-28 10:54:16 +02:00
Peter Zijlstra 126c2092e5 sched: Add rq::ttwu_pending
In preparation of removing rq->wake_list, replace the
!list_empty(rq->wake_list) with rq->ttwu_pending. This is not fully
equivalent as this new variable is racy.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.070399698@infradead.org
2020-05-28 10:54:16 +02:00
Peter Zijlstra b2a02fc43a smp: Optimize send_call_function_single_ipi()
Just like the ttwu_queue_remote() IPI, make use of _TIF_POLLING_NRFLAG
to avoid sending IPIs to idle CPUs.

[ mingo: Fix UP build bug. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161907.953304789@infradead.org
2020-05-28 10:54:15 +02:00
Peter Zijlstra 19a1f5ec69 sched: Fix smp_call_function_single_async() usage for ILB
The recent commit: 90b5363acd ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in kick_ilb() got this wrong.

While on first reading kick_ilb() has an atomic test-and-set that
appears to serialize the use, the matching 'release' is not in the
right place to actually guarantee this serialization.

Rework the nohz_idle_balance() trigger so that the release is in the
IPI callback and thus guarantees the required serialization for the
CSD.

Fixes: 90b5363acd ("sched: Clean up scheduler_ipi()")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: mgorman@techsingularity.net
Link: https://lore.kernel.org/r/20200526161907.778543557@infradead.org
2020-05-28 10:54:15 +02:00
Ingo Molnar 58ef57b16d Merge branch 'core/rcu' into sched/core, to pick up dependency
We are going to rely on the loosening of RCU callback semantics,
introduced by this commit:

  806f04e9fd2c: ("rcu: Allow for smp_call_function() running callbacks from idle")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 10:52:53 +02:00
Mel Gorman 2ebb177175 sched/core: Offload wakee task activation if it the wakee is descheduling
The previous commit:

  c6e7bd7afaeb: ("sched/core: Optimize ttwu() spinning on p->on_cpu")

avoids spinning on p->on_rq when the task is descheduling, but only if the
wakee is on a CPU that does not share cache with the waker.

This patch offloads the activation of the wakee to the CPU that is about to
go idle if the task is the only one on the runqueue. This potentially allows
the waker task to continue making progress when the wakeup is not strictly
synchronous.

This is very obvious with netperf UDP_STREAM running on localhost. The
waker is sending packets as quickly as possible without waiting for any
reply. It frequently wakes the server for the processing of packets and
when netserver is using local memory, it quickly completes the processing
and goes back to idle. The waker often observes that netserver is on_rq
and spins excessively leading to a drop in throughput.

This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
localwakelist-v1r2 respectively.

                                  5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                    vanilla           optttwu-v1r1     localwakelist-v1r2
Hmean     send-64         251.49 (   0.00%)      258.05 *   2.61%*      305.59 *  21.51%*
Hmean     send-128        497.86 (   0.00%)      519.89 *   4.43%*      600.25 *  20.57%*
Hmean     send-256        944.90 (   0.00%)      997.45 *   5.56%*     1140.19 *  20.67%*
Hmean     send-1024      3779.03 (   0.00%)     3859.18 *   2.12%*     4518.19 *  19.56%*
Hmean     send-2048      7030.81 (   0.00%)     7315.99 *   4.06%*     8683.01 *  23.50%*
Hmean     send-3312     10847.44 (   0.00%)    11149.43 *   2.78%*    12896.71 *  18.89%*
Hmean     send-4096     13436.19 (   0.00%)    13614.09 (   1.32%)    15041.09 *  11.94%*
Hmean     send-8192     22624.49 (   0.00%)    23265.32 *   2.83%*    24534.96 *   8.44%*
Hmean     send-16384    34441.87 (   0.00%)    36457.15 *   5.85%*    35986.21 *   4.48%*

Note that this benefit is not universal to all wakeups, it only applies
to the case where the waker often spins on p->on_rq.

The impact can be seen from a "perf sched latency" report generated from
a single iteration of one packet size:

   -----------------------------------------------------------------------------------------------------------------
    Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
   -----------------------------------------------------------------------------------------------------------------

  vanilla
    netperf:4337          |  21709.193 ms |     2932 | avg:    0.002 ms | max:    0.041 ms | max at:    112.154512 s
    netserver:4338        |  14629.459 ms |  5146990 | avg:    0.001 ms | max: 1615.864 ms | max at:    140.134496 s

  localwakelist-v1r2
    netperf:4339          |  29789.717 ms |     2460 | avg:    0.002 ms | max:    0.059 ms | max at:    138.205389 s
    netserver:4340        |  18858.767 ms |  7279005 | avg:    0.001 ms | max:    0.362 ms | max at:    135.709683 s
   -----------------------------------------------------------------------------------------------------------------

Note that the average wakeup delay is quite small on both the vanilla
kernel and with the two patches applied. However, there are significant
outliers with the vanilla kernel with the maximum one measured as 1615
milliseconds with a vanilla kernel but never worse than 0.362 ms with
both patches applied and a much higher rate of context switching.

Similarly a separate profile of cycles showed that 2.83% of all cycles
were spent in try_to_wake_up() with almost half of the cycles spent
on spinning on p->on_rq. With the two patches, the percentage of cycles
spent in try_to_wake_up() drops to 1.13%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-3-mgorman@techsingularity.net
2020-05-25 07:04:10 +02:00
Peter Zijlstra c6e7bd7afa sched/core: Optimize ttwu() spinning on p->on_cpu
Both Rik and Mel reported seeing ttwu() spend significant time on:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that owns the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-2-mgorman@techsingularity.net
2020-05-25 07:01:44 +02:00
Huaixin Chang d505b8af58 sched: Defend cfs and rt bandwidth quota against overflow
When users write some huge number into cpu.cfs_quota_us or
cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of
schedulable checks.

to_ratio() could be altered to avoid unnecessary internal overflow, but
min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still
be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to
prevent overflow.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
2020-05-19 20:34:14 +02:00
Thomas Gleixner 1ed0948eea Merge tag 'noinstr-lds-2020-05-19' into core/rcu
Get the noinstr section and annotation markers to base the RCU parts on.
2020-05-19 15:50:34 +02:00
Will Deacon 88485be531 scs: Move scs_overflow_check() out of architecture code
There is nothing architecture-specific about scs_overflow_check() as
it's just a trivial wrapper around scs_corrupted().

For parity with task_stack_end_corrupted(), rename scs_corrupted() to
task_scs_end_corrupted() and call it from schedule_debug() when
CONFIG_SCHED_STACK_END_CHECK_is enabled, which better reflects its
purpose as a debug feature to catch inadvertent overflow of the SCS.
Finally, remove the unused scs_overflow_check() function entirely.

This has absolutely no impact on architectures that do not support SCS
(currently arm64 only).

Tested-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18 17:47:40 +01:00
Sami Tolvanen d08b9f0ca6 scs: Add support for Clang's Shadow Call Stack (SCS)
This change adds generic support for Clang's Shadow Call Stack,
which uses a shadow stack to protect return addresses from being
overwritten by an attacker. Details are available here:

  https://clang.llvm.org/docs/ShadowCallStack.html

Note that security guarantees in the kernel differ from the ones
documented for user space. The kernel must store addresses of
shadow stacks in memory, which means an attacker capable reading
and writing arbitrary memory may be able to locate them and hijack
control flow by modifying the stacks.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
[will: Numerous cosmetic changes]
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15 16:35:45 +01:00
Thomas Gleixner 2a0a24ebb4 sched: Make scheduler_ipi inline
Now that the scheduler IPI is trivial and simple again there is no point to
have the little function out of line. This simplifies the effort of
constraining the instrumentation nicely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134058.453581595@linutronix.de
2020-05-12 17:10:49 +02:00
Peter Zijlstra (Intel) 90b5363acd sched: Clean up scheduler_ipi()
The scheduler IPI has grown weird and wonderful over the years, time
for spring cleaning.

Move all the non-trivial stuff out of it and into a regular smp function
call IPI. This then reduces the schedule_ipi() to most of it's former NOP
glory and ensures to keep the interrupt vector lean and mean.

Aside of that avoiding the full irq_enter() in the x86 IPI implementation
is incorrect as scheduler_ipi() can be instrumented. To work around that
scheduler_ipi() had an irq_enter/exit() hack when heavy work was
pending. This is gone now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134058.361859938@linutronix.de
2020-05-12 17:10:48 +02:00
Wei Yang b1d1779e5e sched/core: Simplify sched_init()
Currently root_task_group.shares and cfs_bandwidth are initialized for
each online cpu, which not necessary.

Let's take it out to do it only once.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200423214443.29994-1-richard.weiyang@gmail.com
2020-04-30 20:14:42 +02:00
Peter Zijlstra bf2c59fce4 sched/core: Fix illegal RCU from offline CPUs
In the CPU-offline process, it calls mmdrop() after idle entry and the
subsequent call to cpuhp_report_idle_dead(). Once execution passes the
call to rcu_report_dead(), RCU is ignoring the CPU, which results in
lockdep complaining when mmdrop() uses RCU from either memcg or
debugobjects below.

Fix it by cleaning up the active_mm state from BP instead. Every arch
which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
from AP. The only exception is parisc because it switches them to
&init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
but the patch will still work there because it calls mmgrab(&init_mm) in
smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().

  WARNING: suspicious RCU usage
  -----------------------------
  kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!

  other info that might help us debug this:

  RCU used illegally from offline CPU!
  Call Trace:
   dump_stack+0xf4/0x164 (unreliable)
   lockdep_rcu_suspicious+0x140/0x164
   get_work_pool+0x110/0x150
   __queue_work+0x1bc/0xca0
   queue_work_on+0x114/0x120
   css_release+0x9c/0xc0
   percpu_ref_put_many+0x204/0x230
   free_pcp_prepare+0x264/0x570
   free_unref_page+0x38/0xf0
   __mmdrop+0x21c/0x2c0
   idle_task_exit+0x170/0x1b0
   pnv_smp_cpu_kill_self+0x38/0x2e0
   cpu_die+0x48/0x64
   arch_cpu_idle_dead+0x30/0x50
   do_idle+0x2f4/0x470
   cpu_startup_entry+0x38/0x40
   start_secondary+0x7a8/0xa80
   start_secondary_resume+0x10/0x14

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
2020-04-30 20:14:41 +02:00
Chen Yu 457d1f4657 sched: Extract the task putting code from pick_next_task()
Introduce a new function put_prev_task_balance() to do the balance
when necessary, and then put previous task back to the run queue.
This function is extracted from pick_next_task() to prepare for
future usage by other type of task picking logic.

No functional change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/5a99860cf66293db58a397d6248bcb2eee326776.1587464698.git.yu.c.chen@intel.com
2020-04-30 20:14:40 +02:00
Daniel Borkmann 0b54142e4b Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-28 21:23:38 +02:00
Paul E. McKenney 2beaf3280e sched/core: Add function to sample state of locked-down task
A running task's state can be sampled in a consistent manner (for example,
for diagnostic purposes) simply by invoking smp_call_function_single()
on its CPU, which may be obtained using task_cpu(), then having the
IPI handler verify that the desired task is in fact still running.
However, if the task is not running, this sampling can in theory be done
immediately and directly.  In practice, the task might start running at
any time, including during the sampling period.  Gaining a consistent
sample of a not-running task therefore requires that something be done
to lock down the target task's state.

This commit therefore adds a try_invoke_on_locked_down_task() function
that invokes a specified function if the specified task can be locked
down, returning true if successful and if the specified function returns
true.  Otherwise this function simply returns false.  Given that the
function passed to try_invoke_on_nonrunning_task() might be invoked with
a runqueue lock held, that function had better be quite lightweight.

The function is passed the target task's task_struct pointer and the
argument passed to try_invoke_on_locked_down_task(), allowing easy access
to task state and to a location for further variables to be passed in
and out.

Note that the specified function will be called even if the specified
task is currently running.  The function can use ->on_rq and task_curr()
to quickly and easily determine the task's state, and can return false
if this state is not to the function's liking.  The caller of the
try_invoke_on_locked_down_task() would then see the false return value,
and could take appropriate action, for example, trying again later or
sending an IPI if matters are more urgent.

It is expected that use cases such as the RCU CPU stall warning code will
simply return false if the task is currently running.  However, there are
use cases involving nohz_full CPUs where the specified function might
instead fall back to an alternative sampling scheme that relies on heavier
synchronization (such as memory barriers) in the target task.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
[ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
[ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ]
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Christoph Hellwig 32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Quentin Perret eaf5a92ebd sched/core: Fix reset-on-fork from RT with uclamp
uclamp_fork() resets the uclamp values to their default when the
reset-on-fork flag is set. It also checks whether the task has a RT
policy, and sets its uclamp.min to 1024 accordingly. However, during
reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after,
hence leading to an erroneous uclamp.min setting for the new task if it
was forked from RT.

Fix this by removing the unnecessary check on rt_task() in
uclamp_fork() as this doesn't make sense if the reset-on-fork flag is
set.

Fixes: 1a00d99997 ("sched/uclamp: Set default clamps for RT tasks")
Reported-by: Chitti Babu Theegala <ctheegal@codeaurora.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Patrick Bellasi <patrick.bellasi@matbug.net>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com
2020-04-22 23:10:13 +02:00
Vincent Donnefort 275b2f6723 sched/core: Remove unused rq::last_load_update_tick
The following commit:

  5e83eafbfd ("sched/fair: Remove the rq->cpu_load[] update code")

eliminated the last use case for rq->last_load_update_tick, so remove
the field as well.

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1584710495-308969-1-git-send-email-vincent.donnefort@arm.com
2020-04-08 11:35:23 +02:00
Sebastian Andrzej Siewior 62849a9612 workqueue: Remove the warning in wq_worker_sleeping()
The kernel test robot triggered a warning with the following race:
   task-ctx A                            interrupt-ctx B
 worker
  -> process_one_work()
    -> work_item()
      -> schedule();
         -> sched_submit_work()
           -> wq_worker_sleeping()
             -> ->sleeping = 1
               atomic_dec_and_test(nr_running)
         __schedule();                *interrupt*
                                       async_page_fault()
                                       -> local_irq_enable();
                                       -> schedule();
                                          -> sched_submit_work()
                                            -> wq_worker_sleeping()
                                               -> if (WARN_ON(->sleeping)) return
                                          -> __schedule()
                                            ->  sched_update_worker()
                                              -> wq_worker_running()
                                                 -> atomic_inc(nr_running);
                                                 -> ->sleeping = 0;

      ->  sched_update_worker()
        -> wq_worker_running()
          if (!->sleeping) return

In this context the warning is pointless everything is fine.
An interrupt before wq_worker_sleeping() will perform the ->sleeping
assignment (0 -> 1 > 0) twice.
An interrupt after wq_worker_sleeping() will trigger the warning and
nr_running will be decremented (by A) and incremented once (only by B, A
will skip it). This is the case until the ->sleeping is zeroed again in
wq_worker_running().

Remove the WARN statement because this condition may happen. Document
that preemption around wq_worker_sleeping() needs to be disabled to
protect ->sleeping and not just as an optimisation.

Fixes: 6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
2020-04-08 11:35:20 +02:00
Valentin Schneider d76343c6b2 sched/fair: Align rq->avg_idle and rq->avg_scan_cost
sched/core.c uses update_avg() for rq->avg_idle and sched/fair.c uses an
open-coded version (with the exact same decay factor) for
rq->avg_scan_cost. On top of that, select_idle_cpu() expects to be able to
compare these two fields.

The only difference between the two is that rq->avg_scan_cost is computed
using a pure division rather than a shift. Turns out it actually matters,
first of all because the shifted value can be negative, and the standard
has this to say about it:

  """
  The result of E1 >> E2 is E1 right-shifted E2 bit positions. [...] If E1
  has a signed type and a negative value, the resulting value is
  implementation-defined.
  """

Not only this, but (arithmetic) right shifting a negative value (using 2's
complement) is *not* equivalent to dividing it by the corresponding power
of 2. Let's look at a few examples:

  -4      -> 0xF..FC
  -4 >> 3 -> 0xF..FF == -1 != -4 / 8

  -8      -> 0xF..F8
  -8 >> 3 -> 0xF..FF == -1 == -8 / 8

  -9      -> 0xF..F7
  -9 >> 3 -> 0xF..FE == -2 != -9 / 8

Make update_avg() use a division, and export it to the private scheduler
header to reuse it where relevant. Note that this still lets compilers use
a shift here, but should prevent any unwanted surprise. The disassembly of
select_idle_cpu() remains unchanged on arm64, and ttwu_do_wakeup() gains 2
instructions; the diff sort of looks like this:

  - sub x1, x1, x0
  + subs x1, x1, x0 // set condition codes
  + add x0, x1, #0x7
  + csel x0, x0, x1, mi // x0 = x1 < 0 ? x0 : x1
    add x0, x3, x0, asr #3

which does the right thing (i.e. gives us the expected result while still
using an arithmetic shift)

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200330090127.16294-1-valentin.schneider@arm.com
2020-04-08 11:35:18 +02:00
Linus Torvalds 992a1a3b45 CPU (hotplug) updates:
- Support for locked CSD objects in smp_call_function_single_async()
     which allows to simplify callsites in the scheduler core and MIPS
 
   - Treewide consolidation of CPU hotplug functions which ensures the
     consistency between the sysfs interface and kernel state. The low level
     functions cpu_up/down() are now confined to the core code and not
     longer accessible from random code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B9VQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodCyD/0WFYAe7LkOfNjkbLa0IeuyLjF9rnCi
 ilcSXMLpaVwwoQvm7MopwkXUDdmEIyeJ0B641j3mC3AKCRap4+O36H2IEg2byrj7
 twOvQNCfxpVVmCCD11FTH9aQa74LEB6AikTgjevhrRWj6eHsal7c2Ak26AzCgrt+
 0eEkOAOWJbLAlbIiPdHlCZ3TMldcs3gg+lRSYd5QCGQVkZFnwpXzyOvpyJEUGGbb
 R/JuvwJoLhRMiYAJDILoQQQg/J07ODuivse/R8PWaH2djkn+2NyRGrD794PhyyOg
 QoTU0ZrYD3Z48ACXv+N3jLM7wXMcFzjYtr1vW1E3O/YGA7GVIC6XHGbMQ7tEihY0
 ajtwq8DcnpKtuouviYnf7NuKgqdmJXkaZjz3Gms6n8nLXqqSVwuQELWV2CXkxNe6
 9kgnnKK+xXMOGI4TUhN8bejvkXqRCmKMeQJcWyf+7RA9UIhAJw5o7WGo8gXfQWUx
 tazCqDy/inYjqGxckW615fhi2zHfemlYTbSzIGOuMB1TEPKFcrgYAii/VMsYHQVZ
 5amkYUXGQ5brlCOzOn38lzp5OkALBnFzD7xgvOcQgWT3ynVpdqADfBytXiEEHh4J
 KSkSgSSRcS58397nIxnDcJgJouHLvAWYyPZ4UC6mfynuQIic31qMHGVqwdbEKMY3
 4M5dGgqIfOBgYw==
 =jwCg
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core SMP updates from Thomas Gleixner:
 "CPU (hotplug) updates:

   - Support for locked CSD objects in smp_call_function_single_async()
     which allows to simplify callsites in the scheduler core and MIPS

   - Treewide consolidation of CPU hotplug functions which ensures the
     consistency between the sysfs interface and kernel state. The low
     level functions cpu_up/down() are now confined to the core code and
     not longer accessible from random code"

* tag 'smp-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus()
  cpu/hotplug: Hide cpu_up/down()
  cpu/hotplug: Move bringup of secondary CPUs out of smp_init()
  torture: Replace cpu_up/down() with add/remove_cpu()
  firmware: psci: Replace cpu_up/down() with add/remove_cpu()
  xen/cpuhotplug: Replace cpu_up/down() with device_online/offline()
  parisc: Replace cpu_up/down() with add/remove_cpu()
  sparc: Replace cpu_up/down() with add/remove_cpu()
  powerpc: Replace cpu_up/down() with add/remove_cpu()
  x86/smp: Replace cpu_up/down() with add/remove_cpu()
  arm64: hibernate: Use bringup_hibernate_cpu()
  cpu/hotplug: Provide bringup_hibernate_cpu()
  arm64: Use reboot_cpu instead of hardconding it to 0
  arm64: Don't use disable_nonboot_cpus()
  ARM: Use reboot_cpu instead of hardcoding it to 0
  ARM: Don't use disable_nonboot_cpus()
  ia64: Replace cpu_down() with smp_shutdown_nonboot_cpus()
  cpu/hotplug: Create a new function to shutdown nonboot cpus
  cpu/hotplug: Add new {add,remove}_cpu() functions
  sched/core: Remove rq.hrtick_csd_pending
  ...
2020-03-30 18:06:39 -07:00
Johannes Weiner b05e75d611 psi: Fix cpu.pressure for cpu.max and competing cgroups
For simplicity, cpu pressure is defined as having more than one
runnable task on a given CPU. This works on the system-level, but it
has limitations in a cgrouped reality: When cpu.max is in use, it
doesn't capture the time in which a task is not executing on the CPU
due to throttling. Likewise, it doesn't capture the time in which a
competing cgroup is occupying the CPU - meaning it only reflects
cgroup-internal competitive pressure, not outside pressure.

Enable tracking of currently executing tasks, and then change the
definition of cpu pressure in a cgroup from

	NR_RUNNING > 1

to

	NR_RUNNING > ON_CPU

which will capture the effects of cpu.max as well as competition from
outside the cgroup.

After this patch, a cgroup running `stress -c 1` with a cpu.max
setting of 5000 10000 shows ~50% continuous CPU pressure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.org
2020-03-20 13:06:18 +01:00
Paul Turner 46a87b3851 sched/core: Distribute tasks within affinity masks
Currently, when updating the affinity of tasks via either cpusets.cpus,
or, sched_setaffinity(); tasks not currently running within the newly
specified mask will be arbitrarily assigned to the first CPU within the
mask.

This (particularly in the case that we are restricting masks) can
result in many tasks being assigned to the first CPUs of their new
masks.

This:
 1) Can induce scheduling delays while the load-balancer has a chance to
    spread them between their new CPUs.
 2) Can antogonize a poor load-balancer behavior where it has a
    difficult time recognizing that a cross-socket imbalance has been
    forced by an affinity mask.

This change adds a new cpumask interface to allow iterated calls to
distribute within the intersection of the provided masks.

The cases that this mainly affects are:
 - modifying cpuset.cpus
 - when tasks join a cpuset
 - when modifying a task's affinity via sched_setaffinity(2)

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/20200311010113.136465-1-joshdon@google.com
2020-03-20 13:06:18 +01:00
Ingo Molnar 14533a16c4 thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code
drivers/base/arch_topology.c is only built if CONFIG_GENERIC_ARCH_TOPOLOGY=y,
resulting in such build failures:

  cpufreq_cooling.c:(.text+0x1e7): undefined reference to `arch_set_thermal_pressure'

Move it to sched/core.c instead, and keep it enabled on x86 despite
us not having a arch_scale_thermal_pressure() facility there, to
build-test this thing.

Cc: Thara Gopinath <thara.gopinath@linaro.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 14:26:31 +01:00
Peter Xu fd3eafda8f sched/core: Remove rq.hrtick_csd_pending
Now smp_call_function_single_async() provides the protection that
we'll return with -EBUSY if the csd object is still pending, then we
don't need the rq.hrtick_csd_pending any more.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191216213125.9536-4-peterx@redhat.com
2020-03-06 13:42:28 +01:00