License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2017-02-01 18:36:40 +03:00
|
|
|
#ifndef _LINUX_SCHED_TOPOLOGY_H
|
|
|
|
#define _LINUX_SCHED_TOPOLOGY_H
|
|
|
|
|
2017-02-06 12:01:09 +03:00
|
|
|
#include <linux/topology.h>
|
|
|
|
|
2017-02-01 18:36:40 +03:00
|
|
|
#include <linux/sched/idle.h>
|
|
|
|
|
2017-02-01 18:36:40 +03:00
|
|
|
/*
|
|
|
|
* sched-domains (multiprocessor balancing) declarations:
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
|
2020-08-17 14:29:49 +03:00
|
|
|
/* Generate SD flag indexes */
|
2020-08-17 14:29:50 +03:00
|
|
|
#define SD_FLAG(name, mflags) __##name,
|
2020-08-17 14:29:49 +03:00
|
|
|
enum {
|
|
|
|
#include <linux/sched/sd_flags.h>
|
|
|
|
__SD_FLAG_CNT,
|
|
|
|
};
|
|
|
|
#undef SD_FLAG
|
|
|
|
/* Generate SD flag bits */
|
2020-08-17 14:29:50 +03:00
|
|
|
#define SD_FLAG(name, mflags) name = 1 << __##name,
|
2020-08-17 14:29:49 +03:00
|
|
|
enum {
|
|
|
|
#include <linux/sched/sd_flags.h>
|
|
|
|
};
|
|
|
|
#undef SD_FLAG
|
2017-02-01 18:36:40 +03:00
|
|
|
|
2020-08-17 14:29:50 +03:00
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
2020-08-25 16:32:15 +03:00
|
|
|
|
|
|
|
struct sd_flag_debug {
|
2020-08-17 14:29:50 +03:00
|
|
|
unsigned int meta_flags;
|
|
|
|
char *name;
|
|
|
|
};
|
2020-08-25 16:32:15 +03:00
|
|
|
extern const struct sd_flag_debug sd_flag_debug[];
|
|
|
|
|
2020-08-17 14:29:50 +03:00
|
|
|
#endif
|
|
|
|
|
2017-02-01 18:36:40 +03:00
|
|
|
#ifdef CONFIG_SCHED_SMT
|
|
|
|
static inline int cpu_smt_flags(void)
|
|
|
|
{
|
|
|
|
return SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
sched: Add cluster scheduler level in core and related Kconfig for ARM64
This patch adds scheduler level for clusters and automatically enables
the load balance among clusters. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.
Testing has widely been done in two different hardware configurations of
Kunpeng920:
24 cores in one NUMA(6 clusters in each NUMA node);
32 cores in one NUMA(8 clusters in each NUMA node)
Workload is running on either one NUMA node or four NUMA nodes, thus,
this can estimate the effect of cluster spreading w/ and w/o NUMA load
balance.
* Stream benchmark:
4threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%)
MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%)
MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%)
MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%)
6threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%)
MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%)
MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%)
MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%)
12threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%)
MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%)
MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%)
MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%)
Thus, it could help memory-bound workload especially under medium load.
Similar improvement is also seen in lkp-pbzip2:
* lkp-pbzip2 benchmark
2-96 threads (on 4NUMA * 24cores = 96cores)
lkp-pbzip2 lkp-pbzip2
w/o patch w/ patch
Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%*
Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%*
Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%*
Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%*
Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%*
Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%*
Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%*
Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%*
Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%*
2-24 threads (on 1NUMA * 24cores = 24cores)
lkp-pbzip2 lkp-pbzip2
w/o patch w/ patch
Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%*
Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%*
Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%*
Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%*
Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%*
Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%)
Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%*
In the case of 6 threads and 8 threads, we see the greatest performance
improvement.
Similar improvement can be seen on lkp-pixz though the improvement is
smaller:
* lkp-pixz benchmark
2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
lkp-pixz lkp-pixz
w/o patch w/ patch
Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%*
Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%)
Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%*
Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%*
Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%*
Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%*
Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%*
* SPECrate benchmark
4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
Base Base
Run Time Rate
------- ---------
4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3)
8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%)
16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6)
32 Copies(on 4NUMA * 32 cores = 128cores)
[w/o patch]
Base Base Base
Benchmarks Copies Run Time Rate
--------------- ------- --------- ---------
500.perlbench_r 32 584 87.2 *
502.gcc_r 32 503 90.2 *
505.mcf_r 32 745 69.4 *
520.omnetpp_r 32 1031 40.7 *
523.xalancbmk_r 32 597 56.6 *
525.x264_r 1 -- CE
531.deepsjeng_r 32 336 109 *
541.leela_r 32 556 95.4 *
548.exchange2_r 32 513 163 *
557.xz_r 32 530 65.2 *
Est. SPECrate2017_int_base 80.3
[w/ patch]
Base Base Base
Benchmarks Copies Run Time Rate
--------------- ------- --------- ---------
500.perlbench_r 32 580 87.8 (+0.688%) *
502.gcc_r 32 477 95.1 (+5.432%) *
505.mcf_r 32 644 80.3 (+13.574%) *
520.omnetpp_r 32 942 44.6 (+9.58%) *
523.xalancbmk_r 32 560 60.4 (+6.714%%) *
525.x264_r 1 -- CE
531.deepsjeng_r 32 337 109 (+0.000%) *
541.leela_r 32 554 95.6 (+0.210%) *
548.exchange2_r 32 515 163 (+0.000%) *
557.xz_r 32 524 66.0 (+1.227%) *
Est. SPECrate2017_int_base 83.7 (+4.062%)
On the other hand, it is slightly helpful to CPU-bound tasks like
kernbench:
* 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
kernbench kernbench
w/o cluster w/ cluster
Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%)
Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%)
Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%)
Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%)
Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%)
Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%)
Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%)
Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%)
Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%)
Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%*
Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%*
Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%)
Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%*
Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%*
Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%*
Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%)
Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%*
Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%)
Note this patch isn't an universal win, it might hurt those workload
which can benefit from packing. Though tasks which want to take
advantages of lower communication latency of one cluster won't
necessarily been packed in one cluster while kernel is not aware of
clusters, they have some chance to be randomly packed. But this
patch will make them more likely spread.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-09-24 11:51:03 +03:00
|
|
|
#ifdef CONFIG_SCHED_CLUSTER
|
|
|
|
static inline int cpu_cluster_flags(void)
|
|
|
|
{
|
|
|
|
return SD_SHARE_PKG_RESOURCES;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2017-02-01 18:36:40 +03:00
|
|
|
#ifdef CONFIG_SCHED_MC
|
|
|
|
static inline int cpu_core_flags(void)
|
|
|
|
{
|
|
|
|
return SD_SHARE_PKG_RESOURCES;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
static inline int cpu_numa_flags(void)
|
|
|
|
{
|
|
|
|
return SD_NUMA;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
extern int arch_asym_cpu_priority(int cpu);
|
|
|
|
|
|
|
|
struct sched_domain_attr {
|
|
|
|
int relax_domain_level;
|
|
|
|
};
|
|
|
|
|
|
|
|
#define SD_ATTR_INIT (struct sched_domain_attr) { \
|
|
|
|
.relax_domain_level = -1, \
|
|
|
|
}
|
|
|
|
|
|
|
|
extern int sched_domain_level_max;
|
|
|
|
|
|
|
|
struct sched_group;
|
|
|
|
|
|
|
|
struct sched_domain_shared {
|
|
|
|
atomic_t ref;
|
|
|
|
atomic_t nr_busy_cpus;
|
|
|
|
int has_idle_cores;
|
sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.
The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:
@usecs:
[0] 533 | |
[1] 5495 | |
[2, 4) 12008 | |
[4, 8) 239252 | |
[8, 16) 4041924 |@@@@@@@@@@@@@@ |
[16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 4507667 |@@@@@@@@@@@@@@@ |
[512, 1K) 2600472 |@@@@@@@@@ |
[1K, 2K) 927912 |@@@ |
[2K, 4K) 218720 | |
[4K, 8K) 98161 | |
[8K, 16K) 37722 | |
[16K, 32K) 6715 | |
[32K, 64K) 477 | |
[64K, 128K) 7 | |
netperf latency usecs:
=======
case load Lat_99th std%
TCP_RR thread-224 257.39 ( 0.21)
The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.
[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:
SIS Search Efficiency(se_eff%):
A ratio expressed as a percentage of runqueues scanned versus
idle CPUs found. A 100% efficiency indicates that the target,
prev or recent CPU of a task was idle at wakeup. The lower the
efficiency, the more runqueues were scanned before an idle CPU
was found.
SIS Domain Search Efficiency(dom_eff%):
Similar, except only for the slower SIS
patch.
SIS Fast Success Rate(fast_rate%):
Percentage of SIS that used target, prev or
recent CPUs.
SIS Success rate(success_rate%):
Percentage of scans that found an idle CPU.
The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.
Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
TCP_RR 28 threads 99.978 18.535 99.995 100.000
TCP_RR 56 threads 99.397 5.671 99.964 100.000
TCP_RR 84 threads 21.721 6.818 73.632 100.000
TCP_RR 112 threads 12.500 5.533 59.000 100.000
TCP_RR 140 threads 8.524 4.535 49.020 100.000
TCP_RR 168 threads 6.438 3.945 40.309 99.999
TCP_RR 196 threads 5.397 3.718 32.320 99.982
TCP_RR 224 threads 4.874 3.661 25.775 99.767
UDP_RR 28 threads 99.988 17.704 99.997 100.000
UDP_RR 56 threads 99.528 5.977 99.970 100.000
UDP_RR 84 threads 24.219 6.992 76.479 100.000
UDP_RR 112 threads 13.907 5.706 62.538 100.000
UDP_RR 140 threads 9.408 4.699 52.519 100.000
UDP_RR 168 threads 7.095 4.077 44.352 100.000
UDP_RR 196 threads 5.757 3.775 35.764 99.991
UDP_RR 224 threads 5.124 3.704 28.748 99.860
schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
normal 1 mthread 99.152 6.400 99.941 100.000
normal 2 mthreads 97.844 4.003 99.908 100.000
normal 3 mthreads 96.395 2.118 99.917 99.998
normal 4 mthreads 55.288 1.451 98.615 99.804
normal 5 mthreads 7.004 1.870 45.597 61.036
normal 6 mthreads 3.354 1.346 20.777 34.230
normal 7 mthreads 2.183 1.028 11.257 21.055
normal 8 mthreads 1.653 0.825 7.849 15.549
schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
process-pipe 1 group 99.991 7.692 99.999 100.000
process-pipe 2 groups 99.934 4.615 99.997 100.000
process-pipe 3 groups 99.597 3.198 99.987 100.000
process-pipe 4 groups 98.378 2.464 99.958 100.000
process-pipe 5 groups 27.474 3.653 89.811 99.800
process-pipe 6 groups 20.201 4.098 82.763 99.570
process-pipe 7 groups 16.423 4.156 77.398 99.316
process-pipe 8 groups 13.165 3.920 72.232 98.828
process-sockets 1 group 99.977 5.882 99.999 100.000
process-sockets 2 groups 99.927 5.505 99.996 100.000
process-sockets 3 groups 99.397 3.250 99.980 100.000
process-sockets 4 groups 79.680 4.258 98.864 99.998
process-sockets 5 groups 7.673 2.503 63.659 92.115
process-sockets 6 groups 4.642 1.584 58.946 88.048
process-sockets 7 groups 3.493 1.379 49.816 81.164
process-sockets 8 groups 3.015 1.407 40.845 75.500
threads-pipe 1 group 99.997 0.000 100.000 100.000
threads-pipe 2 groups 99.894 2.932 99.997 100.000
threads-pipe 3 groups 99.611 4.117 99.983 100.000
threads-pipe 4 groups 97.703 2.624 99.937 100.000
threads-pipe 5 groups 22.919 3.623 87.150 99.764
threads-pipe 6 groups 18.016 4.038 80.491 99.557
threads-pipe 7 groups 14.663 3.991 75.239 99.247
threads-pipe 8 groups 12.242 3.808 70.651 98.644
threads-sockets 1 group 99.990 6.667 99.999 100.000
threads-sockets 2 groups 99.940 5.114 99.997 100.000
threads-sockets 3 groups 99.469 4.115 99.977 100.000
threads-sockets 4 groups 87.528 4.038 99.400 100.000
threads-sockets 5 groups 6.942 2.398 59.244 88.337
threads-sockets 6 groups 4.359 1.954 49.448 87.860
threads-sockets 7 groups 2.845 1.345 41.198 77.102
threads-sockets 8 groups 2.871 1.404 38.512 74.312
schedstat_parse.py -f tbench_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
loopback 28 threads 99.976 18.369 99.995 100.000
loopback 56 threads 99.222 7.799 99.934 100.000
loopback 84 threads 19.723 6.819 70.215 100.000
loopback 112 threads 11.283 5.371 55.371 99.999
loopback 140 threads 0.000 0.000 0.000 0.000
loopback 168 threads 0.000 0.000 0.000 0.000
loopback 196 threads 0.000 0.000 0.000 0.000
loopback 224 threads 0.000 0.000 0.000 0.000
According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.
[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.
In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.
Introduce the quadratic function:
y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE
x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:
nr_scan = llc_weight * y'
Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
and the number of CPUs to be scanned. Use heuristic scan for now.
For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
scan_nr 112 111 108 102 93 81 65 47 25 1 0 ...
For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
scan_nr 16 15 15 14 13 11 9 6 3 0 0 ...
Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.
When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.
This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.
[Test result]
netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.
Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case load baseline(std%) compare%( std%)
TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40)
TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20)
TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40)
TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22)
TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19)
TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18)
TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43)
TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85)
UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33)
UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21)
UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32)
UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61)
UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04)
UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16)
UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93)
UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62)
Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:
vanilla patched
4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg
38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg
128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg
5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg
There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.
Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case load baseline(std%) compare%( std%)
process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47)
process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81)
process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02)
process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02)
process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13)
process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14)
process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26)
process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03)
threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14)
threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74)
threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13)
threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06)
threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53)
threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26)
threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24)
threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11)
Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case load baseline(std%) compare%( std%)
loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09)
loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17)
loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13)
loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03)
loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19)
loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27)
loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17)
loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22)
Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case load baseline(std%) compare%( std%)
normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)*
normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79)
normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64)
normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28)
*Consider the Standard Deviation, this -7.36% regression might not be valid.
Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.
There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
util_avg 1024 1024 1024 1024 1024 1024 1024 0
Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.
lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.
Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:
if (llc_weight <= 16)
nr_scan = nr_scan * 32 / llc_weight;
For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.
There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
2022-06-12 19:34:28 +03:00
|
|
|
int nr_idle_scan;
|
2017-02-01 18:36:40 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
struct sched_domain {
|
|
|
|
/* These fields must be setup */
|
2019-03-21 03:34:24 +03:00
|
|
|
struct sched_domain __rcu *parent; /* top domain must be null terminated */
|
|
|
|
struct sched_domain __rcu *child; /* bottom domain must be null terminated */
|
2017-02-01 18:36:40 +03:00
|
|
|
struct sched_group *groups; /* the balancing groups of the domain */
|
|
|
|
unsigned long min_interval; /* Minimum balance interval ms */
|
|
|
|
unsigned long max_interval; /* Maximum balance interval ms */
|
|
|
|
unsigned int busy_factor; /* less balancing by factor if busy */
|
|
|
|
unsigned int imbalance_pct; /* No balance until over watermark */
|
|
|
|
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
|
2022-02-08 12:43:34 +03:00
|
|
|
unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */
|
2017-02-01 18:36:40 +03:00
|
|
|
|
|
|
|
int nohz_idle; /* NOHZ IDLE status */
|
|
|
|
int flags; /* See SD_* */
|
|
|
|
int level;
|
|
|
|
|
|
|
|
/* Runtime fields. */
|
|
|
|
unsigned long last_balance; /* init to jiffies. units in jiffies */
|
|
|
|
unsigned int balance_interval; /* initialise to 1. units in ms. */
|
|
|
|
unsigned int nr_balance_failed; /* initialise to 0 */
|
|
|
|
|
|
|
|
/* idle_balance() stats */
|
|
|
|
u64 max_newidle_lb_cost;
|
2021-10-19 15:35:35 +03:00
|
|
|
unsigned long last_decay_max_lb_cost;
|
2017-02-01 18:36:40 +03:00
|
|
|
|
|
|
|
u64 avg_scan_cost; /* select_idle_sibling */
|
|
|
|
|
|
|
|
#ifdef CONFIG_SCHEDSTATS
|
|
|
|
/* load_balance() stats */
|
|
|
|
unsigned int lb_count[CPU_MAX_IDLE_TYPES];
|
|
|
|
unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
|
|
|
|
unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
|
|
|
|
unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
|
|
|
|
unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
|
|
|
|
unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
|
|
|
|
unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
|
|
|
|
unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];
|
|
|
|
|
|
|
|
/* Active load balancing */
|
|
|
|
unsigned int alb_count;
|
|
|
|
unsigned int alb_failed;
|
|
|
|
unsigned int alb_pushed;
|
|
|
|
|
|
|
|
/* SD_BALANCE_EXEC stats */
|
|
|
|
unsigned int sbe_count;
|
|
|
|
unsigned int sbe_balanced;
|
|
|
|
unsigned int sbe_pushed;
|
|
|
|
|
|
|
|
/* SD_BALANCE_FORK stats */
|
|
|
|
unsigned int sbf_count;
|
|
|
|
unsigned int sbf_balanced;
|
|
|
|
unsigned int sbf_pushed;
|
|
|
|
|
|
|
|
/* try_to_wake_up() stats */
|
|
|
|
unsigned int ttwu_wake_remote;
|
|
|
|
unsigned int ttwu_move_affine;
|
|
|
|
unsigned int ttwu_move_balance;
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
|
|
|
char *name;
|
|
|
|
#endif
|
|
|
|
union {
|
|
|
|
void *private; /* used during construction */
|
|
|
|
struct rcu_head rcu; /* used during destruction */
|
|
|
|
};
|
|
|
|
struct sched_domain_shared *shared;
|
|
|
|
|
|
|
|
unsigned int span_weight;
|
|
|
|
/*
|
|
|
|
* Span of all CPUs in this domain.
|
|
|
|
*
|
|
|
|
* NOTE: this field is variable length. (Allocated dynamically
|
|
|
|
* by attaching extra space to the end of the structure,
|
|
|
|
* depending on how many CPUs the kernel has booted up with)
|
|
|
|
*/
|
2020-03-24 03:14:37 +03:00
|
|
|
unsigned long span[];
|
2017-02-01 18:36:40 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
|
|
|
|
{
|
|
|
|
return to_cpumask(sd->span);
|
|
|
|
}
|
|
|
|
|
2019-07-19 16:59:53 +03:00
|
|
|
extern void partition_sched_domains_locked(int ndoms_new,
|
|
|
|
cpumask_var_t doms_new[],
|
|
|
|
struct sched_domain_attr *dattr_new);
|
|
|
|
|
2017-02-01 18:36:40 +03:00
|
|
|
extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
|
|
|
|
struct sched_domain_attr *dattr_new);
|
|
|
|
|
|
|
|
/* Allocate an array of sched domains, for partition_sched_domains(). */
|
|
|
|
cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
|
|
|
|
void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
|
|
|
|
|
|
|
|
bool cpus_share_cache(int this_cpu, int that_cpu);
|
|
|
|
|
|
|
|
typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
|
|
|
|
typedef int (*sched_domain_flags_f)(void);
|
|
|
|
|
|
|
|
#define SDTL_OVERLAP 0x01
|
|
|
|
|
|
|
|
struct sd_data {
|
2019-01-18 17:49:36 +03:00
|
|
|
struct sched_domain *__percpu *sd;
|
|
|
|
struct sched_domain_shared *__percpu *sds;
|
|
|
|
struct sched_group *__percpu *sg;
|
|
|
|
struct sched_group_capacity *__percpu *sgc;
|
2017-02-01 18:36:40 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
struct sched_domain_topology_level {
|
|
|
|
sched_domain_mask_f mask;
|
|
|
|
sched_domain_flags_f sd_flags;
|
|
|
|
int flags;
|
|
|
|
int numa_level;
|
|
|
|
struct sd_data data;
|
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
|
|
|
char *name;
|
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
|
|
|
extern void set_sched_topology(struct sched_domain_topology_level *tl);
|
|
|
|
|
|
|
|
#ifdef CONFIG_SCHED_DEBUG
|
|
|
|
# define SD_INIT_NAME(type) .name = #type
|
|
|
|
#else
|
|
|
|
# define SD_INIT_NAME(type)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#else /* CONFIG_SMP */
|
|
|
|
|
|
|
|
struct sched_domain_attr;
|
|
|
|
|
2019-07-19 16:59:53 +03:00
|
|
|
static inline void
|
|
|
|
partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
|
|
|
|
struct sched_domain_attr *dattr_new)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2017-02-01 18:36:40 +03:00
|
|
|
static inline void
|
|
|
|
partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
|
|
|
|
struct sched_domain_attr *dattr_new)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool cpus_share_cache(int this_cpu, int that_cpu)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2019-06-17 18:00:17 +03:00
|
|
|
#endif /* !CONFIG_SMP */
|
|
|
|
|
2020-10-27 21:07:11 +03:00
|
|
|
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
|
|
|
|
extern void rebuild_sched_domains_energy(void);
|
|
|
|
#else
|
|
|
|
static inline void rebuild_sched_domains_energy(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-12-03 12:56:14 +03:00
|
|
|
#ifndef arch_scale_cpu_capacity
|
2020-07-31 22:20:14 +03:00
|
|
|
/**
|
|
|
|
* arch_scale_cpu_capacity - get the capacity scale factor of a given CPU.
|
|
|
|
* @cpu: the CPU in question.
|
|
|
|
*
|
|
|
|
* Return: the CPU scale factor normalized against SCHED_CAPACITY_SCALE, i.e.
|
|
|
|
*
|
|
|
|
* max_perf(cpu)
|
|
|
|
* ----------------------------- * SCHED_CAPACITY_SCALE
|
|
|
|
* max(max_perf(c) : c \in CPUs)
|
|
|
|
*/
|
2018-12-03 12:56:14 +03:00
|
|
|
static __always_inline
|
2019-06-17 18:00:17 +03:00
|
|
|
unsigned long arch_scale_cpu_capacity(int cpu)
|
2018-12-03 12:56:14 +03:00
|
|
|
{
|
|
|
|
return SCHED_CAPACITY_SCALE;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-02-22 03:52:06 +03:00
|
|
|
#ifndef arch_scale_thermal_pressure
|
|
|
|
static __always_inline
|
|
|
|
unsigned long arch_scale_thermal_pressure(int cpu)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-11-09 22:57:10 +03:00
|
|
|
#ifndef arch_update_thermal_pressure
|
|
|
|
static __always_inline
|
|
|
|
void arch_update_thermal_pressure(const struct cpumask *cpus,
|
|
|
|
unsigned long capped_frequency)
|
|
|
|
{ }
|
|
|
|
#endif
|
|
|
|
|
2017-02-06 12:01:09 +03:00
|
|
|
static inline int task_node(const struct task_struct *p)
|
|
|
|
{
|
|
|
|
return cpu_to_node(task_cpu(p));
|
|
|
|
}
|
|
|
|
|
2017-02-01 18:36:40 +03:00
|
|
|
#endif /* _LINUX_SCHED_TOPOLOGY_H */
|