cpufreq: intel_pstate: Do not use PID-based P-state selection
All systems with a defined ACPI preferred profile that are not "servers" have been using the load-based P-state selection algorithm in intel_pstate since 4.12-rc1 (mobile systems and laptops have been using it since 4.10-rc1) and no problems with it have been reported to date. In particular, no regressions with respect to the PID-based P-state selection have been reported. Also testing indicates that the P-state selection algorithm based on CPU load is generally on par with the PID-based algorithm performance-wise, and for some workloads it turns out to be better than the other one, while being more straightforward and easier to understand at the same time. Moreover, the PID-based P-state selection algorithm in intel_pstate is known to be unstable in some situation and generally problematic, the issues with it are hard to address and it has become a significant maintenance burden. For these reasons, make intel_pstate use the "powersave" P-state selection algorithm based on CPU load in the active mode on all systems and drop the PID-based P-state selection code along with all things related to it from the driver. Also update the documentation accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This commit is contained in:
Родитель
520eccdfe1
Коммит
9d0ef7af1f
|
@ -167,35 +167,17 @@ is set.
|
|||
``powersave``
|
||||
.............
|
||||
|
||||
Without HWP, this P-state selection algorithm generally depends on the
|
||||
processor model and/or the system profile setting in the ACPI tables and there
|
||||
are two variants of it.
|
||||
|
||||
One of them is used with processors from the Atom line and (regardless of the
|
||||
processor model) on platforms with the system profile in the ACPI tables set to
|
||||
"mobile" (laptops mostly), "tablet", "appliance PC", "desktop", or
|
||||
"workstation". It is also used with processors supporting the HWP feature if
|
||||
that feature has not been enabled (that is, with the ``intel_pstate=no_hwp``
|
||||
argument in the kernel command line). It is similar to the algorithm
|
||||
Without HWP, this P-state selection algorithm is similar to the algorithm
|
||||
implemented by the generic ``schedutil`` scaling governor except that the
|
||||
utilization metric used by it is based on numbers coming from feedback
|
||||
registers of the CPU. It generally selects P-states proportional to the
|
||||
current CPU utilization, so it is referred to as the "proportional" algorithm.
|
||||
current CPU utilization.
|
||||
|
||||
The second variant of the ``powersave`` P-state selection algorithm, used in all
|
||||
of the other cases (generally, on processors from the Core line, so it is
|
||||
referred to as the "Core" algorithm), is based on the values read from the APERF
|
||||
and MPERF feedback registers and the previously requested target P-state.
|
||||
It does not really take CPU utilization into account explicitly, but as a rule
|
||||
it causes the CPU P-state to ramp up very quickly in response to increased
|
||||
utilization which is generally desirable in server environments.
|
||||
|
||||
Regardless of the variant, this algorithm is run by the driver's utilization
|
||||
update callback for the given CPU when it is invoked by the CPU scheduler, but
|
||||
not more often than every 10 ms (that can be tweaked via ``debugfs`` in `this
|
||||
particular case <Tuning Interface in debugfs_>`_). Like in the ``performance``
|
||||
case, the hardware configuration is not touched if the new P-state turns out to
|
||||
be the same as the current one.
|
||||
This algorithm is run by the driver's utilization update callback for the
|
||||
given CPU when it is invoked by the CPU scheduler, but not more often than
|
||||
every 10 ms. Like in the ``performance`` case, the hardware configuration
|
||||
is not touched if the new P-state turns out to be the same as the current
|
||||
one.
|
||||
|
||||
This is the default P-state selection algorithm if the
|
||||
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
|
||||
|
@ -720,34 +702,7 @@ P-state is called, the ``ftrace`` filter can be set to to
|
|||
gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
|
||||
<idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
|
||||
|
||||
Tuning Interface in ``debugfs``
|
||||
-------------------------------
|
||||
|
||||
The ``powersave`` algorithm provided by ``intel_pstate`` for `the Core line of
|
||||
processors in the active mode <powersave_>`_ is based on a `PID controller`_
|
||||
whose parameters were chosen to address a number of different use cases at the
|
||||
same time. However, it still is possible to fine-tune it to a specific workload
|
||||
and the ``debugfs`` interface under ``/sys/kernel/debug/pstate_snb/`` is
|
||||
provided for this purpose. [Note that the ``pstate_snb`` directory will be
|
||||
present only if the specific P-state selection algorithm matching the interface
|
||||
in it actually is in use.]
|
||||
|
||||
The following files present in that directory can be used to modify the PID
|
||||
controller parameters at run time:
|
||||
|
||||
| ``deadband``
|
||||
| ``d_gain_pct``
|
||||
| ``i_gain_pct``
|
||||
| ``p_gain_pct``
|
||||
| ``sample_rate_ms``
|
||||
| ``setpoint``
|
||||
|
||||
Note, however, that achieving desirable results this way generally requires
|
||||
expert-level understanding of the power vs performance tradeoff, so extra care
|
||||
is recommended when attempting to do that.
|
||||
|
||||
|
||||
.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
|
||||
.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
|
||||
.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
|
||||
.. _PID controller: https://en.wikipedia.org/wiki/PID_controller
|
||||
|
|
|
@ -172,28 +172,6 @@ struct vid_data {
|
|||
int32_t ratio;
|
||||
};
|
||||
|
||||
/**
|
||||
* struct _pid - Stores PID data
|
||||
* @setpoint: Target set point for busyness or performance
|
||||
* @integral: Storage for accumulated error values
|
||||
* @p_gain: PID proportional gain
|
||||
* @i_gain: PID integral gain
|
||||
* @d_gain: PID derivative gain
|
||||
* @deadband: PID deadband
|
||||
* @last_err: Last error storage for integral part of PID calculation
|
||||
*
|
||||
* Stores PID coefficients and last error for PID controller.
|
||||
*/
|
||||
struct _pid {
|
||||
int setpoint;
|
||||
int32_t integral;
|
||||
int32_t p_gain;
|
||||
int32_t i_gain;
|
||||
int32_t d_gain;
|
||||
int deadband;
|
||||
int32_t last_err;
|
||||
};
|
||||
|
||||
/**
|
||||
* struct global_params - Global parameters, mostly tunable via sysfs.
|
||||
* @no_turbo: Whether or not to use turbo P-states.
|
||||
|
@ -223,7 +201,6 @@ struct global_params {
|
|||
* @last_update: Time of the last update.
|
||||
* @pstate: Stores P state limits for this CPU
|
||||
* @vid: Stores VID limits for this CPU
|
||||
* @pid: Stores PID parameters for this CPU
|
||||
* @last_sample_time: Last Sample time
|
||||
* @aperf_mperf_shift: Number of clock cycles after aperf, merf is incremented
|
||||
* This shift is a multiplier to mperf delta to
|
||||
|
@ -258,7 +235,6 @@ struct cpudata {
|
|||
|
||||
struct pstate_data pstate;
|
||||
struct vid_data vid;
|
||||
struct _pid pid;
|
||||
|
||||
u64 last_update;
|
||||
u64 last_sample_time;
|
||||
|
@ -283,28 +259,6 @@ struct cpudata {
|
|||
|
||||
static struct cpudata **all_cpu_data;
|
||||
|
||||
/**
|
||||
* struct pstate_adjust_policy - Stores static PID configuration data
|
||||
* @sample_rate_ms: PID calculation sample rate in ms
|
||||
* @sample_rate_ns: Sample rate calculation in ns
|
||||
* @deadband: PID deadband
|
||||
* @setpoint: PID Setpoint
|
||||
* @p_gain_pct: PID proportional gain
|
||||
* @i_gain_pct: PID integral gain
|
||||
* @d_gain_pct: PID derivative gain
|
||||
*
|
||||
* Stores per CPU model static PID configuration data.
|
||||
*/
|
||||
struct pstate_adjust_policy {
|
||||
int sample_rate_ms;
|
||||
s64 sample_rate_ns;
|
||||
int deadband;
|
||||
int setpoint;
|
||||
int p_gain_pct;
|
||||
int d_gain_pct;
|
||||
int i_gain_pct;
|
||||
};
|
||||
|
||||
/**
|
||||
* struct pstate_funcs - Per CPU model specific callbacks
|
||||
* @get_max: Callback to get maximum non turbo effective P state
|
||||
|
@ -333,15 +287,6 @@ struct pstate_funcs {
|
|||
};
|
||||
|
||||
static struct pstate_funcs pstate_funcs __read_mostly;
|
||||
static struct pstate_adjust_policy pid_params __read_mostly = {
|
||||
.sample_rate_ms = 10,
|
||||
.sample_rate_ns = 10 * NSEC_PER_MSEC,
|
||||
.deadband = 0,
|
||||
.setpoint = 97,
|
||||
.p_gain_pct = 20,
|
||||
.d_gain_pct = 0,
|
||||
.i_gain_pct = 0,
|
||||
};
|
||||
|
||||
static int hwp_active __read_mostly;
|
||||
static bool per_cpu_limits __read_mostly;
|
||||
|
@ -509,56 +454,6 @@ static inline void intel_pstate_exit_perf_limits(struct cpufreq_policy *policy)
|
|||
}
|
||||
#endif
|
||||
|
||||
static signed int pid_calc(struct _pid *pid, int32_t busy)
|
||||
{
|
||||
signed int result;
|
||||
int32_t pterm, dterm, fp_error;
|
||||
int32_t integral_limit;
|
||||
|
||||
fp_error = pid->setpoint - busy;
|
||||
|
||||
if (abs(fp_error) <= pid->deadband)
|
||||
return 0;
|
||||
|
||||
pterm = mul_fp(pid->p_gain, fp_error);
|
||||
|
||||
pid->integral += fp_error;
|
||||
|
||||
/*
|
||||
* We limit the integral here so that it will never
|
||||
* get higher than 30. This prevents it from becoming
|
||||
* too large an input over long periods of time and allows
|
||||
* it to get factored out sooner.
|
||||
*
|
||||
* The value of 30 was chosen through experimentation.
|
||||
*/
|
||||
integral_limit = int_tofp(30);
|
||||
if (pid->integral > integral_limit)
|
||||
pid->integral = integral_limit;
|
||||
if (pid->integral < -integral_limit)
|
||||
pid->integral = -integral_limit;
|
||||
|
||||
dterm = mul_fp(pid->d_gain, fp_error - pid->last_err);
|
||||
pid->last_err = fp_error;
|
||||
|
||||
result = pterm + mul_fp(pid->integral, pid->i_gain) + dterm;
|
||||
result = result + (1 << (FRAC_BITS-1));
|
||||
return (signed int)fp_toint(result);
|
||||
}
|
||||
|
||||
static inline void intel_pstate_pid_reset(struct cpudata *cpu)
|
||||
{
|
||||
struct _pid *pid = &cpu->pid;
|
||||
|
||||
pid->p_gain = percent_fp(pid_params.p_gain_pct);
|
||||
pid->d_gain = percent_fp(pid_params.d_gain_pct);
|
||||
pid->i_gain = percent_fp(pid_params.i_gain_pct);
|
||||
pid->setpoint = int_tofp(pid_params.setpoint);
|
||||
pid->last_err = pid->setpoint - int_tofp(100);
|
||||
pid->deadband = int_tofp(pid_params.deadband);
|
||||
pid->integral = 0;
|
||||
}
|
||||
|
||||
static inline void update_turbo_state(void)
|
||||
{
|
||||
u64 misc_en;
|
||||
|
@ -911,82 +806,6 @@ static void intel_pstate_update_policies(void)
|
|||
cpufreq_update_policy(cpu);
|
||||
}
|
||||
|
||||
/************************** debugfs begin ************************/
|
||||
static int pid_param_set(void *data, u64 val)
|
||||
{
|
||||
unsigned int cpu;
|
||||
|
||||
*(u32 *)data = val;
|
||||
pid_params.sample_rate_ns = pid_params.sample_rate_ms * NSEC_PER_MSEC;
|
||||
for_each_possible_cpu(cpu)
|
||||
if (all_cpu_data[cpu])
|
||||
intel_pstate_pid_reset(all_cpu_data[cpu]);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int pid_param_get(void *data, u64 *val)
|
||||
{
|
||||
*val = *(u32 *)data;
|
||||
return 0;
|
||||
}
|
||||
DEFINE_SIMPLE_ATTRIBUTE(fops_pid_param, pid_param_get, pid_param_set, "%llu\n");
|
||||
|
||||
static struct dentry *debugfs_parent;
|
||||
|
||||
struct pid_param {
|
||||
char *name;
|
||||
void *value;
|
||||
struct dentry *dentry;
|
||||
};
|
||||
|
||||
static struct pid_param pid_files[] = {
|
||||
{"sample_rate_ms", &pid_params.sample_rate_ms, },
|
||||
{"d_gain_pct", &pid_params.d_gain_pct, },
|
||||
{"i_gain_pct", &pid_params.i_gain_pct, },
|
||||
{"deadband", &pid_params.deadband, },
|
||||
{"setpoint", &pid_params.setpoint, },
|
||||
{"p_gain_pct", &pid_params.p_gain_pct, },
|
||||
{NULL, NULL, }
|
||||
};
|
||||
|
||||
static void intel_pstate_debug_expose_params(void)
|
||||
{
|
||||
int i;
|
||||
|
||||
debugfs_parent = debugfs_create_dir("pstate_snb", NULL);
|
||||
if (IS_ERR_OR_NULL(debugfs_parent))
|
||||
return;
|
||||
|
||||
for (i = 0; pid_files[i].name; i++) {
|
||||
struct dentry *dentry;
|
||||
|
||||
dentry = debugfs_create_file(pid_files[i].name, 0660,
|
||||
debugfs_parent, pid_files[i].value,
|
||||
&fops_pid_param);
|
||||
if (!IS_ERR(dentry))
|
||||
pid_files[i].dentry = dentry;
|
||||
}
|
||||
}
|
||||
|
||||
static void intel_pstate_debug_hide_params(void)
|
||||
{
|
||||
int i;
|
||||
|
||||
if (IS_ERR_OR_NULL(debugfs_parent))
|
||||
return;
|
||||
|
||||
for (i = 0; pid_files[i].name; i++) {
|
||||
debugfs_remove(pid_files[i].dentry);
|
||||
pid_files[i].dentry = NULL;
|
||||
}
|
||||
|
||||
debugfs_remove(debugfs_parent);
|
||||
debugfs_parent = NULL;
|
||||
}
|
||||
|
||||
/************************** debugfs end ************************/
|
||||
|
||||
/************************** sysfs begin ************************/
|
||||
#define show_one(file_name, object) \
|
||||
static ssize_t show_##file_name \
|
||||
|
@ -1661,44 +1480,6 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
|
|||
return target;
|
||||
}
|
||||
|
||||
static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
|
||||
{
|
||||
int32_t perf_scaled, max_pstate, current_pstate, sample_ratio;
|
||||
u64 duration_ns;
|
||||
|
||||
/*
|
||||
* perf_scaled is the ratio of the average P-state during the last
|
||||
* sampling period to the P-state requested last time (in percent).
|
||||
*
|
||||
* That measures the system's response to the previous P-state
|
||||
* selection.
|
||||
*/
|
||||
max_pstate = cpu->pstate.max_pstate_physical;
|
||||
current_pstate = cpu->pstate.current_pstate;
|
||||
perf_scaled = mul_ext_fp(cpu->sample.core_avg_perf,
|
||||
div_fp(100 * max_pstate, current_pstate));
|
||||
|
||||
/*
|
||||
* Since our utilization update callback will not run unless we are
|
||||
* in C0, check if the actual elapsed time is significantly greater (3x)
|
||||
* than our sample interval. If it is, then we were idle for a long
|
||||
* enough period of time to adjust our performance metric.
|
||||
*/
|
||||
duration_ns = cpu->sample.time - cpu->last_sample_time;
|
||||
if ((s64)duration_ns > pid_params.sample_rate_ns * 3) {
|
||||
sample_ratio = div_fp(pid_params.sample_rate_ns, duration_ns);
|
||||
perf_scaled = mul_fp(perf_scaled, sample_ratio);
|
||||
} else {
|
||||
sample_ratio = div_fp(100 * (cpu->sample.mperf << cpu->aperf_mperf_shift),
|
||||
cpu->sample.tsc);
|
||||
if (sample_ratio < int_tofp(1))
|
||||
perf_scaled = 0;
|
||||
}
|
||||
|
||||
cpu->sample.busy_scaled = perf_scaled;
|
||||
return cpu->pstate.current_pstate - pid_calc(&cpu->pid, perf_scaled);
|
||||
}
|
||||
|
||||
static int intel_pstate_prepare_request(struct cpudata *cpu, int pstate)
|
||||
{
|
||||
int max_pstate = intel_pstate_get_base_pstate(cpu);
|
||||
|
@ -1741,23 +1522,6 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
|
|||
fp_toint(cpu->iowait_boost * 100));
|
||||
}
|
||||
|
||||
static void intel_pstate_update_util_pid(struct update_util_data *data,
|
||||
u64 time, unsigned int flags)
|
||||
{
|
||||
struct cpudata *cpu = container_of(data, struct cpudata, update_util);
|
||||
u64 delta_ns = time - cpu->sample.time;
|
||||
|
||||
if ((s64)delta_ns < pid_params.sample_rate_ns)
|
||||
return;
|
||||
|
||||
if (intel_pstate_sample(cpu, time)) {
|
||||
int target_pstate;
|
||||
|
||||
target_pstate = get_target_pstate_use_performance(cpu);
|
||||
intel_pstate_adjust_pstate(cpu, target_pstate);
|
||||
}
|
||||
}
|
||||
|
||||
static void intel_pstate_update_util(struct update_util_data *data, u64 time,
|
||||
unsigned int flags)
|
||||
{
|
||||
|
@ -1792,7 +1556,7 @@ static struct pstate_funcs core_funcs = {
|
|||
.get_turbo = core_get_turbo_pstate,
|
||||
.get_scaling = core_get_scaling,
|
||||
.get_val = core_get_val,
|
||||
.update_util = intel_pstate_update_util_pid,
|
||||
.update_util = intel_pstate_update_util,
|
||||
};
|
||||
|
||||
static const struct pstate_funcs silvermont_funcs = {
|
||||
|
@ -1825,7 +1589,7 @@ static const struct pstate_funcs knl_funcs = {
|
|||
.get_aperf_mperf_shift = knl_get_aperf_mperf_shift,
|
||||
.get_scaling = core_get_scaling,
|
||||
.get_val = core_get_val,
|
||||
.update_util = intel_pstate_update_util_pid,
|
||||
.update_util = intel_pstate_update_util,
|
||||
};
|
||||
|
||||
static const struct pstate_funcs bxt_funcs = {
|
||||
|
@ -1879,8 +1643,6 @@ static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[] = {
|
|||
{}
|
||||
};
|
||||
|
||||
static bool pid_in_use(void);
|
||||
|
||||
static int intel_pstate_init_cpu(unsigned int cpunum)
|
||||
{
|
||||
struct cpudata *cpu;
|
||||
|
@ -1911,8 +1673,6 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
|
|||
intel_pstate_disable_ee(cpunum);
|
||||
|
||||
intel_pstate_hwp_enable(cpu);
|
||||
} else if (pid_in_use()) {
|
||||
intel_pstate_pid_reset(cpu);
|
||||
}
|
||||
|
||||
intel_pstate_get_cpu_pstates(cpu);
|
||||
|
@ -2270,12 +2030,6 @@ static struct cpufreq_driver intel_cpufreq = {
|
|||
|
||||
static struct cpufreq_driver *default_driver = &intel_pstate;
|
||||
|
||||
static bool pid_in_use(void)
|
||||
{
|
||||
return intel_pstate_driver == &intel_pstate &&
|
||||
pstate_funcs.update_util == intel_pstate_update_util_pid;
|
||||
}
|
||||
|
||||
static void intel_pstate_driver_cleanup(void)
|
||||
{
|
||||
unsigned int cpu;
|
||||
|
@ -2310,9 +2064,6 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
|
|||
|
||||
global.min_perf_pct = min_perf_pct_min();
|
||||
|
||||
if (pid_in_use())
|
||||
intel_pstate_debug_expose_params();
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
@ -2321,9 +2072,6 @@ static int intel_pstate_unregister_driver(void)
|
|||
if (hwp_active)
|
||||
return -EBUSY;
|
||||
|
||||
if (pid_in_use())
|
||||
intel_pstate_debug_hide_params();
|
||||
|
||||
cpufreq_unregister_driver(intel_pstate_driver);
|
||||
intel_pstate_driver_cleanup();
|
||||
|
||||
|
@ -2391,24 +2139,6 @@ static int __init intel_pstate_msrs_not_valid(void)
|
|||
return 0;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_ACPI
|
||||
static void intel_pstate_use_acpi_profile(void)
|
||||
{
|
||||
switch (acpi_gbl_FADT.preferred_profile) {
|
||||
case PM_MOBILE:
|
||||
case PM_TABLET:
|
||||
case PM_APPLIANCE_PC:
|
||||
case PM_DESKTOP:
|
||||
case PM_WORKSTATION:
|
||||
pstate_funcs.update_util = intel_pstate_update_util;
|
||||
}
|
||||
}
|
||||
#else
|
||||
static void intel_pstate_use_acpi_profile(void)
|
||||
{
|
||||
}
|
||||
#endif
|
||||
|
||||
static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
|
||||
{
|
||||
pstate_funcs.get_max = funcs->get_max;
|
||||
|
@ -2420,8 +2150,6 @@ static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
|
|||
pstate_funcs.get_vid = funcs->get_vid;
|
||||
pstate_funcs.update_util = funcs->update_util;
|
||||
pstate_funcs.get_aperf_mperf_shift = funcs->get_aperf_mperf_shift;
|
||||
|
||||
intel_pstate_use_acpi_profile();
|
||||
}
|
||||
|
||||
#ifdef CONFIG_ACPI
|
||||
|
|
Загрузка…
Ссылка в новой задаче