New HW support:
- New driver for the Nuvoton WPCM450 interrupt controller
- New driver for the IDT 79rc3243x interrupt controller
- Add support for interrupt trigger configuration to the MStar irqchip
- Add more external interrupt support to the STM32 irqchip
- Add new compatible strings for QCOM SC7280 to the qcom-pdc binding
Fixes and cleanups:
- Drop irq_create_strict_mappings() and irq_create_identity_mapping()
from the irqdomain API, with cleanups in a couple of drivers
- Fix nested NMI issue with spurious interrupts on GICv3
- Don't allow GICv4.1 vSGIs when the CPU doesn't support them
- Various cleanups and minor fixes
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCD5kwPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDCWsQAL5yHXtApf4l3F0W99SJIooumrQh3UR6nENG
2WVR66g+MiuZ/JQcHAojdLQ6W6K9W8eTcY3hRNFCqlI1lrKffz6ovstuYg3Wphog
JX1gQYcpqt67WYtb/TVw3JM5D3NLU4XKPKZPhRzSHv5G9utI2QeAv13EBcPoHxZd
UBRAEdUrv90KIFDe2CxWo8B5ra07xfgOpDvlYYKlee+jQLtf6i4Kj7Tm0XoK3hoW
w0Mo//5r2SggdXfFLW1sm0BGs0bpJMSNixKCZWRfXbnZLAYIaBynSoLT9XoYT/uC
FDegtFZ9IG/5NXJ1d3Yl0RjsPp+iPUOOTq/5gAoXI0hRCLZ1f8G1IuDEoIf8ElOg
kxA1JpYE1fewxNt7oh48BAs3Qa3fdjJ1+k6gFlau4ctJBjxTHMz7v7lr7PmjhPz7
HgcmzFCu9Wb8pj1IDHMINkOMmAiQhgr3N0WK372wQyNE8Z8iB0ZeCYX9jAV5YTK6
eQdsDgNW18rv1ks/f7vzJw4EHRUM2tzSYimgf3oW+EJq6xKacMHfDMp9ERtHcnfJ
+4CCEEafrSOj/KsNpNnA7Bq3Qjh+RdRXDtCPsoGQ3LS1L5/JOaUoSmrCkWNNfXuZ
kUKTrNzopmMPvvwx6Q1YUypMbKCloNvlO3IgKalKNVP5drWA184abOIU2MGp+yI1
LAA8SFYU
=RqVj
-----END PGP SIGNATURE-----
Merge tag 'irqchip-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip and irqdomain updates from Marc Zyngier:
New HW support:
- New driver for the Nuvoton WPCM450 interrupt controller
- New driver for the IDT 79rc3243x interrupt controller
- Add support for interrupt trigger configuration to the MStar irqchip
- Add more external interrupt support to the STM32 irqchip
- Add new compatible strings for QCOM SC7280 to the qcom-pdc binding
Fixes and cleanups:
- Drop irq_create_strict_mappings() and irq_create_identity_mapping()
from the irqdomain API, with cleanups in a couple of drivers
- Fix nested NMI issue with spurious interrupts on GICv3
- Don't allow GICv4.1 vSGIs when the CPU doesn't support them
- Various cleanups and minor fixes
Link: https://lore.kernel.org/r/20210424094640.1731920-1-maz@kernel.org
reg->type is enforced by check_reg_type() and map should never be NULL
(it would already have been dereferenced anyway) so these checks are
unnecessary.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210422235543.4007694-3-revest@chromium.org
In check_bpf_snprintf_call(), a map_direct_value_addr() of the fmt map
should never fail because it has already been checked by
ARG_PTR_TO_CONST_STR. But if it ever fails, it's better to error out
with an explicit debug message rather than silently fail.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210422235543.4007694-2-revest@chromium.org
Add the missing TRAP_PERF case in siginfo_layout() for interpreting the
layout correctly as SIL_PERF_EVENT instead of just SIL_FAULT. This
ensures the si_perf field is copied and not just the si_addr field.
This was caught and tested by running the perf_events/sigtrap_threads
kselftest as a 32-bit binary with a 64-bit kernel.
Fixes: fb6cc127e0 ("signal: Introduce TRAP_PERF si_code and si_perf to siginfo")
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210422191823.79012-2-elver@google.com
Printing a 'long' variable using the '%d' format string is wrong
and causes a warning from gcc:
kernel/kcsan/kcsan_test.c: In function 'nthreads_gen_params':
include/linux/kern_levels.h:5:25: error: format '%d' expects argument of type 'int', but argument 3 has type 'long int' [-Werror=format=]
Use the appropriate format modifier.
Fixes: f6a1491403 ("kcsan: Switch to KUNIT_CASE_PARAM for parameterized tests")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20210421135059.3371701-1-arnd@kernel.org
Vincent reported that for states with a NULL startup/teardown function
we do not call cpuhp_invoke_callback() (because there is none) and as
such we'll not update the cpu_dying() state.
The stale cpu_dying() can eventually lead to triggering BUG().
Rectify this by updating cpu_dying() in the exact same places the
hotplug machinery tracks its directional state, namely
cpuhp_set_state() and cpuhp_reset_state().
Reported-by: Vincent Donnefort <vincent.donnefort@arm.com>
Suggested-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/YH7r+AoQEReSvxBI@hirez.programming.kicks-ass.net
The kthread_is_per_cpu() construct relies on only being called on
PF_KTHREAD tasks (per the WARN in to_kthread). This gives rise to the
following usage pattern:
if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
However, as reported by syzcaller, this is broken. The scenario is:
CPU0 CPU1 (running p)
(p->flags & PF_KTHREAD) // true
begin_new_exec()
me->flags &= ~(PF_KTHREAD|...);
kthread_is_per_cpu(p)
to_kthread(p)
WARN(!(p->flags & PF_KTHREAD) <-- *SPLAT*
Introduce __to_kthread() that omits the WARN and is sure to check both
values.
Use this to remove the problematic pattern for kthread_is_per_cpu()
and fix a number of other kthread_*() functions that have similar
issues but are currently not used in ways that would expose the
problem.
Notably kthread_func() is only ever called on 'current', while
kthread_probe_data() is only used for PF_WQ_WORKER, which implies the
task is from kthread_create*().
Fixes: ac687e6e8c ("kthread: Extract KTHREAD_IS_PER_CPU")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lkml.kernel.org/r/YH6WJc825C4P0FCK@hirez.programming.kicks-ass.net
The handling of sysrq key can be activated by echoing the key to
/proc/sysrq-trigger or via the magic key sequence typed into a terminal
that is connected to the system in some way (serial, USB or other mean).
In the former case, the handling is done in a user context. In the
latter case, it is likely to be in an interrupt context.
Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is
taken with interrupt disabled for the whole duration of the calls to
print_*_stats() and print_rq() which could last for the quite some time
if the information dump happens on the serial console.
If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic
depending on the actually serial console implementation of the
system.
The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't
need serialization from sched_debug_lock.
Calling printk() with interrupt disabled can still be problematic if
multiple instances are running. Allocating a stack buffer of PATH_MAX
bytes is not feasible because of the limited size of the kernel stack.
The solution implemented in this patch is to allow only one caller at a
time to use the full size group_path[], while other simultaneous callers
will have to use shorter stack buffers with the possibility of path
name truncation. A "..." suffix will be printed if truncation may have
happened. The cgroup path name is provided for informational purpose
only, so occasional path name truncation should not be a big problem.
Fixes: efe25c2c7b ("sched: Reinstate group names in /proc/sched_debug")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210415195426.6677-1-longman@redhat.com
psi_group_cpu->tasks, represented by the unsigned int, stores the
number of tasks that could be stalled on a psi resource(io/mem/cpu).
Decrementing these counters at zero leads to wrapping which further
leads to the psi_group_cpu->state_mask is being set with the
respective pressure state. This could result into the unnecessary time
sampling for the pressure state thus cause the spurious psi events.
This can further lead to wrong actions being taken at the user land
based on these psi events.
Though psi_bug is set under these conditions but that just for debug
purpose. Fix it by decrementing the ->tasks count only when it is
non-zero.
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1618585336-37219-1-git-send-email-charante@codeaurora.org
CPU scheduler marks need_resched flag to signal a schedule() on a
particular CPU. But, schedule() may not happen immediately in cases
where the current task is executing in the kernel mode (no
preemption state) for extended periods of time.
This patch adds a warn_on if need_resched is pending for more than the
time specified in sysctl resched_latency_warn_ms. If it goes off, it is
likely that there is a missing cond_resched() somewhere. Monitoring is
done via the tick and the accuracy is hence limited to jiffy scale. This
also means that we won't trigger the warning if the tick is disabled.
This feature (LATENCY_WARN) is default disabled.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210416212936.390566-1-joshdon@google.com
Masami added a wrapper to be able to unhash trace event pointers
as they are only read by root anyway, and they can also be extracted
by the raw trace data buffers. But this wrapper utilized the iterator
to have a temporary buffer to manipulate the text with.
tp_printk is a kernel command line option that will send the trace
output of a trace event to the console on boot up (useful when the
system crashes before finishing the boot). But the code used the same
wrapper that Masami added, and its iterator did not have a buffer,
and this caused the system to crash.
Have the wrapper just print the trace event normally if the iterator
has no temporary buffer.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYH8exRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlvdAP436vts0GOM8UM0y9IAPNrSG5OSvjNe
v5C5UOpnIxjlrgEArBtLLDaByLSBDQj+Vx0LmDW5uMOps49mo3A35TcIEAM=
=GfL1
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix tp_printk command line and trace events
Masami added a wrapper to be able to unhash trace event pointers as
they are only read by root anyway, and they can also be extracted by
the raw trace data buffers. But this wrapper utilized the iterator to
have a temporary buffer to manipulate the text with.
tp_printk is a kernel command line option that will send the trace
output of a trace event to the console on boot up (useful when the
system crashes before finishing the boot). But the code used the same
wrapper that Masami added, and its iterator did not have a buffer, and
this caused the system to crash.
Have the wrapper just print the trace event normally if the iterator
has no temporary buffer"
* tag 'trace-v5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix checking event hash pointer logic when tp_printk is enabled
cap_setfcap is required to create file capabilities.
Since commit 8db6c34f1d ("Introduce v3 namespaced file capabilities"),
a process running as uid 0 but without cap_setfcap is able to work
around this as follows: unshare a new user namespace which maps parent
uid 0 into the child namespace.
While this task will not have new capabilities against the parent
namespace, there is a loophole due to the way namespaced file
capabilities are represented as xattrs. File capabilities valid in
userns 1 are distinguished from file capabilities valid in userns 2 by
the kuid which underlies uid 0. Therefore the restricted root process
can unshare a new self-mapping namespace, add a namespaced file
capability onto a file, then use that file capability in the parent
namespace.
To prevent that, do not allow mapping parent uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the
capability for setting file capabilities.
As a further wrinkle: a task can unshare its user namespace, then open
its uid_map file itself, and map (only) its own uid. In this case we do
not have the credential from before unshare, which was potentially more
restricted. So, when creating a user namespace, we record whether the
creator had CAP_SETFCAP. Then we can use that during map_write().
With this patch:
1. Unprivileged user can still unshare -Ur
ubuntu@caps:~$ unshare -Ur
root@caps:~# logout
2. Root user can still unshare -Ur
ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout
3. Root user without CAP_SETFCAP cannot unshare -Ur:
root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted
Note: an alternative solution would be to allow uid 0 mappings by
processes without CAP_SETFCAP, but to prevent such a namespace from
writing any file capabilities. This approach can be seen at [1].
Background history: commit 95ebabde38 ("capabilities: Don't allow
writing ambiguous v3 file capabilities") tried to fix the issue by
preventing v3 fscaps to be written to disk when the root uid would map
to the same uid in nested user namespaces. This led to regressions for
various workloads. For example, see [2]. Ultimately this is a valid
use-case we have to support meaning we had to revert this change in
3b0c2d3eaa ("Revert 95ebabde38 ("capabilities: Don't allow writing
ambiguous v3 file capabilities")").
Link: https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4 [1]
Link: https://github.com/containers/buildah/issues/3071 [2]
Signed-off-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Andrew G. Morgan <morgan@kernel.org>
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pointers in events that are printed are unhashed if the flags allow it,
and the logic to do so is called before processing the event output from
the raw ring buffer. In most cases, this is done when a user reads one of
the trace files.
But if tp_printk is added on the kernel command line, this logic is done
for trace events when they are triggered, and their output goes out via
printk. The unhash logic (and even the validation of the output) did not
support the tp_printk output, and would crash.
Link: https://lore.kernel.org/linux-tegra/9835d9f1-8d3a-3440-c53f-516c2606ad07@nvidia.com/
Fixes: efbbdaa22b ("tracing: Show real address for trace event arguments")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When !CONFIG_NO_HZ_COMMON we get this new GCC warning:
kernel/sched/fair.c:8398:13: warning: ‘update_nohz_stats’ defined but not used [-Wunused-function]
Move update_nohz_stats() to an already existing CONFIG_NO_HZ_COMMON #ifdef
block.
Beyond fixing the GCC warning, this also simplifies the update_nohz_stats() function.
[ mingo: Rewrote the changelog. ]
Fixes: 0826530de3 ("sched/fair: Remove update of blocked load from newidle_balance")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210329144029.29200-1-yuehaibing@huawei.com
Verifier can constrain the min/max bounds of bpf_get_task_stack's return
value more tightly than the default tnum_unknown. Like bpf_get_stack,
return value is num bytes written into a caller-supplied buf, or error,
so do_refine_retval_range will work.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210416204704.2816874-2-davemarchevsky@fb.com
The implementation takes inspiration from the existing bpf_trace_printk
helper but there are a few differences:
To allow for a large number of format-specifiers, parameters are
provided in an array, like in bpf_seq_printf.
Because the output string takes two arguments and the array of
parameters also takes two arguments, the format string needs to fit in
one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to
a zero-terminated read-only map so we don't need a format string length
arg.
Because the format-string is known at verification time, we also do
a first pass of format string validation in the verifier logic. This
makes debugging easier.
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-4-revest@chromium.org
This type provides the guarantee that an argument is going to be a const
pointer to somewhere in a read-only map value. It also checks that this
pointer is followed by a zero character before the end of the map value.
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-3-revest@chromium.org
Two helpers (trace_printk and seq_printf) have very similar
implementations of format string parsing and a third one is coming
(snprintf). To avoid code duplication and make the code easier to
maintain, this moves the operations associated with format string
parsing (validation and argument sanitization) into one generic
function.
The implementation of the two existing helpers already drifted quite a
bit so unifying them entailed a lot of changes:
- bpf_trace_printk always expected fmt[fmt_size] to be the terminating
NULL character, this is no longer true, the first 0 is terminating.
- bpf_trace_printk now supports %% (which produces the percentage char).
- bpf_trace_printk now skips width formating fields.
- bpf_trace_printk now supports the X modifier (capital hexadecimal).
- bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
- argument casting on 32 bit has been simplified into one macro and
using an enum instead of obscure int increments.
- bpf_seq_printf now uses bpf_trace_copy_string instead of
strncpy_from_kernel_nofault and handles the %pks %pus specifiers.
- bpf_seq_printf now prints longs correctly on 32 bit architectures.
- both were changed to use a global per-cpu tmp buffer instead of one
stack buffer for trace_printk and 6 small buffers for seq_printf.
- to avoid per-cpu buffer usage conflict, these helpers disable
preemption while the per-cpu buffer is in use.
- both helpers now support the %ps and %pS specifiers to print symbols.
The implementation is also moved from bpf_trace.c to helpers.c because
the upcoming bpf_snprintf helper will be made available to all BPF
programs and will need it.
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210419155243.1632274-2-revest@chromium.org
This reverts commit 04c53de57c.
Nathan Chancellor points out that it should not have been merged into
mainline by itself. It was a fix for "gcov: use kvmalloc()", which is
still in -mm/-next. Merging it alone has broken the build.
Link: https://github.com/ClangBuiltLinux/continuous-integration2/runs/2384465683?check_suite_focus=true
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current Hardware events and Hardware cache events have special perf
types, PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE. The two types don't
pass the PMU type in the user interface. For a hybrid system, the perf
subsystem doesn't know which PMU the events belong to. The first capable
PMU will always be assigned to the events. The events never get a chance
to run on the other capable PMUs.
Extend the two types to become PMU aware types. The PMU type ID is
stored at attr.config[63:32].
Add a new PMU capability, PERF_PMU_CAP_EXTENDED_HW_TYPE, to indicate a
PMU which supports the extended PERF_TYPE_HARDWARE and
PERF_TYPE_HW_CACHE.
The PMU type is only required when searching a specific PMU. The PMU
specific codes will only be interested in the 'real' config value, which
is stored in the low 32 bit of the event->attr.config. Update the
event->attr.config in the generic code, so the PMU specific codes don't
need to calculate it separately.
If a user specifies a PMU type, but the PMU doesn't support the extended
type, error out.
If an event cannot be initialized in a PMU specified by a user, error
out immediately. Perf should not try to open it on other PMUs.
The new PMU capability is only set for the X86 hybrid PMUs for now.
Other architectures, e.g., ARM, may need it as well. The support on ARM
may be implemented later separately.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-22-git-send-email-kan.liang@linux.intel.com
Commit 40607ee97e ("preempt/dynamic: Provide irqentry_exit_cond_resched()
static call") tried to provide irqentry_exit_cond_resched() static call
in irqentry_exit, but has a typo in macro conditional statement.
Fixes: 40607ee97e ("preempt/dynamic: Provide irqentry_exit_cond_resched() static call")
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210410073523.5493-1-zhouzhouyi@gmail.com
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
- keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
- fix build after move to net_generic
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
and bpf. BPF verifier changes stand out, otherwise things have
slowed down.
Current release - regressions:
- gro: ensure frag0 meets IP header alignment
- Revert "net: stmmac: re-init rx buffers when mac resume back"
- ethernet: macb: fix the restore of cmp registers
Previous releases - regressions:
- ixgbe: Fix NULL pointer dereference in ethtool loopback test
- ixgbe: fix unbalanced device enable/disable in suspend/resume
- phy: marvell: fix detection of PHY on Topaz switches
- make tcp_allowed_congestion_control readonly in non-init netns
- xen-netback: Check for hotplug-status existence before watching
Previous releases - always broken:
- bpf: mitigate a speculative oob read of up to map value size by
tightening the masking window
- sctp: fix race condition in sctp_destroy_sock
- sit, ip6_tunnel: Unregister catch-all devices
- netfilter: nftables: clone set element expression template
- netfilter: flowtable: fix NAT IPv6 offload mangling
- net: geneve: check skb is large enough for IPv4/IPv6 header
- netlink: don't call ->netlink_bind with table lock held
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmB6aBQACgkQMUZtbf5S
Iruu2BAAqHKdB5Qd1iBGaA1md8f+elErsotzjONz+eh2yqDKRaOW84+Fo9TPKgu6
se0WmAY1HMUO3TEVdFeBsrgrs+bTY1E1OdfoZ39PFNkMdKMM80Ks1rn94nrPOohy
q1uoNxe9jjT3nRQBTKHWdB3ZC3Jetwf3LP7G2b8SoA+gNd9xl+b1H/drmv7WdE/n
pY7/GND7wd4qqidLRDgAaavaiGIdqym8V0bZEpz7cZtjT/U6RhjkBLKSB8JFGUxP
PQ1NFrYKmLDM1zYTSObLOrKUmEaWzPPSsXmWqGkCE4qjJ8euX0e+5EbxF98JHdYW
O+HMtdgr4UJGWAoxyGaxk7h9w0ydVyC1+Xgi6jAFWdXP7wgvXXQrldLnO44pX/6I
dYlIM+Br/5VmnKiS1i1gBUURREBRSEy7ZYxtREjGC7dFSUn9RPm+0s0x/DCRBS9/
MtNo0lCiuWsyaZ2v57aEKLX4YvGpilzg4UU3/45RNW6OnFzQubvjMBJPfap6EUAC
Ii8uUc/vX0Jq4nZVZzDZ7vlkRcJTQgUqKrzgamUuwJmyPqzefkDcbSZub3tM8G39
eetiHS1nqe3QwuP+TYM3MaBjw0bdgNz9Wt3xmY3Ehnf3pujMR5fbAsCbcdowV5/+
OI2ZcTUZculeAW2q9DgsOCtyS/1huwMHG0zO32TgadbFv45UCS0=
=LN+J
-----END PGP SIGNATURE-----
Merge tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.12-rc8, including fixes from netfilter, and
bpf. BPF verifier changes stand out, otherwise things have slowed
down.
Current release - regressions:
- gro: ensure frag0 meets IP header alignment
- Revert "net: stmmac: re-init rx buffers when mac resume back"
- ethernet: macb: fix the restore of cmp registers
Previous releases - regressions:
- ixgbe: Fix NULL pointer dereference in ethtool loopback test
- ixgbe: fix unbalanced device enable/disable in suspend/resume
- phy: marvell: fix detection of PHY on Topaz switches
- make tcp_allowed_congestion_control readonly in non-init netns
- xen-netback: Check for hotplug-status existence before watching
Previous releases - always broken:
- bpf: mitigate a speculative oob read of up to map value size by
tightening the masking window
- sctp: fix race condition in sctp_destroy_sock
- sit, ip6_tunnel: Unregister catch-all devices
- netfilter: nftables: clone set element expression template
- netfilter: flowtable: fix NAT IPv6 offload mangling
- net: geneve: check skb is large enough for IPv4/IPv6 header
- netlink: don't call ->netlink_bind with table lock held"
* tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
netlink: don't call ->netlink_bind with table lock held
MAINTAINERS: update my email
bpf: Update selftests to reflect new error states
bpf: Tighten speculative pointer arithmetic mask
bpf: Move sanitize_val_alu out of op switch
bpf: Refactor and streamline bounds check into helper
bpf: Improve verifier error messages for users
bpf: Rework ptr_limit into alu_limit and add common error path
bpf: Ensure off_reg has no mixed signed bounds for all types
bpf: Move off_reg into sanitize_ptr_alu
bpf: Use correct permission flag for mixed signed bounds arithmetic
ch_ktls: do not send snd_una update to TCB in middle
ch_ktls: tcb close causes tls connection failure
ch_ktls: fix device connection close
ch_ktls: Fix kernel panic
i40e: fix the panic when running bpf in xdpdrv mode
net/mlx5e: fix ingress_ifindex check in mlx5e_flower_parse_meta
net/mlx5e: Fix setting of RS FEC mode
net/mlx5: Fix setting of devlink traps in switchdev mode
Revert "net: stmmac: re-init rx buffers when mac resume back"
...
The return value on success (>= 0) is overwritten by the return value of
put_old_timex32(). That works correct in the fault case, but is wrong for
the success case where put_old_timex32() returns 0.
Just check the return value of put_old_timex32() and return -EFAULT in case
it is not zero.
[ tglx: Massage changelog ]
Fixes: 3a4d44b616 ("ntp: Move adjtimex related compat syscalls to native counterparts")
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210414030449.90692-1-chenjun102@huawei.com
While this code is executed with the wait_lock held, a reader can
acquire the lock without holding wait_lock. The writer side loops
checking the value with the atomic_cond_read_acquire(), but only truly
acquires the lock when the compare-and-exchange is completed
successfully which isn’t ordered. This exposes the window between the
acquire and the cmpxchg to an A-B-A problem which allows reads
following the lock acquisition to observe values speculatively before
the write lock is truly acquired.
We've seen a problem in epoll where the reader does a xchg while
holding the read lock, but the writer can see a value change out from
under it.
Writer | Reader
--------------------------------------------------------------------------------
ep_scan_ready_list() |
|- write_lock_irq() |
|- queued_write_lock_slowpath() |
|- atomic_cond_read_acquire() |
| read_lock_irqsave(&ep->lock, flags);
--> (observes value before unlock) | chain_epi_lockless()
| | epi->next = xchg(&ep->ovflist, epi);
| | read_unlock_irqrestore(&ep->lock, flags);
| |
| atomic_cmpxchg_relaxed() |
|-- READ_ONCE(ep->ovflist); |
A core can order the read of the ovflist ahead of the
atomic_cmpxchg_relaxed(). Switching the cmpxchg to use acquire
semantics addresses this issue at which point the atomic_cond_read can
be switched to use relaxed semantics.
Fixes: b519b56e37 ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock")
Signed-off-by: Ali Saidi <alisaidi@amazon.com>
[peterz: use try_cmpxchg()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Steve Capper <steve.capper@arm.com>
CONFIG_SCHED_DEBUG is the build-time Kconfig knob, the boot param
sched_debug and the /debug/sched/debug_enabled knobs control the
sched_debug_enabled variable, but what they really do is make
SCHED_DEBUG more verbose, so rename the lot.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
With clang-11+, the code is broken due to my kvmalloc() conversion
(which predated the clang-11 support code) leaving one vmalloc() in
place. Fix that.
Link: https://lkml.kernel.org/r/20210412214210.6e1ecca9cdc5.I24459763acf0591d5e6b31c7e3a59890d802f79c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Borkmann says:
====================
pull-request: bpf 2021-04-17
The following pull-request contains BPF updates for your *net* tree.
We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 8 files changed, 175 insertions(+), 111 deletions(-).
The main changes are:
1) Fix a potential NULL pointer dereference in libbpf's xsk
umem handling, from Ciara Loftus.
2) Mitigate a speculative oob read of up to map value size by
tightening the masking window, from Daniel Borkmann.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This work tightens the offset mask we use for unprivileged pointer arithmetic
in order to mitigate a corner case reported by Piotr and Benedict where in
the speculative domain it is possible to advance, for example, the map value
pointer by up to value_size-1 out-of-bounds in order to leak kernel memory
via side-channel to user space.
Before this change, the computed ptr_limit for retrieve_ptr_limit() helper
represents largest valid distance when moving pointer to the right or left
which is then fed as aux->alu_limit to generate masking instructions against
the offset register. After the change, the derived aux->alu_limit represents
the largest potential value of the offset register which we mask against which
is just a narrower subset of the former limit.
For minimal complexity, we call sanitize_ptr_alu() from 2 observation points
in adjust_ptr_min_max_vals(), that is, before and after the simulated alu
operation. In the first step, we retieve the alu_state and alu_limit before
the operation as well as we branch-off a verifier path and push it to the
verification stack as we did before which checks the dst_reg under truncation,
in other words, when the speculative domain would attempt to move the pointer
out-of-bounds.
In the second step, we retrieve the new alu_limit and calculate the absolute
distance between both. Moreover, we commit the alu_state and final alu_limit
via update_alu_sanitation_state() to the env's instruction aux data, and bail
out from there if there is a mismatch due to coming from different verification
paths with different states.
Reported-by: Piotr Krysiuk <piotras@gmail.com>
Reported-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Benedict Schlueter <benedict.schlueter@rub.de>
Add a small sanitize_needed() helper function and move sanitize_val_alu()
out of the main opcode switch. In upcoming work, we'll move sanitize_ptr_alu()
as well out of its opcode switch so this helps to streamline both.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Move the bounds check in adjust_ptr_min_max_vals() into a small helper named
sanitize_check_bounds() in order to simplify the former a bit.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Consolidate all error handling and provide more user-friendly error messages
from sanitize_ptr_alu() and sanitize_val_alu().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Small refactor with no semantic changes in order to consolidate the max
ptr_limit boundary check.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
The mixed signed bounds check really belongs into retrieve_ptr_limit()
instead of outside of it in adjust_ptr_min_max_vals(). The reason is
that this check is not tied to PTR_TO_MAP_VALUE only, but to all pointer
types that we handle in retrieve_ptr_limit() and given errors from the latter
propagate back to adjust_ptr_min_max_vals() and lead to rejection of the
program, it's a better place to reside to avoid anything slipping through
for future types. The reason why we must reject such off_reg is that we
otherwise would not be able to derive a mask, see details in 9d7eceede7
("bpf: restrict unknown scalars of mixed signed bounds for unprivileged").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can
use off_reg for generalizing some of the checks for all pointer types.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
We forbid adding unknown scalars with mixed signed bounds due to the
spectre v1 masking mitigation. Hence this also needs bypass_spec_v1
flag instead of allow_ptr_leaks.
Fixes: 2c78ee898d ("bpf: Implement CAP_BPF")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
If delayacct is disabled, then delayacct_is_task_waiting_on_io()
always returns false, which causes the statistical value to be
wrong. Perhaps tsk->in_iowait is better.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The broadcast device is switched to oneshot mode when the system switches
to oneshot mode. If a broadcast clock event device is registered after the
system switched to oneshot mode, it will stay in periodic mode forever.
Ensure that a late registered device which is selected as broadcast device
is initialized in oneshot mode when the system already uses oneshot mode.
[ tglx: Massage changelog ]
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210331083318.21794-1-jindong.yue@nxp.com
The function tick_check_replacement() is the combination of
tick_check_percpu() and tick_check_preferred(), but tick_check_new_device()
has the same logic open coded.
Use the helper to simplify the code.
[ tglx: Massage changelog ]
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210326022328.3266-1-wangwensheng4@huawei.com
The current sched_slice() seems to have issues; there's two possible
things that could be improved:
- the 'nr_running' used for __sched_period() is daft when cgroups are
considered. Using the RQ wide h_nr_running seems like a much more
consistent number.
- (esp) cgroups can slice it real fine, which makes for easy
over-scheduling, ensure min_gran is what the name says.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.611897312@infradead.org
Move the #ifdef SCHED_DEBUG bits to kernel/sched/debug.c in order to
collect all the debugfs bits.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.353833279@infradead.org
Stop polluting sysctl with undocumented knobs that really are debug
only, move them all to /debug/sched/ along with the existing
/debug/sched_* files that already exist.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
CONFIG_SCHEDSTATS does not depend on SCHED_DEBUG, it is inconsistent
to have the sysctl depend on it.
Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.161151631@infradead.org
The ability to enable/disable NUMA balancing is not a debugging feature
and should not depend on CONFIG_SCHED_DEBUG. For example, machines within
a HPC cluster may disable NUMA balancing temporarily for some jobs and
re-enable it for other jobs without needing to reboot.
This patch removes the dependency on CONFIG_SCHED_DEBUG for
kernel.numa_balancing sysctl. The other numa balancing related sysctls
are left as-is because if they need to be tuned then it is more likely
that NUMA balancing needs to be fixed instead.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210324133916.GQ15768@suse.de
Use the new cpu_dying() state to simplify and fix the balance_push()
vs CPU hotplug rollback state.
Specifically, we currently rely on notifiers sched_cpu_dying() /
sched_cpu_activate() to terminate balance_push, however if the
cpu_down() fails when we're past sched_cpu_deactivate(), it should
terminate balance_push at that point and not wait until we hit
sched_cpu_activate().
Similarly, when cpu_up() fails and we're going back down, balance_push
should be active, where it currently is not.
So instead, make sure balance_push is enabled below SCHED_AP_ACTIVE
(when !cpu_active()), and gate it's utility with cpu_dying().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/YHgAYef83VQhKdC2@hirez.programming.kicks-ass.net
Introduce a cpumask that indicates (for each CPU) what direction the
CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
an up fails and we do a roll-back down, it will accurately reflect the
direction.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210310150109.151441252@infradead.org
Adds bit perf_event_attr::sigtrap, which can be set to cause events to
send SIGTRAP (with si_code TRAP_PERF) to the task where the event
occurred. The primary motivation is to support synchronous signals on
perf events in the task where an event (such as breakpoints) triggered.
To distinguish perf events based on the event type, the type is set in
si_errno. For events that are associated with an address, si_addr is
copied from perf_sample_data.
The new field perf_event_attr::sig_data is copied to si_perf, which
allows user space to disambiguate which event (of the same type)
triggered the signal. For example, user space could encode the relevant
information it cares about in sig_data.
We note that the choice of an opaque u64 provides the simplest and most
flexible option. Alternatives where a reference to some user space data
is passed back suffer from the problem that modification of referenced
data (be it the event fd, or the perf_event_attr) can race with the
signal being delivered (of course, the same caveat applies if user space
decides to store a pointer in sig_data, but the ABI explicitly avoids
prescribing such a design).
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/lkml/YBv3rAT566k+6zjg@hirez.programming.kicks-ass.net/
Introduces the TRAP_PERF si_code, and associated siginfo_t field
si_perf. These will be used by the perf event subsystem to send signals
(if requested) to the task where an event occurred.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
Link: https://lkml.kernel.org/r/20210408103605.1676875-6-elver@google.com
Adds bit perf_event_attr::remove_on_exec, to support removing an event
from a task on exec.
This option supports the case where an event is supposed to be
process-wide only, and should not propagate beyond exec, to limit
monitoring to the original process image only.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210408103605.1676875-5-elver@google.com
Adds bit perf_event_attr::inherit_thread, to restricting inheriting
events only if the child was cloned with CLONE_THREAD.
This option supports the case where an event is supposed to be
process-wide only (including subthreads), but should not propagate
beyond the current process's shared environment.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/YBvj6eJR%2FDY2TsEB@hirez.programming.kicks-ass.net/
As with other ioctls (such as PERF_EVENT_IOC_{ENABLE,DISABLE}), fix up
handling of PERF_EVENT_IOC_MODIFY_ATTRIBUTES to also apply to children.
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lkml.kernel.org/r/20210408103605.1676875-3-elver@google.com
Make perf_event_exit_event() more robust, such that we can use it from
other contexts. Specifically the up and coming remove_on_exec.
For this to work we need to address a few issues. Remove_on_exec will
not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to
disable event_function_call() and we thus have to use
perf_remove_from_context().
When using perf_remove_from_context(), there's two races to consider.
The first is against close(), where we can have concurrent tear-down
of the event. The second is against child_list iteration, which should
not find a half baked event.
To address this, teach perf_remove_from_context() to special case
!ctx->is_active and about DETACH_CHILD.
[ elver@google.com: fix racing parent/child exit in sync_child_event(). ]
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210408103605.1676875-2-elver@google.com
Currently, we start allocating AUX pages half the size of the total
requested AUX buffer size, ignoring the attr.aux_watermark setting. This,
in turn, makes intel_pt driver disregard the watermark also, as it uses
page order for its SG (ToPA) configuration.
Now, this can be fixed in the intel_pt PMU driver, but seeing as it's the
only one currently making use of high order allocations, there is no
reason not to fix the allocator instead. This way, any other driver
wishing to add this support would not have to worry about this.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210414154955.49603-2-alexander.shishkin@linux.intel.com
Commit ec9c82e03a ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.
Using copy_from_user() and clear_user() for 64bit values
is suboptimal.
We can use faster put_user() and get_user() on 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20210413203352.71350-4-eric.dumazet@gmail.com
After commit 8f28177014 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()
Mathieu pointed out the same cleanup can be done
in rseq_syscall().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20210413203352.71350-3-eric.dumazet@gmail.com
Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.
This removes one stac/clac pair on x86 in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20210413203352.71350-2-eric.dumazet@gmail.com
The idea for this originates from the real time tree to make signal
delivery for realtime applications more efficient. In quite some of these
application scenarios a control tasks signals workers to start their
computations. There is usually only one signal per worker on flight. This
works nicely as long as the kmem cache allocations do not hit the slow path
and cause latencies.
To cure this an optimistic caching was introduced (limited to RT tasks)
which allows a task to cache a single sigqueue in a pointer in task_struct
instead of handing it back to the kmem cache after consuming a signal. When
the next signal is sent to the task then the cached sigqueue is used
instead of allocating a new one. This solved the problem for this set of
application scenarios nicely.
The task cache is not preallocated so the first signal sent to a task goes
always to the cache allocator. The cached sigqueue stays around until the
task exits and is freed when task::sighand is dropped.
After posting this solution for mainline the discussion came up whether
this would be useful in general and should not be limited to realtime
tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org
One concern leading to the original limitation was to avoid a large amount
of pointlessly cached sigqueues in alive tasks. The other concern was
vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.
The accounting problem is real, but on the other hand slightly academic.
After gathering some statistics it turned out that after boot of a regular
distro install there are less than 10 sigqueues cached in ~1500 tasks.
In case of a 'mass fork and fire signal to child' scenario the extra 80
bytes of memory per task are well in the noise of the overall memory
consumption of the fork bomb.
If this should be limited then this would need an extra counter in struct
user, more atomic instructions and a seperate rlimit. Yet another tunable
which is mostly unused.
The caching is actually used. After boot and a full kernel compile on a
64CPU machine with make -j128 the number of 'allocations' looks like this:
From slab: 23996
From task cache: 52223
I.e. it reduces the number of slab cache operations by ~68%.
A typical pattern there is:
<...>-58490 __sigqueue_alloc: for 58488 from slab ffff8881132df460
<...>-58488 __sigqueue_free: cache ffff8881132df460
<...>-58488 __sigqueue_alloc: for 1149 from cache ffff8881103dc550
bash-1149 exit_task_sighand: free ffff8881132df460
bash-1149 __sigqueue_free: cache ffff8881103dc550
The interesting sequence is that the exiting task 58488 grabs the sigqueue
from bash's task cache to signal exit and bash sticks it back into it's own
cache. Lather, rinse and repeat.
The caching is probably not noticable for the general use case, but the
benefit for latency sensitive applications is clear. While kmem caches are
usually just serving from the fast path the slab merging (default) can
depending on the usage pattern of the merged slabs cause occasional slow
path allocations.
The time spared per cached entry is a few micro seconds per signal which is
not relevant for e.g. a kernel build, but for signal heavy workloads it's
measurable.
As there is no real downside of this caching mechanism making it
unconditionally available is preferred over more conditional code or new
magic tunables.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
There is no point in having the conditional at the callsite.
Just hand in the allocation mode flag to __sigqueue_alloc() and use it to
initialize sigqueue::flags.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210322092258.898677147@linutronix.de
Add two new kdb environment access methods as kdb_setenv() and
kdb_printenv() in order to abstract out environment access code
from kdb command functions.
Also, replace (char *)0 with NULL as an initializer for environment
variables array.
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/1612771342-16883-1-git-send-email-sumit.garg@linaro.org
[daniel.thompson@linaro.org: Replaced (char *)0/NULL initializers with
an array size]
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
allnoconfig_y is an ugly hack that sets a symbol to 'y' by allnoconfig.
allnoconfig does not mean a minimal set of CONFIG options because a
bunch of prompts are hidden by 'if EMBEDDED' or 'if EXPERT', but I do
not like to hack Kconfig this way.
Use the pre-existing feature, KCONFIG_ALLCONFIG, to provide a one
liner config fragment. CONFIG_EMBEDDED=y is still forced when
allnoconfig is invoked as a part of tinyconfig.
No change in the .config file produced by 'make tinyconfig'.
The output of 'make allnoconfig' will be changed; we will get
CONFIG_EMBEDDED=n because allnoconfig literally sets all symbols to n.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
An error path exited the function before freeing the allocated
"argv" variable.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYHY3LRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qigOAPwOvbUI9PQTW3hs16XHDGbgtzdzX6A7
kF7GlId5tXbZDwD/bW2gilFCjULCEPDuqsDy5EXrbZ7V7kulOfIw2e8CAQM=
=HwKu
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix a memory link in dyn_event_release().
An error path exited the function before freeing the allocated 'argv'
variable"
* tag 'trace-v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/dynevent: Fix a memory leak in an error handling path
There is currently no way to discover the target of a tracing program
attachment after the fact. Add this information to bpf_link_info and return
it when querying the bpf_link fd.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210413091607.58945-1-toke@redhat.com
This change introduces a prctl that allows the user program to control
which PAC keys are enabled in a particular task. The main reason
why this is useful is to enable a userspace ABI that uses PAC to
sign and authenticate function pointers and other pointers exposed
outside of the function, while still allowing binaries conforming
to the ABI to interoperate with legacy binaries that do not sign or
authenticate pointers.
The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.
This change adds a small amount of overhead to kernel entry and exit
due to additional required instruction sequences.
On a DragonBoard 845c (Cortex-A75) with the powersave governor, the
overhead of similar instruction sequences was measured as 4.9ns when
simulating the common case where IA is left enabled, or 43.7ns when
simulating the uncommon case where IA is disabled. These numbers can
be seen as the worst case scenario, since in more realistic scenarios
a better performing governor would be used and a newer chip would be
used that would support PAC unlike Cortex-A75 and would be expected
to be faster than Cortex-A75.
On an Apple M1 under a hypervisor, the overhead of the entry/exit
instruction sequences introduced by this patch was measured as 0.3ns
in the case where IA is left enabled, and 33.0ns in the case where
IA is disabled.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5
Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Change hierachy to hierarchy and unrechable to unreachable,
no functionality changed.
Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull ARM cpufreq updates for v5.13 from Viresh Kumar:
"- Fix typos in s5pv210 cpufreq driver (Bhaskar Chowdhury).
- Armada 37xx: Fix cpufreq changing base CPU speed to 800 MHz from
1000 MHz (Pali Rohár and Marek Behún).
- cpufreq-dt: Return -EPROBE_DEFER on failure to add table (Quanyang
Wang).
- Minor cleanup in cppc driver (Tom Saeger).
- Add frequency invariance support for CPPC driver and generalize
freq invariance support arch-topology driver (Viresh Kumar)."
* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
cpufreq: armada-37xx: Fix module unloading
cpufreq: armada-37xx: Remove cur_frequency variable
cpufreq: armada-37xx: Fix determining base CPU frequency
cpufreq: armada-37xx: Fix driver cleanup when registration failed
clk: mvebu: armada-37xx-periph: Fix workaround for switching from L1 to L0
clk: mvebu: armada-37xx-periph: Fix switching CPU freq from 250 Mhz to 1 GHz
cpufreq: armada-37xx: Fix the AVS value for load L1
clk: mvebu: armada-37xx-periph: remove .set_parent method for CPU PM clock
cpufreq: armada-37xx: Fix setting TBG parent for load levels
cpufreq: dt: dev_pm_opp_of_cpumask_add_table() may return -EPROBE_DEFER
cpufreq: cppc: simplify default delay_us setting
cpufreq: Rudimentary typos fix in the file s5pv210-cpufreq.c
cpufreq: CPPC: Add support for frequency invariance
arch_topology: Export arch_freq_scale and helpers
arch_topology: Allow multiple entities to provide sched_freq_tick() callback
arch_topology: Rename freq_scale as arch_freq_scale
The only exported helper we have right now is task_work_cancel(), which
cancels any task_work from a given task where func matches the queued
work item. This is a bit too coarse for some use cases. Add a
task_work_cancel_match() that allows to more specifically target
individual work items outside of purely the callback function used.
task_work_cancel() can be trivially implemented on top of that, hence do
so.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
fork() fails if signal_pending() is true, but there are two conditions
that can lead to that:
1) An actual signal is pending. We want fork to fail for that one, like
we always have.
2) TIF_NOTIFY_SIGNAL is pending, because the task has pending task_work.
We don't need to make it fail for that case.
Allow fork() to proceed if just task_work is pending, by changing the
signal_pending() check to task_sigpending().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ambiguous/confusing kernel log message.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmBy4N4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hvVg/7BWV2CDlG2g19KASGgWQo53+JO40E5j1j
dG6h5u709TAYKtr2eckP0fC2Eah6Y3CXbZ2P2JMAwrO4KUETAyzyjlupFDBeiqap
SI6XX6alir6M0p6uy0UaEjzqbRJHSMJJKYPyA3YoewsQiKVCbYf6xYQJ0cA5VHaQ
ZcAof/lvmH7uMG3rDWbHDE3G1A67fK3FUp8AxR2i+7IAcG25q0Ov2gYWNOTM3Eh2
vSAuot/HPXIFjZDakfnx+iC2bIRZ3D6jgwlZNPIyzPFXB7A4+fyuFoJ8VbKyNRUa
38f0gQVYad7AnhriDjeaAngcPeHaBEWyWGnQXX99mZy7jqa/HbFZUI6Btxb+Ertg
rGZ7XyfOalZfaD6UBfc8Pr0fvn7Ci6chww5XO6F/A4P02lWwcQreNLMJQ8Jzujvf
FBADbHT7QvTm48JHJqfID/jNAp9/u2jjHq3I3B8k0gHkGsnVwliyHinxoaE8XyzG
vXDk/C0RvdywJBAz5H3VbR1Q6NeTCl/IIzGf4e7XjxH6Y3tyHeDJI3EdSojUrnGk
V74CzRnqsnC+kfvU92Ms5+daeRMOyctZTpOeIGSjD3AYxo7D7FqFTw3M1L2vnqDn
oe6qrUDLHNITPzouuDDGDS9c0aHtRydfPSEz3NH3WLgVyfJM6rgZ5BX9ItZKFgMR
ZCeoY21JXeA=
=y7/k
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2021-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixlets from Ingo Molnar:
"Two minor fixes: one for a Clang warning, the other improves an
ambiguous/confusing kernel log message"
* tag 'locking-urgent-2021-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Address clang -Wformat warning printing for %hd
lockdep: Add a missing initialization hint to the "INFO: Trying to register non-static key" message
Pull RCU changes from Paul E. McKenney:
- Bitmap support for "N" as alias for last bit
- kvfree_rcu updates
- mm_dump_obj() updates. (One of these is to mm, but was suggested by Andrew Morton.)
- RCU callback offloading update
- Polling RCU grace-period interfaces
- Realtime-related RCU updates
- Tasks-RCU updates
- Torture-test updates
- Torture-test scripting updates
- Miscellaneous fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
note_interrupt() increments desc->irq_count for each interrupt even for
percpu interrupt handlers, even when they are handled successfully. This
causes cacheline bouncing and limits scalability.
Instead of incrementing irq_count every time, only start incrementing it
after seeing an unhandled irq, which should avoid the cache line
bouncing in the common path.
This actually should give better consistency in handling misbehaving
irqs too, because instead of the first unhandled irq arriving at an
arbitrary point in the irq_count cycle, its arrival will begin the
irq_count cycle.
Cédric reports the result of his IPI throughput test:
Millions of IPIs/s
----------- --------------------------------------
upstream upstream patched
chips cpus default noirqdebug default (irqdebug)
----------- -----------------------------------------
1 0-15 4.061 4.153 4.084
0-31 7.937 8.186 8.158
0-47 11.018 11.392 11.233
0-63 11.460 13.907 14.022
2 0-79 8.376 18.105 18.084
0-95 7.338 22.101 22.266
0-111 6.716 25.306 25.473
0-127 6.223 27.814 28.029
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210402132037.574661-1-npiggin@gmail.com
KMSAN complains that new_value at cpumask_parse_user() from
write_irq_affinity() from irq_affinity_proc_write() is uninitialized.
[ 148.133411][ T5509] =====================================================
[ 148.135383][ T5509] BUG: KMSAN: uninit-value in find_next_bit+0x325/0x340
[ 148.137819][ T5509]
[ 148.138448][ T5509] Local variable ----new_value.i@irq_affinity_proc_write created at:
[ 148.140768][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.142298][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.143823][ T5509] =====================================================
Since bitmap_parse() from cpumask_parse_user() calls find_next_bit(),
any alloc_cpumask_var() + cpumask_parse_user() sequence has possibility
that find_next_bit() accesses uninitialized cpu mask variable. Fix this
problem by replacing alloc_cpumask_var() with zalloc_cpumask_var().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210401055823.3929-1-penguin-kernel@I-love.SAKURA.ne.jp
Conflicts:
MAINTAINERS
- keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
- simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
- trivial
include/linux/ethtool.h
- trivial, fix kdoc while at it
include/linux/skmsg.h
- move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
- add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
- trivial
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge misc fixes from Andrew Morton:
"14 patches.
Subsystems affected by this patch series: mm (kasan, gup, pagecache,
and kfence), MAINTAINERS, mailmap, nds32, gcov, ocfs2, ia64, and lib"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS
kfence, x86: fix preemptible warning on KPTI-enabled systems
lib/test_kasan_module.c: suppress unused var warning
kasan: fix conflict with page poisoning
fs: direct-io: fix missing sdio->boundary
ia64: fix user_stack_pointer() for ptrace()
ocfs2: fix deadlock between setattr and dio_end_io_write
gcov: re-fix clang-11+ support
nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff
mm/gup: check page posion status for coredump.
.mailmap: fix old email addresses
mailmap: update email address for Jordan Crouse
treewide: change my e-mail address, fix my name
MAINTAINERS: update CZ.NIC's Turris information
mac80211, wireless, and bpf trees. No scary regressions here
or in the works, but small fixes for 5.12 changes keep coming.
Current release - regressions:
- virtio: do not pull payload in skb->head
- virtio: ensure mac header is set in virtio_net_hdr_to_skb()
- Revert "net: correct sk_acceptq_is_full()"
- mptcp: revert "mptcp: provide subflow aware release function"
- ethernet: lan743x: fix ethernet frame cutoff issue
- dsa: fix type was not set for devlink port
- ethtool: remove link_mode param and derive link params
from driver
- sched: htb: fix null pointer dereference on a null new_q
- wireless: iwlwifi: Fix softirq/hardirq disabling in
iwl_pcie_enqueue_hcmd()
- wireless: iwlwifi: fw: fix notification wait locking
- wireless: brcmfmac: p2p: Fix deadlock introduced by avoiding
the rtnl dependency
Current release - new code bugs:
- napi: fix hangup on napi_disable for threaded napi
- bpf: take module reference for trampoline in module
- wireless: mt76: mt7921: fix airtime reporting and related
tx hangs
- wireless: iwlwifi: mvm: rfi: don't lock mvm->mutex when sending
config command
Previous releases - regressions:
- rfkill: revert back to old userspace API by default
- nfc: fix infinite loop, refcount & memory leaks in LLCP sockets
- let skb_orphan_partial wake-up waiters
- xfrm/compat: Cleanup WARN()s that can be user-triggered
- vxlan, geneve: do not modify the shared tunnel info when PMTU
triggers an ICMP reply
- can: fix msg_namelen values depending on CAN_REQUIRED_SIZE
- can: uapi: mark union inside struct can_frame packed
- sched: cls: fix action overwrite reference counting
- sched: cls: fix err handler in tcf_action_init()
- ethernet: mlxsw: fix ECN marking in tunnel decapsulation
- ethernet: nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx
- ethernet: i40e: fix receiving of single packets in xsk zero-copy
mode
- ethernet: cxgb4: avoid collecting SGE_QBASE regs during traffic
Previous releases - always broken:
- bpf: Refuse non-O_RDWR flags in BPF_OBJ_GET
- bpf: Refcount task stack in bpf_get_task_stack
- bpf, x86: Validate computation of branch displacements
- ieee802154: fix many similar syzbot-found bugs
- fix NULL dereferences in netlink attribute handling
- reject unsupported operations on monitor interfaces
- fix error handling in llsec_key_alloc()
- xfrm: make ipv4 pmtu check honor ip header df
- xfrm: make hash generation lock per network namespace
- xfrm: esp: delete NETIF_F_SCTP_CRC bit from features for esp
offload
- ethtool: fix incorrect datatype in set_eee ops
- xdp: fix xdp_return_frame() kernel BUG throw for page_pool
memory model
- openvswitch: fix send of uninitialized stack memory in ct limit
reply
Misc:
- udp: add get handling for UDP_GRO sockopt
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmBwyfAACgkQMUZtbf5S
IruJ/BAAnjghw2kWXRCKK3Tkm0pi0zjaKvTS30AcKCW2+GnqSxTdiWNv+mxqFgnm
YdduPKiGwLoDkA2i2d4EF8/HK6m+Q6bHcUbZ2npEm1ElkKfxCYGmocor8n2kD+a9
je94VGYV7zytnxXw85V6/jFLDqOXXwhBfHhlDMVBZP8OyzUfbDKGorWmyGuy9GJp
81bvzqN2bHUGIM0cDr+ol3eYw2ituGWgiqNfnq7z+/NVcYmD0EPChDRbp0jtH1ng
dcoONI6YlymDEDpu/9GmyKL1ken9lcWoVdvv/aDGtP62x6SYDt5HKe3wAtJ+Kjbq
jIPADxPx5BymYIZRBtdNR0rP66LycA7hDtM/C/h1WoihDXwpGeNUU4g0aJ+hsP5Q
ldwJI1DJo79VbwM2c3Kg73PaphLcPD4RdwF0/ovFsl0+bTDfj8i93ah4Wnzj0Qli
EMiSDEDNb51e9nkW+xu+FjLWmxHJvLOL/+VgHV5bPJJBob2fqnjAMj2PkPEuEtXY
TPWEh9y3zaEyp/9tNx0cstGOt6Gf5DQ5Nk6tX6hMpJT/BeL8mju1jm0yPLZhMJjF
LlTrJgXftfP/cjltdSm4aVqSU5okjHNYDhmHlNgvzih5mt+NVslRJfzwq62Vudqy
C0kpmVdQNFkOB0UcqQihevZg9mvem3m/dYl+v/MV7Uq6r4s4M2A=
=SHL0
-----END PGP SIGNATURE-----
Merge tag 'net-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.12-rc7, including fixes from can, ipsec,
mac80211, wireless, and bpf trees.
No scary regressions here or in the works, but small fixes for 5.12
changes keep coming.
Current release - regressions:
- virtio: do not pull payload in skb->head
- virtio: ensure mac header is set in virtio_net_hdr_to_skb()
- Revert "net: correct sk_acceptq_is_full()"
- mptcp: revert "mptcp: provide subflow aware release function"
- ethernet: lan743x: fix ethernet frame cutoff issue
- dsa: fix type was not set for devlink port
- ethtool: remove link_mode param and derive link params from driver
- sched: htb: fix null pointer dereference on a null new_q
- wireless: iwlwifi: Fix softirq/hardirq disabling in
iwl_pcie_enqueue_hcmd()
- wireless: iwlwifi: fw: fix notification wait locking
- wireless: brcmfmac: p2p: Fix deadlock introduced by avoiding the
rtnl dependency
Current release - new code bugs:
- napi: fix hangup on napi_disable for threaded napi
- bpf: take module reference for trampoline in module
- wireless: mt76: mt7921: fix airtime reporting and related tx hangs
- wireless: iwlwifi: mvm: rfi: don't lock mvm->mutex when sending
config command
Previous releases - regressions:
- rfkill: revert back to old userspace API by default
- nfc: fix infinite loop, refcount & memory leaks in LLCP sockets
- let skb_orphan_partial wake-up waiters
- xfrm/compat: Cleanup WARN()s that can be user-triggered
- vxlan, geneve: do not modify the shared tunnel info when PMTU
triggers an ICMP reply
- can: fix msg_namelen values depending on CAN_REQUIRED_SIZE
- can: uapi: mark union inside struct can_frame packed
- sched: cls: fix action overwrite reference counting
- sched: cls: fix err handler in tcf_action_init()
- ethernet: mlxsw: fix ECN marking in tunnel decapsulation
- ethernet: nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx
- ethernet: i40e: fix receiving of single packets in xsk zero-copy
mode
- ethernet: cxgb4: avoid collecting SGE_QBASE regs during traffic
Previous releases - always broken:
- bpf: Refuse non-O_RDWR flags in BPF_OBJ_GET
- bpf: Refcount task stack in bpf_get_task_stack
- bpf, x86: Validate computation of branch displacements
- ieee802154: fix many similar syzbot-found bugs
- fix NULL dereferences in netlink attribute handling
- reject unsupported operations on monitor interfaces
- fix error handling in llsec_key_alloc()
- xfrm: make ipv4 pmtu check honor ip header df
- xfrm: make hash generation lock per network namespace
- xfrm: esp: delete NETIF_F_SCTP_CRC bit from features for esp
offload
- ethtool: fix incorrect datatype in set_eee ops
- xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory
model
- openvswitch: fix send of uninitialized stack memory in ct limit
reply
Misc:
- udp: add get handling for UDP_GRO sockopt"
* tag 'net-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (182 commits)
net: fix hangup on napi_disable for threaded napi
net: hns3: Trivial spell fix in hns3 driver
lan743x: fix ethernet frame cutoff issue
net: ipv6: check for validity before dereferencing cfg->fc_nlinfo.nlh
net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits
net: dsa: lantiq_gswip: Don't use PHY auto polling
net: sched: sch_teql: fix null-pointer dereference
ipv6: report errors for iftoken via netlink extack
net: sched: fix err handler in tcf_action_init()
net: sched: fix action overwrite reference counting
Revert "net: sched: bump refcount for new action in ACT replace mode"
ice: fix memory leak of aRFS after resuming from suspend
i40e: Fix sparse warning: missing error code 'err'
i40e: Fix sparse error: 'vsi->netdev' could be null
i40e: Fix sparse error: uninitialized symbol 'ring'
i40e: Fix sparse errors in i40e_txrx.c
i40e: Fix parameters in aq_get_phy_register()
nl80211: fix beacon head validation
bpf, x86: Validate computation of branch displacements for x86-32
bpf, x86: Validate computation of branch displacements for x86-64
...
LLVM changed the expected function signature for llvm_gcda_emit_function()
in the clang-11 release. Users of clang-11 or newer may have noticed
their kernels producing invalid coverage information:
$ llvm-cov gcov -a -c -u -f -b <input>.gcda -- gcno=<input>.gcno
1 <func>: checksum mismatch, \
(<lineno chksum A>, <cfg chksum B>) != (<lineno chksum A>, <cfg chksum C>)
2 Invalid .gcda File!
...
Fix up the function signatures so calling this function interprets its
parameters correctly and computes the correct cfg checksum. In
particular, in clang-11, the additional checksum is no longer optional.
Link: https://reviews.llvm.org/rG25544ce2df0daa4304c07e64b9c8b0f7df60c11d
Link: https://lkml.kernel.org/r/20210408184631.1156669-1-ndesaulniers@google.com
Reported-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass
group_smaller_max_cpu_capacity(<candidate group>, <local group>);
which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.
Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.
One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.
Also note that comparing capacity extrema of local and source sched_group's
doesn't make much sense when at the day of the day the imbalance will be
pulled by a known env->dst_cpu, whose capacity can be anywhere within the
local group's capacity extrema.
While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().
This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.
Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose. Cleanup the LBF_DST_PINNED active balance special case.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.
In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).
It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.
Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:
DIE [ ]
MC [ ][ ]
0 1 2 3
L L B B
arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
Mel Gorman did some nice work in 9fe1f127b9 ("sched/fair: Merge
select_idle_core/cpu()"), resulting in the kernel being more efficient
at finding an idle CPU, and in tasks spending less time waiting to be
run, both according to the schedstats run_delay numbers, and according
to measured application latencies. Yay.
The flip side of this is that we see more task migrations (about 30%
more), higher cache misses, higher memory bandwidth utilization, and
higher CPU use, for the same number of requests/second.
This is most pronounced on a memcache type workload, which saw a
consistent 1-3% increase in total CPU use on the system, due to those
increased task migrations leading to higher L2 cache miss numbers, and
higher memory utilization. The exclusive L3 cache on Skylake does us
no favors there.
On our web serving workload, that effect is usually negligible.
It appears that the increased number of CPU migrations is generally a
good thing, since it leads to lower cpu_delay numbers, reflecting the
fact that tasks get to run faster. However, the reduced locality and
the corresponding increase in L2 cache misses hurts a little.
The patch below appears to fix the regression, while keeping the
benefit of the lower cpu_delay numbers, by reintroducing
select_idle_smt with a twist: when a socket has no idle cores, check
to see if the sibling of "prev" is idle, before searching all the
other CPUs.
This fixes both the occasional 9% regression on the web serving
workload, and the continuous 2% CPU use regression on the memcache
type workload.
With Mel's patches and this patch together, task migrations are still
high, but L2 cache misses, memory bandwidth, and CPU time used are
back down to what they were before. The p95 and p99 response times for
the memcache type application improve by about 10% over what they were
before Mel's patches got merged.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
static_call_update() had stronger type requirements than regular C,
relax them to match. Instead of requiring the @func argument has the
exact matching type, allow any type which C is willing to promote to the
right (function) pointer type. Specifically this allows (void *)
arguments.
This cleans up a bunch of static_call_update() callers for
PREEMPT_DYNAMIC and should get around silly GCC11 warnings for free.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YFoN7nCl8OfGtpeh@hirez.programming.kicks-ass.net
Here is the warning converted as error and reported by GCC:
kernel/static_call.c: In function ‘__static_call_update’:
kernel/static_call.c:153:18: error: unused variable ‘mod’ [-Werror=unused-variable]
153 | struct module *mod = site_mod->mod;
| ^~~
cc1: all warnings being treated as errors
make[1]: *** [scripts/Makefile.build:271: kernel/static_call.o] Error 1
This is simply because since recently, we no longer use 'mod' variable
elsewhere if MODULE is unset.
When using 'make tinyconfig' to generate the default kconfig, MODULE is
unset.
There are different ways to fix this warning. Here I tried to minimised
the number of modified lines and not add more #ifdef. We could also move
the declaration of the 'mod' variable inside the if-statement or
directly use site_mod->mod.
Fixes: 698bacefe9 ("static_call: Align static_call_is_init() patching condition")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210326105023.2058860-1-matthieu.baerts@tessares.net
With CONFIG_CFI_CLANG and ThinLTO, Clang appends a hash to the names
of all static functions not marked __used. This can break userspace
tools that don't expect the function name to change, so strip out the
hash from the output.
Suggested-by: Jack Pham <jackp@codeaurora.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-8-samitolvanen@google.com
With CONFIG_CFI_CLANG, a callback function passed to
__kthread_queue_delayed_work from a module points to a jump table
entry defined in the module instead of the one used in the core
kernel, which breaks function address equality in this check:
WARN_ON_ONCE(timer->function != ktead_delayed_work_timer_fn);
Use WARN_ON_FUNCTION_MISMATCH() instead to disable the warning
when CFI and modules are both enabled.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-7-samitolvanen@google.com
With CONFIG_CFI_CLANG, a callback function passed to
__queue_delayed_work from a module points to a jump table entry
defined in the module instead of the one used in the core kernel,
which breaks function address equality in this check:
WARN_ON_ONCE(timer->function != delayed_work_timer_fn);
Use WARN_ON_FUNCTION_MISMATCH() instead to disable the warning
when CFI and modules are both enabled.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-6-samitolvanen@google.com
This change adds support for Clang’s forward-edge Control Flow
Integrity (CFI) checking. With CONFIG_CFI_CLANG, the compiler
injects a runtime check before each indirect function call to ensure
the target is a valid function with the correct static type. This
restricts possible call targets and makes it more difficult for
an attacker to exploit bugs that allow the modification of stored
function pointers. For more details, see:
https://clang.llvm.org/docs/ControlFlowIntegrity.html
Clang requires CONFIG_LTO_CLANG to be enabled with CFI to gain
visibility to possible call targets. Kernel modules are supported
with Clang’s cross-DSO CFI mode, which allows checking between
independently compiled components.
With CFI enabled, the compiler injects a __cfi_check() function into
the kernel and each module for validating local call targets. For
cross-module calls that cannot be validated locally, the compiler
calls the global __cfi_slowpath_diag() function, which determines
the target module and calls the correct __cfi_check() function. This
patch includes a slowpath implementation that uses __module_address()
to resolve call targets, and with CONFIG_CFI_CLANG_SHADOW enabled, a
shadow map that speeds up module look-ups by ~3x.
Clang implements indirect call checking using jump tables and
offers two methods of generating them. With canonical jump tables,
the compiler renames each address-taken function to <function>.cfi
and points the original symbol to a jump table entry, which passes
__cfi_check() validation. This isn’t compatible with stand-alone
assembly code, which the compiler doesn’t instrument, and would
result in indirect calls to assembly code to fail. Therefore, we
default to using non-canonical jump tables instead, where the compiler
generates a local jump table entry <function>.cfi_jt for each
address-taken function, and replaces all references to the function
with the address of the jump table entry.
Note that because non-canonical jump table addresses are local
to each component, they break cross-module function address
equality. Specifically, the address of a global function will be
different in each module, as it's replaced with the address of a local
jump table entry. If this address is passed to a different module,
it won’t match the address of the same function taken there. This
may break code that relies on comparing addresses passed from other
components.
CFI checking can be disabled in a function with the __nocfi attribute.
Additionally, CFI can be disabled for an entire compilation unit by
filtering out CC_FLAGS_CFI.
By default, CFI failures result in a kernel panic to stop a potential
exploit. CONFIG_CFI_PERMISSIVE enables a permissive mode, where the
kernel prints out a rate-limited warning instead, and allows execution
to continue. This option is helpful for locating type mismatches, but
should only be enabled during development.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-2-samitolvanen@google.com
Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210402025833.27599-1-johunt@akamai.com
Change "occured" to "occurred" in kernel/power/autosleep.c.
Change "consiting" to "consisting" in kernel/power/snapshot.c.
Change "avaiable" to "available" in kernel/power/swap.c.
No functionality changed.
Signed-off-by: Lu Jialin <lujialin4@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Make the tick_nohz_get_next_hrtimer() kerneldoc comment state clearly
that the function may return negative numbers.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
bpf_preload_lock is already defined with DEFINE_MUTEX(). There is no
need to initialize it again. Remove the extraneous initialization.
Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210405194904.GA148013@LEGION