-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJK5DwAKCRDbK58LschI
gyUtAQD4gT4BEVHRqvniw9yyqYo0BvElAznutDq7o9kFHFep2gEAoksEWS84OdZj
0L5mSKjXrpHKzmY/jlMrVIcTb3VzOw0=
=gAYE
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-06-21
We've added 7 non-merge commits during the last 14 day(s) which contain
a total of 7 files changed, 181 insertions(+), 15 deletions(-).
The main changes are:
1) Fix a verifier id tracking issue with scalars upon spill,
from Maxim Mikityanskiy.
2) Fix NULL dereference if an exception is generated while a BPF
subprogram is running, from Krister Johansen.
3) Fix a BTF verification failure when compiling kernel with LLVM_IAS=0,
from Florent Revest.
4) Fix expected_attach_type enforcement for kprobe_multi link,
from Jiri Olsa.
5) Fix a bpf_jit_dump issue for x86_64 to pick the correct JITed image,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Force kprobe multi expected_attach_type for kprobe_multi link
bpf/btf: Accept function names that contain dots
selftests/bpf: add a test for subprogram extables
bpf: ensure main program has an extable
bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable.
selftests/bpf: Add test cases to assert proper ID tracking on spill
bpf: Fix verifier id tracking of scalars on spill
====================
Link: https://lore.kernel.org/r/20230621101116.16122-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We currently allow to create perf link for program with
expected_attach_type == BPF_TRACE_KPROBE_MULTI.
This will cause crash when we call helpers like get_attach_cookie or
get_func_ip in such program, because it will call the kprobe_multi's
version (current->bpf_ctx context setup) of those helpers while it
expects perf_link's current->bpf_ctx context setup.
Making sure that we use BPF_TRACE_KPROBE_MULTI expected_attach_type
only for programs attaching through kprobe_multi link.
Fixes: ca74823c6e ("bpf: Add cookie support to programs attached with kprobe multi link")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230618131414.75649-1-jolsa@kernel.org
When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM
leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn,
pahole creates BTF_KIND_FUNC entries for these and this makes the BTF
metadata validation fail because they contain a dot.
In a dramatic turn of event, this BTF verification failure can cause
the netfilter_bpf initialization to fail, causing netfilter_core to
free the netfilter_helper hashmap and netfilter_ftp to trigger a
use-after-free. The risk of u-a-f in netfilter will be addressed
separately but the existence of "asan.module_ctor" debug info under some
build conditions sounds like a good enough reason to accept functions
that contain dots in BTF.
Although using only LLVM=1 is the recommended way to compile clang-based
kernels, users can certainly do LLVM=1, LLVM_IAS=0 as well and we still
try to support that combination according to Nick. To clarify:
- > v5.10 kernel, LLVM=1 (LLVM_IAS=0 is not the default) is recommended,
but user can still have LLVM=1, LLVM_IAS=0 to trigger the issue
- <= 5.10 kernel, LLVM=1 (LLVM_IAS=0 is the default) is recommended in
which case GNU as will be used
Fixes: 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Yonghong Song <yhs@meta.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/bpf/20230615145607.3469985-1-revest@chromium.org
When subprograms are in use, the main program is not jit'd after the
subprograms because jit_subprogs sets a value for prog->bpf_func upon
success. Subsequent calls to the JIT are bypassed when this value is
non-NULL. This leads to a situation where the main program and its
func[0] counterpart are both in the bpf kallsyms tree, but only func[0]
has an extable. Extables are only created during JIT. Now there are
two nearly identical program ksym entries in the tree, but only one has
an extable. Depending upon how the entries are placed, there's a chance
that a fault will call search_extable on the aux with the NULL entry.
Since jit_subprogs already copies state from func[0] to the main
program, include the extable pointer in this state duplication.
Additionally, ensure that the copy of the main program in func[0] is not
added to the bpf_prog_kallsyms table. Instead, let the main program get
added later in bpf_prog_load(). This ensures there is only a single
copy of the main program in the kallsyms table, and that its tag matches
the tag observed by tooling like bpftool.
Cc: stable@vger.kernel.org
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.1686616663.git.kjlx@templeofstupid.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
introduced during this -rc cycle or which were considered inappropriate
for a backport.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZIdw7QAKCRDdBJ7gKXxA
jki4AQCygi1UoqVPq4N/NzJbv2GaNDXNmcJIoLvPpp3MYFhucAEAtQNzAYO9z6CT
iLDMosnuh+1KLTaKNGL5iak3NAxnxQw=
=mTdI
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"19 hotfixes. 14 are cc:stable and the remainder address issues which
were introduced during this development cycle or which were considered
inappropriate for a backport"
* tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
zswap: do not shrink if cgroup may not zswap
page cache: fix page_cache_next/prev_miss off by one
ocfs2: check new file size on fallocate call
mailmap: add entry for John Keeping
mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()
epoll: ep_autoremove_wake_function should use list_del_init_careful
mm/gup_test: fix ioctl fail for compat task
nilfs2: reject devices with insufficient block count
ocfs2: fix use-after-free when unmounting read-only filesystem
lib/test_vmalloc.c: avoid garbage in page array
nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
riscv/purgatory: remove PGO flags
powerpc/purgatory: remove PGO flags
x86/purgatory: remove PGO flags
kexec: support purgatories with .text.hot sections
mm/uffd: allow vma to merge as much as possible
mm/uffd: fix vma operation where start addr cuts part of vma
radix-tree: move declarations to header
nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
Patch series "kexec: Fix kexec_file_load for llvm16 with PGO", v7.
When upreving llvm I realised that kexec stopped working on my test
platform.
The reason seems to be that due to PGO there are multiple .text sections
on the purgatory, and kexec does not supports that.
This patch (of 4):
Clang16 links the purgatory text in two sections when PGO is in use:
[ 1] .text PROGBITS 0000000000000000 00000040
00000000000011a1 0000000000000000 AX 0 0 16
[ 2] .rela.text RELA 0000000000000000 00003498
0000000000000648 0000000000000018 I 24 1 8
...
[17] .text.hot. PROGBITS 0000000000000000 00003220
000000000000020b 0000000000000000 AX 0 0 1
[18] .rela.text.hot. RELA 0000000000000000 00004428
0000000000000078 0000000000000018 I 24 17 8
And both of them have their range [sh_addr ... sh_addr+sh_size] on the
area pointed by `e_entry`.
This causes that image->start is calculated twice, once for .text and
another time for .text.hot. The second calculation leaves image->start
in a random location.
Because of this, the system crashes immediately after:
kexec_core: Starting new kernel
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-0-b05c520b7296@chromium.org
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-1-b05c520b7296@chromium.org
Fixes: 930457057a ("kernel/kexec_file.c: split up __kexec_load_puragory")
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Philipp Rudo <prudo@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A bunch of fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmSDUGkPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpw6gH+wbopniPKDrxNwTpDFx3jix3QDTzVMY4Bq4k
QdwPfjAZ1aDZXYHV1CdFXeKTA+ZkWHIREZSr+E/2/jeI55Exc2AeFptZrUesSg29
jMN1MPs00CCy8Qi9BiCZIQkFkIKHNA2PY8wIA0oIXhIaG7pBtYQ14CnAFqn41ev5
II20h389KMthe0lwm4ni/qHVZzG/2qP/JXLKf35proDEnU5WWM1rQZ1666EFMaIR
6QExqwbPubxfv44Kl3mMkanGj6MmtLtFa2XlMLbEfLrU5/Xz+CywqSFHTUerrh3I
eTNyqz4Oyj6UpRq264rqQBJmpSn8LWFBZXQlJ6Y+ef/h8Mhdewk=
=G8CT
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio bug fixes from Michael Tsirkin:
"A bunch of fixes all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
tools/virtio: use canonical ftrace path
vhost_vdpa: support PACKED when setting-getting vring_base
vhost: support PACKED when setting-getting vring_base
vhost: Fix worker hangs due to missed wake up calls
vhost: Fix crash during early vhost_transport_send_pkt calls
vhost_net: revert upend_idx only on retriable error
vhost_vdpa: tell vqs about the negotiated
vdpa/mlx5: Fix hang when cvq commands are triggered during device unregister
tools/virtio: Add .gitignore for ringtest
tools/virtio: Fix arm64 ringtest compilation error
vduse: avoid empty string for dev name
vhost: use kzalloc() instead of kmalloc() followed by memset()
* Fix css_set reference leaks on fork failures.
* Fix CPU hotplug locking in cgroup_transfer_tasks() which is used by
cgroup1 cpuset.
* Doc update.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZIJ5bQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGc/CAQCmE2cKGBWN45xbzIA5S7+zq8QCv85BYlnAgqpR
jgF8GQD/fFXdmKL0wTzjTf1YOvEi9UxJqhDvHSRtV53fzPedbg4=
=s+u8
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix css_set reference leaks on fork failures
- Fix CPU hotplug locking in cgroup_transfer_tasks() which is used by
cgroup1 cpuset
- Doc update
* tag 'cgroup-for-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Documentation: Clarify usage of memory limits
cgroup: always put cset in cgroup_css_set_put_fork
cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()
We can race where we have added work to the work_list, but
vhost_task_fn has passed that check but not yet set us into
TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and
just return.
This bug was intoduced in commit f9010dbdce ("fork, vhost: Use
CLONE_THREAD to fix freezer/ps regression") when I moved the setting
of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from
logging warnings about being in the wrong state. This moves the setting
of TASK_INTERRUPTIBLE back to before we test if we need to stop the
task to avoid a possible race there as well. We then have vhost_worker
set TASK_RUNNING if it finds work similar to before.
Fixes: f9010dbdce ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230607192338.6041-3-michael.christie@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The following scenario describes a bug in the verifier where it
incorrectly concludes about equivalent scalar IDs which could lead to
verifier bypass in privileged mode:
1. Prepare a 32-bit rogue number.
2. Put the rogue number into the upper half of a 64-bit register, and
roll a random (unknown to the verifier) bit in the lower half. The
rest of the bits should be zero (although variations are possible).
3. Assign an ID to the register by MOVing it to another arbitrary
register.
4. Perform a 32-bit spill of the register, then perform a 32-bit fill to
another register. Due to a bug in the verifier, the ID will be
preserved, although the new register will contain only the lower 32
bits, i.e. all zeros except one random bit.
At this point there are two registers with different values but the same
ID, which means the integrity of the verifier state has been corrupted.
5. Compare the new 32-bit register with 0. In the branch where it's
equal to 0, the verifier will believe that the original 64-bit
register is also 0, because it has the same ID, but its actual value
still contains the rogue number in the upper half.
Some optimizations of the verifier prevent the actual bypass, so
extra care is needed: the comparison must be between two registers,
and both branches must be reachable (this is why one random bit is
needed). Both branches are still suitable for the bypass.
6. Right shift the original register by 32 bits to pop the rogue number.
7. Use the rogue number as an offset with any pointer. The verifier will
believe that the offset is 0, while in reality it's the given number.
The fix is similar to the 32-bit BPF_MOV handling in check_alu_op for
SCALAR_VALUE. If the spill is narrowing the actual register value, don't
keep the ID, make sure it's reset to 0.
Fixes: 354e8f1970 ("bpf: Support <8-byte scalar spill and refill")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Andrii Nakryiko <andrii@kernel.org> # Checked veristat delta
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230607123951.558971-2-maxtram95@gmail.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZIDxUwAKCRDbK58LschI
g5hDAQD7ukrniCvMRNIm2yUZIGSxE4RvGiXptO4a0NfLck5R/wEAsfN2KUsPcPhW
HS37lVfx7VVXfj42+REf7lWLu4TXpwk=
=6mS/
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-06-07
We've added 7 non-merge commits during the last 7 day(s) which contain
a total of 12 files changed, 112 insertions(+), 7 deletions(-).
The main changes are:
1) Fix a use-after-free in BPF's task local storage, from KP Singh.
2) Make struct path handling more robust in bpf_d_path, from Jiri Olsa.
3) Fix a syzbot NULL-pointer dereference in sockmap, from Eric Dumazet.
4) UAPI fix for BPF_NETFILTER before final kernel ships,
from Florian Westphal.
5) Fix map-in-map array_map_gen_lookup code generation where elem_size was
not being set for inner maps, from Rhys Rustad-Elliott.
6) Fix sockopt_sk selftest's NETLINK_LIST_MEMBERSHIPS assertion,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add extra path pointer check to d_path helper
selftests/bpf: Fix sockopt_sk selftest
bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
selftests/bpf: Add access_inner_map selftest
bpf: Fix elem_size not being set for inner maps
bpf: Fix UAF in task local storage
bpf, sockmap: Avoid potential NULL dereference in sk_psock_verdict_data_ready()
====================
Link: https://lore.kernel.org/r/20230607220514.29698-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Anastasios reported crash on stable 5.15 kernel with following
BPF attached to lsm hook:
SEC("lsm.s/bprm_creds_for_exec")
int BPF_PROG(bprm_creds_for_exec, struct linux_binprm *bprm)
{
struct path *path = &bprm->executable->f_path;
char p[128] = { 0 };
bpf_d_path(path, p, 128);
return 0;
}
But bprm->executable can be NULL, so bpf_d_path call will crash:
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
...
RIP: 0010:d_path+0x22/0x280
...
Call Trace:
<TASK>
bpf_d_path+0x21/0x60
bpf_prog_db9cf176e84498d9_bprm_creds_for_exec+0x94/0x99
bpf_trampoline_6442506293_0+0x55/0x1000
bpf_lsm_bprm_creds_for_exec+0x5/0x10
security_bprm_creds_for_exec+0x29/0x40
bprm_execve+0x1c1/0x900
do_execveat_common.isra.0+0x1af/0x260
__x64_sys_execve+0x32/0x40
It's problem for all stable trees with bpf_d_path helper, which was
added in 5.9.
This issue is fixed in current bpf code, where we identify and mark
trusted pointers, so the above code would fail even to load.
For the sake of the stable trees and to workaround potentially broken
verifier in the future, adding the code that reads the path object from
the passed pointer and verifies it's valid in kernel space.
Fixes: 6e22ab9da7 ("bpf: Add d_path helper")
Reported-by: Anastasios Papagiannis <tasos.papagiannnis@gmail.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230606181714.532998-1-jolsa@kernel.org
Andrii Nakryiko writes:
And we currently don't have an attach type for NETLINK BPF link.
Thankfully it's not too late to add it. I see that link_create() in
kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't
have done that. Instead we need to add BPF_NETLINK attach type to enum
bpf_attach_type. And wire all that properly throughout the kernel and
libbpf itself.
This adds BPF_NETFILTER and uses it. This breaks uabi but this
wasn't in any non-rc release yet, so it should be fine.
v2: check link_attack prog type in link_create too
Fixes: 84601d6ee6 ("bpf: add bpf_link support for BPF_NETFILTER programs")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20230605131445.32016-1-fw@strlen.de
- Return NULL if the trace_probe list on trace_probe_event is empty.
- selftests/ftrace: Choose testing symbol name for filtering feature
from sample data instead of fixed symbol.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmR640AACgkQ2/sHvwUr
PxugGgf/YwwocmUqiEtTukTB7fzoAjYyQXr0YaJM+DjeZXMqAJ4dl9tV1/AmAL4j
iWtZd53aolTym/3P2VADfSc4xiyWjFdkYv7zRPjpqfMg3XsELJgshwz+12dmmMdx
0uco1l2/Ge3JNPK6BuWaO3V44QjoPSgiRsmxxKLh5K7M9V5swL7fadoLtins1B0r
TVVqdyEHQkZLTByexg7wHYd/ro+4lexv1yhvyP4rEmYRPDoR56eOF2zwcQMHPvaY
qstdP2ce6m5rG0gp4TsY7oRkezb64y903hNQuumoU6VR9nI3IK4PZjuX5/xns2By
G9mRaOqb02+UmP+HhX4QGmr92G9Vyw==
=o07w
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- Return NULL if the trace_probe list on trace_probe_event is empty
- selftests/ftrace: Choose testing symbol name for filtering feature
from sample data instead of fixed symbol
* tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
selftests/ftrace: Choose target function for filter test from samples
tracing/probe: trace_probe_primary_from_call(): checked list_first_entry
Commit d937bc3449 ("bpf: make uniform use of array->elem_size
everywhere in arraymap.c") changed array_map_gen_lookup to use
array->elem_size instead of round_up(map->value_size, 8) as the element
size when generating code to access a value in an array map.
array->elem_size, however, is not set by bpf_map_meta_alloc when
initializing an BPF_MAP_TYPE_ARRAY_OF_MAPS or BPF_MAP_TYPE_HASH_OF_MAPS.
This results in array_map_gen_lookup incorrectly outputting code that
always accesses index 0 in the array (as the index will be calculated
via a multiplication with the element size, which is incorrectly set to
0).
Set elem_size on the bpf_array object when allocating an array or hash
of maps to fix this.
Fixes: d937bc3449 ("bpf: make uniform use of array->elem_size everywhere in arraymap.c")
Signed-off-by: Rhys Rustad-Elliott <me@rhysre.net>
Link: https://lore.kernel.org/r/20230602190110.47068-2-me@rhysre.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
When task local storage was generalized for tracing programs, the
bpf_task_local_storage callback was moved from a BPF LSM hook
callback for security_task_free LSM hook to it's own callback. But a
failure case in bad_fork_cleanup_security was missed which, when
triggered, led to a dangling task owner pointer and a subsequent
use-after-free. Move the bpf_task_storage_free to the very end of
free_task to handle all failure cases.
This issue was noticed when a BPF LSM program was attached to the
task_alloc hook on a kernel with KASAN enabled. The program used
bpf_task_storage_get to copy the task local storage from the current
task to the new task being created.
Fixes: a10787e6d5 ("bpf: Enable task local storage for tracing programs")
Reported-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230602002612.1117381-1-kpsingh@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
A zstd fix by lucas as he tested zstd decompression support
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmR5D8USHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoingK8P/3XmCG83G1JULOgFso657i1qdsVbqQek
Qzo7t9IniifEYZPu8+BTplrtwP3G7x9xlci94JWSftaDvRCWh7kULJ2EMYe61aQk
eEiYu79AYQYDAYx2mUacvf8QeSWoJmxY7DyUHt9eZQTJ90VN6/svKThxLk2HbQvR
LtDryHVJqVnH2iSdkP3GA/kVulvLfrQ4+entVhr609kFsvZz21WICtnPU5/xu3PJ
/Cb8T5FvTjQO6+nCjKMWlBM5sfXSQDxMqcnppZO/sed0yPP47hGzljJM2tlWNkK7
qb7IrXYW4SowjcIGzQIFHbPSBRM+02hhIF0iSK66kGVrDHocJ1u2pKhhzgsnZdfk
Wjr1R1CHAthjr1e1S4sFAnl3VTXBCPtC2L2SdCR3aGb1EB2bEjPmZTex6HWWNeCV
iBVLdJxyt0K6NQTlFe4b5ZE5j0JB2h/uDSpY9OrwMwkVA4BptKRlkUt+4EliPeaf
lBNABLFDSK82x7bL9MSurzJhLOumj/9CMpl/WkwJYsT5RK5zkfcLpmh6YuRQlA/1
xR4NKlJn3pVGjXrmAl2t6VJrSa/wCQjkUzd1xyilLI/oxk4uxelGFJiofCyl4zSF
+A0MHbvrL6N8x6conbfWkQ4cFgSGLSneB5UgVe9myl2b/P3n6PCgyghulo7F9JYj
G65Ty9CqKJRm
=FVHd
-----END PGP SIGNATURE-----
Merge tag 'modules-6.4-rc5-second-pull' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull modules fix from Luis Chamberlain:
"A zstd fix by lucas as he tested zstd decompression support"
* tag 'modules-6.4-rc5-second-pull' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module/decompress: Fix error checking on zstd decompression
While implementing support for in-kernel decompression in kmod,
finit_module() was returning a very suspicious value:
finit_module(3, "", MODULE_INIT_COMPRESSED_FILE) = 18446744072717407296
It turns out the check for module_get_next_page() failing is wrong,
and hence the decompression was not really taking place. Invert
the condition to fix it.
Fixes: 169a58ad82 ("module/decompress: Support zstd in-kernel decompression")
Cc: stable@kernel.org
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.
To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.
This is a modified version of the patch written by Mike Christie
<michael.christie@oracle.com> which was a modified version of patch
originally written by Linus.
Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.
Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c. Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn. vhost_worker now returns true if work was done.
The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit. This collection clears
__fatal_signal_pending. This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.
For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file. To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec. The
coredump code and de_thread have been modified to ignore vhost threads.
Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.
Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.
Fixes: 6e890c5d50 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Co-developed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All callers of trace_probe_primary_from_call() check the return
value to be non NULL. However, the function returns
list_first_entry(&tpe->probes, ...) which can never be NULL.
Additionally, it does not check for the list being possibly empty,
possibly causing a type confusion on empty lists.
Use list_first_entry_or_null() which solves both problems.
Link: https://lore.kernel.org/linux-trace-kernel/20230128-list-entry-null-check-v1-1-8bde6a3da2ef@diag.uniroma1.it/
Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
A fix is provided for ia64. Even though ia64 is on life support it helps
to fix issues if we can. Thanks to Linus for doing tons of the ia64
debugging.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmR2JPsSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinwGIQAIrm0qpwVgDvh98B/anYDVbPpdn1U+/I
s9LBUMZCMIVEHGoOUU7YV3QGMh27OFvsNW72ggrgPuCbOgPzAfyxLJZoNY1nLURO
ZWSi2Jg4Om82BTP/Vw79yDjikLjJWAUKT/nHuyJxOnLKVsjKrlKuX+UAPXCqMv1O
yukZnCoQx6c57iRFLrpGq/+OM4Y/vZ2w8zeb5/HOSvVglkIITlXvcMUXmk0JxtS5
Rr5R14F59BZnpQD5F3hYYvIWycM2DYNdHA5FFPLt1US6TAXjYlk/hf6jgEmHxvHI
jN6U3qG0Wm8VO9ZlPXzwKYTPmHhc5llXqSlkXILuYke79w1dS1BINJb7dg3LcGna
eq2IX0ZwxC4SicvDDFWZGmriwIIYbR7jWrSmtTr7W1HIb5mICIrXSuG0150xaCYv
fc79brpBRm5me+py+WYkFKqT3DKkGHdB85+c7iXPToZeHY8v6581t3R4XQSHzPF6
JY0Uhlsb17Korbsl8Iw2FFQ8K8DRjLpeeQnDU7iwEuGpZrPSx872o8pDNH1rnmIp
wIOjfnjWc5arlCPjWWc0Ga/np7ewBP4Zw4Ff4AFyZG1ISrTmzFBWQI7TqyemjC9v
RTIBN5QmJIW6Zai/7oL5xNapRJCO250+dg1G8uOka2dklfnLdKxIBafWBa+/Znmn
ZpmhFljq1w3y
=h/VF
-----END PGP SIGNATURE-----
Merge tag 'modules-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull modules fix from Luis Chamberlain:
"A fix is provided for ia64. Even though ia64 is on life support it
helps to fix issues if we can. Thanks to Linus for doing tons of the
ia64 debugging"
* tag 'modules-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: fix module load for ia64
User events:
- Use long instead of int for storing the enable set/clear bit, as it was
found that big endian machines could end up using the wrong bits.
- Split allocating mm and attaching it. This keeps the allocation separate
from the registration and avoids various races.
- Remove RCU locking around pin_user_pages_remote() as that can schedule. The
RCU protection is no longer needed with the above split of mm allocation and
attaching.
- Rename the "link" fields of the various structs to something more
meaningful.
- Add comments around user_event_mm struct usage and locking requirements.
Timerlat tracer:
- Fix missed wakeup of timerlat thread caused by the timerlat interrupt
triggering when tracing is off. The timer interrupt handler needs to always
wake up the timerlat thread regardless if tracing is enabled or not,
otherwise, it will never wake up.
Histograms:
- Fix regression of breaking the "stacktrace" modifier for variables. That
modifier cannot be used for values, but can be used for variables that are
passed from one histogram to the next. This was broken when adding the
restriction to values as the variable logic used the same code.
- Rename the special field "stacktrace" to "common_stacktrace". Special fields
(that are not actually part of the event, but can act just like event
fields, like 'comm' and 'timestamp') should be prefixed with 'common_' for
consistency. To keep backward compatibility, 'stacktrace' can still be used
(as with the special field 'cpu'), but can be overridden if the event has a
field called 'stacktrace'.
- Update the synthetic event selftests to use the new name (synthetic events
are created by histograms)
Tracing bootup selftests:
- Reorganize the code to keep artifacts of the selftests not compiled in when
selftests are not configured.
- Add various cond_resched() around the selftest code, as the softlock
watchdog was triggering much more often. It appears that the kernel runs
slower now with full debugging enabled.
- While debugging ftrace with ftrace (using an instance ring buffer instead of
the top level one), I found that the selftests were disabling prints to the
debug instance. This should not happen, as the selftests only disable
printing to the main buffer as the selftests examine the main buffer to see
if it has what it expects, and prints can make the tests fail. Make the
selftests only disable printing to the toplevel buffer, and leave the
instance buffers alone.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZHQGJBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qu6hAQCJ1WebZUTJ/s7pFo36mXirLnrW4afB
Ua6sALseqKNesgEAyhLmd2+sMeqmAbCCIUWtcWJb/Pod0jGOt0U8+cBxfw8=
=PhaX
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"User events:
- Use long instead of int for storing the enable set/clear bit, as it
was found that big endian machines could end up using the wrong
bits.
- Split allocating mm and attaching it. This keeps the allocation
separate from the registration and avoids various races.
- Remove RCU locking around pin_user_pages_remote() as that can
schedule. The RCU protection is no longer needed with the above
split of mm allocation and attaching.
- Rename the "link" fields of the various structs to something more
meaningful.
- Add comments around user_event_mm struct usage and locking
requirements.
Timerlat tracer:
- Fix missed wakeup of timerlat thread caused by the timerlat
interrupt triggering when tracing is off. The timer interrupt
handler needs to always wake up the timerlat thread regardless if
tracing is enabled or not, otherwise, it will never wake up.
Histograms:
- Fix regression of breaking the "stacktrace" modifier for variables.
That modifier cannot be used for values, but can be used for
variables that are passed from one histogram to the next. This was
broken when adding the restriction to values as the variable logic
used the same code.
- Rename the special field "stacktrace" to "common_stacktrace".
Special fields (that are not actually part of the event, but can
act just like event fields, like 'comm' and 'timestamp') should be
prefixed with 'common_' for consistency. To keep backward
compatibility, 'stacktrace' can still be used (as with the special
field 'cpu'), but can be overridden if the event has a field called
'stacktrace'.
- Update the synthetic event selftests to use the new name (synthetic
events are created by histograms)
Tracing bootup selftests:
- Reorganize the code to keep artifacts of the selftests not compiled
in when selftests are not configured.
- Add various cond_resched() around the selftest code, as the
softlock watchdog was triggering much more often. It appears that
the kernel runs slower now with full debugging enabled.
- While debugging ftrace with ftrace (using an instance ring buffer
instead of the top level one), I found that the selftests were
disabling prints to the debug instance.
This should not happen, as the selftests only disable printing to
the main buffer as the selftests examine the main buffer to see if
it has what it expects, and prints can make the tests fail.
Make the selftests only disable printing to the toplevel buffer,
and leave the instance buffers alone"
* tag 'trace-v6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Have function_graph selftest call cond_resched()
tracing: Only make selftest conditionals affect the global_trace
tracing: Make tracing_selftest_running/delete nops when not used
tracing: Have tracer selftests call cond_resched() before running
tracing: Move setting of tracing_selftest_running out of register_tracer()
tracing/selftests: Update synthetic event selftest to use common_stacktrace
tracing: Rename stacktrace field to common_stacktrace
tracing/histograms: Allow variables to have some modifiers
tracing/user_events: Document user_event_mm one-shot list usage
tracing/user_events: Rename link fields for clarity
tracing/user_events: Remove RCU lock while pinning pages
tracing/user_events: Split up mm alloc and attach
tracing/timerlat: Always wakeup the timerlat thread
tracing/user_events: Use long vs int for atomic bit ops
This reverts commit 9828ed3f69.
Sadly, it does seem to cause failures to load modules. Johan Hovold reports:
"This change breaks module loading during boot on the Lenovo Thinkpad
X13s (aarch64).
Specifically it results in indefinite probe deferral of the display
and USB (ethernet) which makes it a pain to debug. Typing in the dark
to acquire some logs reveals that other modules are missing as well"
Since this was applied late as a "let's try this", I'm reverting it
asap, and we can try to figure out what goes wrong later. The excessive
parallel module loading problem is annoying, but not noticeable in
normal situations, and this was only meant as an optimistic workaround
for a user-space bug.
One possible solution may be to do the optimistic exclusive open first,
and then use a lock to serialize loading if that fails.
Reported-by: Johan Hovold <johan@kernel.org>
Link: https://lore.kernel.org/lkml/ZHRpH-JXAxA6DnzR@hovoldconsulting.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When all kernel debugging is enabled (lockdep, KSAN, etc), the function
graph enabling and disabling can take several seconds to complete. The
function_graph selftest enables and disables function graph tracing
several times. With full debugging enabled, the soft lockup watchdog was
triggering because the selftest was running without ever scheduling.
Add cond_resched() throughout the test to make sure it does not trigger
the soft lockup detector.
Link: https://lkml.kernel.org/r/20230528051742.1325503-6-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing_selftest_running and tracing_selftest_disabled variables were
to keep trace_printk() and other writes from affecting the tracing
selftests, as the tracing selftests would examine the ring buffer to see
if it contained what it expected or not. trace_printk() and friends could
add to the ring buffer and cause the selftests to fail (and then disable
the tracer that was being tested). To keep that from happening, these
variables were added and would keep trace_printk() and friends from
writing to the ring buffer while the tests were going on.
But this was only the top level ring buffer (owned by the global_trace
instance). There is no reason to prevent writing into ring buffers of
other instances via the trace_array_printk() and friends. For the
functions that could be used by other instances, check if the global_trace
is the tracer instance that is being written to before deciding to not
allow the write.
Link: https://lkml.kernel.org/r/20230528051742.1325503-5-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There's no reason to test the condition variables tracing_selftest_running
or tracing_selftest_delete when tracing selftests are not enabled. Make
them define 0s when not the selftests are not configured in.
Link: https://lkml.kernel.org/r/20230528051742.1325503-4-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
As there are more and more internal selftests being added to the Linux
kernel (KSAN, lockdep, etc) the selftests are taking longer to run when
these are enabled. Add a cond_resched() to the calling of
do_run_tracer_selftest() to force a schedule if NEED_RESCHED is set,
otherwise the soft lockup watchdog may trigger on boot up.
Link: https://lkml.kernel.org/r/20230528051742.1325503-3-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The variables tracing_selftest_running and tracing_selftest_disabled are
only used for when CONFIG_FTRACE_STARTUP_TEST is enabled. Make them only
visible within the selftest code. The setting of those variables are in
the register_tracer() call, and set in a location where they do not need
to be. Create a wrapper around run_tracer_selftest() called
do_run_tracer_selftest() which sets those variables, and have
register_tracer() call that instead.
Having those variables only set within the CONFIG_FTRACE_STARTUP_TEST
scope gets rid of them (and also the ability to remove testing against
them) when the startup tests are not enabled (most cases).
Link: https://lkml.kernel.org/r/20230528051742.1325503-2-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Prevent that the allocation path wakes up kswapd. That's a long
standing issue due to the GFP_ATOMIC allocation flag. As debug objects
can be invoked from pretty much any context waking kswapd can end up
in arbitrary lock chains versus the waitqueue lock.
- Correct the explicit lockdep wait-type violation in
debug_object_fill_pool().
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRzCBQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoa8FD/sFaHGSVtNTYgkV75umETMWbx+nR0Sp
Y/i62MswIWU/DWmD9IKaBxlHpBByHgopBAozDnUix6RfQvf8V/GSU6PWa9HAR2QH
rYwQCN/2/e8yQNAFv+9AiYGzPU3fRI/z7rYgfhhiWoLjivMFUCXypjBG0BAiCBxC
pYKZDMhBeySIUjtEL6xjcflA8XXKuLUPGy1WeKBxRgJeNvM0GlbifNXoy0JaXBso
NK+1FOG7zm05r2RqZjN0rAVRrrdgA4JYygpYC8YmzePoFQVXLeUnlbjjW9uYX+hz
MoLuVeF+rKk9NHNu3NoD4kFgrNp3NXAAAzH1MJwIADy9THtsyWAeEgyUkkie9aiX
Oa8eSjpJQjUv5h+VRKpMhh2RAAAhCYDuX/QC2FLImLy+GRF3dMhsAmuYgKXN2kHa
CFkM84vStMiMVxKhwtLpxVE7VOrxzXxbqMO65kMrCXYxK1SfKtEZr8FrORvUjU7G
MmH+D9sB034nkCBU+oGMsMYAAzB4rLp5Cw9qqvwWLfJvWLcUoPxjgUV6hLR6mNXx
6+2133Tf68Fz4TgyEDN9XhQ7QEsKKGTTDMJ5JYolnrRe54sUJSsX+44khrbocSde
WcEfcwhR+mjDDx0eVB2oT9bedxMf639mqPNn//EqJkzS4s+sECC8OiHbdvL3ArUq
S92nrMxvyMB42Q==
=7B4m
-----END PGP SIGNATURE-----
Merge tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects fixes from Thomas Gleixner:
"Two fixes for debugobjects:
- Prevent the allocation path from waking up kswapd.
That's a long standing issue due to the GFP_ATOMIC allocation flag.
As debug objects can be invoked from pretty much any context waking
kswapd can end up in arbitrary lock chains versus the waitqueue
lock
- Correct the explicit lockdep wait-type violation in
debug_object_fill_pool()"
* tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobjects: Don't wake up kswapd from fill_pool()
debugobjects,locking: Annotate debug_object_fill_pool() wait type violation
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZHGVRAAKCRCAXGG7T9hj
vqtJAQDizKasLE7tSnfs/FrZ/4xPaDLe3bQifMx2C1dtYCjRcAD/ciZSa1L0WzZP
dpEZnlYRzsR3bwLktQEMQFOvlbh1SwE=
=K860
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- a double free fix in the Xen pvcalls backend driver
- a fix for a regression causing the MSI related sysfs entries to not
being created in Xen PV guests
- a fix in the Xen blkfront driver for handling insane input data
better
* tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/pci/xen: populate MSI sysfs entries
xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
xen/blkfront: Only check REQ_FUA for writes
It turns out that udev under certain circumstances will concurrently try
to load the same modules over-and-over excessively. This isn't a kernel
bug, but it ends up affecting the kernel, to the point that under
certain circumstances we can fail to boot, because the kernel uses a lot
of memory to read all the module data all at once.
Note that it isn't a memory leak, it's just basically a thundering herd
problem happening at bootup with a lot of CPUs, with the worst cases
then being pretty bad.
Admittedly the worst situations are somewhat contrived: lots and lots of
CPUs, not a lot of memory, and KASAN enabled to make it all slower and
as such (unintentionally) exacerbate the problem.
Luis explains: [1]
"My best assessment of the situation is that each CPU in udev ends up
triggering a load of duplicate set of modules, not just one, but *a
lot*. Not sure what heuristics udev uses to load a set of modules per
CPU."
Petr Pavlu chimes in: [2]
"My understanding is that udev workers are forked. An initial kmod
context is created by the main udevd process but no sharing happens
after the fork. It means that the mentioned memory pool logic doesn't
really kick in.
Multiple parallel load requests come from multiple udev workers, for
instance, each handling an udev event for one CPU device and making
the exactly same requests as all others are doing at the same time.
The optimization idea would be to recognize these duplicate requests
at the udevd/kmod level and converge them"
Note that module loading has tried to mitigate this issue before, see
for example commit 064f4536d1 ("module: avoid allocation if module is
already present and ready"), which has a few ASCII graphs on memory use
due to this same issue.
However, while that noticed that the module was already loaded, and
exited with an error early before spending any more time on setting up
the module, it didn't handle the case of multiple concurrent module
loads all being active - but not complete - at the same time.
Yes, one of them will eventually win the race and finalize its copy, and
the others will then notice that the module already exists and error
out, but while this all happens, we have tons of unnecessary concurrent
work being done.
Again, the real fix is for udev to not do that (maybe it should use
threads instead of fork, and have actual shared data structures and not
cause duplicate work). That real fix is apparently not trivial.
But it turns out that the kernel already has a pretty good model for
dealing with concurrent access to the same file: the i_writecount of the
inode.
In fact, the module loading already indirectly uses 'i_writecount' ,
because 'kernel_file_read()' will in fact do
ret = deny_write_access(file);
if (ret)
return ret;
...
allow_write_access(file);
around the read of the file data. We do not allow concurrent writes to
the file, and return -ETXTBUSY if the file was open for writing at the
same time as the module data is loaded from it.
And the solution to the reader concurrency problem is to simply extend
this "no concurrent writers" logic to simply be "exclusive access".
Note that "exclusive" in this context isn't really some absolute thing:
it's only exclusion from writers and from other "special readers" that
do this writer denial. So we simply introduce a variation of that
"deny_write_access()" logic that not only denies write access, but also
requires that this is the _only_ such access that denies write access.
Which means that you can't start loading a module that is already being
loaded as a module by somebody else, or you will get the same -ETXTBSY
error that you would get if there were writers around.
[ It also means that you can't try to load a currently executing
executable as a module, for the same reason: executables do that same
"deny_write_access()" thing, and that's obviously where the whole
ETXTBSY logic traditionally came from.
This is not a problem for kernel modules, since the set of normal
executable files and kernel module files is entirely disjoint. ]
This new function is called "exclusive_deny_write_access()", and the
implementation is trivial, in that it's just an atomic decrement of
i_writecount if it was 0 before.
To use that new exclusivity check, all we then do is wrap the module
loading with that exclusive_deny_write_access()() / allow_write_access()
pair. The actual patch is a bit bigger than that, because we want to
surround not just the "load file data" part, but the whole module setup,
to get maximum exclusion.
So this ends up splitting up "finit_module()" into a few helper
functions to make it all very clear and legible.
In Luis' test-case (bringing up 255 vcpu's in a virtual machine [3]),
the "wasted vmalloc" space (ie module data read into a vmalloc'ed area
in order to be loaded as a module, but then discarded because somebody
else loaded the same module instead) dropped from 1.8GiB to 474kB. Yes,
that's gigabytes to kilobytes.
It doesn't drop completely to zero, because even with this change, you
can still end up having completely serial pointless module loads, where
one udev process has loaded a module fully (and thus the kernel has
released that exclusive lock on the module file), and then another udev
process tries to load the same module again.
So while we cannot fully get rid of the fundamental bug in user space,
we _can_ get rid of the excessive concurrent thundering herd effect.
A couple of final side notes on this all:
- This tweak only affects the "finit_module()" system call, which gives
the kernel a file descriptor with the module data.
You can also just feed the module data as raw data from user space
with "init_module()" (note the lack of 'f' at the beginning), and
obviously for that case we do _not_ have any "exclusive read" logic.
So if you absolutely want to do things wrong in user space, and try
to load the same module multiple times, and error out only later when
the kernel ends up saying "you can't load the same module name
twice", you can still do that.
And in fact, some distros will do exactly that, because they will
uncompress the kernel module data in user space before feeding it to
the kernel (mainly because they haven't started using the new kernel
side decompression yet).
So this is not some absolute "you can't do concurrent loads of the
same module". It's literally just a very simple heuristic that will
catch it early in case you try to load the exact same module file at
the same time, and in that case avoid a potentially nasty situation.
- There is another user of "deny_write_access()": the verity code that
enables fs-verity on a file (the FS_IOC_ENABLE_VERITY ioctl).
If you use fs-verity and you care about verifying the kernel modules
(which does make sense), you should do it *before* loading said
kernel module. That may sound obvious, but now the implementation
basically requires it. Because if you try to do it concurrently, the
kernel may refuse to load the module file that is being set up by the
fs-verity code.
- This all will obviously mean that if you insist on loading the same
module in parallel, only one module load will succeed, and the others
will return with an error.
That was true before too, but what is different is that the -ETXTBSY
error can be returned *before* the success case of another process
fully loading and instantiating the module.
Again, that might sound obvious, and it is indeed the whole point of
the whole change: we are much quicker to notice the whole "you're
already in the process of loading this module".
So it's very much intentional, but it does mean that if you just
spray the kernel with "finit_module()", and expect that the module is
immediately loaded afterwards without checking the return value, you
are doing something horribly horribly wrong.
I'd like to say that that would never happen, but the whole _reason_
for this commit is that udev is currently doing something horribly
horribly wrong, so ...
Link: https://lore.kernel.org/all/ZEGopJ8VAYnE7LQ2@bombadil.infradead.org/ [1]
Link: https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/ [2]
Link: https://lore.kernel.org/lkml/ZG%2Fa+nrt4%2FAAUi5z@bombadil.infradead.org/ [3]
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current release - regressions:
- net: fix skb leak in __skb_tstamp_tx()
- eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
Current release - new code bugs:
- handshake:
- fix sock->file allocation
- fix handshake_dup() ref counting
- bluetooth:
- fix potential double free caused by hci_conn_unlink
- fix UAF in hci_conn_hash_flush
Previous releases - regressions:
- core: fix stack overflow when LRO is disabled for virtual interfaces
- tls: fix strparser rx issues
- bpf:
- fix many sockmap/TCP related issues
- fix a memory leak in the LRU and LRU_PERCPU hash maps
- init the offload table earlier
- eth: mlx5e:
- do as little as possible in napi poll when budget is 0
- fix using eswitch mapping in nic mode
- fix deadlock in tc route query code
Previous releases - always broken:
- udplite: fix NULL pointer dereference in __sk_mem_raise_allocated()
- raw: fix output xfrm lookup wrt protocol
- smc: reset connection when trying to use SMCRv2 fails
- phy: mscc: enable VSC8501/2 RGMII RX clock
- eth: octeontx2-pf: fix TSOv6 offload
- eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRvOisSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkMW8P/3rZy4Yy2bIWFCkxKD/aPvqG60ZZfvV/
sB7Qu3X0OLiDNAmdDsXjCFeMYnV4cxDvwxjFUVQX0ZZEilEbGQ2XlOaFTpXS3jeW
UQup55DW7VG6BkuNJipwtLkLSQ498Z+qinRPsmNPVADkItHHbyrSnKNjh34ruhly
P5edWJ/3PuzoK2hN/izgBpk0i1UC1+tSKKANV5dlIWb6CXY9C8pvr0CScuGb5rKv
xAs40Rp1eaFmkYkhbAn3H2fvSOoCr2aSDeS2SvRAxca9OUcrUAjnnsLTVq5WI22/
PxSESy6wfE2e5+q1AwskwBdFO3LLKheVYJF2KzSlRk4FuWk50GbwbpueRSOYEU7b
2w0MveYggr4m3B06/2esrsr6bEPsb4QFKE+hubX5FmIPECOz+dOA0RW4mOysvzqM
q+xEuR9uWFsrMO7WVU7/4oF02HqAfAtaEn/87aniGz5o7bzPbmyyyBKfmb4s2c13
TU828rEBNGkmqxSwsZHUOt21IJoOa646W99zsmGpRo/m47pFx093HVR22Hr1dH0B
BllhsmtvJZ2XsWkR2Q9aAyyluc3/b3yI24OM125y7bIBWte2MF908xaStx/al+AF
jPL/ioEQKNsOJKHan9EzhbyH98RCfEotLb+ha/qNQ9GGjKROHsTn9EgP7h7367oo
yS8QLmvng01f
=hz3D
-----END PGP SIGNATURE-----
Merge tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth and bpf.
Current release - regressions:
- net: fix skb leak in __skb_tstamp_tx()
- eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
Current release - new code bugs:
- handshake:
- fix sock->file allocation
- fix handshake_dup() ref counting
- bluetooth:
- fix potential double free caused by hci_conn_unlink
- fix UAF in hci_conn_hash_flush
Previous releases - regressions:
- core: fix stack overflow when LRO is disabled for virtual
interfaces
- tls: fix strparser rx issues
- bpf:
- fix many sockmap/TCP related issues
- fix a memory leak in the LRU and LRU_PERCPU hash maps
- init the offload table earlier
- eth: mlx5e:
- do as little as possible in napi poll when budget is 0
- fix using eswitch mapping in nic mode
- fix deadlock in tc route query code
Previous releases - always broken:
- udplite: fix NULL pointer dereference in __sk_mem_raise_allocated()
- raw: fix output xfrm lookup wrt protocol
- smc: reset connection when trying to use SMCRv2 fails
- phy: mscc: enable VSC8501/2 RGMII RX clock
- eth: octeontx2-pf: fix TSOv6 offload
- eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize"
* tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated().
net: phy: mscc: enable VSC8501/2 RGMII RX clock
net: phy: mscc: remove unnecessary phydev locking
net: phy: mscc: add support for VSC8501
net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLE
net/handshake: Enable the SNI extension to work properly
net/handshake: Unpin sock->file if a handshake is cancelled
net/handshake: handshake_genl_notify() shouldn't ignore @flags
net/handshake: Fix uninitialized local variable
net/handshake: Fix handshake_dup() ref counting
net/handshake: Remove unneeded check from handshake_dup()
ipv6: Fix out-of-bounds access in ipv6_find_tlv()
net: ethernet: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
docs: netdev: document the existence of the mail bot
net: fix skb leak in __skb_tstamp_tx()
r8169: Use a raw_spinlock_t for the register locks.
page_pool: fix inconsistency for page_pool_ring_[un]lock()
bpf, sockmap: Test progs verifier error with latest clang
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZG4AiAAKCRDbK58LschI
g+xlAQCmefGbDuwPckZLnomvt6gl4bkIjs7kc1ySbG9QBnaInwD/WyrJaQIPijuD
qziHPAyx+MEgPseFU1b7Le35SZ66IwM=
=s4R1
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-05-24
We've added 19 non-merge commits during the last 10 day(s) which contain
a total of 20 files changed, 738 insertions(+), 448 deletions(-).
The main changes are:
1) Batch of BPF sockmap fixes found when running against NGINX TCP tests,
from John Fastabend.
2) Fix a memleak in the LRU{,_PERCPU} hash map when bucket locking fails,
from Anton Protopopov.
3) Init the BPF offload table earlier than just late_initcall,
from Jakub Kicinski.
4) Fix ctx access mask generation for 32-bit narrow loads of 64-bit fields,
from Will Deacon.
5) Remove a now unsupported __fallthrough in BPF samples,
from Andrii Nakryiko.
6) Fix a typo in pkg-config call for building sign-file,
from Jeremy Sowden.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, sockmap: Test progs verifier error with latest clang
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0
bpf, sockmap: Build helper to create connected socket pair
bpf, sockmap: Pull socket helpers out of listen test for general use
bpf, sockmap: Incorrectly handling copied_seq
bpf, sockmap: Wake up polling after data copy
bpf, sockmap: TCP data stall on recv before accept
bpf, sockmap: Handle fin correctly
bpf, sockmap: Improved check for empty queue
bpf, sockmap: Reschedule is now done through backlog
bpf, sockmap: Convert schedule_work into delayed_work
bpf, sockmap: Pass skb ownership through read_skb
bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
samples/bpf: Drop unnecessary fallthrough
bpf: netdev: init the offload table earlier
selftests/bpf: Fix pkg-config call building sign-file
====================
Link: https://lore.kernel.org/r/20230524170839.13905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit bf5e758f02 ("genirq/msi: Simplify sysfs handling") reworked the
creation of sysfs entries for MSI IRQs. The creation used to be in
msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs.
Then it moved into __msi_domain_alloc_irqs which is an implementation of
domain_alloc_irqs. However, Xen comes with the only other implementation
of domain_alloc_irqs and hence doesn't run the sysfs population code
anymore.
Commit 6c796996ee ("x86/pci/xen: Fixup fallout from the PCI/MSI
overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info
but that doesn't actually have an effect because Xen uses it's own
domain_alloc_irqs implementation.
Fix this by making use of the fallback functions for sysfs population.
Fixes: bf5e758f02 ("genirq/msi: Simplify sysfs handling")
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20230503131656.15928-1-mheyne@amazon.de
Signed-off-by: Juergen Gross <jgross@suse.com>
The histogram and synthetic events can use a pseudo event called
"stacktrace" that will create a stacktrace at the time of the event and
use it just like it was a normal field. We have other pseudo events such
as "common_cpu" and "common_timestamp". To stay consistent with that,
convert "stacktrace" to "common_stacktrace". As this was used in older
kernels, to keep backward compatibility, this will act just like
"common_cpu" did with "cpu". That is, "cpu" will be the same as
"common_cpu" unless the event has a "cpu" field. In which case, the
event's field is used. The same is true with "stacktrace".
Also update the documentation to reflect this change.
Link: https://lore.kernel.org/linux-trace-kernel/20230523230913.6860e28d@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Modifiers are used to change the behavior of keys. For instance, they
can grouped into buckets, converted to syscall names (from the syscall
identifier), show task->comm of the current pid, be an array of longs
that represent a stacktrace, and more.
It was found that nothing stopped a value from taking a modifier. As
values are simple counters. If this happened, it would call code that
was not expecting a modifier and crash the kernel. This was fixed by
having the ___create_val_field() function test if a modifier was present
and fail if one was. This fixed the crash.
Now there's a problem with variables. Variables are used to pass fields
from one event to another. Variables are allowed to have some modifiers,
as the processing may need to happen at the time of the event (like
stacktraces and comm names of the current pid). The issue is that it too
uses __create_val_field(). Now that fails on modifiers, variables can no
longer use them (this is a regression).
As not all modifiers are for variables, have them use a separate check.
Link: https://lore.kernel.org/linux-trace-kernel/20230523221108.064a5d82@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: e0213434fe ("tracing: Do not let histogram values have some modifiers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
During 6.4 development it became clear that the one-shot list used by
the user_event_mm's next field was confusing to others. It is not clear
how this list is protected or what the next field usage is for unless
you are familiar with the code.
Add comments into the user_event_mm struct indicating lock requirement
and usage. Also document how and why this approach was used via comments
in both user_event_enabler_update() and user_event_mm_get_all() and the
rules to properly use it.
Link: https://lkml.kernel.org/r/20230519230741.669-5-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently most list_head fields of various structs within user_events
are simply named link. This causes folks to keep additional context in
their head when working with the code, which can be confusing.
Instead of using link, describe what the actual link is, for example:
list_del_rcu(&mm->link);
Changes into:
list_del_rcu(&mm->mms_link);
The reader now is given a hint the link is to the mms global list
instead of having to remember or spot check within the code.
Link: https://lkml.kernel.org/r/20230519230741.669-4-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
pin_user_pages_remote() can reschedule which means we cannot hold any
RCU lock while using it. Now that enablers are not exposed out to the
tracing register callbacks during fork(), there is clearly no need to
require the RCU lock as event_mutex is enough to protect changes.
Remove unneeded RCU usages when pinning pages and walking enablers with
event_mutex held. Cleanup a misleading "safe" list walk that is not
needed. During fork() duplication, remove unneeded RCU list add, since
the list is not exposed yet.
Link: https://lkml.kernel.org/r/20230519230741.669-3-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wiiBfT4zNS29jA0XEsy8EmbqTH1hAPdRJCDAJMD8Gxt5A@mail.gmail.com/
Fixes: 7235759084 ("tracing/user_events: Use remote writes for event enablement")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ change log written by Beau Belgrave ]
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When a new mm is being created in a fork() path it currently is
allocated and then attached in one go. This leaves the mm exposed out to
the tracing register callbacks while any parent enabler locations are
copied in. This should not happen.
Split up mm alloc and attach as unique operations. When duplicating
enablers, first alloc, then duplicate, and only upon success, attach.
This prevents any timing window outside of the event_reg mutex for
enablement walking. This allows for dropping RCU requirement for
enablement walking in later patches.
Link: https://lkml.kernel.org/r/20230519230741.669-2-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=whTBvXJuoi_kACo3qi5WZUmRrhyA-_=rRFsycTytmB6qw@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ change log written by Beau Belgrave ]
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
While testing rtla timerlat auto analysis, I reach a condition where
the interface was not receiving tracing data. I was able to manually
reproduce the problem with these steps:
# echo 0 > tracing_on # disable trace
# echo 1 > osnoise/stop_tracing_us # stop trace if timerlat irq > 1 us
# echo timerlat > current_tracer # enable timerlat tracer
# sleep 1 # wait... that is the time when rtla
# apply configs like prio or cgroup
# echo 1 > tracing_on # start tracing
# cat trace
# tracer: timerlat
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# ||||| ACTIVATION
# TASK-PID CPU# ||||| TIMESTAMP ID CONTEXT LATENCY
# | | | ||||| | | | |
NOTHING!
Then, trying to enable tracing again with echo 1 > tracing_on resulted
in no change: the trace was still not tracing.
This problem happens because the timerlat IRQ hits the stop tracing
condition while tracing is off, and do not wake up the timerlat thread,
so the timerlat threads are kept sleeping forever, resulting in no
trace, even after re-enabling the tracer.
Avoid this condition by always waking up the threads, even after stopping
tracing, allowing the tracer to return to its normal operating after
a new tracing on.
Link: https://lore.kernel.org/linux-trace-kernel/1ed8f830638b20a39d535d27d908e319a9a3c4e2.1683822622.git.bristot@kernel.org
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Fixes: a955d7eac1 ("trace: Add timerlat tracer")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Each event stores a int to track which bit to set/clear when enablement
changes. On big endian 64-bit configurations, it's possible this could
cause memory corruption when it's used for atomic bit operations.
Use unsigned long for enablement values to ensure any possible
corruption cannot occur. Downcast to int after mask for the bit target.
Link: https://lore.kernel.org/all/6f758683-4e5e-41c3-9b05-9efc703e827c@kili.mountain/
Link: https://lore.kernel.org/linux-trace-kernel/20230505205855.6407-1-beaub@linux.microsoft.com
Fixes: dcb8177c13 ("tracing/user_events: Add ioctl for disabling addresses")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A successful call to cgroup_css_set_fork() will always have taken
a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always
do a corresponding put in cgroup_css_set_put_fork().
Without this, a cset and its contained css structures will be
leaked for some fork failures. The following script reproduces
the leak for a fork failure due to exceeding pids.max in the
pids controller. A similar thing can happen if we jump to the
bad_fork_cancel_cgroup label in copy_process().
[ -z "$1" ] && echo "Usage $0 pids-root" && exit 1
PID_ROOT=$1
CGROUP=$PID_ROOT/foo
[ -e $CGROUP ] && rmdir -f $CGROUP
mkdir $CGROUP
echo 5 > $CGROUP/pids.max
echo $$ > $CGROUP/cgroup.procs
fork_bomb()
{
set -e
for i in $(seq 10); do
/bin/sleep 3600 &
done
}
(fork_bomb) &
wait
echo $$ > $PID_ROOT/cgroup.procs
kill $(cat $CGROUP/cgroup.procs)
rmdir $CGROUP
Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Smatch warns:
kernel/module/stats.c:394 read_file_mod_stats()
warn: passing freed memory 'buf'
We are passing 'buf' to simple_read_from_buffer() after freeing it.
Fix this by changing the order of 'simple_read_from_buffer' and 'kfree'.
Fixes: df3e764d8e ("module: add debug stats to help identify memory pressure")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
The commit 4f7e723643 ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock()
deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and
cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing
cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks()
was missed and not handled, which will cause th following warning:
WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40
CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Workqueue: events cpuset_hotplug_workfn
RIP: 0010:lockdep_assert_cpus_held+0x32/0x40
<...>
Call Trace:
<TASK>
cpuset_attach+0x40/0x240
cgroup_migrate_execute+0x452/0x5e0
? _raw_spin_unlock_irq+0x28/0x40
cgroup_transfer_tasks+0x1f3/0x360
? find_held_lock+0x32/0x90
? cpuset_hotplug_workfn+0xc81/0xed0
cpuset_hotplug_workfn+0xcb1/0xed0
? process_one_work+0x248/0x5b0
process_one_work+0x2b9/0x5b0
worker_thread+0x56/0x3b0
? process_one_work+0x5b0/0x5b0
kthread+0xf1/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
So just use the cgroup_attach_{lock,unlock}() helper to fix it.
Reported-by: Zhao Gongyi <zhaogongyi@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Fixes: 05c7b7a92c ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
Cc: stable@vger.kernel.org # v5.17+
Signed-off-by: Tejun Heo <tj@kernel.org>
The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.
Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
A narrow load from a 64-bit context field results in a 64-bit load
followed potentially by a 64-bit right-shift and then a bitwise AND
operation to extract the relevant data.
In the case of a 32-bit access, an immediate mask of 0xffffffff is used
to construct a 64-bit BPP_AND operation which then sign-extends the mask
value and effectively acts as a glorified no-op. For example:
0: 61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0)
results in the following code generation for a 64-bit field:
ldr x7, [x7] // 64-bit load
mov x10, #0xffffffffffffffff
and x7, x7, x10
Fix the mask generation so that narrow loads always perform a 32-bit AND
operation:
ldr x7, [x7] // 64-bit load
mov w10, #0xffffffff
and w7, w7, w10
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
Cc: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Fixes: 31fd85816d ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Initialize 'ret' local variables on fprobe_handler() to fix the smatch
warning. With this, fprobe function exit handler is not working
randomly.
- Fix to use preempt_enable/disable_notrace for rethook handler to
prevent recursive call of fprobe exit handler (which is based on
rethook)
- Fix recursive call issue on fprobe_kprobe_handler().
- Fix to detect recursive call on fprobe_exit_handler().
- Fix to make all arch-dependent rethook code notrace.
(the arch-independent code is already notrace)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmRmKgQACgkQ2/sHvwUr
PxvlCgf+OJk5O9IJlTgqDV6JNPsTzFS7qqyAyQmZW9Bj8STfWAIRxa0zeGbZE58K
5LwgzAj+SqzYRwIvzzZ3xsA5j7f1Wj7wG0TQgmpnIW+hprwDrLsUhoZ5s1D/Ojel
A4rAnqCrgnh5m5SenU2QCUngGKn004j4RASaZvRELDyvyIkBSqNhswCH8ZWGPror
KuCu5AmEnFagYl0lmNL3H2aCITAg3QEK+fE6iR+lYsqfR3xbs4YAcqiylHBdY0wX
ssK7LVdRmv7O6TxSj4P2ohDvLJP3eL9bVirsJpg0OVbqWJCs65T2rJJjXiKojYXf
vSVWFJFK5oV98ZHfXTG9R7x0DEwc+g==
=jO68
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- Initialize 'ret' local variables on fprobe_handler() to fix the
smatch warning. With this, fprobe function exit handler is not
working randomly.
- Fix to use preempt_enable/disable_notrace for rethook handler to
prevent recursive call of fprobe exit handler (which is based on
rethook)
- Fix recursive call issue on fprobe_kprobe_handler()
- Fix to detect recursive call on fprobe_exit_handler()
- Fix to make all arch-dependent rethook code notrace (the
arch-independent code is already notrace)"
* tag 'probes-fixes-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rethook, fprobe: do not trace rethook related functions
fprobe: add recursion detection in fprobe_exit_handler
fprobe: make fprobe_kprobe_handler recursion free
rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handler
tracing: fprobe: Initialize ret valiable to fix smatch error