WSL2-Linux-Kernel

История

Daniel Borkmann a5d1d35222 bpf: Fix toctou on read-only map's constant scalar tracking [ Upstream commit `353050be4c` ] Commit `a23740ec43` ("bpf: Track contents of read-only maps as scalars") is checking whether maps are read-only both from BPF program side and user space side, and then, given their content is constant, reading out their data via map->ops->map_direct_value_addr() which is then subsequently used as known scalar value for the register, that is, it is marked as __mark_reg_known() with the read value at verification time. Before `a23740ec43`, the register content was marked as an unknown scalar so the verifier could not make any assumptions about the map content. The current implementation however is prone to a TOCTOU race, meaning, the value read as known scalar for the register is not guaranteed to be exactly the same at a later point when the program is executed, and as such, the prior made assumptions of the verifier with regards to the program will be invalid which can cause issues such as OOB access, etc. While the BPF_F_RDONLY_PROG map flag is always fixed and required to be specified at map creation time, the map->frozen property is initially set to false for the map given the map value needs to be populated, e.g. for global data sections. Once complete, the loader "freezes" the map from user space such that no subsequent updates/deletes are possible anymore. For the rest of the lifetime of the map, this freeze one-time trigger cannot be undone anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_* cmd calls which would update/delete map entries will be rejected with -EPERM since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also means that pending update/delete map entries must still complete before this guarantee is given. This corner case is not an issue for loaders since they create and prepare such program private map in successive steps. However, a malicious user is able to trigger this TOCTOU race in two different ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is used to expand the competition interval, so that map_update_elem() can modify the contents of the map after map_freeze() and bpf_prog_load() were executed. This works, because userfaultfd halts the parallel thread which triggered a map_update_elem() at the time where we copy key/value from the user buffer and this already passed the FMODE_CAN_WRITE capability test given at that time the map was not "frozen". Then, the main thread performs the map_freeze() and bpf_prog_load(), and once that had completed successfully, the other thread is woken up to complete the pending map_update_elem() which then changes the map content. For ii) the idea of the batched update is similar, meaning, when there are a large number of updates to be processed, it can increase the competition interval between the two. It is therefore possible in practice to modify the contents of the map after executing map_freeze() and bpf_prog_load(). One way to fix both i) and ii) at the same time is to expand the use of the map's map->writecnt. The latter was introduced in `fc9702273e` ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") and further refined in `1f6cb19be2` ("bpf: Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with the rationale to make a writable mmap()'ing of a map mutually exclusive with read-only freezing. The counter indicates writable mmap() mappings and then prevents/fails the freeze operation. Its semantics can be expanded beyond just mmap() by generally indicating ongoing write phases. This would essentially span any parallel regular and batched flavor of update/delete operation and then also have map_freeze() fail with -EBUSY. For the check_mem_access() in the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all last pending writes have completed via bpf_map_write_active() test. Once the map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0 only then we are really guaranteed to use the map's data as known constants. For map->frozen being set and pending writes in process of still being completed we fall back to marking that register as unknown scalar so we don't end up making assumptions about it. With this, both TOCTOU reproducers from i) and ii) are fixed. Note that the map->writecnt has been converted into a atomic64 in the fix in order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning the freeze_mutex over entire map update/delete operations in syscall side would not be possible due to then causing everything to be serialized. Similarly, something like synchronize_rcu() after setting map->frozen to wait for update/deletes to complete is not possible either since it would also have to span the user copy which can sleep. On the libbpf side, this won't break `d66562fba1` ("libbpf: Add BPF object skeleton support") as the anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed mmap()-ed memory where for .rodata it's non-writable. Fixes: `a23740ec43` ("bpf: Track contents of read-only maps as scalars") Reported-by: w1tcher.bupt@gmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>		2021-11-25 09:48:36 +01:00
..
bpf	bpf: Fix toctou on read-only map's constant scalar tracking	2021-11-25 09:48:36 +01:00
cgroup	cgroup: Fix rootcg cpu.stat guest double counting	2021-11-18 19:16:45 +01:00
configs	…
debug	kdb: Adopt scheduler's task classification	2021-11-18 19:17:06 +01:00
dma	dma-mapping fixes for Linux 5.15	2021-10-20 10:16:51 -10:00
entry	entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume()	2021-09-22 10:24:01 -04:00
events	perf/core: Avoid put_page() when GUP fails	2021-11-21 13:44:14 +01:00
gcov	…
irq	PCI/MSI: Move non-mask check back into low level accessors	2021-11-18 19:17:14 +01:00
kcsan	…
livepatch	…
locking	lockdep: Let lock_is_held_type() detect recursive read as read	2021-11-18 19:16:23 +01:00
power	PM: hibernate: fix sparse warnings	2021-11-18 19:16:38 +01:00
printk	memblock: introduce saner 'memblock_free_ptr()' interface	2021-09-14 13:23:22 -07:00
rcu	rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr	2021-11-18 19:16:30 +01:00
sched	sched/fair: Prevent dead task groups from regaining cfs_rq's	2021-11-25 09:48:32 +01:00
time	posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()	2021-11-18 19:17:14 +01:00
trace	tracing: Add length protection to histogram string copies	2021-11-25 09:48:34 +01:00
.gitignore	…
Kconfig.freezer	…
Kconfig.hz	…
Kconfig.locks	…
Kconfig.preempt	…
Makefile	…
acct.c	kernel/acct.c: use dedicated helper to access rlimit values	2021-09-08 11:50:26 -07:00
async.c	…
audit.c	…
audit.h	…
audit_fsnotify.c	…
audit_tree.c	…
audit_watch.c	…
auditfilter.c	…
auditsc.c	audit: fix possible null-pointer dereference in audit_filter_rules	2021-10-18 18:27:47 -04:00
backtracetest.c	…
bounds.c	…
capability.c	…
cfi.c	…
compat.c	arch: remove compat_alloc_user_space	2021-09-08 15:32:35 -07:00
configs.c	…
context_tracking.c	…
cpu.c	…
cpu_pm.c	…
crash_core.c	…
crash_dump.c	…
cred.c	ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring	2021-10-20 10:34:20 -05:00
delayacct.c	…
dma.c	…
exec_domain.c	…
exit.c	…
extable.c	…
fail_function.c	…
fork.c	posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()	2021-11-18 19:17:14 +01:00
freezer.c	…
futex.c	futex: Remove unused variable 'vpid' in futex_proxy_trylock_atomic()	2021-09-03 23:00:22 +02:00
gen_kheaders.sh	…
groups.c	…
hung_task.c	…
iomem.c	…
irq_work.c	…
jump_label.c	…
kallsyms.c	…
kcmp.c	…
kcov.c	…
kexec.c	kexec: avoid compat_alloc_user_space	2021-09-08 15:32:34 -07:00
kexec_core.c	…
kexec_elf.c	…
kexec_file.c	…
kexec_internal.h	…
kheaders.c	…
kmod.c	…
kprobes.c	kprobes: Do not use local variable when creating debugfs file	2021-11-18 19:16:29 +01:00
ksysfs.c	…
kthread.c	…
latencytop.c	…
module-internal.h	…
module.c	module: fix clang CFI with MODULE_UNLOAD=n	2021-09-28 12:56:26 +02:00
module_signature.c	…
module_signing.c	…
notifier.c	…
nsproxy.c	…
padata.c	…
panic.c	…
params.c	…
pid.c	…
pid_namespace.c	…
profile.c	profiling: fix shift-out-of-bounds bugs	2021-09-08 11:50:26 -07:00
ptrace.c	…
range.c	…
reboot.c	…
regset.c	…
relay.c	…
resource.c	…
resource_kunit.c	…
rseq.c	KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest	2021-09-22 10:24:01 -04:00
scftorture.c	…
scs.c	scs: Release kasan vmalloc poison in scs_free process	2021-11-18 19:16:29 +01:00
seccomp.c	…
signal.c	signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed	2021-11-18 19:16:01 +01:00
smp.c	…
smpboot.c	…
smpboot.h	…
softirq.c	…
stackleak.c	…
stacktrace.c	…
static_call.c	…
stop_machine.c	…
sys.c	Merge branch 'akpm' (patches from Andrew)	2021-09-08 12:55:35 -07:00
sys_ni.c	compat: remove some compat entry points	2021-09-08 15:32:35 -07:00
sysctl-test.c	…
sysctl.c	…
task_work.c	…
taskstats.c	…
test_kprobes.c	…
torture.c	…
tracepoint.c	…
tsacct.c	…
ucount.c	ucounts: Fix signal ucount refcounting	2021-10-18 16:02:30 -05:00
uid16.c	…
uid16.h	…
umh.c	…
up.c	…
user-return-notifier.c	…
user.c	fs/epoll: use a per-cpu counter for user's watches count	2021-09-08 11:50:27 -07:00
user_namespace.c	…
usermode_driver.c	…
utsname.c	…
utsname_sysctl.c	…
watch_queue.c	…
watchdog.c	…
watchdog_hld.c	…
workqueue.c	workqueue: make sysfs of unbound kworker cpumask more clever	2021-11-18 19:16:17 +01:00
workqueue_internal.h	…