WSL2-Linux-Kernel/drivers
Andrii Nakryiko 1e0bd5a091 bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails
92117d8443 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
(32k). Due to using 32-bit counter, it's possible in practice to overflow
refcounter and make it wrap around to 0, causing erroneous map free, while
there are still references to it, causing use-after-free problems.

But having a failing refcounting operations are problematic in some cases. One
example is mmap() interface. After establishing initial memory-mapping, user
is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
splitting it into multiple non-contiguous regions. All this happening without
any control from the users of mmap subsystem. Rather mmap subsystem sends
notifications to original creator of memory mapping through open/close
callbacks, which are optionally specified during initial memory mapping
creation. These callbacks are used to maintain accurate refcount for bpf_map
(see next patch in this series). The problem is that open() callback is not
supposed to fail, because memory-mapped resource is set up and properly
referenced. This is posing a problem for using memory-mapping with BPF maps.

One solution to this is to maintain separate refcount for just memory-mappings
and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
There are similar use cases in current work on tcp-bpf, necessitating extra
counter as well. This seems like a rather unfortunate and ugly solution that
doesn't scale well to various new use cases.

Another approach to solve this is to use non-failing refcount_t type, which
uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
stays there. This utlimately causes memory leak, but prevents use after free.

But given refcounting is not the most performance-critical operation with BPF
maps (it's not used from running BPF program code), we can also just switch to
64-bit counter that can't overflow in practice, potentially disadvantaging
32-bit platforms a tiny bit. This simplifies semantics and allows above
described scenarios to not worry about failing refcount increment operation.

In terms of struct bpf_map size, we are still good and use the same amount of
space:

BEFORE (3 cache lines, 8 bytes of padding at the end):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
	atomic_t                   usercnt;              /*   132     4 */
	struct work_struct work;                         /*   136    32 */
	char                       name[16];             /*   168    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 146, holes: 1, sum holes: 38 */
	/* padding: 8 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

AFTER (same 3 cache lines, no extra padding now):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
	atomic64_t                 usercnt;              /*   136     8 */
	struct work_struct work;                         /*   144    32 */
	char                       name[16];             /*   176    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 154, holes: 1, sum holes: 38 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

This patch, while modifying all users of bpf_map_inc, also cleans up its
interface to match bpf_map_put with separate operations for bpf_map_inc and
bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
respectively). Also, given there are no users of bpf_map_inc_not_zero
specifying uref=true, remove uref flag and default to uref=false internally.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
2019-11-18 11:41:59 +01:00
..
accessibility
acpi Power management fix for 5.4-rc6 2019-11-01 09:30:48 -07:00
amba ARM updates for 5.4-rc: 2019-10-23 06:26:33 -04:00
android binder: Don't modify VMA bounds in ->mmap handler 2019-10-17 05:58:44 -07:00
ata ata: libahci_platform: Fix regulator_get_optional() misuse 2019-10-25 14:22:20 -06:00
atm atm: remove unneeded semicolon 2019-10-28 16:47:22 -07:00
auxdisplay It's a somewhat calmer cycle for docs this time, as the churn of the mass 2019-09-17 16:22:26 -07:00
base PM: QoS: Drop frequency QoS types from device PM QoS 2019-10-21 02:05:21 +02:00
bcma bcma: make arrays pwr_info_offset and sprom_sizes static const, shrinks object size 2019-09-13 16:44:49 +03:00
block nbd: verify socket is supported during setup 2019-10-25 14:37:21 -06:00
bluetooth Bluetooth: hci_bcm: Fix RTS handling during startup 2019-10-21 17:05:14 +02:00
bus Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
cdrom
char char/random: Add a newline at the end of the file 2019-10-02 13:49:43 -07:00
clk Merge tag 'fix-missing-panels' into fixes 2019-10-04 09:06:41 -07:00
clocksource timer-of: don't use conditional expression with mixed 'void' types 2019-10-02 16:16:07 -07:00
connector
counter
cpufreq cpufreq: Cancel policy update work scheduled before freeing 2019-10-22 18:07:30 +02:00
cpuidle cpuidle: haltpoll: Take 'idle=' override into account 2019-10-22 11:43:17 +02:00
crypto Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
dax
dca
devfreq PM / devfreq: passive: fix compiler warning 2019-08-26 21:37:37 +09:00
dio
dma dmaengine: cppi41: Fix cppi41_dma_prep_slave_sg() when idle 2019-10-23 21:15:21 +05:30
dma-buf dma-buf/resv: fix exclusive fence get 2019-10-10 17:05:20 +02:00
edac EDAC/ghes: Fix Use after free in ghes_edac remove path 2019-10-17 11:27:05 +02:00
eisa
extcon chrome platform changes for v5.4 2019-09-19 14:14:28 -07:00
firewire
firmware Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
fpga Char/Misc driver patches for 5.4-rc1 2019-09-18 11:14:31 -07:00
fsi fsi: scom: Don't abort operations for minor errors 2019-08-28 22:59:18 +02:00
gnss
gpio gpio: lynxpoint: set default handler to be handle_bad_irq() 2019-10-15 01:19:05 +02:00
gpu Merge tag 'drm-fixes-5.4-2019-10-30' of git://people.freedesktop.org/~agd5f/linux into drm-fixes 2019-11-01 11:27:39 +10:00
greybus staging: greybus: move es2 to drivers/greybus/ 2019-08-27 19:03:08 +02:00
hid Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid 2019-10-28 14:26:33 +01:00
hsi HSI changes for the 5.4 series 2019-09-22 12:02:21 -07:00
hv Drivers: hv: vmbus: Fix harmless building warnings without CONFIG_PM_SLEEP 2019-10-01 14:49:45 -04:00
hwmon hwmon: (nct7904) Add array fan_alarm and vsen_alarm to store the alarms in nct7904_data struct. 2019-10-02 06:42:48 -07:00
hwspinlock
hwtracing Char/Misc driver patches for 5.4-rc1 2019-09-18 11:14:31 -07:00
i2c i2c: stm32f7: remove warning when compiling with W=1 2019-10-24 20:52:21 +02:00
i3c i3c: master: Use dev_to_i3cmaster() 2019-08-27 09:43:59 +02:00
ide
idle x86/intel: Aggregate microserver naming 2019-08-28 11:29:32 +02:00
iio First set of IIO fixes for the 5.4 cycle. 2019-10-10 11:18:37 +02:00
infiniband RDMA/hns: Prevent memory leaks of eq->buf_list 2019-10-28 15:06:38 -03:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2019-10-25 17:31:53 -04:00
interconnect
iommu iommu/vt-d: Fix panic after kexec -p for kdump 2019-10-30 10:30:22 +01:00
ipack
irqchip irqchip updates for 5.4, take 2 2019-10-25 14:25:15 +02:00
isdn mISDN: remove unused variable 'faxmodulation_s' 2019-11-03 19:10:30 -08:00
leds leds: lm3532: Fix optional led-max-microamp prop error handling 2019-09-12 20:45:52 +02:00
lightnvm lightnvm: print error when target is not found 2019-09-05 13:17:01 -06:00
macintosh cpufreq: Use per-policy frequency QoS 2019-10-21 02:05:21 +02:00
mailbox mailbox: qcom-apcs: fix max_register value 2019-09-17 00:54:29 -05:00
mcb
md for-linus-2019-10-18 2019-10-18 22:29:36 -04:00
media media: stkwebcam: fix runtime PM after driver unbind 2019-10-04 14:38:46 +02:00
memory iommu/mediatek: Clean up struct mtk_smi_iommu 2019-08-30 15:57:27 +02:00
memstick memstick: jmb38x_ms: Fix an error handling path in 'jmb38x_ms_probe()' 2019-10-09 11:08:03 +02:00
message
mfd mfd: mt6397: Fix probe after changing mt6397-core 2019-10-24 08:49:25 +01:00
misc misc: fastrpc: prevent memory leak in fastrpc_dma_buf_attach 2019-10-04 18:22:14 +02:00
mmc mmc: mxs: fix flags passed to dmaengine_prep_slave_sg 2019-10-21 16:16:38 +02:00
mtd mtd: rawnand: au1550nd: Fix au_read_buf16() prototype 2019-10-07 09:56:36 +02:00
mux
net bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails 2019-11-18 11:41:59 +01:00
nfc nfc: pn532_uart: Make use of pn532 autopoll 2019-10-29 21:05:26 -07:00
ntb NTB: fix IDT Kconfig typos/spellos 2019-09-23 17:20:40 -04:00
nubus
nvdimm libnvdimm fixes v5.4-rc1 2019-09-29 10:33:41 -07:00
nvme Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-01 17:48:11 -07:00
nvmem Char/Misc driver patches for 5.4-rc1 2019-09-18 11:14:31 -07:00
of of: reserved_mem: add missing of_node_put() for proper ref-counting 2019-10-23 15:15:05 -05:00
opp opp: Reinitialize the list_kref before adding the static OPPs again 2019-10-23 10:58:44 +05:30
oprofile
parisc parisc: Remove 32-bit DMA enforcement from sba_iommu 2019-10-14 21:44:26 +02:00
parport Char/Misc driver patches for 5.4-rc1 2019-09-18 11:14:31 -07:00
pci PCI: PM: Fix pci_power_up() 2019-10-15 23:51:36 +02:00
pcmcia Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2019-09-28 08:14:15 -07:00
perf Merge branches 'for-next/52-bit-kva', 'for-next/cpu-topology', 'for-next/error-injection', 'for-next/perf', 'for-next/psci-cpuidle', 'for-next/rng', 'for-next/smpboot', 'for-next/tbi' and 'for-next/tlbi' into for-next/core 2019-08-30 12:46:12 +01:00
phy pci-v5.4-changes 2019-09-23 19:16:01 -07:00
pinctrl pinctrl: aspeed-g6: Rename SD3 to EMMC and rework pin groups 2019-10-16 15:58:27 +02:00
platform platform/x86: i2c-multi-instantiate: Fail the probe if no IRQ provided 2019-10-14 15:31:50 +03:00
pnp
power power supply and reset changes for the v5.4 series 2019-09-22 12:04:59 -07:00
powercap Power management updates for 5.4-rc1 2019-09-17 19:15:14 -07:00
pps
ps3
ptp ptp: Add a ptp clock driver for IDT ClockMatrix. 2019-11-03 17:35:40 -08:00
pwm pwm: Changes for v5.4-rc1 2019-09-27 12:19:47 -07:00
rapidio
ras
regulator regulator: Fixes for v5.4 2019-10-23 15:31:17 -04:00
remoteproc remoteproc updates for v5.4 2019-09-22 10:55:08 -07:00
reset ARM: SoC fixes 2019-09-30 10:04:28 -07:00
rpmsg rpmsg: glink-smem: Name the edge based on parent remoteproc 2019-09-17 15:33:31 -07:00
rtc RTC for 5.4 2019-09-22 11:05:43 -07:00
s390 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
sbus
scsi SCSI fixes on 20191025 2019-10-25 20:11:33 -04:00
sfi
sh
siox
slimbus
soc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
soundwire soundwire updates for v5.4-rc1 2019-09-22 10:52:23 -07:00
spi spi: Add a PTP system timestamp to the transfer structure 2019-10-08 17:38:15 +01:00
spmi
ssb ssb: make array pwr_info_offset static const, makes object smaller 2019-09-13 17:23:18 +03:00
staging Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
target SCSI fixes on 20191025 2019-10-25 20:11:33 -04:00
tc
tee tee/shm: untag user pointers in tee_shm_register 2019-09-25 17:51:41 -07:00
thermal cpufreq: Use per-policy frequency QoS 2019-10-21 02:05:21 +02:00
thunderbolt thunderbolt: Add support for Intel Ice Lake 2019-08-26 12:15:06 +03:00
tty 8250-men-mcb: fix error checking when get_num_ports returns -ENODEV 2019-10-15 21:38:41 +02:00
uio Char/Misc driver patches for 5.4-rc1 2019-09-18 11:14:31 -07:00
usb usb: cdns3: Error out if USB_DR_MODE_UNKNOWN in cdns3_core_init_role() 2019-10-18 12:00:15 -07:00
vfio vfio/type1: Initialize resv_msi_base 2019-10-15 14:07:01 -06:00
vhost vringh: fix copy direction of vringh_iov_push_kern() 2019-10-28 04:25:04 -04:00
video video/logo: do not generate unneeded logo C files 2019-10-05 15:29:49 +09:00
virt virt: vbox: fix memory leak in hgcm_call_preprocess_linaddr 2019-10-10 14:50:32 +02:00
virtio virtio_ring: fix stalls for packed rings 2019-10-28 04:24:46 -04:00
visorbus
vlynq
vme
w1 w1: ds250x: Fix build error without CRC16 2019-10-10 15:35:41 +02:00
watchdog linux-watchdog 5.4-rc1 tag 2019-09-27 11:17:38 -07:00
xen Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-10-19 17:09:11 -04:00
zorro
Kconfig Staging/IIO driver patches for 5.4-rc1 2019-09-18 11:05:34 -07:00
Makefile Staging/IIO driver patches for 5.4-rc1 2019-09-18 11:05:34 -07:00