WSL2-Linux-Kernel/net/core
Jakub Kicinski 7f9eee196e Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux
Pavel Begunkov says:

====================
io_uring zerocopy send

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking UDP with an optimised version of the selftest (see [1]), which
sends a bunch of requests, waits for completions and repeats. "+ flush" column
posts one additional "buffer-free" notification per request, and just "zc"
doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc    | zc             | zc + flush
4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc    | zc             | zc + flush
8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting. There
is also an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

For TCP on localhost (with hacks enabling localhost zerocopy) and including
additional overhead for receive:

IO size | non-zc    | zc
1200    | 4174      | 4148
4096    | 7597      | 11228

Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help, should look better for 4000,
but couldn't test properly because of setup problems.

Links:

  liburing (benchmark + tests):
  [1] https://github.com/isilence/liburing/tree/zc_v4

  kernel repo:
  [2] https://github.com/isilence/linux/tree/zc_v4

  RFC v1:
  [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

  RFC v2:
  https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/

  Net patches based:
  git@github.com:isilence/linux.git zc_v4-net-base
  or
  https://github.com/isilence/linux/tree/zc_v4-net-base

API design overview:

  The series introduces an io_uring concept of notifactors. From the userspace
  perspective it's an entity to which it can bind one or more requests and then
  requesting to flush it. Flushing a notifier makes it impossible to attach new
  requests to it, and instructs the notifier to post a completion once all
  requests attached to it are completed and the kernel doesn't need the buffers
  anymore.

  Notifications are stored in notification slots, which should be registered as
  an array in io_uring. Each slot stores only one notifier at any particular
  moment. Flushing removes it from the slot and the slot automatically replaces
  it with a new notifier. All operations with notifiers are done by specifying
  an index of a slot it's currently in.

  When registering a notification the userspace specifies a u64 tag for each
  slot, which will be copied in notification completion entries as
  cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
  sequence number counting notifiers of a slot.

====================

Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19 14:22:41 -07:00
..
.gitignore net: skb: use auto-generation to convert skb drop reason to string 2022-06-07 12:51:41 +02:00
Makefile net: skb: use auto-generation to convert skb drop reason to string 2022-06-07 12:51:41 +02:00
bpf_sk_storage.c bpf: Compute map_btf_id during build time 2022-04-26 11:35:21 -07:00
datagram.c Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux 2022-07-19 14:22:41 -07:00
dev.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-14 15:27:35 -07:00
dev.h net: add skb_defer_max sysctl 2022-05-16 11:33:59 +01:00
dev_addr_lists.c net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
dev_addr_lists_test.c net: kunit: add a test for dev_addr_lists 2021-11-20 12:25:57 +00:00
dev_ioctl.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
devlink.c net: devlink: remove unused locked functions 2022-07-18 20:10:48 -07:00
drop_monitor.c drop_monitor: adopt u64_stats_t 2022-06-09 21:53:12 -07:00
dst.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
dst_cache.c wireguard: device: reset peer src endpoint when netns exits 2021-11-29 19:50:45 -08:00
failover.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
fib_notifier.c net: fib_notifier: propagate extack down to the notifier block callback 2019-10-04 11:10:56 -07:00
fib_rules.c fib: expand fib_rule_policy 2021-12-16 07:18:35 -08:00
filter.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-14 15:27:35 -07:00
flow_dissector.c flow_dissector: Add number of vlan tags dissector 2022-04-20 11:09:13 +01:00
flow_offload.c net: extract port range fields from fl_flow_key 2022-07-13 12:16:56 +01:00
gen_estimator.c net: sched: Remove Qdisc::running sequence counter 2021-10-18 12:54:41 +01:00
gen_stats.c net: stats: Read the statistics in ___gnet_stats_copy_basic() instead of adding. 2021-10-21 12:47:56 +01:00
gro.c net: allow gro_max_size to exceed 65536 2022-05-16 10:18:56 +01:00
gro_cells.c net: add per-cpu storage and net->core_stats 2022-03-11 23:17:24 -08:00
hwbm.c net: hwbm: Make the hwbm_pool lock a mutex 2019-06-09 19:40:10 -07:00
link_watch.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
lwt_bpf.c bpf, lwt: Fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook 2022-04-22 17:45:25 +02:00
lwtunnel.c lwtunnel: Validate RTA_ENCAP_TYPE attribute length 2021-12-31 14:31:59 +00:00
neighbour.c net, neigh: introduce interval_probe_time_ms for periodic probe 2022-06-30 13:14:35 +02:00
net-procfs.c net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
net-sysfs.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-06-23 12:33:24 -07:00
net-sysfs.h net-sysfs: add netdev_change_owner() 2020-02-26 20:07:25 -08:00
net-traces.c tcp: add tracepoint for checksum errors 2021-05-14 15:26:03 -07:00
net_namespace.c net: initialize init_net earlier 2022-02-06 11:04:29 +00:00
netclassid_cgroup.c bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode 2021-09-13 16:35:58 -07:00
netevent.c net: core: Correct function name netevent_unregister_notifier() in the kerneldoc 2021-03-28 17:56:56 -07:00
netpoll.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
netprio_cgroup.c bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode 2021-09-13 16:35:58 -07:00
of_net.c Revert "of: net: support NVMEM cells with MAC in text format" 2022-01-12 14:14:36 +00:00
page_pool.c net: page_pool: optimize page pool page allocation in NUMA scenario 2022-07-07 17:03:16 -07:00
pktgen.c net: rename reference+tracking helpers 2022-06-09 21:52:55 -07:00
ptp_classifier.c ptp: Add generic PTP is_sync() function 2022-03-07 11:31:34 +00:00
request_sock.c tcp: add rcu protection around tp->fastopen_rsk 2019-10-13 10:13:08 -07:00
rtnetlink.c net: allow gro_max_size to exceed 65536 2022-05-16 10:18:56 +01:00
scm.c memcg: enable accounting for scm_fp_list objects 2021-07-20 06:00:38 -07:00
secure_seq.c tcp: resalt the secret every 10 seconds 2022-05-04 19:22:21 -07:00
selftests.c net: core: constify mac addrs in selftests 2021-10-24 13:59:44 +01:00
skbuff.c Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux 2022-07-19 14:22:41 -07:00
skmsg.c Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2022-07-09 12:24:16 -07:00
sock.c tls: rx: periodically flush socket backlog 2022-07-06 12:56:35 +01:00
sock_destructor.h skb_expand_head() adjust skb->truesize incorrectly 2021-10-22 12:35:51 -07:00
sock_diag.c net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
sock_map.c bpf: Fix sockmap calling sleepable function in teardown path 2022-06-28 09:30:03 +02:00
sock_reuseport.c tcp: Add stats for socket migration. 2021-06-23 12:56:08 -07:00
stream.c net: use WARN_ON_ONCE() in sk_stream_kill_queues() 2022-06-09 21:53:55 -07:00
sysctl_net_core.c Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2022-05-23 16:07:14 -07:00
timestamping.c net: Introduce a new MII time stamping interface. 2019-12-25 19:51:33 -08:00
tso.c net: tso: add UDP segmentation support 2020-06-18 20:46:23 -07:00
utils.c net: core: Use csum_replace_by_diff() and csum_sub() instead of opencoding 2022-02-21 11:40:44 +00:00
xdp.c Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2022-03-22 11:18:49 -07:00