WSL2-Linux-Kernel/net/ipv4
Eric Dumazet 68822bdf76 net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821935 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26 17:05:59 -07:00
..
bpfilter
netfilter netfilter: nft_fib: reverse path filter for policy-based routing on iif 2022-04-11 12:10:09 +02:00
Kconfig fou: Remove XRFM from NET_FOU Kconfig 2022-04-12 14:56:33 -07:00
Makefile
af_inet.c ipv4: Avoid using RTO_ONLINK with ip_route_connect(). 2022-04-22 13:06:03 +01:00
ah4.c
arp.c arp: fix unused variable warnning when CONFIG_PROC_FS=n 2022-04-25 11:50:18 +01:00
bpf_tcp_ca.c
cipso_ipv4.c
datagram.c ipv4: Avoid using RTO_ONLINK with ip_route_connect(). 2022-04-22 13:06:03 +01:00
devinet.c
esp4.c esp: limit skb_page_frag_refill use to a single page 2022-04-13 10:16:11 +02:00
esp4_offload.c net: Fix esp GSO on inter address family tunnels. 2022-03-07 13:14:04 +01:00
fib_frontend.c net: Add l3mdev index to flow struct and avoid oif reset for port devices 2022-03-15 20:20:02 -07:00
fib_lookup.h
fib_notifier.c
fib_rules.c
fib_semantics.c ipv4: Use dscp_t in struct fib_rt_info 2022-04-11 17:37:50 -07:00
fib_trie.c ipv4: Use dscp_t in struct fib_entry_notifier_info 2022-04-11 17:37:50 -07:00
fou.c fou: Remove XRFM from NET_FOU Kconfig 2022-04-12 14:56:33 -07:00
gre_demux.c
gre_offload.c
icmp.c net: icmp: add skb drop reasons to icmp protocol 2022-04-11 10:38:37 +01:00
igmp.c
inet_connection_sock.c
inet_diag.c
inet_fragment.c net: ip: Handle delivery_time in ip defrag 2022-03-03 14:38:48 +00:00
inet_hashtables.c
inet_timewait_sock.c
inetpeer.c
ip_forward.c net: ip: add skb drop reasons to ip forwarding 2022-04-13 13:09:57 +01:00
ip_fragment.c net: ip: Handle delivery_time in ip defrag 2022-03-03 14:38:48 +00:00
ip_gre.c net: Handle l3mdev in ip_tunnel_init_flow 2022-04-15 14:27:30 -07:00
ip_input.c net-core: rx_otherhost_dropped to core_stats 2022-04-07 20:32:49 -07:00
ip_options.c
ip_output.c net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() 2022-03-03 14:38:48 +00:00
ip_sockglue.c
ip_tunnel.c net: Handle l3mdev in ip_tunnel_init_flow 2022-04-15 14:27:30 -07:00
ip_tunnel_core.c
ip_vti.c
ipcomp.c
ipconfig.c
ipip.c
ipmr.c
ipmr_base.c
metrics.c
netfilter.c
netlink.c
nexthop.c
ping.c net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
proc.c
protocol.c
raw.c net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
raw_diag.c
route.c ipv4: Initialise ->flowi4_scope properly in ICMP handlers. 2022-04-22 13:06:03 +01:00
syncookies.c
sysctl_net_ipv4.c tcp: adjust TSO packet sizes based on min_rtt 2022-03-09 20:05:44 -08:00
tcp.c net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
tcp_bbr.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_bic.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_bpf.c net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
tcp_cdg.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_cong.c tcp: Add tracepoint for tcp_set_ca_state 2022-04-07 20:33:15 -07:00
tcp_cubic.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_dctcp.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_dctcp.h
tcp_diag.c
tcp_fastopen.c
tcp_highspeed.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_htcp.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_hybla.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_illinois.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_input.c tcp: fix signed/unsigned comparison 2022-04-18 10:28:40 +01:00
tcp_ipv4.c net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
tcp_lp.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_metrics.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_minisocks.c
tcp_nv.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_offload.c
tcp_output.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_rate.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_recovery.c
tcp_scalable.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_timer.c
tcp_ulp.c
tcp_vegas.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_vegas.h
tcp_veno.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_westwood.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tcp_yeah.c tcp: add accessors to read/set tp->snd_cwnd 2022-04-06 12:05:41 -07:00
tunnel4.c
udp.c net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
udp_bpf.c net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
udp_diag.c
udp_impl.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
udp_offload.c
udp_tunnel_core.c
udp_tunnel_nic.c
udp_tunnel_stub.c
udplite.c
xfrm4_input.c
xfrm4_output.c
xfrm4_policy.c net: Add l3mdev index to flow struct and avoid oif reset for port devices 2022-03-15 20:20:02 -07:00
xfrm4_protocol.c
xfrm4_state.c
xfrm4_tunnel.c