random32.c has two random number generators in it: one that is meant to
be used deterministically, with some predefined seed, and one that does
the same exact thing as random.c, except does it poorly. The first one
has some use cases. The second one no longer does and can be replaced
with calls to random.c's proper random number generator.
The relatively recent siphash-based bad random32.c code was added in
response to concerns that the prior random32.c was too deterministic.
Out of fears that random.c was (at the time) too slow, this code was
anonymously contributed. Then out of that emerged a kind of shadow
entropy gathering system, with its own tentacles throughout various net
code, added willy nilly.
Stop👏making👏bespoke👏random👏number👏generators👏.
Fortunately, recent advances in random.c mean that we can stop playing
with this sketchiness, and just use get_random_u32(), which is now fast
enough. In micro benchmarks using RDPMC, I'm seeing the same median
cycle count between the two functions, with the mean being _slightly_
higher due to batches refilling (which we can optimize further need be).
However, when doing *real* benchmarks of the net functions that actually
use these random numbers, the mean cycles actually *decreased* slightly
(with the median still staying the same), likely because the additional
prandom code means icache misses and complexity, whereas random.c is
generally already being used by something else nearby.
The biggest benefit of this is that there are many users of prandom who
probably should be using cryptographically secure random numbers. This
makes all of those accidental cases become secure by just flipping a
switch. Later on, we can do a tree-wide cleanup to remove the static
inline wrapper functions that this commit adds.
There are also some low-ish hanging fruits for making this even faster
in the future: a get_random_u16() function for use in the networking
stack will give a 2x performance boost there, using SIMD for ChaCha20
will let us compute 4 or 8 or 16 blocks of output in parallel, instead
of just one, giving us large buffers for cheap, and introducing a
get_random_*_bh() function that assumes irqs are already disabled will
shave off a few cycles for ordinary calls. These are things we can chip
away at down the road.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
RFC 8684 section 3.7 describes several opportunities for a MPTCP
connection to "fall back" to regular TCP early in the connection
process, before it has been confirmed that MPTCP options can be
successfully propagated on all SYN, SYN/ACK, and data packets. If a peer
acknowledges the first received data packet with a regular TCP header
(no MPTCP options), fallback is allowed.
If the recipient of that first data packet finds a MPTCP DSS checksum
error, this provides an opportunity to fail gracefully with a TCP
fallback rather than resetting the connection (as might happen if a
checksum failure were detected later).
This commit modifies the checksum failure code to attempt fallback on
the initial subflow of a MPTCP connection, only if it's a failure in the
first data mapping. In cases where the peer initiates the connection,
requests checksums, is the first to send data, and the peer is sending
incorrect checksums (see
https://github.com/multipath-tcp/mptcp_net-next/issues/275), this allows
the connection to proceed as TCP rather than reset.
Fixes: dd8bcd1768 ("mptcp: validate the data checksum")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MPTCP code typecasts the checksum value to u16 and
then converts it to big endian while storing the value into
the MPTCP option.
As a result, the wire encoding for little endian host is
wrong, and that causes interoperabilty interoperability
issues with other implementation or host with different endianness.
Address the issue writing in the packet the unmodified __sum16 value.
MPTCP checksum is disabled by default, interoperating with systems
with bad mptcp-level csum encoding should cause fallback to TCP.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/275
Fixes: c5b39e26d0 ("mptcp: send out checksum for DSS")
Fixes: 390b95a5fb ("mptcp: receive checksum for DSS")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2022-05-18
1) Fix "disable_policy" flag use when arriving from different devices.
From Eyal Birger.
2) Fix error handling of pfkey_broadcast in function pfkey_process.
From Jiasheng Jiang.
3) Check the encryption module availability consistency in pfkey.
From Thomas Bartschies.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the recent introduction supporting the SM3 and SM4 hash algos for IPsec, the kernel
produces invalid pfkey acquire messages, when these encryption modules are disabled. This
happens because the availability of the algos wasn't checked in all necessary functions.
This patch adds these checks.
Signed-off-by: Thomas Bartschies <thomas.bartschies@cvk.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If skb_clone() returns null pointer, pfkey_broadcast() will
return error.
Therefore, it should be better to check the return value of
pfkey_broadcast() and return error if fails.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
After merging the net-next tree, today's linux-next build (powerpc
ppc64_defconfig) produced this warning:
nf_conntrack_netlink.c:1717 warning: 'ctnetlink_dump_one_entry' defined but not used
Fixes: 8a75a2c174 ("netfilter: conntrack: remove unconfirmed list")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
There are sleep in atomic context bugs when the request to secure
element of st-nci is timeout. The root cause is that nci_skb_alloc
with GFP_KERNEL parameter is called in st_nci_se_wt_timeout which is
a timer handler. The call paths that could trigger bugs are shown below:
(interrupt context 1)
st_nci_se_wt_timeout
nci_hci_send_event
nci_hci_send_data
nci_skb_alloc(..., GFP_KERNEL) //may sleep
(interrupt context 2)
st_nci_se_wt_timeout
nci_hci_send_event
nci_hci_send_data
nci_send_data
nci_queue_tx_data_frags
nci_skb_alloc(..., GFP_KERNEL) //may sleep
This patch changes allocation mode of nci_skb_alloc from GFP_KERNEL to
GFP_ATOMIC in order to prevent atomic context sleeping. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: ed06aeefda ("nfc: st-nci: Rename st21nfcb to st-nci")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220517012530.75714-1-duoming@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rdma write with inline flag when sending small packages,
whose length is shorter than the qp's max_inline_data, can
help reducing latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.5us-0.7us improvement in latency.
Test command:
server: smc_run taskset -c 1 qperf
client: smc_run taskset -c 1 qperf <server ip> -oo \
msg_size:1:2K:*2 -t 30 -vu tcp_lat
The results shown below:
msgsize before after
1B 11.2 us 10.6 us (-0.6 us)
2B 11.2 us 10.7 us (-0.5 us)
4B 11.3 us 10.7 us (-0.6 us)
8B 11.2 us 10.6 us (-0.6 us)
16B 11.3 us 10.7 us (-0.6 us)
32B 11.3 us 10.6 us (-0.7 us)
64B 11.2 us 11.2 us (0 us)
128B 11.2 us 11.2 us (0 us)
256B 11.2 us 11.2 us (0 us)
512B 11.4 us 11.3 us (-0.1 us)
1KB 11.4 us 11.5 us (0.1 us)
2KB 11.5 us 11.5 us (0 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: kernel test robot <lkp@intel.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As cdc msg's length is 44B, cdc msgs can be sent inline in
most rdma devices, which can help reducing sending latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.4us-0.7us improvement in latency.
Test command:
server: smc_run taskset -c 1 qperf
client: smc_run taskset -c 1 qperf <server ip> -oo \
msg_size:1:2K:*2 -t 30 -vu tcp_lat
The results shown below:
msgsize before after
1B 11.9 us 11.2 us (-0.7 us)
2B 11.7 us 11.2 us (-0.5 us)
4B 11.7 us 11.3 us (-0.4 us)
8B 11.6 us 11.2 us (-0.4 us)
16B 11.7 us 11.3 us (-0.4 us)
32B 11.7 us 11.3 us (-0.4 us)
64B 11.7 us 11.2 us (-0.5 us)
128B 11.6 us 11.2 us (-0.4 us)
256B 11.8 us 11.2 us (-0.6 us)
512B 11.8 us 11.4 us (-0.4 us)
1KB 11.9 us 11.4 us (-0.5 us)
2KB 12.1 us 11.5 us (-0.6 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: kernel test robot <lkp@intel.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The global blackhole_netdev has replaced pernet loopback_dev to become the
one given to the object that holds an netdev when ifdown in many places of
ipv4 and ipv6 since commit 8d7017fd62 ("blackhole_netdev: use
blackhole_netdev to invalidate dst entries").
Especially after commit faab39f63c ("net: allow out-of-order netdev
unregistration"), it's no longer safe to use loopback_dev that may be
freed before other netdev.
This patch is to set dst dev to blackhole_netdev instead of loopback_dev
in ifdown.
v1->v2:
- add Fixes tag as Eric suggested.
Fixes: faab39f63c ("net: allow out-of-order netdev unregistration")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/e8c87482998ca6fcdab214f5a9d582899ec0c648.1652665047.git.lucien.xin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEBsvAIBsPu6mG7thcrX5LkNig010FAmKCr/0THG1rbEBwZW5n
dXRyb25peC5kZQAKCRCtfkuQ2KDTXbVWCAClh9ywNEPiSxXsB2k9EdVoijz0hv6I
y6YLFPfHxc8w0ewOnBxJOBeLhVjIE2K3n81NaEudheBBImWV8CGlFPHu5sI4S01b
aFcAuLT6SOat52xzkh44WGGjW517+EZM1Of1xzPhKdnR0nLvva52SpJlYwgLrkCN
O6r30cDuU558H4g2BNxavbbNFPzL7HhkBpHmgWQg4q2p7LvLuTE6sN2CvNEgUWYM
nIyK6P/FIyy3l7M7E873WhIEGNGRnAXRBEPXivRmKTlw0wizcvPxlWKm4YxTA0WL
30fw58Hy4x6SZD+7+sP0QhdjiYOofgMtZFZn/LaJYE+rfaYLsFUH3A/2
=tUrE
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-5.19-20220516' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2022-05-16
the first 2 patches are by me and target the CAN raw protocol. The 1st
removes an unneeded assignment, the other one adds support for
SO_TXTIME/SCM_TXTIME.
Oliver Hartkopp contributes 2 patches for the ISOTP protocol. The 1st
adds support for transmission without flow control, the other let's
bind() return an error on incorrect CAN ID formatting.
Geert Uytterhoeven contributes a patch to clean up ctucanfd's Kconfig
file.
Vincent Mailhol's patch for the slcan driver uses the proper function
to check for invalid CAN frames in the xmit callback.
The next patch is by Geert Uytterhoeven and makes the interrupt-names
of the renesas,rcar-canfd dt bindings mandatory.
A patch by my update the ctucanfd dt bindings to include the common
CAN controller bindings.
The last patch is by Akira Yokosawa and fixes a breakage the
ctucanfd's documentation.
* tag 'linux-can-next-for-5.19-20220516' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
docs: ctucanfd: Use 'kernel-figure' directive instead of 'figure'
dt-bindings: can: ctucanfd: include common CAN controller bindings
dt-bindings: can: renesas,rcar-canfd: Make interrupt-names required
can: slcan: slc_xmit(): use can_dropped_invalid_skb() instead of manual check
can: ctucanfd: Let users select instead of depend on CAN_CTUCANFD
can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting
can: isotp: add support for transmission without flow control
can: raw: add support for SO_TXTIME/SCM_TXTIME
can: raw: raw_sendmsg(): remove not needed setting of skb->sk
====================
Link: https://lore.kernel.org/r/20220516202625.1129281-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While preparing for Clang randstruct support (which duplicated many of
the warnings the randstruct GCC plugin warned about), one strange one
remained only for the randstruct GCC plugin. Eliminating this rids
the plugin of the last exception.
It seems the plugin is happy to dereference individual members of
a cross-struct cast, but it is upset about casting to a whole object
pointer. This only manifests in one place in the kernel, so just replace
the variable with individual member accesses. There is no change in
executable instruction output.
Drop the last exception from the randstruct GCC plugin.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: netdev@vger.kernel.org
Cc: linux-hardening@vger.kernel.org
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/lkml/20220511022217.58586-1-kuniyu@amazon.co.jp
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/lkml/20220511151542.4cb3ff17@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Support this via passthrough to the underlying tcp listener socket.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/271
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 4293248c67.
Additional locks are not needed, all the touched sections
are already under mptcp socket lock protection.
Fixes: 4293248c67 ("mptcp: add data lock for sk timers")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 3ea566422c ("can: isotp: sanitize CAN ID checks in
isotp_bind()") checks the given CAN ID address information by
sanitizing the input values.
This check (silently) removes obsolete bits by masking the given CAN
IDs.
Derek Will suggested to give a feedback to the application programmer
when the 'sanitizing' was actually needed which means the programmer
provided CAN ID content in a wrong format (e.g. SFF CAN IDs with a CAN
ID > 0x7FF).
Link: https://lore.kernel.org/all/20220515181633.76671-1-socketcan@hartkopp.net
Suggested-by: Derek Will <derekrobertwill@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Usually the ISO 15765-2 protocol is a point-to-point protocol to transfer
segmented PDUs to a dedicated receiver. This receiver sends a flow control
message to specify protocol options and timings (e.g. block size / STmin).
The so called functional addressing communication allows a 1:N
communication but is limited to a single frame length.
This new CAN_ISOTP_CF_BROADCAST allows an unconfirmed 1:N communication
with PDU length that would not fit into a single frame. This feature is
not covered by the ISO 15765-2 standard.
Link: https://lore.kernel.org/all/20220507115558.19065-1-socketcan@hartkopp.net
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch calls into sock_cmsg_send() to parse the user supplied
control information into a struct sockcm_cookie. Then assign the
requested transmit time to the skb.
This makes it possible to use the Earliest TXTIME First (ETF) packet
scheduler with the CAN_RAW protocol. The user can send a CAN_RAW frame
with a TXTIME and the kernel (with the ETF scheduler) will take care
of sending it to the network interface.
Link: https://lore.kernel.org/all/20220502091946.1916211-3-mkl@pengutronix.de
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The skb in raw_sendmsg() is allocated with sock_alloc_send_skb(),
which subsequently calls sock_alloc_send_pskb() -> skb_set_owner_w(),
which assigns "skb->sk = sk".
This patch removes the not needed setting of skb->sk.
Link: https://lore.kernel.org/all/20220502091946.1916211-2-mkl@pengutronix.de
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
They are either obsolete or useless.
Those in the normal processing path cannot be enabled on a production
system; they generate too much noise.
One pr_debug call resides in an error path and does provide useful info,
merge it with the existing nf_log_invalid().
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently nf_conncount can trigger garbage collection (GC)
at multiple places. Each GC process takes a spin_lock_bh
to traverse the nf_conncount_list. We found that when testing
port scanning use two parallel nmap, because the number of
connection increase fast, the nf_conncount_count and its
subsequent call to __nf_conncount_add take too much time,
causing several CPU lockup. This happens when user set the
conntrack limit to +20,000, because the larger the limit,
the longer the list that GC has to traverse.
The patch mitigate the performance issue by avoiding unnecessary
GC with a timestamp. Whenever nf_conncount has done a GC,
a timestamp is updated, and beforce the next time GC is
triggered, we make sure it's more than a jiffies.
By doin this we can greatly reduce the CPU cycles and
avoid the softirq lockup.
To reproduce it in OVS,
$ ovs-appctl dpctl/ct-set-limits zone=1,limit=20000
$ ovs-appctl dpctl/ct-get-limits
At another machine, runs two nmap
$ nmap -p1- <IP>
$ nmap -p1- <IP>
Signed-off-by: William Tu <u9012063@gmail.com>
Co-authored-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reported-by: Greg Rose <gvrose8192@gmail.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 40867d74c3 ("net: Add l3mdev index to flow struct and avoid oif
reset for port devices") introduces a flow key specific for layer 3
domains, such as a VRF master device. This allows for explicit VRF domain
selection instead of abusing the oif flow key.
Update ip[6]_route_me_harder() to make use of that new key when re-routing
mangled packets within VRFs instead of setting the flow oif, making it
consistent with other users.
Signed-off-by: Martin Willi <martin@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When running a combination of PPPoE on top of a VLAN, we need to set
info->outdev to the PPPoE device, otherwise PPPoE encap is skipped
during software offload.
Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When calling dev_fill_forward_path on a pppoe device, the provided destination
address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
code needs to update ctx->daddr to the correct value.
Fix this by storing the address inside struct net_device_path_ctx
Fixes: f6efc675c9 ("net: ppp: resolve forwarding path for bridge pppoe devices")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The dst entry does not contain a valid hardware address, so skip the lookup
in order to avoid running into errors here.
The proper hardware address is filled in from nft_dev_path_info
Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If a flow cannot be offloaded, the code currently repeatedly tries again as
quickly as possible, which can significantly increase system load.
Fix this by limiting flow timeout update and hardware offload retry to once
per second.
Fixes: c07531c01d ("netfilter: flowtable: Remove redundant hw refresh bit")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
syzbot was able to trigger an Out-of-Bound on the pedit action:
UBSAN: shift-out-of-bounds in net/sched/act_pedit.c:238:43
shift exponent 1400735974 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 3606 Comm: syz-executor151 Not tainted 5.18.0-rc5-syzkaller-00165-g810c2f0a3f86 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
ubsan_epilogue+0xb/0x50 lib/ubsan.c:151
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x187 lib/ubsan.c:322
tcf_pedit_init.cold+0x1a/0x1f net/sched/act_pedit.c:238
tcf_action_init_1+0x414/0x690 net/sched/act_api.c:1367
tcf_action_init+0x530/0x8d0 net/sched/act_api.c:1432
tcf_action_add+0xf9/0x480 net/sched/act_api.c:1956
tc_ctl_action+0x346/0x470 net/sched/act_api.c:2015
rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5993
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:725
____sys_sendmsg+0x6e2/0x800 net/socket.c:2413
___sys_sendmsg+0xf3/0x170 net/socket.c:2467
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fe36e9e1b59
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffef796fe88 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe36e9e1b59
RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000003
RBP: 00007fe36e9a5d00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe36e9a5d90
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
The 'shift' field is not validated, and any value above 31 will
trigger out-of-bounds. The issue predates the git history, but
syzbot was able to trigger it only after the commit mentioned in
the fixes tag, and this change only applies on top of such commit.
Address the issue bounding the 'shift' value to the maximum allowed
by the relevant operator.
Reported-and-tested-by: syzbot+8ed8fc4c57e9dcf23ca6@syzkaller.appspotmail.com
Fixes: 8b796475fd ("net/sched: act_pedit: really ensure the skb is writable")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_defer_free_flush() can consume cpu cycles,
it seems better to call it in the inner loop:
- Potentially frees page/skb that will be reallocated while hot.
- Account for the cpu cycles in the @time_limit determination.
- Keep softnet_data.defer_count small to reduce chances for
skb_attempt_defer_free() to send an IPI.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 68822bdf76 ("net: generalize skb freeing
deferral to per-cpu lists") added another per-cpu
cache of skbs. It was expected to be small,
and an IPI was forced whenever the list reached 128
skbs.
We might need to be able to control more precisely
queue capacity and added latency.
An IPI is generated whenever queue reaches half capacity.
Default value of the new limit is 64.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_defer_free_flush() runs from softirq context,
we have the opportunity to refill the napi_alloc_cache,
and/or use kmem_cache_free_bulk() when this cache is full.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A cpu can observe sd->defer_count reaching 128,
and call smp_call_function_single_async()
Problem is that the remote CPU can clear sd->defer_count
before the IPI is run/acknowledged.
Other cpus can queue more packets and also decide
to call smp_call_function_single_async() while the pending
IPI was not yet delivered.
This is a common issue with smp_call_function_single_async().
Callers must ensure correct synchronization and serialization.
I triggered this issue while experimenting smaller threshold.
Performing the call to smp_call_function_single_async()
under sd->defer_lock protection did not solve the problem.
Commit 5a18ceca63 ("smp: Allow smp_call_function_single_async()
to insert locked csd") replaced an informative WARN_ON_ONCE()
with a return of -EBUSY, which is often ignored.
Test of CSD_FLAG_LOCK presence is racy anyway.
Fixes: 68822bdf76 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
return value of tcp_inbound_md5_hash().
And it can panic the kernel with NULL pointer in
net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
is NULL.
Fixes: 1330b6ef33 ("skb: make drop reason booleanable")
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after
we make it the return value of the functions with return type of enum
skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can
be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason().
So we check the range of drop reason in kfree_skb_reason() with
DEBUG_NET_WARN_ON_ONCE().
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'reason' will be set to 'SKB_DROP_REASON_NOT_SPECIFIED' if it not
small that SKB_DROP_REASON_MAX in net_dm_packet_trace_kfree_skb_hit(),
but it can't avoid it to be 0, which is invalid and can cause NULL
pointer in drop_reasons.
Therefore, reset it to SKB_DROP_REASON_NOT_SPECIFIED when 'reason <= 0'.
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Connect with O_NONBLOCK will not be completed immediately
and returns -EINPROGRESS. It is possible to use selector/poll
for completion by selecting the socket for writing. After select
indicates writability, a second connect function call will return
0 to indicate connected successfully as TCP does, but smc returns
-EISCONN. Use socket state for smc to indicate connect state, which
can help smc aligning the connect behaviour with TCP.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is no longer a macro, but an inlined function.
INET_MATCH() -> inet_match()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
INET6_MATCH() runs without holding a lock on the socket.
We probably need to annotate most reads.
This patch makes INET6_MATCH() an inline function
to ease our changes.
v2: inline function only defined if IS_ENABLED(CONFIG_IPV6)
Change the name to inet6_match(), this is no longer a macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use READ_ONCE() in paths not holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_bound_dev_if can change under us, use READ_ONCE() annotation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_csk_bind_conflict() can access sk->sk_bound_dev_if for
unlocked sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When reading listener sk->sk_bound_dev_if locklessly,
we must use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
because other cpus/threads might locklessly read this field.
sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
because they run without socket lock held.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_rcv() reads sk->sk_bound_dev_if twice while the socket
is not locked. Another cpu could change this field under us.
Fixes: 0fd9a65a76 ("[SCTP] Support SO_BINDTODEVICE socket option on incoming packets.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while
this field can be changed by another thread.
Adds minimal annotations to avoid KCSAN splats for UDP.
Following patches will add more annotations to potential lockless readers.
BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg
write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0:
__ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221
ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
__sys_connect_file net/socket.c:1900 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1917
__do_sys_connect net/socket.c:1927 [inline]
__se_sys_connect net/socket.c:1924 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1924
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1:
udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0xffffff9b
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
I chose to not add Fixes: tag because race has minor consequences
and stable teams busy enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.
Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.
This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the gro_max_size to exceed a value larger than 65536.
There weren't really any external limitations that prevented this other
than the fact that IPv4 only supports a 16 bit length field. Since we have
the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
via a constant rather than relying on gro_max_size.
[edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
BIG TCP ipv6 packets (bigger than 64K).
This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
so that resulting packet can go through IPv6/TCP stacks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.
If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.
v2: perform HBH removal from ipv6_gso_segment() instead of
skb_segment() (Alexander feedback)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hystart_ack_delay() had the assumption that a TSO packet
would not be bigger than GSO_MAX_SIZE.
This will no longer be true.
We should use sk->sk_gso_max_size instead.
This reduces chances of spurious Hystart ACK train detections.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.
With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.
So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.
v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
in a new sk_trim_gso_size() helper.
netif_set_tso_max_size() caps the requested TSO size
with GSO_MAX_SIZE.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS
are used to report to user-space the device TSO limits.
ip -d link sh dev eth1
...
tso_max_size 65536 tso_max_segs 65535
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This is v2 including deadlock fix in conntrack ecache rework
reported by Jakub Kicinski.
The following patchset contains Netfilter updates for net-next,
mostly updates to conntrack from Florian Westphal.
1) Add a dedicated list for conntrack event redelivery.
2) Include event redelivery list in conntrack dumps of dying type.
3) Remove per-cpu dying list for event redelivery, not used anymore.
4) Add netns .pre_exit to cttimeout to zap timeout objects before
synchronize_rcu() call.
5) Remove nf_ct_unconfirmed_destroy.
6) Add generation id for conntrack extensions for conntrack
timeout and helpers.
7) Detach timeout policy from conntrack on cttimeout module removal.
8) Remove __nf_ct_unconfirmed_destroy.
9) Remove unconfirmed list.
10) Remove unconditional local_bh_disable in init_conntrack().
11) Consolidate conntrack iterator nf_ct_iterate_cleanup().
12) Detect if ctnetlink listeners exist to short-circuit event
path early.
13) Un-inline nf_ct_ecache_ext_add().
14) Add nf_conntrack_events autodetect ctnetlink listener mode
and make it default.
15) Add nf_ct_ecache_exist() to check for event cache extension.
16) Extend flowtable reverse route lookup to include source, iif,
tos and mark, from Sven Auhagen.
17) Do not verify zero checksum UDP packets in nf_reject,
from Kevin Mitchell.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the new struct ieee80211_rate_status and its
annotation in struct ieee80211_tx_status in minstrel_ht.
In minstrel_ht_tx_status, a check for the presence of instances of the
new struct in ieee80211_tx_status is added. Based on this, minstrel_ht
then gets and updates internal rate stats with either struct
ieee80211_rate_status or ieee80211_tx_info->status.rates.
Adjusted variants of minstrel_ht_txstat_valid, minstrel_ht_get_stats,
minstrel_{ht/vht}_get_group_idx are added which use struct
ieee80211_rate_status and struct rate_info instead of the legacy structs.
struct rate_info from cfg80211.h does not provide whether short preamble
was used for the transmission. So we retrieve this information from VIF
and STA configuration and cache it in a new flag in struct minstrel_ht_sta
per rate control instance.
Compile-Tested: current wireless-next tree with all flags on
Tested-on: Xiaomi 4A Gigabit (MediaTek MT7603E, MT7612E) with OpenWrt
Linux 5.10.113
Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Link: https://lore.kernel.org/r/20220509173958.1398201-3-jelonek.jonas@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds the new struct ieee80211_rate_status and replaces
'struct rate_info *rate' in ieee80211_tx_status with pointer and length
annotation.
The struct ieee80211_rate_status allows to:
(1) receive tx power status feedback for transmit power control (TPC)
per packet or packet retry
(2) dynamic mapping of wifi chip specific multi-rate retry (mrr)
chains with different lengths
(3) increase the limit of annotatable rate indices to support
IEEE802.11ac rate sets and beyond
ieee80211_tx_info, control and status buffer, and ieee80211_tx_rate
cannot be used to achieve these goals due to fixed size limitations.
Our new struct contains a struct rate_info to annotate the rate that was
used, retry count of the rate and tx power. It is intended for all
information related to RC and TPC that needs to be passed from driver to
mac80211 and its RC/TPC algorithms like Minstrel_HT. It corresponds to
one stage in an mrr. Multiple subsequent instances of this struct can be
included in struct ieee80211_tx_status via a pointer and a length variable.
Those instances can be allocated on-stack. The former reference to a single
instance of struct rate_info is replaced with our new annotation.
An extension is introduced to struct ieee80211_hw. There are two new
members called 'tx_power_levels' and 'max_txpwr_levels_idx' acting as a
tx power level table. When a wifi device is registered, the driver shall
supply all supported power levels in this list. This allows to support
several quirks like differing power steps in power level ranges or
alike. TPC can use this for algorithm and thus be designed more abstract
instead of handling all possible step widths individually.
Further mandatory changes in status.c, mt76 and ath11k drivers due to the
removal of 'struct rate_info *rate' are also included.
status.c already uses the information in ieee80211_tx_status->rate in
radiotap, this is now changed to use ieee80211_rate_status->rate_idx.
mt76 driver already uses struct rate_info to pass the tx rate to status
path. The new members of the ieee80211_tx_status are set to NULL and 0
because the previously passed rate is not relevant to rate control and
accurate information is passed via tx_info->status.rates.
For ath11k, the txrate can be passed via this struct because ath11k uses
firmware RC and thus the information does not interfere with software RC.
Compile-Tested: current wireless-next tree with all flags on
Tested-on: Xiaomi 4A Gigabit (MediaTek MT7603E, MT7612E) with OpenWrt
Linux 5.10.113
Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Link: https://lore.kernel.org/r/20220509173958.1398201-2-jelonek.jonas@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fill all requested rates (in case of ath9k 4 rate slots are
available, so fill all 4 instead of only 3), improves throughput in
noisy environment.
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Link: https://lore.kernel.org/r/20220402153014.31332-2-ps.report@gmx.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
AP may run out of BSS color after color collision
detection event from driver.
Disable BSS color collision detection if no free colors are
available based on bss color disabled bit sent as a part of
NL80211_ATTR_HE_BSS_COLOR attribute sent in
NL80211_CMD_SET_BEACON.
It can be reenabled once new color is available.
Signed-off-by: Lavanya Suresh <lavaks@codeaurora.org>
Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com>
Link: https://lore.kernel.org/r/1649867295-7204-3-git-send-email-quic_ramess@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
NL80211_ATTR_HE_BSS_COLOR attribute can be included in both
NL80211_CMD_START_AP and NL80211_CMD_SET_BEACON commands.
Move he_bss_color from cfg80211_ap_settings to cfg80211_beacon_data
and parse NL80211_ATTR_HE_BSS_COLOR as a part of nl80211_parse_beacon()
to have bss color settings parsed for both start ap and set beacon
commands.
Add a new flag he_bss_color_valid to indicate whether
NL80211_ATTR_HE_BSS_COLOR attribute is included.
Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com>
Link: https://lore.kernel.org/r/1649867295-7204-2-git-send-email-quic_ramess@quicinc.com
[fix build ...]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In IPv4 setting the "disable_policy" flag on a device means no policy
should be enforced for traffic originating from the device. This was
implemented by seting the DST_NOPOLICY flag in the dst based on the
originating device.
However, dsts are cached in nexthops regardless of the originating
devices, in which case, the DST_NOPOLICY flag value may be incorrect.
Consider the following setup:
+------------------------------+
| ROUTER |
+-------------+ | +-----------------+ |
| ipsec src |----|-|ipsec0 | |
+-------------+ | |disable_policy=0 | +----+ |
| +-----------------+ |eth1|-|-----
+-------------+ | +-----------------+ +----+ |
| noipsec src |----|-|eth0 | |
+-------------+ | |disable_policy=1 | |
| +-----------------+ |
+------------------------------+
Where ROUTER has a default route towards eth1.
dst entries for traffic arriving from eth0 would have DST_NOPOLICY
and would be cached and therefore can be reused by traffic originating
from ipsec0, skipping policy check.
Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead
of the DST in IN/FWD IPv4 policy checks.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We currently track whether we're associated and which the
BSS is in the same variable (ifmgd->associated), but for
MLD we'll need to move the BSS pointer to be per link,
while the question whether we're associated or not is for
the whole interface.
Add ifmgd->assoc_bss that stores the pointer and change
ifmgd->associated to be just a bool, so the question of
whether we're associated can continue working after MLD
rework, without requiring changes, while the BSS pointer
will have to be changed/used checked per link.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We don't need to copy this locally, we now only use the
variable to print before doing other things.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We never use the bssid argument to ieee80211_sta_connection_lost()
so we might as well just remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no need to look it up from the ifmgd->associated
BSS configuration, we already maintain a local copy since
commit b0140fda62 ("mac80211: mlme: save ssid info to
ieee80211_bss_conf while assoc").
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since we always track the BSSID there when we get associated,
these are equivalent, but ifmgd->bssid saves a dereference and
thus makes the code a bit smaller, and more readable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This code is tightly coupled to the sdata->u.mgd data
structure, so there's no reason for it to be in utils.
Move it to mlme.c.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Ping-Ke's previous patch adjusted the CCMP AAD construction
to properly take the order bit into account, but failed to
update the (identical) GCMP AAD construction as well.
Unify the AAD construction between the two cases.
Reported-by: Jouni Malinen <j@w1.fi>
Link: https://lore.kernel.org/r/20220506105150.51d66e2a6f3c.I65f12be82c112365169e8a9f48c7a71300e814b9@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the PM closes a fully established MPJ subflow or the subflow
creation errors out in it's early stage the subflows counter is
not bumped accordingly.
This change adds the missing accounting, additionally taking care
of updating accordingly the 'accept_subflow' flag.
Fixes: a88c9e4969 ("mptcp: do not block subflows creation on errors")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Highlights include:
Stable fixes:
- SUNRPC: Ensure that the gssproxy client can start in a connected state
Bugfixes:
- Revert "SUNRPC: Ensure gss-proxy connects on setup"
- nfs: fix broken handling of the softreval mount option
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJ+j5kACgkQZwvnipYK
APIQag/+JXtDt9xo0SsGTMJ2PJArnRoyd3QcZjJtoabaZTylZyILZ20sxvt/jgby
miOKrI+bbsykbrwQijRLF/Yys1G+iMBzSWy4lpJ9eH4AVkfjr5qqauRbPobo/ZiI
i5fDLR84FlzKYWBY1Nv6YE5VREukIlXbrq7KWe/HoS7/hSAamhkUv+a0M8iNLGO4
QtH7+M0iBZTI9yM4gFAcMAANV21SxvjqP1z62kbCp00qO2mL0PgF/2pxC9WgX24/
EZX037ykzKFkjgWfzT8+/eIfCQGIPi/9e6Ir4Bc99psVFOYd1YxkTLBycNwm1VOz
5RLORbURDVMPQH2/qZ57u7B0gJF76UZM4pH3gv+i9nFhoUqf3kFZAOy48MZEFz3X
sPiQZLck63mvO9Bd3QX6pFZc0datiYmhuXRknjxV6Bz/Y41y3NzeJeX4r88s4q7l
tisZDmaIm0Q+H07QOTL0aCk456amP6XLnO1+nu/PSR/3ImwJSaOpypHx/BxDJUTG
TvpY1ouBZ8irfT6JrbfVnSdbedIZx4c+6btVw6mlT40edZF6M3r+3s8s2TU7x+co
uiBMB4Qj/C19zqcf/DziiL1PZEJPm3lk0fBHc6JIWCV4I3eYxQ5J4nFJsy8/jPm1
lTgreoHWfnYC3WhlxUk+N3+9X/9+iFEJUxqGYfANf8GFaGqcYVQ=
=/Yg+
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.18-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"One more pull request. There was a bug in the fix to ensure that gss-
proxy continues to work correctly after we fixed the AF_LOCAL socket
leak in the RPC code. This therefore reverts that broken patch, and
replaces it with one that works correctly.
Stable fixes:
- SUNRPC: Ensure that the gssproxy client can start in a connected
state
Bugfixes:
- Revert "SUNRPC: Ensure gss-proxy connects on setup"
- nfs: fix broken handling of the softreval mount option"
* tag 'nfs-for-5.18-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: fix broken handling of the softreval mount option
SUNRPC: Ensure that the gssproxy client can start in a connected state
Revert "SUNRPC: Ensure gss-proxy connects on setup"
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2022-05-13
1) Cleanups for the code behind the XFRM offload API. This is a
preparation for the extension of the API for policy offload.
From Leon Romanovsky.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: drop not needed flags variable in XFRM offload struct
net/mlx5e: Use XFRM state direction instead of flags
netdevsim: rely on XFRM state direction instead of flags
ixgbe: propagate XFRM offload state direction instead of flags
xfrm: store and rely on direction to construct offload flags
xfrm: rename xfrm_state_offload struct to allow reuse
xfrm: delete not used number of external headers
xfrm: free not used XFRM_ESP_NO_TRAILER flag
====================
Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The checksum is optional for UDP packets. However nf_reject would
previously require a valid checksum to elicit a response such as
ICMP_DEST_UNREACH.
Add some logic to nf_reject_verify_csum to determine if a UDP packet has
a zero checksum and should therefore not be verified.
Signed-off-by: Kevin Mitchell <kevmitch@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When creating a flow table entry, the reverse route is looked
up based on the current packet.
There can be scenarios where the user creates a custom ip rule
to route the traffic differently.
In order to support those scenarios, the lookup needs to add
more information based on the current packet.
The patch adds multiple new information to the route lookup.
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This adds the new nf_conntrack_events=2 mode and makes it the
default.
This leverages the earlier flag in struct net to allow to avoid
the event extension as long as no event listener is active in
the namespace.
This avoids, for most cases, allocation of ct->ext area.
A followup patch will take further advantage of this by avoiding
calls down into the event framework if the extension isn't present.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only called when new ct is allocated or the extension isn't present.
This function will be extended, place this in the conntrack module
instead of inlining.
The callers already depend on nf_conntrack module.
Return value is changed to bool, noone used the returned pointer.
Make sure that the core drops the newly allocated conntrack
if the extension is requested but can't be added.
This makes it necessary to ifdef the section, as the stub
always returns false we'd drop every new conntrack if the
the ecache extension is disabled in kconfig.
Add from data path (xt_CT, nft_ct) is unchanged.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
At this time, every new conntrack gets the 'event cache extension'
enabled for it.
This is because the 'net.netfilter.nf_conntrack_events' sysctl defaults
to 1.
Changing the default to 0 means that commands that rely on the event
notification extension, e.g. 'conntrack -E' or conntrackd, stop working.
We COULD detect if there is a listener by means of
'nfnetlink_has_listeners()' and only add the extension if this is true.
The downside is a dependency from conntrack module to nfnetlink module.
This adds a different way: inc/dec a counter whenever a ctnetlink group
is being (un)subscribed and toggle a flag in struct net.
Next patches will take advantage of this and will only add the event
extension if the flag is set.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a structure to collect all the context data that is
passed to the cleanup iterator.
struct nf_ct_iter_data {
struct net *net;
void *data;
u32 portid;
int report;
};
There is a netns field that allows to clean up conntrack entries
specifically owned by the specified netns.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that the conntrack entry isn't placed on the pcpu list anymore the
bh only needs to be disabled in the 'expectation present' case.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its not needed anymore:
A. If entry is totally new, then the rcu-protected resource
must already have been removed from global visibility before call
to nf_ct_iterate_destroy.
B. If entry was allocated before, but is not yet in the hash table
(uncofirmed case), genid gets incremented and synchronize_rcu() call
makes sure access has completed.
C. Next attempt to peek at extension area will fail for unconfirmed
conntracks, because ext->genid != genid.
D. Conntracks in the hash are iterated as before.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Increment the extid on module removal; this makes sure that even
in extreme cases any old uncofirmed entry that happened to be kept
e.g. on nfnetlink_queue list will not trip over a stale timeout
reference.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Multiple netfilter extensions store pointers to external data
in their extension area struct.
Examples:
1. Timeout policies
2. Connection tracking helpers.
No references are taken for these.
When a helper or timeout policy is removed, the conntrack table gets
traversed and affected extensions are cleared.
Conntrack entries not yet in the hashtable are referenced via a special
list, the unconfirmed list.
On removal of a policy or connection tracking helper, the unconfirmed
list gets traversed an all entries are marked as dying, this prevents
them from getting committed to the table at insertion time: core checks
for dying bit, if set, the conntrack entry gets destroyed at confirm
time.
The disadvantage is that each new conntrack has to be added to the percpu
unconfirmed list, and each insertion needs to remove it from this list.
The list is only ever needed when a policy or helper is removed -- a rare
occurrence.
Add a generation ID count: Instead of adding to the list and then
traversing that list on policy/helper removal, increment a counter
that is stored in the extension area.
For unconfirmed conntracks, the extension has the genid valid at ct
allocation time.
Removal of a helper/policy etc. increments the counter.
At confirmation time, validate that ext->genid == global_id.
If the stored number is not the same, do not allow the conntrack
insertion, just like as if a confirmed-list traversal would have flagged
the entry as dying.
After insertion, the genid is no longer relevant (conntrack entries
are now reachable via the conntrack table iterators and is set to 0.
This allows removal of the percpu unconfirmed list.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This helper tags connections not yet in the conntrack table as
dying. These nf_conn entries will be dropped instead when the
core attempts to insert them from the input or postrouting
'confirm' hook.
After the previous change, the entries get unlinked from the
list earlier, so that by the time the actual exit hook runs,
new connections no longer have a timeout policy assigned.
Its enough to walk the hashtable instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make it so netns pre_exit unlinks the objects from the pernet list, so
they cannot be found anymore.
netns core issues a synchronize_rcu() before calling the exit hooks so
any the time the exit hooks run unconfirmed nf_conn entries have been
free'd or they have been committed to the hashtable.
The exit hook still tags unconfirmed entries as dying, this can
now be removed in a followup change.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its no longer needed. Entries that need event redelivery are placed
on the new pernet dying list.
The advantage is that there is no need to take additional spinlock on
conntrack removal unless event redelivery failed or the conntrack entry
was never added to the table in the first place (confirmed bit not set).
The IPS_CONFIRMED bit now needs to be set as soon as the entry has been
unlinked from the unconfirmed list, else the destroy function may
attempt to unlink it a second time.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The new pernet dying list includes conntrack entries that await
delivery of the 'destroy' event via ctnetlink.
The old percpu dying list will be removed soon.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This disentangles event redelivery and the percpu dying list.
Because entries are now stored on a dedicated list, all
entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries
still have confirmed bit set -- the reference count is at least 1.
The 'struct net' back-pointer can be removed as well.
The pcpu dying list will be removed eventually, it has no functionality.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commits:
0dad4087a8 ("tcp/dccp: get rid of inet_twsk_purge()")
d507204d3c ("tcp/dccp: add tw->tw_bslot")
As Leonard pointed out, a newly allocated netns can happen
to reuse a freed 'struct net'.
While TCP TW timers were covered by my patches, other things were not:
1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look
at 4-tuple plus the 'struct net' pointer.
2) /proc/net/tcp[6] and inet_diag, same reason.
3) hashinfo->bhash[], same reason.
Fixing all this seems risky, lets instead revert.
In the future, we might have a per netns tcp hash table, or
a per netns list of timewait sockets...
Fixes: 0dad4087a8 ("tcp/dccp: get rid of inet_twsk_purge()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Leonard Crestez <cdleonard@gmail.com>
Tested-by: Leonard Crestez <cdleonard@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When suspending the passive scanning _must_ have its filter_policy set
to 0x01 to use the accept list otherwise _any_ advertise report would
end up waking up the system.
In order to fix the filter_policy the code now checks for
hdev->suspended && HCI_CONN_FLAG_REMOTE_WAKEUP
first, since the MGMT_OP_SET_DEVICE_FLAGS will reject any attempt to
set HCI_CONN_FLAG_REMOTE_WAKEUP when it cannot be programmed in the
acceptlist, so it can return success causing the proper filter_policy
to be used.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215768
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
HCI_CONN_FLAG_REMOTE_WAKEUP can only be set if device can be programmed
in the allowlist which in case of device using RPA requires LL Privacy
support to be enabled.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215768
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
INET_MATCH() runs without holding a lock on the socket.
We probably need to annotate most reads.
This patch makes INET_MATCH() an inline function
to ease our changes.
v2:
We remove the 32bit version of it, as modern compilers
should generate the same code really, no need to
try to be smarter.
Also make 'struct net *net' the first argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move power_on work cancel to hci_dev_close_sync to ensure that power_on
work is canceled after HCI interface down, power off, rfkill, etc.
For example, if
hciconfig hci0 down
is done early enough during boot, it may run before power_on work.
Then, power_on work will actually bring up interface despite above
hciconfig command.
Signed-off-by: Vasyl Vavrychuk <vasyl.vavrychuk@opensynergy.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Concurrent operations from events on le_{accept,resolv}_list are
currently unprotected by hdev->lock.
Most existing code do already protect the lists with that lock.
This can be observed in hci_debugfs and hci_sync.
Add the protection for these events too.
Fixes: b950aa8863 ("Bluetooth: Add definitions and track LE resolve list modification")
Fixes: 0f36b589e4 ("Bluetooth: Track LE white list modification via HCI commands")
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
All accesses (both reads and modifications) to
hdev->{accept,reject}_list are protected by hdev lock,
except the ones in hci_conn_request_evt. This can cause a race
condition in the form of a list corruption.
The solution is to protect these lists in hci_conn_request_evt as well.
I was unable to find the exact commit that introduced the issue for the
reject list, I was only able to find it for the accept list.
Fixes: a55bd29d52 ("Bluetooth: Add white list lookup for incoming connection requests")
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
hci_is_adv_monitoring's function documentation states that it must be
called under the hdev lock. Paths that leads to an unlocked call are:
discov_update => start_discovery => interleaved_discov => active_scan
and: discov_update => start_discovery => active_scan
The solution is to take the lock in active_scan during the duration of
the call to hci_is_adv_monitoring.
Fixes: c32d624640 ("Bluetooth: disable filter dup when scan for adv monitor")
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This prints warnings for controllers setting broken quirks to increase
their visibility and warn about broken controllers firmware that
probably needs updates to behave properly.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This adds HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk which can be
used to mark HCI_Enhanced_Setup_Synchronous_Connection as broken even
if its support command bit are set since some controller report it as
supported but the command don't work properly with some configurations
(e.g. BT_VOICE_TRANSPARENT/mSBC).
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Small change to add new commands to tail of the list, and find/remove them
from the head of the list.
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Connecting the same socket twice consecutively in sco_sock_connect()
could lead to a race condition where two sco_conn objects are created
but only one is associated with the socket. If the socket is closed
before the SCO connection is established, the timer associated with the
dangling sco_conn object won't be canceled. As the sock object is being
freed, the use-after-free problem happens when the timer callback
function sco_sock_timeout() accesses the socket. Here's the call trace:
dump_stack+0x107/0x163
? refcount_inc+0x1c/
print_address_description.constprop.0+0x1c/0x47e
? refcount_inc+0x1c/0x7b
kasan_report+0x13a/0x173
? refcount_inc+0x1c/0x7b
check_memory_region+0x132/0x139
refcount_inc+0x1c/0x7b
sco_sock_timeout+0xb2/0x1ba
process_one_work+0x739/0xbd1
? cancel_delayed_work+0x13f/0x13f
? __raw_spin_lock_init+0xf0/0xf0
? to_kthread+0x59/0x85
worker_thread+0x593/0x70e
kthread+0x346/0x35a
? drain_workqueue+0x31a/0x31a
? kthread_bind+0x4b/0x4b
ret_from_fork+0x1f/0x30
Link: https://syzkaller.appspot.com/bug?extid=2bef95d3ab4daa10155b
Reported-by: syzbot+2bef95d3ab4daa10155b@syzkaller.appspotmail.com
Fixes: e1dee2c1de ("Bluetooth: fix repeated calls to sco_sock_kill")
Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Reviewed-by: Joseph Hwang <josephsih@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently If use page pool allocation stats to analysis a RX performance
degradation problem. These stats only count for pages allocate from
page_pool_alloc_pages. But nic drivers such as hns3 use
page_pool_dev_alloc_frag to allocate pages, so page stats in this API
should also be counted.
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The listen sk is currently stored in two hash tables,
listening_hash (hashed by port) and lhash2 (hashed by port and address).
After commit 0ee58dad5b ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f643 ("net: tcp: prefer listeners bound to an address"),
the TCP-SYN lookup fast path does not use listening_hash.
The commit 05c0b35709 ("tcp: seq_file: Replace listening_hash with lhash2")
also moved the seq_file (/proc/net/tcp) iteration usage from
listening_hash to lhash2.
There are still a few listening_hash usages left.
One of them is inet_reuseport_add_sock() which uses the listening_hash
to search a listen sk during the listen() system call. This turns
out to be very slow on use cases that listen on many different
VIPs at a popular port (e.g. 443). [ On top of the slowness in
adding to the tail in the IPv6 case ]. The latter patch has a
selftest to demonstrate this case.
This patch takes this chance to move all remaining listening_hash
usages to lhash2 and then retire listening_hash.
Since most changes need to be done together, it is hard to cut
the listening_hash to lhash2 switch into small patches. The
changes in this patch is highlighted here for the review
purpose.
1. Because of the listening_hash removal, lhash2 can use the
sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
This will also keep the sk_unhashed() check to work as is
after stop adding sk to listening_hash.
The union is removed from inet_listen_hashbucket because
only nulls_head is needed.
2. icsk->icsk_listen_portaddr_node and its helpers are removed.
3. The current lhash2 users needs to iterate with sk_nulls_node
instead of icsk_listen_portaddr_node.
One case is in the inet[6]_lhash2_lookup().
Another case is the seq_file iterator in tcp_ipv4.c.
One thing to note is sk_nulls_next() is needed
because the old inet_lhash2_for_each_icsk_continue()
does a "next" first before iterating.
4. Move the remaining listening_hash usage to lhash2
inet_reuseport_add_sock() which this series is
trying to improve.
inet_diag.c and mptcp_diag.c are the final two
remaining use cases and is moved to lhash2 now also.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch folds lhash2 related functions into __inet_hash and
inet_unhash. This will make the removal of the listening_hash
in a latter patch easier to review.
First, this patch folds inet_hash2 into __inet_hash.
For unhash, the current call sequence is like
inet_unhash() => __inet_unhash() => inet_unhash2().
The specific testing cases in __inet_unhash() are mostly related
to TCP_LISTEN sk and its caller inet_unhash() already has
the TCP_LISTEN test, so this patch folds both __inet_unhash() and
inet_unhash2() into inet_unhash().
Note that all listening_hash users also have lhash2 initialized,
so the !h->lhash2 check is no longer needed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 0ee58dad5b ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f643 ("net: tcp: prefer listeners bound to an address"),
the count is no longer used. This patch removes it.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DSA has not supported (and probably will not support in the future
either) independent tagging protocols per CPU port.
Different switch drivers have different requirements, some may need to
replicate some settings for each CPU port, some may need to apply some
settings on a single CPU port, while some may have to configure some
global settings and then some per-CPU-port settings.
In any case, the current model where DSA calls ->change_tag_protocol for
each CPU port turns out to be impractical for drivers where there are
global things to be done. For example, felix calls dsa_tag_8021q_register(),
which makes no sense per CPU port, so it suppresses the second call.
Let drivers deal with replication towards all CPU ports, and remove the
CPU port argument from the function prototype.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the time - commit 7569459a52 ("net: dsa: manage flooding on the CPU
ports") - not introducing a dedicated switch callback for host flooding
made sense, because for the only user, the felix driver, there was
nothing different to do for the CPU port than set the flood flags on the
CPU port just like on any other bridge port.
There are 2 reasons why this approach is not good enough, however.
(1) Other drivers, like sja1105, support configuring flooding as a
function of {ingress port, egress port}, whereas the DSA
->port_bridge_flags() function only operates on an egress port.
So with that driver we'd have useless host flooding from user ports
which don't need it.
(2) Even with the felix driver, support for multiple CPU ports makes it
difficult to piggyback on ->port_bridge_flags(). The way in which
the felix driver is going to support host-filtered addresses with
multiple CPU ports is that it will direct these addresses towards
both CPU ports (in a sort of multicast fashion), then restrict the
forwarding to only one of the two using the forwarding masks.
Consequently, flooding will also be enabled towards both CPU ports.
However, ->port_bridge_flags() gets passed the index of a single CPU
port, and that leaves the flood settings out of sync between the 2
CPU ports.
This is to say, it's better to have a specific driver method for host
flooding, which takes the user port as argument. This solves problem (1)
by allowing the driver to do different things for different user ports,
and problem (2) by abstracting the operation and letting the driver do
whatever, rather than explicitly making the DSA core point to the CPU
port it thinks needs to be touched.
This new method also creates a problem, which is that cross-chip setups
are not handled. However I don't have hardware right now where I can
test what is the proper thing to do, and there isn't hardware compatible
with multi-switch trees that supports host flooding. So it remains a
problem to be tackled in the future.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
and bluetooth. No outstanding fires.
Current release - regressions:
- eth: atlantic: always deep reset on pm op, fix null-deref
Current release - new code bugs:
- rds: use maybe_get_net() when acquiring refcount on TCP sockets
[refinement of a previous fix]
- eth: ocelot: mark traps with a bool instead of guessing type based
on list membership
Previous releases - regressions:
- net: fix skipping features in for_each_netdev_feature()
- phy: micrel: fix null-derefs on suspend/resume and probe
- bcmgenet: check for Wake-on-LAN interrupt probe deferral
Previous releases - always broken:
- ipv4: drop dst in multicast routing path, prevent leaks
- ping: fix address binding wrt vrf
- net: fix wrong network header length when BPF protocol translation
is used on skbs with a fraglist
- bluetooth: fix the creation of hdev->name
- rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
- wifi: iwlwifi: iwl-dbg: use del_timer_sync() before freeing
- wifi: ath11k: reduce the wait time of 11d scan and hw scan while
adding an interface
- mac80211: fix rx reordering with non explicit / psmp ack policy
- mac80211: reset MBSSID parameters upon connection
- nl80211: fix races in nl80211_set_tx_bitrate_mask()
- tls: fix context leak on tls_device_down
- sched: act_pedit: really ensure the skb is writable
- batman-adv: don't skb_split skbuffs with frag_list
- eth: ocelot: fix various issues with TC actions (null-deref; bad
stats; ineffective drops; ineffective filter removal)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmJ9VBwACgkQMUZtbf5S
IrumIg/+M4ZTUjz/34v2djiwivBHjfn3Aa2SepJ3516AqxauLUr4h4yfIcFXW8vA
81PyZkpqLBdgGklt+1GQfNCGrJKgC4+sM0eyRATMl1nf5zgrc0bZfp+ZN1bZeuoy
iRB7vwCnAPdLg997inxQeG11RYS+DUDDz68MJReqnqB6o5+xX87vcDEpyKaISTVo
SlUe/SYEk7S/Erk37l4wlz8bpr3EoIy1pYc4fDZOhBa8gKOVmmXidI5pwcjZXSo5
RQ9oH74/aaI7dgEQ4UvyAksCUrj17GRygBDyLE/p0eMc/2oV0lO6DQ3RGvh6ucPj
wU4AAOI5SRrXiqSzbDFXLFvtN7wLuXU+s4a/xq0z/Z4w7CQb8HVH1S92lYIA+jkA
Q8RCT3XY523Me8JWAyGJGl2PnbLqlGhFdqucmvHpxXOKy2F+QC69Z44kGWIaBliR
C3zoUuMnAV41DIp7YXbxvxjMk8GtNUXU4Vdg7+v1DlZxxe+DDH4cxB5YmReq+bNC
60p79/vpNzJgYuZgtlk4xi1YNcOywVJb0L6zXus5p4SU/Ud63H7NHufndCnf/X6Q
9W2vWrTJb9r1+cWRBKvxyWdqyUT68UgBAbhY2WvPdmjbP5UkW7YLCMfBOsAg7vPi
GZeroxYIg/NwyqWm/aUuPdiNgD73Qq4Zvan5JmavP/PwX8AP9eA=
=jt7U
-----END PGP SIGNATURE-----
Merge tag 'net-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless, and bluetooth.
No outstanding fires.
Current release - regressions:
- eth: atlantic: always deep reset on pm op, fix null-deref
Current release - new code bugs:
- rds: use maybe_get_net() when acquiring refcount on TCP sockets
[refinement of a previous fix]
- eth: ocelot: mark traps with a bool instead of guessing type based
on list membership
Previous releases - regressions:
- net: fix skipping features in for_each_netdev_feature()
- phy: micrel: fix null-derefs on suspend/resume and probe
- bcmgenet: check for Wake-on-LAN interrupt probe deferral
Previous releases - always broken:
- ipv4: drop dst in multicast routing path, prevent leaks
- ping: fix address binding wrt vrf
- net: fix wrong network header length when BPF protocol translation
is used on skbs with a fraglist
- bluetooth: fix the creation of hdev->name
- rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
- wifi: iwlwifi: iwl-dbg: use del_timer_sync() before freeing
- wifi: ath11k: reduce the wait time of 11d scan and hw scan while
adding an interface
- mac80211: fix rx reordering with non explicit / psmp ack policy
- mac80211: reset MBSSID parameters upon connection
- nl80211: fix races in nl80211_set_tx_bitrate_mask()
- tls: fix context leak on tls_device_down
- sched: act_pedit: really ensure the skb is writable
- batman-adv: don't skb_split skbuffs with frag_list
- eth: ocelot: fix various issues with TC actions (null-deref; bad
stats; ineffective drops; ineffective filter removal)"
* tag 'net-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
tls: Fix context leak on tls_device_down
net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
mlxsw: Avoid warning during ip6gre device removal
net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
net: ethernet: mediatek: ppe: fix wrong size passed to memset()
Bluetooth: Fix the creation of hdev->name
i40e: i40e_main: fix a missing check on list iterator
net/sched: act_pedit: really ensure the skb is writable
s390/lcs: fix variable dereferenced before check
s390/ctcm: fix potential memory leak
s390/ctcm: fix variable dereferenced before check
net: atlantic: verify hw_head_ lies within TX buffer ring
net: atlantic: add check for MAX_SKB_FRAGS
net: atlantic: reduce scope of is_rsc_complete
net: atlantic: fix "frag[0] not initialized"
net: stmmac: fix missing pci_disable_device() on error in stmmac_pci_probe()
net: phy: micrel: Fix incorrect variable type in micrel
decnet: Use container_of() for struct dn_neigh casts
...
The commit cited below claims to fix a use-after-free condition after
tls_device_down. Apparently, the description wasn't fully accurate. The
context stayed alive, but ctx->netdev became NULL, and the offload was
torn down without a proper fallback, so a bug was present, but a
different kind of bug.
Due to misunderstanding of the issue, the original patch dropped the
refcount_dec_and_test line for the context to avoid the alleged
premature deallocation. That line has to be restored, because it matches
the refcount_inc_not_zero from the same function, otherwise the contexts
that survived tls_device_down are leaked.
This patch fixes the described issue by restoring refcount_dec_and_test.
After this change, there is no leak anymore, and the fallback to
software kTLS still works.
Fixes: c55dcdd435 ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Non blocking sendmsg will return -EAGAIN when any signal pending
and no send space left, while non blocking recvmsg return -EINTR
when signal pending and no data received. This may makes confused.
As TCP returns -EAGAIN in the conditions described above. Align the
behavior of smc with TCP.
Fixes: 846e344eb7 ("net/smc: add receive timeout check")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220512030820.73848-1-guangguan.wang@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When calling ndo_set_vf_rate() the max_tx_rate parameter may be zero,
in which case the setting is cleared, or it must be greater or equal to
min_tx_rate.
Enforce this requirement on all calls to ndo_set_vf_rate via a wrapper
which also only calls ndo_set_vf_rate() if defined by the driver.
Based on work by Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Bin Chen <bin.chen@corigine.com>
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- Fix the creation of hdev->name when index is greater than 9999
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmJ8U9YZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKRpdD/4sYxNpqdhUuScsTJ2oZzjT
dLE8REPwE/Rvwcw7+eBtnce3L+/UlQQwxJrGbvsBtGsbLxqdT3YGHlTi9MQrty9I
IDXxLvYRKs39YjumlpzNgoLJmRguIAZaGg+n/LdIqhh+AhXx6mKWD4Fp5ovoFOWW
Z9vNqKWLRhTHHm88xNazOIdP5jw2YEZYaR6sVYfIF89gp2W4gIU+POMFo1DAhw1Q
jtf+CrJFtkUiLxGz4iTKKJZcnJHItOQrrsigHgWBlCSRhZLKEXMjwMKQp9uVYcC9
WxqHh0TmfcjNA/E4k0i6fjXcd7zx38JTAtTQIkGyNrBaBHAxIZrbWeIQgJ/9gxpQ
JuWGubpYeRFPYf9TvO5WdnMSLVibTqPf+LQYzicqNnIpsD8NUcno4AV9gtHC7GaE
hbcOL7z/zOEr1cIkFis2bfNhYrzBgv2OJW0VWREt/OdV8bI0/ABXKlRqZIheC9+y
LKxb+tTDCV4w2E9TkCeuehk/Pki4F7sJ5zRNlg/zrzkGZVp7909i5hy6lGESqOAI
fs0J9dNXralljsEjOGEnJgyQFbdSHBlpE07NFWOo+Tt3l664pVglmmAz8SrMElUI
F7PBPO2lwKUB2V2OM87bzeNBh2KnOkfZkJrFh3knXFc9jW32WAMORQb7ukW/E8le
gOO8aZtRtEXtSEm8NB8ilw==
=MbKC
-----END PGP SIGNATURE-----
Merge tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix the creation of hdev->name when index is greater than 9999
* tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: Fix the creation of hdev->name
====================
Link: https://lore.kernel.org/r/20220512002901.823647-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Second set of fixes for v5.18 and hopefully the last one. We have a
new iwlwifi maintainer, a fix to rfkill ioctl interface and important
fixes to both stack and two drivers.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmJ72akRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZso0wf+PmQev6QTWG/LPBfcIp7H6upRMS09/+Se
SEhS9UAE/qh4GJM0Mn1XE6T5mcokQZ/Ck5uaWT3Be9Dwbbk/ucAvYGEvf4OxmUEM
wwtDaA0BaFwS417iW5FLLAsu2ascN8yeje/+yK+Uu9DpB2KxXSIQB7OJpy3/HVAj
jEavgZN/fQEiTba9/JDa6DBMm2RVAZrmc+1sB5FakUocVTuN2pZAkM+lOBXvlHS4
4jd/KEFDyto2BMOR46IOwXTNKgBk2UovqeYFrTdonMz7W7nhzWJcguFU0e6rej5q
MCWxT8PryZ5yD5wl7pfOKZRTqFf+Mb+Up+yFipEEgd2SYnwxjplkaQ==
=qa3F
-----END PGP SIGNATURE-----
Merge tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v5.18
Second set of fixes for v5.18 and hopefully the last one. We have a
new iwlwifi maintainer, a fix to rfkill ioctl interface and important
fixes to both stack and two drivers.
* tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
nl80211: fix locking in nl80211_set_tx_bitrate_mask()
mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
mac80211_hwsim: fix RCU protected chanctx access
mailmap: update Kalle Valo's email
mac80211: Reset MBSSID parameters upon connection
cfg80211: retrieve S1G operating channel number
nl80211: validate S1G channel width
mac80211: fix rx reordering with non explicit / psmp ack policy
ath11k: reduce the wait time of 11d scan and hw scan while add interface
MAINTAINERS: update iwlwifi driver maintainer
iwlwifi: iwl-dbg: Use del_timer_sync() before freeing
====================
Link: https://lore.kernel.org/r/20220511154535.A1A12C340EE@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set a size limit of 8 bytes of the written buffer to "hdev->name"
including the terminating null byte, as the size of "hdev->name" is 8
bytes. If an id value which is greater than 9999 is allocated,
then the "snprintf(hdev->name, sizeof(hdev->name), "hci%d", id)"
function call would lead to a truncation of the id value in decimal
notation.
Set an explicit maximum id parameter in the id allocation function call.
The id allocation function defines the maximum allocated id value as the
maximum id parameter value minus one. Therefore, HCI_MAX_ID is defined
as 10000.
Signed-off-by: Itay Iellin <ieitayie@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In an effort to actually test the refcounting logic at runtime, add a
refcount_t member to prog_test_ref_kfunc and use it in selftests to
verify and test the whole logic more exhaustively.
The kfunc calls for prog_test_member do not require runtime refcounting,
as they are only used for verifier selftests, not during runtime
execution. Hence, their implementation now has a WARN_ON_ONCE as it is
not meant to be reachable code at runtime. It is strictly used in tests
triggering failure cases in the verifier. bpf_kfunc_call_memb_release is
called from map free path, since prog_test_member is embedded in map
value for some verifier tests, so we skip WARN_ON_ONCE for it.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220511194654.765705-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently pedit tries to ensure that the accessed skb offset
is writable via skb_unclone(). The action potentially allows
touching any skb bytes, so it may end-up modifying shared data.
The above causes some sporadic MPTCP self-test failures, due to
this code:
tc -n $ns2 filter add dev ns2eth$i egress \
protocol ip prio 1000 \
handle 42 fw \
action pedit munge offset 148 u8 invert \
pipe csum tcp \
index 100
The above modifies a data byte outside the skb head and the skb is
a cloned one, carrying a TCP output packet.
This change addresses the issue by keeping track of a rough
over-estimate highest skb offset accessed by the action and ensuring
such offset is really writable.
Note that this may cause performance regressions in some scenarios,
but hopefully pedit is not in the critical path.
Fixes: db2c24175d ("act_pedit: access skb->data safely")
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/1fcf78e6679d0a287dd61bb0f04730ce33b3255d.1652194627.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a followup of previous patch.
Dumping the stack trace is a good start, but printing
basic skb information is probably better.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I have a syzbot report that managed to get a crash in skb_checksum_help()
If syzbot can trigger these BUG(), it makes sense to replace
them with more friendly WARN_ON_ONCE() since skb_checksum_help()
can instead return an error code.
Note that syzbot will still crash there, until real bug is fixed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This config option enables network debugging checks.
This patch adds DEBUG_NET_WARN_ON_ONCE(cond)
Note that this is not a replacement for WARN_ON_ONCE(cond)
as (cond) is not evaluated if CONFIG_DEBUG_NET is not set.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace struct bpf_tramp_progs with struct bpf_tramp_links to collect
struct bpf_tramp_link(s) for a trampoline. struct bpf_tramp_link
extends bpf_link to act as a linked list node.
arch_prepare_bpf_trampoline() accepts a struct bpf_tramp_links to
collects all bpf_tramp_link(s) that a trampoline should call.
Change BPF trampoline and bpf_struct_ops to pass bpf_tramp_links
instead of bpf_tramp_progs.
Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220510205923.3206889-2-kuifeng@fb.com
A user told me that bpf_jit_enable can be disabled on one system, but he
failed to disable bpf_jit_enable on the other system:
# echo 0 > /proc/sys/net/core/bpf_jit_enable
bash: echo: write error: Invalid argument
No useful info is available through the dmesg log, a quick analysis shows
that the issue is related with CONFIG_BPF_JIT_ALWAYS_ON.
When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set
to 1 and setting any other value than that will return failure.
It is better to print some info to tell the user if disable bpf_jit_enable
failed.
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1652153703-22729-3-git-send-email-yangtiezhu@loongson.cn
Clang's structure layout randomization feature gets upset when it sees
struct neighbor (which is randomized) cast to struct dn_neigh:
net/decnet/dn_route.c:1123:15: error: casting from randomized structure pointer type 'struct neighbour *' to 'struct dn_neigh *'
gateway = ((struct dn_neigh *)neigh)->addr;
^
Update all the open-coded casts to use container_of() to do the conversion
instead of depending on strict member ordering.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202205041247.WKBEHGS5-lkp@intel.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Zheng Yongjun <zhengyongjun3@huawei.com>
Cc: Bill Wendling <morbo@google.com>
Cc: linux-decnet-user@lists.sourceforge.net
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220508102217.2647184-1-keescook@chromium.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pointer dev is being assigned a value that is never used, the assignment
and the variable are redundant and can be removed. Also replace null check
with the preferred !ptr idiom.
Cleans up clang scan warning:
net/x25/x25_proc.c:94:26: warning: Although the value stored to 'dev' is
used in the enclosing expression, the value is never actually read
from 'dev' [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20220508214500.60446-1-colin.i.king@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If a physical clock supports a free running cycle counter, then
timestamps shall be based on this time too. For TX it is known in
advance before the transmission if a timestamp based on the free running
cycle counter is needed. For RX it is impossible to know which timestamp
is needed before the packet is received and assigned to a socket.
Support late timestamp determination by a network device. Therefore, an
address/cookie is stored within the new netdev_data field of struct
skb_shared_hwtstamps. This address/cookie is provided to a new network
device function called ndo_get_tstamp(), which returns a timestamp based
on the normal/adjustable time or based on the free running cycle
counter. If function is not supported, then timestamp handling is not
changed.
This mechanism is intended for RX, but TX use is also possible.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ptp_convert_timestamp() converts only the timestamp hwtstamp, which is
a field of the argument with the type struct skb_shared_hwtstamps *. So
a pointer to the hwtstamp field of this structure is sufficient.
Rework ptp_convert_timestamp() to use an argument of type ktime_t *.
This allows to add additional timestamp manipulation stages before the
call of ptp_convert_timestamp().
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The free running cycle counter of physical clocks called cycles shall be
used for hardware timestamps to enable synchronisation.
Introduce new flag SKBTX_HW_TSTAMP_USE_CYCLES, which signals driver to
provide a TX timestamp based on cycles if cycles are supported.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- Don't skb_split skbuffs with frag_list, by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmJ3whYWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoYV9D/wKIqVyKZrMd+O7lougzQJt8ME5
J0mSlBCNoxP7d9WuMdnfjXH7O2hq323Qtw/x47dv0ln+H69XMdXrsiuQxJk2OELk
/2CkswpOVEwGaxZ93FM7vIHwmR0Rw/bd3rC+d83YzCzfKg5QDXNDoxQAMBX45hVv
7DHJRql6sEAAcoGQleM6LmtHN66sxXEcwPnhDvEvrSDDPW6YD7tRvkPPWAPledO7
EdZmr8WWWl4jgL8sPaIGIKgz+XGECMbLO1n7PtUtQ1IH9Xib8FoN9g6JzUUFXA0I
v23zWqTb75xjZvz8gJQUoVpLxEsTM33ai0S6Q26YAUkqKP/70yzjYPGHu+jpPj8I
SLmwzY8NHrfCyAkMxeIChDYQTK8nMFHWDA+lq1BsduBeb/oBNO9WFfuQOopUgSmo
5tYNlBb6Jar7UC17C/bGgV4y8UaTLSygwj/566Ty1b40bqh/+oycjt5rEXFqZN/3
A1doCLKC1pVl9anOyEo70w71VlhH1IKXBIqUF+UDFGiF/zvImfkAvPHSZXFEB13l
rF9+O+J824+yTOoo3BWFjX+A+xrS8rATGt5I6SqrhfCtuRTUCiNj/uAhmDC3Cziu
nR775J2r81y6iyCfY8GOOyXpNjLQR/2ApOa6mmliSCCym3aHBl2A/Dpc3hopcqHN
UX/7ovSwCzZcYhlvew==
=S0xN
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-pullrequest-20220508' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Don't skb_split skbuffs with frag_list, by Sven Eckelmann
* tag 'batadv-net-pullrequest-20220508' of git://git.open-mesh.org/linux-merge:
batman-adv: Don't skb_split skbuffs with frag_list
====================
Link: https://lore.kernel.org/r/20220508132110.20451-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a race between switchdev_bridge_port_offload() and the
dsa_port_switchdev_sync_attrs() call right below it.
When switchdev_bridge_port_offload() finishes, FDB entries have been
replayed by the bridge, but are scheduled for deferred execution later.
However dsa_port_switchdev_sync_attrs -> dsa_port_can_apply_vlan_filtering()
may impose restrictions on the vlan_filtering attribute and refuse
offloading.
When this happens, the delayed FDB entries will dereference dp->bridge,
which is a NULL pointer because we have stopped the process of
offloading this bridge.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Workqueue: dsa_ordered dsa_slave_switchdev_event_work
pc : dsa_port_bridge_host_fdb_del+0x64/0x100
lr : dsa_slave_switchdev_event_work+0x130/0x1bc
Call trace:
dsa_port_bridge_host_fdb_del+0x64/0x100
dsa_slave_switchdev_event_work+0x130/0x1bc
process_one_work+0x294/0x670
worker_thread+0x80/0x460
---[ end trace 0000000000000000 ]---
Error: dsa_core: Must first remove VLAN uppers having VIDs also present in bridge.
Fix the bug by doing what we do on the normal bridge leave path as well,
which is to wait until the deferred FDB entries complete executing, then
exit.
The placement of dsa_flush_workqueue() after switchdev_bridge_port_unoffload()
guarantees that both the FDB additions and deletions on rollback are waited for.
Fixes: d7d0d423db ("net: dsa: flush switchdev workqueue when leaving the bridge")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20220507134550.1849834-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add extack support to .ndo_fdb_del in netdevice.h and
all related methods.
Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When clatd starts with ebpf offloaing, and NETIF_F_GRO_FRAGLIST is enable,
several skbs are gathered in skb_shinfo(skb)->frag_list. The first skb's
ipv6 header will be changed to ipv4 after bpf_skb_proto_6_to_4,
network_header\transport_header\mac_header have been updated as ipv4 acts,
but other skbs in frag_list didnot update anything, just ipv6 packets.
udp_queue_rcv_skb will call skb_segment_list to traverse other skbs in
frag_list and make sure right udp payload is delivered to user space.
Unfortunately, other skbs in frag_list who are still ipv6 packets are
updated like the first skb and will have wrong transport header length.
e.g.before bpf_skb_proto_6_to_4,the first skb and other skbs in frag_list
has the same network_header(24)& transport_header(64), after
bpf_skb_proto_6_to_4, ipv6 protocol has been changed to ipv4, the first
skb's network_header is 44,transport_header is 64, other skbs in frag_list
didnot change.After skb_segment_list, the other skbs in frag_list has
different network_header(24) and transport_header(44), so there will be 20
bytes different from original,that is difference between ipv6 header and
ipv4 header. Just change transport_header to be the same with original.
Actually, there are two solutions to fix it, one is traversing all skbs
and changing every skb header in bpf_skb_proto_6_to_4, the other is
modifying frag_list skb's header in skb_segment_list. Considering
efficiency, adopt the second one--- when the first skb and other skbs in
frag_list has different network_header length, restore them to make sure
right udp payload is delivered to user space.
Signed-off-by: Lina Wang <lina.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- remove unnecessary type castings, by Yu Zhe
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmJ3xIgWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoa6GD/wPif4zkO407J0jU9xnf8F5w8Mz
5flPXyFu/YlecXAZjLb2BTl+6xI3KLf6lbDIRvcJ3rga2P69IL2JQvqNf9r8HtpP
zl95frZaxKT3S3DoMBdPyU9HgtrmYVnUXtrQGuL0Sm9CukW0p08SlhqlRgp1NVRy
XqNhmYhPJhhHk38rwOnNvMdUPsjzOCECV2Lbe4e1lX/503Q17DcYEL+Y+0GpQZwW
tDSwAUYT0Us0HJSEMwouag+x7uom/BlZml2iLuSTNKNYxoPb3JIOZiVyJ+2fWN0C
qlxy8+mOVJoKPRBlzRvX/Dr5GFJbav+Sg1tTR3gDYQvPB4jIPaZcAnI0OuPuRER+
FdJPk9VqEHTVuaMljpRJJVDTHX084IjgFRuF7xhrZf/oB076GoGzXG6XPJuOWcoN
BbX5ai4Vtj2aHq5RdbAaxHdrUoGM5+XKHvSXJmp4MgTC2dcEWlKqZMI39t/6BYBH
n9UHUMDgy1NL/C1olVpfTN9OUYwWec971/cQmU4mmuHpybVL1gACEO+vFx9jabO1
w5lmCbvMcl6SiN/pxi/+tsOOU5TB5PH8M5oAvbinBU9kCqi6g1ldj0jc5TUBttsj
YFYDpqnn/TnEfVjnqMsKEX9E30WevledW8Nms5jzxIzncjWGEse2l7CEwjEcNTtI
qkukIJYrbzvwR4OGDg==
=Cg7C
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-pullrequest-20220508' of git://git.open-mesh.org/linux-merge
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- remove unnecessary type castings, by Yu Zhe
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that the gssproxy client connects to the server from the gssproxy
daemon process context so that the AF_LOCAL socket connection is done
using the correct path and namespaces.
Fixes: 1d658336b0 ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This reverts commit 892de36fd4.
The gssproxy server is unresponsive when it calls into the kernel to
start the upcall service, so it will not reply to our RPC ping at all.
Reported-by: "J.Bruce Fields" <bfields@fieldses.org>
Fixes: 892de36fd4 ("SUNRPC: Ensure gss-proxy connects on setup")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
All the users of these functions are gone, delete them before they gain
new ones.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netlink_recvmsg() does not need to change transport header.
If transport header was needed, it should have been reset
by the producer (netlink_dump()), not the consumer(s).
The following trace probably happened when multiple threads
were using MSG_PEEK.
BUG: KCSAN: data-race in netlink_recvmsg / netlink_recvmsg
write to 0xffff88811e9f15b2 of 2 bytes by task 32012 on cpu 1:
skb_reset_transport_header include/linux/skbuff.h:2760 [inline]
netlink_recvmsg+0x1de/0x790 net/netlink/af_netlink.c:1978
sock_recvmsg_nosec net/socket.c:948 [inline]
sock_recvmsg net/socket.c:966 [inline]
__sys_recvfrom+0x204/0x2c0 net/socket.c:2097
__do_sys_recvfrom net/socket.c:2115 [inline]
__se_sys_recvfrom net/socket.c:2111 [inline]
__x64_sys_recvfrom+0x74/0x90 net/socket.c:2111
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
write to 0xffff88811e9f15b2 of 2 bytes by task 32005 on cpu 0:
skb_reset_transport_header include/linux/skbuff.h:2760 [inline]
netlink_recvmsg+0x1de/0x790 net/netlink/af_netlink.c:1978
____sys_recvmsg+0x162/0x2f0
___sys_recvmsg net/socket.c:2674 [inline]
__sys_recvmsg+0x209/0x3f0 net/socket.c:2704
__do_sys_recvmsg net/socket.c:2714 [inline]
__se_sys_recvmsg net/socket.c:2711 [inline]
__x64_sys_recvmsg+0x42/0x50 net/socket.c:2711
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0xffff -> 0x0000
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 32005 Comm: syz-executor.4 Not tainted 5.18.0-rc1-syzkaller-00328-ge1f700ebd6be-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220505161946.2867638-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Highlights include:
Stable fixes:
- Fix a socket leak when setting up an AF_LOCAL RPC client
- Ensure that knfsd connects to the gss-proxy daemon on setup
Bugfixes:
- Fix a refcount leak when migrating a task off an offlined transport
- Don't gratuitously invalidate inode attributes on delegation return
- Don't leak sockets in xs_local_connect()
- Ensure timely close of disconnected AF_LOCAL sockets
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJ1f3wACgkQZwvnipYK
APKKIRAAmVUswcfRQ9wSz5wW6DCFU9hdsN9JD4pAcPvWAYGo8fqmn3I3qe/iaBCf
rrJF38SfQVygtthmAY4CBBwOiVxm2fvqanML2lta+ZUU15MqoH2px3kMYemRyulJ
9/2yP25AUSgkmwdEmm69hIXJkEJa3dsjg+LajQZ5X01DgKSfpObS5s9t/upM9kve
Wqz5QRr+aJnZuuYYJWxNmXZ4XQEkzHccg3aSswB6bEsEGNXKo8NnWryrSMnWTW1y
rQCb0e+gxpoVFgV3ngP1r9xT2l2ISbJIIhTPoj5hSjSVlFvQlIEyHtGA2vuIEZH9
hPJAnaSc7Xb+QER6XfZkTxjW+jtMl5OmMKkWUcUmHiYv2KIM8dUAd3ANnbDBCUvw
C5bGF907Qjqs5d2VdfsbisT9ikyn+xw6SFxcr9HYyH2T3dIsC1A8P9uUvn/afwUQ
EPfQIsIEDeufo6O8KLfF+gCO9kbk9rdaP8Bv3B2H94aRs1yYde9bJpa7QABncGbA
otWehkX/AbrIa4Zjp1ELzcVJxlIl+/AtxzCdGY2me1Ds388U/RKsyDWwXuGynLP6
98ycdtHWVyoJ48L5kZowuj8/3tEB998En5hh0HSuAd0DYkAuGxaSGb+iuwKi/M0H
+D1wZxef49r2ggQkEOsllTEjJKSHcq1+vCVASZ8ITEbcVUSiO90=
=LSoH
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a socket leak when setting up an AF_LOCAL RPC client
- Ensure that knfsd connects to the gss-proxy daemon on setup
Bugfixes:
- Fix a refcount leak when migrating a task off an offlined transport
- Don't gratuitously invalidate inode attributes on delegation return
- Don't leak sockets in xs_local_connect()
- Ensure timely close of disconnected AF_LOCAL sockets"
* tag 'nfs-for-5.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
Revert "SUNRPC: attempt AF_LOCAL connect on setup"
SUNRPC: Ensure gss-proxy connects on setup
SUNRPC: Ensure timely close of disconnected AF_LOCAL sockets
SUNRPC: Don't leak sockets in xs_local_connect()
NFSv4: Don't invalidate inode attributes on delegation return
SUNRPC release the transport of a relocated task with an assigned transport
These are now internal to the core, no need to expose them.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers should call the TSO setting helper, GSO is controllable
by user space.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until commit 46e6b992c2 ("rtnetlink: allow GSO maximums to
be set on device creation") the gso_max_segs and gso_max_size
of a device were not controlled from user space.
The quoted commit added the ability to control them because of
the following setup:
netns A | netns B
veth<->veth eth0
If eth0 has TSO limitations and user wants to efficiently forward
traffic between eth0 and the veths they should copy the TSO
limitations of eth0 onto the veths. This would happen automatically
for macvlans or ipvlan but veth users are not so lucky (given the
loose coupling).
Unfortunately the commit in question allowed users to also override
the limits on real HW devices.
It may be useful to control the max GSO size and someone may be using
that ability (not that I know of any user), so create a separate set
of knobs to reliably record the TSO limitations. Validate the user
requests.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make later patches smaller create a helper for inheriting
the TSO limitations of a lower device. The TSO in the name
is not an accident, subsequent patches will replace GSO
with TSO in more names.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After drivers were converted to rely on direction, the flags is not
used anymore and can be removed.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
XFRM state doesn't need anything from flags except to understand
direction, so store it separately. For future patches, such change
will allow us to reuse xfrm_dev_offload for policy offload too, which
has three possible directions instead of two.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The struct xfrm_state_offload has all fields needed to hold information
for offloaded policies too. In order to do not create new struct with
same fields, let's rename existing one and reuse it later.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
num_exthdrs is set but never used, so delete it.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>