When CONFIG_IP_MULTICAST is not set and multicast ip is added to the device
with autojoin flag or when multicast ip is deleted kernel will crash.
steps to reproduce:
ip addr add 224.0.0.0/32 dev eth0
ip addr del 224.0.0.0/32 dev eth0
or
ip addr add 224.0.0.0/32 dev eth0 autojoin
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000088
pc : _raw_write_lock_irqsave+0x1e0/0x2ac
lr : lock_sock_nested+0x1c/0x60
Call trace:
_raw_write_lock_irqsave+0x1e0/0x2ac
lock_sock_nested+0x1c/0x60
ip_mc_config.isra.28+0x50/0xe0
inet_rtm_deladdr+0x1a8/0x1f0
rtnetlink_rcv_msg+0x120/0x350
netlink_rcv_skb+0x58/0x120
rtnetlink_rcv+0x14/0x20
netlink_unicast+0x1b8/0x270
netlink_sendmsg+0x1a0/0x3b0
____sys_sendmsg+0x248/0x290
___sys_sendmsg+0x80/0xc0
__sys_sendmsg+0x68/0xc0
__arm64_sys_sendmsg+0x20/0x30
el0_svc_common.constprop.2+0x88/0x150
do_el0_svc+0x20/0x80
el0_sync_handler+0x118/0x190
el0_sync+0x140/0x180
Fixes: 93a714d6b5 ("multicast: Extend ip address command to enable multicast group join/leave on")
Signed-off-by: Taras Chornyi <taras.chornyi@plvision.eu>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rds_free_mr(), it calls rds_destroy_mr(mr) directly. But this
defeats the purpose of reference counting and makes MR free handling
impossible. It means that holding a reference does not guarantee that
it is safe to access some fields. For example, In
rds_cmsg_rdma_dest(), it increases the ref count, unlocks and then
calls mr->r_trans->sync_mr(). But if rds_free_mr() (and
rds_destroy_mr()) is called in between (there is no lock preventing
this to happen), r_trans_private is set to NULL, causing a panic.
Similar issue is in rds_rdma_unuse().
Reported-by: zerons <sironhide0null@gmail.com>
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And removed rds_mr_put().
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the local node id(qrtr_local_nid) is not modified after its
initialization, it equals to the broadcast node id(QRTR_NODE_BCAST).
So the messages from local node should not be taken as broadcast
and keep the process going to send them out anyway.
The definitions are as follow:
static unsigned int qrtr_local_nid = NUMA_NO_NODE;
Fixes: fdf5fd3975 ("net: qrtr: Broadcast messages only from control port")
Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- support for asynchronous create and unlink (Jeff Layton). Creates
and unlinks are satisfied locally, without waiting for a reply from
the MDS, provided the client has been granted appropriate caps (new
in v15.y.z ("Octopus") release). This can be a big help for metadata
heavy workloads such as tar and rsync. Opt-in with the new nowsync
mount option.
- multiple blk-mq queues for rbd (Hannes Reinecke and myself). When
the driver was converted to blk-mq, we settled on a single blk-mq
queue because of a global lock in libceph and some other technical
debt. These have since been addressed, so allocate a queue per CPU
to enhance parallelism.
- don't hold onto caps that aren't actually needed (Zheng Yan). This
has been our long-standing behavior, but it causes issues with some
active/standby applications (synchronous I/O, stalls if the standby
goes down, etc).
- .snap directory timestamps consistent with ceph-fuse (Luis Henriques)
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl6OEO4THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi0XNB/wItYipkjlL5fIUBqiRWzYai72DWdPp
CnOZo8LB+O0MQDomPT6DdpU1OMlWZi5HF7zklrZ35LTm21UkRNC9zvccjs9l66PJ
qo9cKJbxxju+hgzIvgvK9PjlDlaiFAc/pkF8lZ/NaOnSsM1vvsFL9IuY2LXS38MY
A/uUTZNUnFy5udam8TPuN+gWwZcUIH48lRWQLWe2I/hNJSweX1l8OHvecOBg+cYH
G+8vb7mLU2V9ky0YT5JJmVxUV3CWA5wH6ZrWWy1ofVDdeSFLPrhgWX6IMjaNq+Gd
xPfxmly47uBviSqON9dMkiThgy0Qj7yi0Pvx+1sAZbD7aj/6A4qg3LX5
=GIX0
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The main items are:
- support for asynchronous create and unlink (Jeff Layton).
Creates and unlinks are satisfied locally, without waiting for a
reply from the MDS, provided the client has been granted
appropriate caps (new in v15.y.z ("Octopus") release). This can be
a big help for metadata heavy workloads such as tar and rsync.
Opt-in with the new nowsync mount option.
- multiple blk-mq queues for rbd (Hannes Reinecke and myself).
When the driver was converted to blk-mq, we settled on a single
blk-mq queue because of a global lock in libceph and some other
technical debt. These have since been addressed, so allocate a
queue per CPU to enhance parallelism.
- don't hold onto caps that aren't actually needed (Zheng Yan).
This has been our long-standing behavior, but it causes issues with
some active/standby applications (synchronous I/O, stalls if the
standby goes down, etc).
- .snap directory timestamps consistent with ceph-fuse (Luis
Henriques)"
* tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client: (49 commits)
ceph: fix snapshot directory timestamps
ceph: wait for async creating inode before requesting new max size
ceph: don't skip updating wanted caps when cap is stale
ceph: request new max size only when there is auth cap
ceph: cleanup return error of try_get_cap_refs()
ceph: return ceph_mdsc_do_request() errors from __get_parent()
ceph: check all mds' caps after page writeback
ceph: update i_requested_max_size only when sending cap msg to auth mds
ceph: simplify calling of ceph_get_fmode()
ceph: remove delay check logic from ceph_check_caps()
ceph: consider inode's last read/write when calculating wanted caps
ceph: always renew caps if mds_wanted is insufficient
ceph: update dentry lease for async create
ceph: attempt to do async create when possible
ceph: cache layout in parent dir on first sync create
ceph: add new MDS req field to hold delegated inode number
ceph: decode interval_sets for delegated inos
ceph: make ceph_fill_inode non-static
ceph: perform asynchronous unlink if we have sufficient caps
ceph: don't take refs to want mask unless we have all bits
...
In testing, we found that for request sockets the sk->sk_reuseport field
may yet be uninitialized, which caused bpf_sk_assign() to randomly
succeed or return -ESOCKTNOSUPPORT when handling the forward ACK in a
three-way handshake.
Fix it by only applying the reuseport check for full sockets.
Fixes: cf7fbe660f ("bpf: Add socket assign support")
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200408033540.10339-1-joe@wand.net.nz
Building with some experimental patches, I came across a warning
in the tls code:
include/linux/compiler.h:215:30: warning: assignment discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
215 | *(volatile typeof(x) *)&(x) = (val); \
| ^
net/tls/tls_main.c:650:4: note: in expansion of macro 'smp_store_release'
650 | smp_store_release(&saved_tcpv4_prot, prot);
This appears to be a legitimate warning about assigning a const pointer
into the non-const 'saved_tcpv4_prot' global. Annotate both the ipv4 and
ipv6 pointers 'const' to make the code internally consistent.
Fixes: 5bb4c45d46 ("net/tls: Read sk_prot once when building tls proto ops")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Creation and management of L2TPv3 tunnels and session through netlink
requires CAP_NET_ADMIN. However, a process with CAP_NET_ADMIN in a
non-initial user namespace gets an EPERM due to the use of the
genetlink GENL_ADMIN_PERM flag. Thus, management of L2TP VPNs inside
an unprivileged container won't work.
We replaced the GENL_ADMIN_PERM by the GENL_UNS_ADMIN_PERM flag
similar to other network modules which also had this problem, e.g.,
openvswitch (commit 4a92602aa1 "openvswitch: allow management from
inside user namespaces") and nl80211 (commit 5617c6cd6f "nl80211:
Allow privileged operations from user namespaces").
I tested this in the container runtime trustm3 (trustm3.github.io)
and was able to create l2tp tunnels and sessions in unpriviliged
(user namespaced) containers using a private network namespace.
For other runtimes such as docker or lxc this should work, too.
Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the kernel specifies binutils 2.23 as the minimum version, we
can remove ifdefs for AVX2 and ADX throughout.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Doing this probing inside of the Makefiles means we have a maze of
ifdefs inside the source code and child Makefiles that need to make
proper decisions on this too. Instead, we do it at Kconfig time, like
many other compiler and assembler options, which allows us to set up the
dependencies normally for full compilation units. In the process, the
ADX test changes to use %eax instead of %r10 so that it's valid in both
32-bit and 64-bit mode.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
In the current hsr code, only 0 and 1 protocol versions are valid.
But current hsr code doesn't check the version, which is received by
userspace.
Test commands:
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1 version 4
In the test commands, version 4 is invalid.
So, the command should be failed.
After this patch, following error will occur.
"Error: hsr: Only versions 0..1 are supported."
Fixes: ee1c279772 ("net/hsr: Added support for HSR v1")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After driver sets the missed chain on the tc skb extension it is
consumed (deleted) by tc_classify_ingress and tc jumps to that chain.
If tc now misses on this chain (either no match, or no goto action),
then last executed chain remains 0, and the skb extension is not re-added,
and the next datapath (ovs) will start from 0.
Fix that by setting last executed chain to the chain read from the skb
extension, so if there is a miss, we set it back.
Fixes: af699626ee ("net: sched: Support specifying a starting chain via tc skb ext")
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For HZ < 1000 timeout 2000us rounds up to 1 jiffy but expires randomly
because next timer interrupt could come shortly after starting softirq.
For commonly used CONFIG_HZ=1000 nothing changes.
Fixes: 7acf8a1e8a ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
Reported-by: Dmitry Yakunin <zeil@yandex-team.ru>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit fac6fce9bd ("net: icmp6: provide input address for
traceroute6") ICMPv6 errors have source addresses from the ingress
interface. However, this overrides when source address selection is
influenced by setting preferred source addresses on routes.
This can result in ICMP errors being lost to upstream BCP38 filters
when the wrong source addresses are used, breaking path MTU discovery
and traceroute.
This patch sets the modified source address selection to only take place
when the route used has no prefsrc set.
It can be tested with:
ip link add v1 type veth peer name v2
ip netns add test
ip netns exec test ip link set lo up
ip link set v2 netns test
ip link set v1 up
ip netns exec test ip link set v2 up
ip addr add 2001:db8::1/64 dev v1 nodad
ip addr add 2001:db8::3 dev v1 nodad
ip netns exec test ip addr add 2001:db8::2/64 dev v2 nodad
ip netns exec test ip route add unreachable 2001:db8:1::1
ip netns exec test ip addr add 2001:db8:100::1 dev lo
ip netns exec test ip route add 2001:db8::1 dev v2 src 2001:db8:100::1
ip route add 2001:db8:1000::1 via 2001:db8::2
traceroute6 -s 2001:db8::1 2001:db8:1000::1
traceroute6 -s 2001:db8::3 2001:db8:1000::1
ip netns delete test
Output before:
$ traceroute6 -s 2001:db8::1 2001:db8:1000::1
traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets
1 2001:db8::2 (2001:db8::2) 0.843 ms !N 0.396 ms !N 0.257 ms !N
$ traceroute6 -s 2001:db8::3 2001:db8:1000::1
traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets
1 2001:db8::2 (2001:db8::2) 0.772 ms !N 0.257 ms !N 0.357 ms !N
After:
$ traceroute6 -s 2001:db8::1 2001:db8:1000::1
traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets
1 2001:db8:100::1 (2001:db8:100::1) 8.885 ms !N 0.310 ms !N 0.174 ms !N
$ traceroute6 -s 2001:db8::3 2001:db8:1000::1
traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets
1 2001:db8::2 (2001:db8::2) 1.403 ms !N 0.205 ms !N 0.313 ms !N
Fixes: fac6fce9bd ("net: icmp6: provide input address for traceroute6")
Signed-off-by: Tim Stallard <code@timstallard.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net, they are:
1) Fix spurious overlap condition in the rbtree tree, from Stefano Brivio.
2) Fix possible uninitialized pointer dereference in nft_lookup.
3) IDLETIMER v1 target matches the Android layout, from
Maciej Zenczykowski.
4) Dangling pointer in nf_tables_set_alloc_name, from Eric Dumazet.
5) Fix RCU warning splat in ipset find_set_type(), from Amol Grover.
6) Report EOPNOTSUPP on unsupported set flags and object types in sets.
7) Add NFT_SET_CONCAT flag to provide consistent error reporting
when users defines set with ranges in concatenations in old kernels.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Stable fixes:
- Fix a page leak in nfs_destroy_unlinked_subrequests()
- Fix use-after-free issues in nfs_pageio_add_request()
- Fix new mount code constant_table array definitions
- finish_automount() requires us to hold 2 refs to the mount record
Features:
- Improve the accuracy of telldir/seekdir by using 64-bit cookies when
possible.
- Allow one RDMA active connection and several zombie connections to
prevent blocking if the remote server is unresponsive.
- Limit the size of the NFS access cache by default
- Reduce the number of references to credentials that are taken by NFS
- pNFS files and flexfiles drivers now support per-layout segment
COMMIT lists.
- Enable partial-file layout segments in the pNFS/flexfiles driver.
- Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
- pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
the DS using the layouterror mechanism.
Bugfixes and cleanups:
- SUNRPC: Fix krb5p regressions
- Don't specify NFS version in "UDP not supported" error
- nfsroot: set tcp as the default transport protocol
- pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
- alloc_nfs_open_context() must use the file cred when available
- Fix locking when dereferencing the delegation cred
- Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
- Various clean ups of the NFS O_DIRECT commit code
- Clean up RDMA connect/disconnect
- Replace zero-length arrays with C99-style flexible arrays
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6LhhsACgkQZwvnipYK
APIOJxAAiQOgmIg1CV4mrlcVhkwy09N5JAia6AENtoTmwm08nAYg5Y8REb9uX46a
/MJsM2WG8hBCgI6eYmRY8LTr4Ft9rTQEJM9DRMuwQREXwMWwBhUv/QakCeqY1lHE
lyB1z4hj5XKeUoN/OcfALC/GXFFf56A0UyN05nMzeCkBTdd3+qu+hW8Ge1wkAXcr
f0pyLbzdFZlJuTmI4tr8F93g9p3ezuFBuEroT7XPIVJylAdZVumHqnOnz/Mvb99x
rNTsX2dc44GhSAfRnTzPumU3MT6BOLvUzNH1xzdiqKzJrbOnG8WjFodrGr3JWpfp
HkeyYQxJ+Hnfb2LiZBjvMQE8M7kVMZ1jVbrGJEbCxfSqgTly8lOHboqAeKsFaReK
LStnusizdA1LHQVZxPdvn+oL49RDxnzm9dY+DkrXK1qT0GE+icN1CyTyLLfkSCp8
tYvZSJ/qPk5BNZegqH1nBqXkMDkOJ4eEA7+luXDmajRkdRrZ3IWY2M1DpMEoueJ2
j/zoj/NFr1oErU4o7PV9oolA1Euhn1L3wIDuzsbVtjySmbXJNQTtaVVRFpGw3SsZ
7rbqi4BB0SzOooNhQ4q8mLNi4qT7bl/3D04eL8UVzEM73plexhQ8XiOEz/VrIRX7
L9viXH49g4DHQ0rZIaWefxFueqpgbNvQwnlLZl2uQotG9hwhTts=
=YUcP
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a page leak in nfs_destroy_unlinked_subrequests()
- Fix use-after-free issues in nfs_pageio_add_request()
- Fix new mount code constant_table array definitions
- finish_automount() requires us to hold 2 refs to the mount record
Features:
- Improve the accuracy of telldir/seekdir by using 64-bit cookies
when possible.
- Allow one RDMA active connection and several zombie connections to
prevent blocking if the remote server is unresponsive.
- Limit the size of the NFS access cache by default
- Reduce the number of references to credentials that are taken by
NFS
- pNFS files and flexfiles drivers now support per-layout segment
COMMIT lists.
- Enable partial-file layout segments in the pNFS/flexfiles driver.
- Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
- pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
the DS using the layouterror mechanism.
Bugfixes and cleanups:
- SUNRPC: Fix krb5p regressions
- Don't specify NFS version in "UDP not supported" error
- nfsroot: set tcp as the default transport protocol
- pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
- alloc_nfs_open_context() must use the file cred when available
- Fix locking when dereferencing the delegation cred
- Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
- Various clean ups of the NFS O_DIRECT commit code
- Clean up RDMA connect/disconnect
- Replace zero-length arrays with C99-style flexible arrays"
* tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits)
NFS: Clean up process of marking inode stale.
SUNRPC: Don't start a timer on an already queued rpc task
NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()
NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode()
NFS: Beware when dereferencing the delegation cred
NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout
NFS: finish_automount() requires us to hold 2 refs to the mount record
NFS: Fix a few constant_table array definitions
NFS: Try to join page groups before an O_DIRECT retransmission
NFS: Refactor nfs_lock_and_join_requests()
NFS: Reverse the submission order of requests in __nfs_pageio_add_request()
NFS: Clean up nfs_lock_and_join_requests()
NFS: Remove the redundant function nfs_pgio_has_mirroring()
NFS: Fix memory leaks in nfs_pageio_stop_mirroring()
NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()
NFS: Fix use-after-free issues in nfs_pageio_add_request()
NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()
NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio()
pNFS/flexfiles: Specify the layout segment range in LAYOUTGET
...
Pull networking fixes from David Miller:
1) Slave bond and team devices should not be assigned ipv6 link local
addresses, from Jarod Wilson.
2) Fix clock sink config on some at803x PHY devices, from Oleksij
Rempel.
3) Uninitialized stack space transmitted in slcan frames, fix from
Richard Palethorpe.
4) Guard HW VLAN ops properly in stmmac driver, from Jose Abreu.
5) "=" --> "|=" fix in aquantia driver, from Colin Ian King.
6) Fix TCP fallback in mptcp, from Florian Westphal. (accessing a plain
tcp_sk as if it were an mptcp socket).
7) Fix cavium driver in some configurations wrt. PTP, from Yue Haibing.
8) Make ipv6 and ipv4 consistent in the lower bound allowed for
neighbour entry retrans_time, from Hangbin Liu.
9) Don't use private workqueue in pegasus usb driver, from Petko
Manolov.
10) Fix integer overflow in mlxsw, from Colin Ian King.
11) Missing refcnt init in cls_tcindex, from Cong Wang.
12) One too many loop iterations when processing cmpri entries in ipv6
rpl code, from Alexander Aring.
13) Disable SG and TSO by default in r8169, from Heiner Kallweit.
14) NULL deref in macsec, from Davide Caratti.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits)
macsec: fix NULL dereference in macsec_upd_offload()
skbuff.h: Improve the checksum related comments
net: dsa: bcm_sf2: Ensure correct sub-node is parsed
qed: remove redundant assignment to variable 'rc'
wimax: remove some redundant assignments to variable result
mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_VLAN_MANGLE
mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_PRIORITY
r8169: change back SG and TSO to be disabled by default
net: dsa: bcm_sf2: Do not register slave MDIO bus with OF
ipv6: rpl: fix loop iteration
tun: Don't put_page() for all negative return values from XDP program
net: dsa: mt7530: fix null pointer dereferencing in port5 setup
mptcp: add some missing pr_fmt defines
net: phy: micrel: kszphy_resume(): add delay after genphy_resume() before accessing PHY registers
net_sched: fix a missing refcnt in tcindex_init()
net: stmmac: dwmac1000: fix out-of-bounds mac address reg setting
mlxsw: spectrum_trap: fix unintention integer overflow on left shift
pegasus: Remove pegasus' own workqueue
neigh: support smaller retrans_time settting
net: openvswitch: use hlist_for_each_entry_rcu instead of hlist_for_each_entry
...
Stefano originally proposed to introduce this flag, users hit EOPNOTSUPP
in new binaries with old kernels when defining a set with ranges in
a concatenation.
Fixes: f3a2181e16 ("netfilter: nf_tables: Support for sets with multiple ranged fields")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
EINVAL should be used for malformed netlink messages. New userspace
utility and old kernels might easily result in EINVAL when exercising
new set features, which is misleading.
Fixes: 8aeff920dc ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
first_len is the remainder of the first page we're copying.
If this size is larger, then out of page boundary write will
otherwise happen.
Fixes: c05cd36458 ("xsk: add support to allow unaligned chunk placement")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1585813930-19712-1-git-send-email-lirongqing@baidu.com
This patch fix the loop iteration by not walking over the last
iteration. The cmpri compressing value exempt the last segment. As the
code shows the last iteration will be overwritten by cmpre value
handling which is for the last segment.
I think this doesn't end in any bufferoverflows because we work on worst
case temporary buffer sizes but it ends in not best compression settings
in some cases.
Fixes: 8610c7c6e3 ("net: ipv6: add support for rpl sr exthdr")
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix read with O_NONBLOCK to allow incomplete read and return immediately
- Rest is just cleanup (indent, unused field in struct, extra semicolon)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl6LCyEACgkQq06b7GqY
5nAAQg//ThE/rojrB74S8yaD3sJlfsT+BoMzGDeos9FGdsHpO8t1ISUiheAcUn/Y
a6igfJxTefxEO5SWqvsGEousRl0sRTyLo8RuQ2CC7+zKN8Ou+eG+4s2QXfDxBD+p
xJGJrsWD2SkQ98/hkxDY6d31BM9iQGvpvVp/c9qLwtvTlnvEMXll0JiACcnmiPRt
z92bx8QkaPlt399qv7LltfHftxCHOMcieQlYAF3wDjLauiEdDWysiMn0NIC2izwY
hPlsVCiV504yhvQFcXVnmbWVJe88THlBVVjuVAi9IwpU8D2+1UK1kM/yiyVkYml+
U4JC7ZBQoqw2obSLPk5mZ6E7nGWfWnIBpt/B9dO7akP8pS9bv+uCjwg65tFEop5/
z9nuyjoQT/dEoThveeoEoxpIbyd690n1vbhcTdGDFDX5nfglrxj0S6bNVW0k6e41
9aPwWzAfU9Ip9GHwE6Ewf/y88mMI7rCP+o/pC8XkUUGiV+foFt3oO5wCcvGbxwgJ
nnZ5oY+Ren/QXOmq/cN3RUWJvTRbc8TDBm6Fa0iGkk7d1OUBuRTWuBUXvxHMpO3A
xHapDRddPcrZ9QQSJZGvMSYwG0ZUZ6MRL2qTU4X/3sIi0giApFkRMpmmo0HjvMAf
PIFl2MG2Ok4A8yWQN98AMXy8vgs6ql6+Wb3W0b5dNFsAGEk7f9U=
=wAe5
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.7' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Not much new, but a few patches for this cycle:
- Fix read with O_NONBLOCK to allow incomplete read and return
immediately
- Rest is just cleanup (indent, unused field in struct, extra
semicolon)"
* tag '9p-for-5.7' of git://github.com/martinetd/linux:
net/9p: remove unused p9_req_t aux field
9p: read only once on O_NONBLOCK
9pnet: allow making incomplete read requests
9p: Remove unneeded semicolon
9p: Fix Kconfig indentation
ip_set_type_list is traversed using list_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of ip_set_type_mutex.
Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Android has long had an extension to IDLETIMER to send netlink
messages to userspace, see:
https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/include/uapi/linux/netfilter/xt_IDLETIMER.h#42
Note: this is idletimer target rev 1, there is no rev 0 in
the Android common kernel sources, see registration at:
https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/net/netfilter/xt_IDLETIMER.c#483
When we compare that to upstream's new idletimer target rev 1:
https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git/tree/include/uapi/linux/netfilter/xt_IDLETIMER.h#n46
We immediately notice that these two rev 1 structs are the
same size and layout, and that while timer_type and send_nl_msg
are differently named and serve a different purpose, they're
at the same offset.
This makes them impossible to tell apart - and thus one cannot
know in a mixed Android/vanilla environment whether one means
timer_type or send_nl_msg.
Since this is iptables/netfilter uapi it introduces a problem
between iptables (vanilla vs Android) userspace and kernel
(vanilla vs Android) if the two don't match each other.
Additionally when at some point in the future Android picks up
5.7+ it's not at all clear how to resolve the resulting merge
conflict.
Furthermore, since upgrading the kernel on old Android phones
is pretty much impossible there does not seem to be an easy way
out of this predicament.
The only thing I've been able to come up with is some super
disgusting kernel version >= 5.7 check in the iptables binary
to flip between different struct layouts.
By adding a dummy field to the vanilla Linux kernel header file
we can force the two structs to be compatible with each other.
Long term I think I would like to deprecate send_nl_msg out of
Android entirely, but I haven't quite been able to figure out
exactly how we depend on it. It seems to be very similar to
sysfs notifications but with some extra info.
Currently it's actually always enabled whenever Android uses
the IDLETIMER target, so we could also probably entirely
remove it from the uapi in favour of just always enabling it,
but again we can't upgrade old kernels already in the field.
(Also note that this doesn't change the structure's size,
as it is simply fitting into the pre-existing padding, and
that since 5.7 hasn't been released yet, there's still time
to make this uapi visible change)
Cc: Manoj Basapathi <manojbm@codeaurora.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Initialize set lookup matching element to NULL. Otherwise, the
NFT_LOOKUP_F_INV flag reverses the matching logic and it leads to
deference an uninitialized pointer to the matching element. Make sure
element data area and stateful expression are accessed if there is a
matching set element.
This patch undoes 24791b9aa1 ("netfilter: nft_set_bitmap: initialize set
element extension in lookups") which is not required anymore.
Fixes: 339706bc21 ("netfilter: nft_lookup: update element stateful expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Case a1. for overlap detection in __nft_rbtree_insert() is not a valid
one: start-after-start is not needed to detect any type of interval
overlap and it actually results in a false positive if, while
descending the tree, this is the only step we hit after starting from
the root.
This introduced a regression, as reported by Pablo, in Python tests
cases ip/ip.t and ip/numgen.t:
ip/ip.t: ERROR: line 124: add rule ip test-ip4 input ip hdrlength vmap { 0-4 : drop, 5 : accept, 6 : continue } counter: This rule should not have failed.
ip/numgen.t: ERROR: line 7: add rule ip test-ip4 pre dnat to numgen inc mod 10 map { 0-5 : 192.168.10.100, 6-9 : 192.168.20.200}: This rule should not have failed.
Drop case a1. and renumber others, so that they are a bit clearer. In
order for these diagrams to be readily understandable, a bigger rework
is probably needed, such as an ASCII art of the actual rbtree (instead
of a flattened version).
Shell script test sets/0044interval_overlap_0 should cover all
possible cases for false negatives, so I consider that test case still
sufficient after this change.
v2: Fix comments for cases a3. and b3.
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move the test for whether a task is already queued to prevent
corruption of the timer list in __rpc_sleep_on_priority_timeout().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl6AiV4ACgkQ+7dXa6fL
C2uCDg/+NevZHjgevGpFS9WByBtzFXKbewSSO0VZqS8RcFcy7+jRdTkcP66HJkGD
3JxfI4wQm2LOiaX8tHDlmWIPfp3G5Tnjae8peOEVBrCZ5WZMcD3CzquMv18kd8KK
5iiFzsTYDCywP/BwHCJgVQjPNpSp9drNCL5T+oql0nWeEUVAmiVziTIZgM9cAiyj
S/sfe76KxdNzaEEvpL5Mg/ieq1es/ssGA1jZ8jZI+YfN8mtBBH8KebckskKSgVTK
OjBLQAEanCbq3UsEqSsvEqbBpK7JkQJPOE153VRr6Nq/0MDSniwZjqIYOkgeB9pR
YStIrK9LyL/D0aMe8A52I7Ml1zuUUXb9zVAo3yLgubWPDYXvLs8n8Cgl8ZCQBFXy
0t86rlSq9SooGf3M2ket8Gk/XgymcRNP9LHr6MUGEO23l2ELBoO8hsWvUTAHoKVx
kn27S4YceW4+5UYfYm87ZQpUTbKPNATuBkts+QxiSrZMCnk82G6keA8JqNgKCe4d
xvUJew/JEGx2J0T8vYiBfpB1zEbtYdluglYyyzHpl0wkafYGRj1tTvwRZwQNhxw5
IFQvxEK2kVFTKcoLBGq909udb4QlQOfMAYal0u/4iCNiipNgxZawcWSfIMGN4qbG
tPvvBL2ocRmycmoMS7EHwWUSWIzwggi7zosDsM8jfLHBshBLeAE=
=vvN4
-----END PGP SIGNATURE-----
Merge tag 'keys-fixes-20200329' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull keyrings fixes from David Howells:
"Here's a couple of patches that fix a circular dependency between
holding key->sem and mm->mmap_sem when reading data from a key.
One potential issue is that a filesystem looking to use a key inside,
say, ->readpages() could deadlock if the key being read is the key
that's required and the buffer the key is being read into is on a page
that needs to be fetched.
The case actually detected is a bit more involved - with a filesystem
calling request_key() and locking the target keyring for write - which
could be being read"
* tag 'keys-fixes-20200329' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
KEYS: Avoid false positive ENOMEM error on key read
KEYS: Don't write out to userspace while holding key semaphore
- Fix EXCHANGE_ID response when NFSD runs in a container
- A battery of new static trace points
- Socket transports now use bio_vec to send Replies
- NFS/RDMA now supports filesystems with no .splice_read method
- Favor memcpy() over DMA mapping for small RPC/RDMA Replies
- Add pre-requisites for supporting multiple Write chunks
- Numerous minor fixes and clean-ups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJegj9pAAoJEDNqszNvZn+XNGgP/RsRul/UGe70YoPS6AwxI+c1
2JVni5LV83aVGSN1df/xRNdugWh4j8e8stBIJPCnWFzUERFvrzVeVyW0/dlIy37l
SRL1L62EzFUejAL45O+CkF5+KI2kAWMgDCv+rPnFnIuXVa/sThx63F1AJikVMPjB
7We3vd5Kh/CrMeMflebJYuY12xE6di2b3ifkZRO0/yuMaAuqJrDreYf4L6xpA4rC
QnKQcNl7LGlOwGSI2WvDrCLE056PJFhTuzTawI80NKnkXMMFNc6/7NoXJqasVlHG
fiki2mHbJrbYd8isIm3Vl/QkFsM8QjijtpVxC9gd151w0P7DfpMYmSzlZL7nvq/R
Nt6IIqbaxWSS1VULsuS7rDtBwwZpW/LRWaUhEvMwimR2jeOxcwtlDVTX/dRH2mxq
Ume64Hn8xMEhhx9tHCPQ+Rgjqv5m+ZEAvmV6B7RM9nT2z2MSzQQESeMB14VZZmF/
2oH1dDCVdCmb4ZOcD5yxL6Y1hijn45s+YHdts9uIsCudKYPI906vPhogFC+PVJv+
MrOiUf8d40H0ra8VAUFCjAceOulkv90aLhBjoHbPsP4SQOTsRuUXnsKESZpSHY72
nT/uPM23ULv4kQ6tHB8yQ3ordjCBRgb4zIKtotc3Wpi7dhO8u6ptPj4soiflRShO
8/3N5dYfqdt9FRyr7Z8/
=o5G0
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.7' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull nfsd updates from Chuck Lever:
- Fix EXCHANGE_ID response when NFSD runs in a container
- A battery of new static trace points
- Socket transports now use bio_vec to send Replies
- NFS/RDMA now supports filesystems with no .splice_read method
- Favor memcpy() over DMA mapping for small RPC/RDMA Replies
- Add pre-requisites for supporting multiple Write chunks
- Numerous minor fixes and clean-ups
[ Chuck is filling in for Bruce this time while he and his family settle
into a new house ]
* tag 'nfsd-5.7' of git://git.linux-nfs.org/projects/cel/cel-2.6: (39 commits)
svcrdma: Fix leak of transport addresses
SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
SUNRPC/cache: don't allow invalid entries to be flushed
nfsd: fsnotify on rmdir under nfsd/clients/
nfsd4: kill warnings on testing stateids with mismatched clientids
nfsd: remove read permission bit for ctl sysctl
NFSD: Fix NFS server build errors
sunrpc: Add tracing for cache events
SUNRPC/cache: Allow garbage collection of invalid cache entries
nfsd: export upcalls must not return ESTALE when mountd is down
nfsd: Add tracepoints for update of the expkey and export cache entries
nfsd: Add tracepoints for exp_find_key() and exp_get_by_name()
nfsd: Add tracing to nfsd_set_fh_dentry()
nfsd: Don't add locks to closed or closing open stateids
SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends
SUNRPC: Refactor xs_sendpages()
svcrdma: Avoid DMA mapping small RPC Replies
svcrdma: Fix double sync of transport header buffer
svcrdma: Refactor chunk list encoders
SUNRPC: Add encoders for list item discriminators
...
The initial refcnt of struct tcindex_data should be 1,
it is clear that I forgot to set it to 1 in tcindex_init().
This leads to a dec-after-zero warning.
Reported-by: syzbot+8325e509a1bf83ec741d@syzkaller.appspotmail.com
Fixes: 304e024216 ("net_sched: add a temporary refcnt for struct tcindex_data")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here are 3 SPDX patches for 5.7-rc1.
One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.
Nothing too complex, but you will get a merge conflict with your current
tree, that should be trivial to handle (one file modified by two things,
one file deleted.)
All 3 of these have been in linux-next for a while, with no reported
issues other than the merge conflict.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXodg5A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykySQCgy9YDrkz7nWq6v3Gohl6+lW/L+rMAnRM4uTZm
m5AuCzO3Azt9KBi7NL+L
=2Lm5
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull SPDX updates from Greg KH:
"Here are three SPDX patches for 5.7-rc1.
One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.
Nothing too complex, but you will get a merge conflict with your
current tree, that should be trivial to handle (one file modified by
two things, one file deleted.)
All three of these have been in linux-next for a while, with no
reported issues other than the merge conflict"
* tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
ASoC: MT6660: make spdxcheck.py happy
.gitignore: add SPDX License Identifier
.gitignore: remove too obvious comments
Currently, we limited the retrans_time to be greater than HZ/2. i.e.
setting retrans_time less than 500ms will not work. This makes the user
unable to achieve a more accurate control for bonding arp fast failover.
Update the sanity check to HZ/100, which is 10ms, to let users have more
ability on the retrans_time control.
v3: sync the behavior with IPv6 and update all the timer handler
v2: use HZ instead of hard code number
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct sw_flow is protected by RCU, when traversing them,
use hlist_for_each_entry_rcu.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, SO_BINDTODEVICE requires CAP_NET_RAW. This change allows a
non-root user to bind a socket to an interface if it is not already
bound. This is useful to allow an application to bind itself to a
specific VRF for outgoing or incoming connections. Currently, an
application wanting to manage connections through several VRF need to
be privileged.
Previously, IP_UNICAST_IF and IPV6_UNICAST_IF were added for
Wine (76e21053b5 and c4062dfc42) specifically for use by
non-root processes. However, they are restricted to sendmsg() and not
usable with TCP. Allowing SO_BINDTODEVICE would allow TCP clients to
get the same privilege. As for TCP servers, outside the VRF use case,
SO_BINDTODEVICE would only further restrict connections a server could
accept.
When an application is restricted to a VRF (with `ip vrf exec`), the
socket is bound to an interface at creation and therefore, a
non-privileged call to SO_BINDTODEVICE to escape the VRF fails.
When an application bound a socket to SO_BINDTODEVICE and transmit it
to a non-privileged process through a Unix socket, a tentative to
change the bound device also fails.
Before:
>>> import socket
>>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
PermissionError: [Errno 1] Operation not permitted
After:
>>> import socket
>>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
>>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
PermissionError: [Errno 1] Operation not permitted
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sparse reports an error due to use of RCU_INIT_POINTER helper to assign to
sk_user_data pointer, which is not tagged with __rcu:
net/core/sock.c:1875:25: error: incompatible types in comparison expression (different address spaces):
net/core/sock.c:1875:25: void [noderef] <asn:4> *
net/core/sock.c:1875:25: void *
... and rightfully so. sk_user_data is not always treated as a pointer to
an RCU-protected data. When it is used to point at an RCU-protected object,
we access it with __sk_user_data to inform sparse about it.
In this case, when the child socket does not inherit sk_user_data from the
parent, there is no reason to treat it as an RCU-protected pointer.
Use a regular assignment to clear the pointer value.
Fixes: f1ff5ce2cd ("net, sk_msg: Clear sk_user_data pointer on clone if tagged")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200402125524.851439-1-jakub@cloudflare.com
Obtained with:
$ make W=1 net/mptcp/token.o
net/mptcp/token.c:53: warning: Function parameter or member 'req' not described in 'mptcp_token_new_request'
net/mptcp/token.c:98: warning: Function parameter or member 'sk' not described in 'mptcp_token_new_connect'
net/mptcp/token.c:133: warning: Function parameter or member 'conn' not described in 'mptcp_token_new_accept'
net/mptcp/token.c:178: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy_request'
net/mptcp/token.c:191: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy'
Fixes: 79c0949e9a (mptcp: Add key generation and token tree)
Fixes: 58b0991962 (mptcp: create msk early)
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
mptcp_subflow_data_available() is commonly called via
ssk->sk_data_ready(), in this case the mptcp socket lock
cannot be acquired.
Therefore, while we can safely discard subflow data that
was already received up to msk->ack_seq, we cannot be sure
that 'subflow->data_avail' will still be valid at the time
userspace wants to read the data -- a previous read on a
different subflow might have carried this data already.
In that (unlikely) event, msk->ack_seq will have been updated
and will be ahead of the subflow dsn.
We can check for this condition and skip/resync to the expected
sequence number.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is needed at least until proper MPTCP-Level fin/reset
signalling gets added:
We wake parent when a subflow changes, but we should do this only
when all subflows have closed, not just one.
Schedule the mptcp worker and tell it to check eof state on all
subflows.
Only flag mptcp socket as closed and wake userspace processes blocking
in poll if all subflows have closed.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph Paasch reports following crash:
general protection fault [..]
CPU: 0 PID: 2874 Comm: syz-executor072 Not tainted 5.6.0-rc5 #62
RIP: 0010:__pv_queued_spin_lock_slowpath kernel/locking/qspinlock.c:471
[..]
queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline]
do_raw_spin_lock include/linux/spinlock.h:181 [inline]
spin_lock_bh include/linux/spinlock.h:343 [inline]
__mptcp_flush_join_list+0x44/0xb0 net/mptcp/protocol.c:278
mptcp_shutdown+0xb3/0x230 net/mptcp/protocol.c:1882
[..]
Problem is that mptcp_shutdown() socket isn't an mptcp socket,
its a plain tcp_sk. Thus, trying to access mptcp_sk specific
members accesses garbage.
Root cause is that accept() returns a fallback (tcp) socket, not an mptcp
one. There is code in getpeername to detect this and override the sockets
stream_ops. But this will only run when accept() caller provided a
sockaddr struct. "accept(fd, NULL, 0)" will therefore result in
mptcp stream ops, but with sock->sk pointing at a tcp_sk.
Update the existing fallback handling to detect this as well.
Moreover, mptcp_shutdown did not have fallback handling, and
mptcp_poll did it too late so add that there as well.
Reported-by: Christoph Paasch <cpaasch@apple.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable err is being initialized with a value that is never
read and it is being updated later with a new value. The initialization
is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: f41071407c85 ("net: dsa: implement auto-normalization of MTU for bridge hardware datapath")
Signed-off-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bonding slave and team port devices should not have link-local addresses
automatically added to them, as it can interfere with openvswitch being
able to properly add tc ingress.
Basic reproducer, courtesy of Marcelo:
$ ip link add name bond0 type bond
$ ip link set dev ens2f0np0 master bond0
$ ip link set dev ens2f1np2 master bond0
$ ip link set dev bond0 up
$ ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens2f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
mq master bond0 state UP group default qlen 1000
link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
5: ens2f1np2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc
mq master bond0 state DOWN group default qlen 1000
link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP group default qlen 1000
link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
inet6 fe80::20f:53ff:fe2f:ea40/64 scope link
valid_lft forever preferred_lft forever
(above trimmed to relevant entries, obviously)
$ sysctl net.ipv6.conf.ens2f0np0.addr_gen_mode=0
net.ipv6.conf.ens2f0np0.addr_gen_mode = 0
$ sysctl net.ipv6.conf.ens2f1np2.addr_gen_mode=0
net.ipv6.conf.ens2f1np2.addr_gen_mode = 0
$ ip a l ens2f0np0
2: ens2f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
mq master bond0 state UP group default qlen 1000
link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
inet6 fe80::20f:53ff:fe2f:ea40/64 scope link tentative
valid_lft forever preferred_lft forever
$ ip a l ens2f1np2
5: ens2f1np2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc
mq master bond0 state DOWN group default qlen 1000
link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
inet6 fe80::20f:53ff:fe2f:ea40/64 scope link tentative
valid_lft forever preferred_lft forever
Looks like addrconf_sysctl_addr_gen_mode() bypasses the original "is
this a slave interface?" check added by commit c2edacf80e, and
results in an address getting added, while w/the proposed patch added,
no address gets added. This simply adds the same gating check to another
code path, and thus should prevent the same devices from erroneously
obtaining an ipv6 link-local address.
Fixes: d35a00b8e3 ("net/ipv6: allow sysctl to change link-local address generation mode")
Reported-by: Moshe Levi <moshele@mellanox.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Marcelo Ricardo Leitner <mleitner@redhat.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although we intentionally use an ordered workqueue for all tc
filter works, the ordering is not guaranteed by RCU work,
given that tcf_queue_work() is esstenially a call_rcu().
This problem is demostrated by Thomas:
CPU 0:
tcf_queue_work()
tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
-> Migration to CPU 1
CPU 1:
tcf_queue_work(&p->rwork, tcindex_destroy_work);
so the 2nd work could be queued before the 1st one, which leads
to a free-after-free.
Enforcing this order in RCU work is hard as it requires to change
RCU code too. Fortunately we can workaround this problem in tcindex
filter by taking a temporary refcnt, we only refcnt it right before
we begin to destroy it. This simplifies the code a lot as a full
refcnt requires much more changes in tcindex_set_parms().
Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
Fixes: 3d210534cc ("net_sched: fix a race condition in tcindex_destroy()")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Highlights:
1) Fix the iwlwifi regression, from Johannes Berg.
2) Support BSS coloring and 802.11 encapsulation offloading in
hardware, from John Crispin.
3) Fix some potential Spectre issues in qtnfmac, from Sergey
Matyukevich.
4) Add TTL decrement action to openvswitch, from Matteo Croce.
5) Allow paralleization through flow_action setup by not taking the
RTNL mutex, from Vlad Buslov.
6) A lot of zero-length array to flexible-array conversions, from
Gustavo A. R. Silva.
7) Align XDP statistics names across several drivers for consistency,
from Lorenzo Bianconi.
8) Add various pieces of infrastructure for offloading conntrack, and
make use of it in mlx5 driver, from Paul Blakey.
9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.
10) Lots of parallelization improvements during configuration changes
in mlxsw driver, from Ido Schimmel.
11) Add support to devlink for generic packet traps, which report
packets dropped during ACL processing. And use them in mlxsw
driver. From Jiri Pirko.
12) Support bcmgenet on ACPI, from Jeremy Linton.
13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
Starovoitov, and your's truly.
14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.
15) Fix sysfs permissions when network devices change namespaces, from
Christian Brauner.
16) Add a flags element to ethtool_ops so that drivers can more simply
indicate which coalescing parameters they actually support, and
therefore the generic layer can validate the user's ethtool
request. Use this in all drivers, from Jakub Kicinski.
17) Offload FIFO qdisc in mlxsw, from Petr Machata.
18) Support UDP sockets in sockmap, from Lorenz Bauer.
19) Fix stretch ACK bugs in several TCP congestion control modules,
from Pengcheng Yang.
20) Support virtual functiosn in octeontx2 driver, from Tomasz
Duszynski.
21) Add region operations for devlink and use it in ice driver to dump
NVM contents, from Jacob Keller.
22) Add support for hw offload of MACSEC, from Antoine Tenart.
23) Add support for BPF programs that can be attached to LSM hooks,
from KP Singh.
24) Support for multiple paths, path managers, and counters in MPTCP.
From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
and others.
25) More progress on adding the netlink interface to ethtool, from
Michal Kubecek"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
cxgb4/chcr: nic-tls stats in ethtool
net: dsa: fix oops while probing Marvell DSA switches
net/bpfilter: remove superfluous testing message
net: macb: Fix handling of fixed-link node
net: dsa: ksz: Select KSZ protocol tag
netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
net: stmmac: create dwmac-intel.c to contain all Intel platform
net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
net: dsa: bcm_sf2: Add support for matching VLAN TCI
net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
net: dsa: bcm_sf2: Disable learning for ASP port
net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
net: dsa: b53: Restore VLAN entries upon (re)configuration
net: dsa: bcm_sf2: Fix overflow checks
hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
...
In case memory resources for buf were allocated, release them before
return.
Addresses-Coverity-ID: 1492011 ("Resource leak")
Fixes: a7a29f9c36 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix an oops in dsa_port_phylink_mac_change() caused by a combination
of a20f997010 ("net: dsa: Don't instantiate phylink for CPU/DSA
ports unless needed") and the net-dsa-improve-serdes-integration
series of patches 65b7a2c8e3 ("Merge branch
'net-dsa-improve-serdes-integration'").
Unable to handle kernel NULL pointer dereference at virtual address 00000124
pgd = c0004000
[00000124] *pgd=00000000
Internal error: Oops: 805 [#1] SMP ARM
Modules linked in: tag_edsa spi_nor mtd xhci_plat_hcd mv88e6xxx(+) xhci_hcd armada_thermal marvell_cesa dsa_core ehci_orion libdes phy_armada38x_comphy at24 mcp3021 sfp evbug spi_orion sff mdio_i2c
CPU: 1 PID: 214 Comm: irq/55-mv88e6xx Not tainted 5.6.0+ #470
Hardware name: Marvell Armada 380/385 (Device Tree)
PC is at phylink_mac_change+0x10/0x88
LR is at mv88e6352_serdes_irq_status+0x74/0x94 [mv88e6xxx]
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A testing message was brought by 13d0f7b814 ("net/bpfilter: fix dprintf
usage for /dev/kmsg") but should've been deleted before patch submission.
Although it doesn't cause any harm to the code or functionality itself, it's
totally unpleasant to have it displayed on every loop iteration with no real
use case. Thus remove it unconditionally.
Fixes: 13d0f7b814 ("net/bpfilter: fix dprintf usage for /dev/kmsg")
Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next:
1) Add support to specify a stateful expression in set definitions,
this allows users to specify e.g. counters per set elements.
2) Flowtable software counter support.
3) Flowtable hardware offload counter support, from wenxu.
3) Parallelize flowtable hardware offload requests, from Paul Blakey.
This includes a patch to add one work entry per offload command.
4) Several patches to rework nf_queue refcount handling, from Florian
Westphal.
4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling
information is missing and set up indirect flow block as TC_SETUP_FT,
patch from wenxu.
5) Stricter netlink attribute sanity check on filters, from Romain Bellan
and Florent Fourcot.
5) Annotations to make sparse happy, from Jules Irenge.
6) Improve icmp errors in debugging information, from Haishuang Yan.
7) Fix warning in IPVS icmp error debugging, from Haishuang Yan.
8) Fix endianess issue in tcp extension header, from Sergey Marinkevich.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patch allowed device drivers to publish their default
binding between packet trap policers and packet trap groups. However,
some users might not be content with this binding and would like to
change it.
In case user space passed a packet trap policer identifier when setting
a packet trap group, invoke the appropriate device driver callback and
pass the new policer identifier.
v2:
* Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
devlink_trap_group_set() and bail if not present
* Add extack error message in case trap group was partially modified
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packet trap groups are used to aggregate logically related packet traps.
Currently, these groups allow user space to batch operations such as
setting the trap action of all member traps.
In order to prevent the CPU from being overwhelmed by too many trapped
packets, it is desirable to bind a packet trap policer to these groups.
For example, to limit all the packets that encountered an exception
during routing to 10Kpps.
Allow device drivers to bind default packet trap policers to packet trap
groups when the latter are registered with devlink.
The next patch will enable user space to change this default binding.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devices capable of offloading the kernel's datapath and perform
functions such as bridging and routing must also be able to send (trap)
specific packets to the kernel (i.e., the CPU) for processing.
For example, a device acting as a multicast-aware bridge must be able to
trap IGMP membership reports to the kernel for processing by the bridge
module.
In most cases, the underlying device is capable of handling packet rates
that are several orders of magnitude higher compared to those that can
be handled by the CPU.
Therefore, in order to prevent the underlying device from overwhelming
the CPU, devices usually include packet trap policers that are able to
police the trapped packets to rates that can be handled by the CPU.
This patch allows capable device drivers to register their supported
packet trap policers with devlink. User space can then tune the
parameters of these policer (currently, rate and burst size) and read
from the device the number of packets that were dropped by the policer,
if supported.
Subsequent patches in the series will allow device drivers to create
default binding between these policers and packet trap groups and allow
user space to change the binding.
v2:
* Add 'strict_start_type' in devlink policy
* Have device drivers provide max/min rate/burst size for each policer.
Use them to check validity of user provided parameters
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid taking a reference on listen sockets by checking the socket type
in the sk_assign and in the corresponding skb_steal_sock() code in the
the transport layer, and by ensuring that the prefetch free (sock_pfree)
function uses the same logic to check whether the socket is refcounted.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
the determination of whether the socket is reference counted in the case
where it is prefetched by earlier logic such as early_demux.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
Add support for TPROXY via a new bpf helper, bpf_sk_assign().
This helper requires the BPF program to discover the socket via a call
to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
helper takes its own reference to the socket in addition to any existing
reference that may or may not currently be obtained for the duration of
BPF processing. For the destination socket to receive the traffic, the
traffic must be routed towards that socket via local route. The
simplest example route is below, but in practice you may want to route
traffic more narrowly (eg by CIDR):
$ ip route add local default dev lo
This patch avoids trying to introduce an extra bit into the skb->sk, as
that would require more invasive changes to all code interacting with
the socket to ensure that the bit is handled correctly, such as all
error-handling cases along the path from the helper in BPF through to
the orphan path in the input. Instead, we opt to use the destructor
variable to switch on the prefetch of the socket.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
- Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
Maybe someday we'll get to the end of this stuff...maybe...
- Some organizational work to bring some order to the core-api manual.
- Various new docs and additions to the existing documentation.
- Typo fixes, warning fixes, ...
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl6BLf4PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YLhkIAIhcg6gxp0oZZ3KDfQyhvej0EWQGVDNkmloQ
O1VOSV3RJsZL9HwN9xSNnNfN5+hw5RUYVbn1s201uj6kovZY9qcTpHP2LCizUeGb
eFkSTmzkyAuAbJjuVLgMPDerJPEew0HnudiToeSpQeoIL1WB6YGd4/5H/cN1KLex
8ggjllcY0wOgbiFffmK6+tavDv7vT0lKTdwKRYh2nxu7zrPVVd1ZnW+RtntdTVQt
i+xwV6/YdWtg5C53IwBPpeyubX40vqaIjU8rzpLq5SCVbsZN14sSR709m1AYCOK0
i4VDWEhfA2XBi6Nycl5U0czuGziwoHrTgSCkS1mmSDujnpgfKM8=
=6YOS
-----END PGP SIGNATURE-----
Merge tag 'docs-5.7' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"This has been a busy cycle for documentation work.
Highlights include:
- Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
Maybe someday we'll get to the end of this stuff...maybe...
- Some organizational work to bring some order to the core-api
manual.
- Various new docs and additions to the existing documentation.
- Typo fixes, warning fixes, ..."
* tag 'docs-5.7' of git://git.lwn.net/linux: (123 commits)
Documentation: x86: exception-tables: document CONFIG_BUILDTIME_TABLE_SORT
MAINTAINERS: adjust to filesystem doc ReST conversion
docs: deprecated.rst: Add BUG()-family
doc: zh_CN: add translation for virtiofs
doc: zh_CN: index files in filesystems subdirectory
docs: locking: Drop :c:func: throughout
docs: locking: Add 'need' to hardirq section
docs: conf.py: avoid thousands of duplicate label warning on Sphinx
docs: prevent warnings due to autosectionlabel
docs: fix reference to core-api/namespaces.rst
docs: fix pointers to io-mapping.rst and io_ordering.rst files
Documentation: Better document the softlockup_panic sysctl
docs: hw-vuln: tsx_async_abort.rst: get rid of an unused ref
docs: perf: imx-ddr.rst: get rid of a warning
docs: filesystems: fuse.rst: supress a Sphinx warning
docs: translations: it: avoid duplicate refs at programming-language.rst
docs: driver.rst: supress two ReSt warnings
docs: trace: events.rst: convert some new stuff to ReST format
Documentation: Add io_ordering.rst to driver-api manual
Documentation: Add io-mapping.rst to driver-api manual
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJEMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpie7D/9gN4zhykYDfcgamfxMtTbpla2PdTnWoJxP
fjy/Nx2FySakmccaiCGQSQ1rzD1L67UQkJgEH6hPTomJvA4FaOmJ+ZSaExMy55LH
ZT+nD3zQ9SCuA0DEpfxbsCP1tbnoXSMQNt8Tyh0x8PAoxp5bI0eRczOju1QWLWTS
tjBEMZNipN6krrV9RPWT0S5Z31/yGr/sXprCSHFV9Ypzwrx58Tj2i6F9gR7FVbLs
nV2/O8taEn0sMQIz8TVHKol/TBalluGrC4M/bOeS3faP3BPN4TT24Gtc0LAKEibk
F49/SX7FzwhOdl43Bdkbe2bbL86p+zOLSf0IMBwMm0DJl4aiOljRUYTSYRolgGgm
Ebw9QhemTwbxxeD2nEriA4EAeYvTx69RDlN2eVilwwfJ48Xz9fVm3GNYG7LISeON
k3/TyZOBQH2SZ2Hc3oF2Mq9j1UPHXZHUUsUNlNcN+aM9SFHcWkRi6xZWemTJHJZ4
zFss5RZHo0+RLBa8rrx8xaO8iWrc73+FuRhr9eSsmyPIj+OZ4ezEFRRRHwtk2fgv
dZvD413AyCI1c+3LlBusESMsrtXyY8p9O9buNTzHy3ZUtHe0ERmYV2m/a83A5pXo
Kia/5aJbPIC61bAkCCkiVo+W9OASJ6o5+3CXl5sM9lGTbDXjcofzewmd+RHPestx
xVbzeR9UIw==
=bYLJ
-----END PGP SIGNATURE-----
Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Here are the io_uring changes for this merge window. Light on new
features this time around (just splice + buffer selection), lots of
cleanups, fixes, and improvements to existing support. In particular,
this contains:
- Cleanup fixed file update handling for stack fallback (Hillf)
- Re-work of how pollable async IO is handled, we no longer require
thread offload to handle that. Instead we rely using poll to drive
this, with task_work execution.
- In conjunction with the above, allow expendable buffer selection,
so that poll+recv (for example) no longer has to be a split
operation.
- Make sure we honor RLIMIT_FSIZE for buffered writes
- Add support for splice (Pavel)
- Linked work inheritance fixes and optimizations (Pavel)
- Async work fixes and cleanups (Pavel)
- Improve io-wq locking (Pavel)
- Hashed link write improvements (Pavel)
- SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)"
* tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits)
io_uring: cleanup io_alloc_async_ctx()
io_uring: fix missing 'return' in comment
io-wq: handle hashed writes in chains
io-uring: drop 'free_pfile' in struct io_file_put
io-uring: drop completion when removing file
io_uring: Fix ->data corruption on re-enqueue
io-wq: close cancel gap for hashed linked work
io_uring: make spdxcheck.py happy
io_uring: honor original task RLIMIT_FSIZE
io-wq: hash dependent work
io-wq: split hashing and enqueueing
io-wq: don't resched if there is no work
io-wq: remove duplicated cancel code
io_uring: fix truncated async read/readv and write/writev retry
io_uring: dual license io_uring.h uapi header
io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
io_uring: Fix unused function warnings
io_uring: add end-of-bits marker and build time verify it
io_uring: provide means of removing buffers
io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
...
If outer_proto is not set, GCC warning as following:
In file included from net/netfilter/ipvs/ip_vs_core.c:52:
net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_in_icmp':
include/net/ip_vs.h:233:4: warning: 'outer_proto' may be used uninitialized in this function [-Wmaybe-uninitialized]
233 | printk(KERN_DEBUG pr_fmt(msg), ##__VA_ARGS__); \
| ^~~~~~
net/netfilter/ipvs/ip_vs_core.c:1666:8: note: 'outer_proto' was declared here
1666 | char *outer_proto;
| ^~~~~~~~~~~
Fixes: 73348fed35 ("ipvs: optimize tunnel dumps for icmp errors")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I got a problem on MIPS with Big-Endian is turned on: every time when
NF trying to change TCP MSS it returns because of new.v16 was greater
than old.v16. But real MSS was 1460 and my rule was like this:
add rule table chain tcp option maxseg size set 1400
And 1400 is lesser that 1460, not greater.
Later I founded that main causer is cast from u32 to __be16.
Debugging:
In example MSS = 1400(HEX: 0x578). Here is representation of each byte
like it is in memory by addresses from left to right(e.g. [0x0 0x1 0x2
0x3]). LE — Little-Endian system, BE — Big-Endian, left column is type.
LE BE
u32: [78 05 00 00] [00 00 05 78]
As you can see, u32 representation will be casted to u16 from different
half of 4-byte address range. But actually nf_tables uses registers and
store data of various size. Actually TCP MSS stored in 2 bytes. But
registers are still u32 in definition:
struct nft_regs {
union {
u32 data[20];
struct nft_verdict verdict;
};
};
So, access like regs->data[priv->sreg] exactly u32. So, according to
table presents above, per-byte representation of stored TCP MSS in
register will be:
LE BE
(u32)regs->data[]: [78 05 00 00] [05 78 00 00]
^^ ^^
We see that register uses just half of u32 and other 2 bytes may be
used for some another data. But in nft_exthdr_tcp_set_eval() it casted
just like u32 -> __be16:
new.v16 = src
But u32 overfill __be16, so it get 2 low bytes. For clarity draw
one more table(<xx xx> means that bytes will be used for cast).
LE BE
u32: [<78 05> 00 00] [00 00 <05 78>]
(u32)regs->data[]: [<78 05> 00 00] [05 78 <00 00>]
As you can see, for Little-Endian nothing changes, but for Big-endian we
take the wrong half. In my case there is some other data instead of
zeros, so new MSS was wrongly greater.
For shooting this bug I used solution for ports ranges. Applying of this
patch does not affect Little-Endian systems.
Signed-off-by: Sergey Marinkevich <sergey.marinkevich@eltex-co.ru>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johan Hedberg says:
====================
pull request: bluetooth-next 2020-03-29
Here are a few more Bluetooth patches for the 5.7 kernel:
- Fix assumption of encryption key size when reading fails
- Add support for DEFER_SETUP with L2CAP Enhanced Credit Based Mode
- Fix issue with auto-connected devices
- Fix suspend handling when entering the state fails
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The approach taken to pass the port policer methods on to drivers is
pragmatic. It is similar to the port mirroring implementation (in that
the DSA core does all of the filter block interaction and only passes
simple operations for the driver to implement) and dissimilar to how
flow-based policers are going to be implemented (where the driver has
full control over the flow_cls_offload data structure).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make room for other actions for the matchall filter by keeping the
mirred argument parsing self-contained in its own function.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On low memory system, run time dumps can consume too much memory. Add
administrator ability to disable auto dumps per reporter as part of the
error flow handle routine.
This attribute is not relevant while executing
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET.
By default, auto dump is activated for any reporter that has a dump method,
as part of the reporter registration to devlink.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When health reporter is registered to devlink, devlink will implicitly set
auto recover if and only if the reporter has a recover method. No reason
to explicitly get the auto recover flag from the driver.
Remove this flag from all drivers that called
devlink_health_reporter_create.
All existing health reporters set auto recovery to true if they have a
recover method.
Yet, administrator can unset auto recover via netlink command as prior to
this patch.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rest of the devlink code sets the extack message using
NL_SET_ERR_MSG_MOD. Change the existing appearances of NL_SET_ERR_MSG
to NL_SET_ERR_MSG_MOD.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It may be up to the driver (in case ANY HW stats is passed) to select
which type of HW stats he is going to use. Add an infrastructure to
expose this information to user.
$ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop
$ tc -s filter show dev enp3s0np1 ingress
filter protocol ip pref 1 flower chain 0
filter protocol ip pref 1 flower chain 0 handle 0x1
eth_type ipv4
dst_ip 192.168.1.1
in_hw in_hw_count 2
action order 1: gact action drop
random type none pass val 0
index 1 ref 1 bind 1 installed 10 sec used 10 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats immediate <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a helper to pass value and selector to. The helper packs them
into struct and puts them into netlink message.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2020-03-28
1) Use kmem_cache_zalloc() instead of kmem_cache_alloc()
in xfrm_state_alloc(). From Huang Zijiang.
2) esp_output_fill_trailer() is the same in IPv4 and IPv6,
so share this function to avoide code duplcation.
From Raed Salem.
3) Add offload support for esp beet mode.
From Xin Long.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point in preparing the module name in a buffer. The format
string can be passed diectly to 'request_module()'.
This axes a few lines of code and cleans a few things:
- max len for a driver name is MODULE_NAME_LEN wich is ~ 60 chars,
not 128. It would be down-sized in 'request_module()'
- we should pass the total size of the buffer to 'snprintf()', not the
size minus 1
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long says:
On udp rx path udp_rcv_segment() may do segment where the frag skbs
will get the header copied from the head skb in skb_segment_list()
by calling __copy_skb_header(), which could overwrite the frag skbs'
extensions by __skb_ext_copy() and cause a leak.
This issue was found after loading esp_offload where a sec path ext
is set in the skb.
Fix this by discarding head state of the fraglist skb before replacing
its contents.
Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reported-by: Xiumei Mu <xmu@redhat.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without NAPI_GRO_CB(skb)->is_flist initialized, when the dev doesn't
support NETIF_F_GRO_FRAGLIST, is_flist can still be set and fraglist
will be used in udp_gro_receive().
So fix it by initializing is_flist with 0 in udp_gro_receive.
Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Coverity complains about a double write to *p. Don't bother with
osd_instructions and directly skip to the end of redirect reply.
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_monc_handle_map() confuses static checkers which report a
false use-after-free on monc->monmap, missing that monc->monmap and
client->monc.monmap is the same pointer.
Use monc->monmap consistently and get rid of "old", which is redundant.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since these helpers are only used by ceph.ko, move them there and
rename them with _sync_ qualifiers.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Although CEPH_DEFINE_SHOW_FUNC is much older, it now duplicates
DEFINE_SHOW_ATTRIBUTE from linux/seq_file.h.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Implement TSINFO_GET request to get timestamping information for a network
device. This is traditionally available via ETHTOOL_GET_TS_INFO ioctl
request.
Move part of ethtool_get_ts_info() into common.c so that ioctl and netlink
code use the same logic to get timestamping information from the device.
v3: use "TSINFO" rather than "TIMESTAMP", suggested by Richard Cochran
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add three string sets related to timestamping information:
ETH_SS_SOF_TIMESTAMPING: SOF_TIMESTAMPING_* flags
ETH_SS_TS_TX_TYPES: timestamping Tx types
ETH_SS_TS_RX_FILTERS: timestamping Rx filters
These will be used for TIMESTAMP_GET request.
v2: avoid compiler warning ("enumeration value not handled in switch")
in net_hwtstamp_validate()
v3: omit dash in Tx type names ("one-step-*" -> "onestep-*"), suggested by
Richard Cochran
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send ETHTOOL_MSG_EEE_NTF notification whenever EEE settings of a network
device are modified using ETHTOOL_MSG_EEE_SET netlink message or
ETHTOOL_SEEE ioctl request.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement EEE_SET netlink request to set EEE settings of a network device.
These are traditionally set with ETHTOOL_SEEE ioctl request.
The netlink interface allows setting the EEE status for all link modes
supported by kernel but only first 32 link modes can be set at the moment
as only those are supported by the ethtool_ops callback.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement EEE_GET request to get EEE settings of a network device. These
are traditionally available via ETHTOOL_GEEE ioctl request.
The netlink interface allows reporting EEE status for all link modes
supported by kernel but only first 32 link modes are provided at the moment
as only those are reported by the ethtool_ops callback and drivers.
v2: fix alignment (whitespace only)
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send ETHTOOL_MSG_PAUSE_NTF notification whenever pause parameters of
a network device are modified using ETHTOOL_MSG_PAUSE_SET netlink message
or ETHTOOL_SPAUSEPARAM ioctl request.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement PAUSE_SET netlink request to set pause parameters of a network
device. Thease are traditionally set with ETHTOOL_SPAUSEPARAM ioctl
request.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement PAUSE_GET request to get pause parameters of a network device.
These are traditionally available via ETHTOOL_GPAUSEPARAM ioctl request.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send ETHTOOL_MSG_COALESCE_NTF notification whenever coalescing parameters
of a network device are modified using ETHTOOL_MSG_COALESCE_SET netlink
message or ETHTOOL_SCOALESCE ioctl request.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement COALESCE_SET netlink request to set coalescing parameters of
a network device. These are traditionally set with ETHTOOL_SCOALESCE ioctl
request. This commit adds only support for device coalescing parameters,
not per queue coalescing parameters.
Like the ioctl implementation, the generic ethtool code checks if only
supported parameters are modified; if not, first offending attribute is
reported using extack.
v2: fix alignment (whitespace only)
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement COALESCE_GET request to get coalescing parameters of a network
device. These are traditionally available via ETHTOOL_GCOALESCE ioctl
request. This commit adds only support for device coalescing parameters,
not per queue coalescing parameters.
Omit attributes with zero values unless they are declared as supported
(i.e. the corresponding bit in ethtool_ops::supported_coalesce_params is
set).
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew noticed that some handlers for *_SET commands leak a netdev
reference if required ethtool_ops callbacks do not exist. One of them is
ethnl_set_privflags(), a simple reproducer would be e.g.
ip link add veth1 type veth peer name veth2
ethtool --set-priv-flags veth1 foo on
ip link del veth1
Make sure dev_put() is called when ethtool_ops check fails.
Fixes: f265d79959 ("ethtool: set device private flags with PRIVFLAGS_SET request")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds functionality to configure routes for RPL source routing
functionality. There is no IPIP functionality yet implemented which can
be added later when the cases when to use IPv6 encapuslation comes more
clear.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The build_state callback of lwtunnel doesn't contain the net namespace
structure yet. This patch will add it so we can check on specific
address configuration at creation time of rpl source routes.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds rpl source routing receive handling. Everything works
only if sysconf "rpl_seg_enabled" and source routing is enabled. Mostly
the same behaviour as IPv6 segmentation routing. To handle compression
and uncompression a rpl.c file is created which contains the necessary
functionality. The receive handling will also care about IPv6
encapsulated so far it's specified as possible nexthdr in RFC 6554.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a functionality to addrconf to check on a specific RPL
address configuration. According to RFC 6554:
To detect loops in the SRH, a router MUST determine if the SRH
includes multiple addresses assigned to any interface on that
router. If such addresses appear more than once and are separated by
at least one address not assigned to that router.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when creating a new ipip interface with no local/remote configuration,
the lookup is done with TUNNEL_NO_KEY flag, making it impossible to
match the new interface (only possible match being fallback or metada
case interface); e.g: `ip link add tunl1 type ipip dev eth0`
To fix this case, adding a flag check before the key comparison so we
permit to match an interface with no local/remote config; it also avoids
breaking possible userland tools relying on TUNNEL_NO_KEY flag and
uninitialised key.
context being on my side, I'm creating an extra ipip interface attached
to the physical one, and moving it to a dedicated namespace.
Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: William Dauchy <w.dauchy@criteo.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose a new netlink family to userspace to control the PM, setting:
- list of local addresses to be signalled.
- list of local addresses used to created subflows.
- maximum number of add_addr option to react
When the msk is fully established, the PM netlink attempts to
announce the 'signal' list via the ADD_ADDR option. Since we
currently lack the ADD_ADDR echo (and related event) only the
first addr is sent.
After exhausting the 'announce' list, the PM tries to create
subflow for each addr in 'local' list, waiting for each
connection to be completed before attempting the next one.
Idea is to add an additional PM hook for ADD_ADDR echo, to allow
the PM netlink announcing multiple addresses, in sequence.
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s"
or "nstat" will show them automatically.
The MPTCP MIB counters are allocated in a distinct pcpu area in order to
avoid bloating/wasting TCP pcpu memory.
Counters are allocated once the first MPTCP socket is created in a
network namespace and free'd on exit.
If no sockets have been allocated, all-zero mptcp counters are shown.
The MIB counter list is taken from the multipath-tcp.org kernel, but
only a few counters have been picked up so far. The counter list can
be increased at any time later on.
v2 -> v3:
- remove 'inline' in foo.c files (David S. Miller)
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add ulp-specific diagnostic functions, so that subflow information can be
dumped to userspace programs like 'ss'.
v2 -> v3:
- uapi: use bit macros appropriate for userspace
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On timeout event, schedule a work queue to do the retransmission.
Retransmission code closely resembles the sendmsg() implementation and
re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags'
sake - and peeking the relevant dfrag from the rtx head.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>