WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Moshe Shemesh	12af29e779	devlink: Move health common function to health file Now that all devlink health callbacks and related code are in file health.c move common health functions and devlink_health_reporter struct to be local in health.c file. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-15 19:15:44 -08:00
Moshe Shemesh	c9311ee13f	devlink: Move devlink health test to health file Move devlink health report test callback from leftover.c to health.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-15 19:15:44 -08:00
Moshe Shemesh	7004c6c457	devlink: Move devlink health dump to health file Move devlink health report dump callbacks and related code from leftover.c to health.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-15 19:15:44 -08:00
Moshe Shemesh	a929df7fd9	devlink: Move devlink fmsg and health diagnose to health file Devlink fmsg (formatted message) is used by devlink health diagnose, dump and drivers which support these devlink health callbacks. Therefore, move devlink fmsg helpers and related code to file health.c. Move devlink health diagnose to file health.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-15 19:15:44 -08:00
Moshe Shemesh	55b9b24968	devlink: Move devlink health report and recover to health file Move devlink health report helper and recover callback and related code from leftover.c to health.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-15 19:15:44 -08:00
Moshe Shemesh	db6b5f3ec4	devlink: Move devlink health get and set code to health file Move devlink health get and set callbacks and related code from leftover.c to health.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-15 19:15:44 -08:00
Moshe Shemesh	bfd4e6a5db	devlink: health: Fix nla_nest_end in error flow devlink_nl_health_reporter_fill() error flow calls nla_nest_end(). Fix it to call nla_nest_cancel() instead. Note the bug is harmless as genlmsg_cancel() cancel the entire message, so no fixes tag added. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-15 19:15:44 -08:00
Moshe Shemesh	b4740e3a81	devlink: Split out health reporter create code Move devlink health reporter create/destroy and related dev code to new file health.c. This file shall include all callbacks and functionality that are related to devlink health. In addition, fix kdoc indentation and make reporter create/destroy kdoc more clear. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-15 19:15:44 -08:00
Alexander Lobakin	6c20822fad	bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES &xdp_buff and &xdp_frame are bound in a way that xdp_buff->data_hard_start == xdp_frame It's always the case and e.g. xdp_convert_buff_to_frame() relies on this. IOW, the following: for (u32 i = 0; i < 0xdead; i++) { xdpf = xdp_convert_buff_to_frame(&xdp); xdp_convert_frame_to_buff(xdpf, &xdp); } shouldn't ever modify @xdpf's contents or the pointer itself. However, "live packet" code wrongly treats &xdp_frame as part of its context placed before the data_hard_start. With such flow, data_hard_start is sizeof(*xdpf) off to the right and no longer points to the XDP frame. Instead of replacing `sizeof(ctx)` with `offsetof(ctx, xdpf)` in several places and praying that there are no more miscalcs left somewhere in the code, unionize ::frm with ::data in a flex array, so that both starts pointing to the actual data_hard_start and the XDP frame actually starts being a part of it, i.e. a part of the headroom, not the context. A nice side effect is that the maximum frame size for this mode gets increased by 40 bytes, as xdp_buff::frame_sz includes everything from data_hard_start (-> includes xdpf already) to the end of XDP/skb shared info. Also update %MAX_PKT_SIZE accordingly in the selftests code. Leave it hardcoded for 64 bit && 4k pages, it can be made more flexible later on. Minor: align `&head->data` with how `head->frm` is assigned for consistency. Minor #2: rename 'frm' to 'frame' in &xdp_page_head while at it for clarity. (was found while testing XDP traffic generator on ice, which calls xdp_convert_frame_to_buff() for each XDP frame) Fixes: `b530e9e106` ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230215185440.4126672-1-aleksander.lobakin@intel.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-02-15 17:39:36 -08:00
Johannes Berg	3caf31e7b1	wifi: mac80211: add documentation for amsdu_mesh_control This documentation wasn't added in the original patch, add it now. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: `6e4c0d0460` ("wifi: mac80211: add a workaround for receiving non-standard mesh A-MSDU") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-15 18:31:16 +01:00
Johannes Berg	ab5f171e36	wifi: mac80211: always initialize link_sta with sta When we have multiple interfaces receiving the same frame, such as a multicast frame, one interface might have a sta and the other not. In this case, link_sta would be set but not cleared again. Always set link_sta, so we keep an invariant that link_sta and sta are either both set or both not set. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-15 18:27:35 +01:00
Johannes Berg	0d846bdc11	wifi: mac80211: pass 'sta' to ieee80211_rx_data_set_sta() There's at least one case in ieee80211_rx_for_interface() where we might pass &((struct sta_info )NULL)->sta to it only to then do container_of(), and then checking the result for NULL, but checking the result of container_of() for NULL looks really odd. Fix this by just passing the struct sta_info instead. Fixes: `e66b7920aa` ("wifi: mac80211: fix initialization of rx->link and rx->link_sta") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-15 18:27:25 +01:00
Marc Bornand	c38c701851	wifi: cfg80211: Set SSID if it is not already set When a connection was established without going through NL80211_CMD_CONNECT, the ssid was never set in the wireless_dev struct. Now we set it in __cfg80211_connect_result() when it is not already set. When using a userspace configuration that does not call cfg80211_connect() (can be checked with breakpoints in the kernel), this patch should allow `networkctl status device_name` to output the SSID instead of null. Cc: stable@vger.kernel.org Reported-by: Yohan Prod'homme <kernel@zoddo.fr> Fixes: `7b0a0e3c3a` (wifi: cfg80211: do some rework towards MLO link APIs) Link: https://bugzilla.kernel.org/show_bug.cgi?id=216711 Signed-off-by: Marc Bornand <dev.mbornand@systemb.ch> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-15 18:26:58 +01:00
NeilBrown	5bab56fff5	NFS: fix disabling of swap When swap is activated to a file on an NFSv4 mount we arrange that the state manager thread is always present as starting a new thread requires memory allocations that might block waiting for swap. Unfortunately the code for allowing the state manager thread to exit when swap is disabled was not tested properly and does not work. This can be seen by examining /proc/fs/nfsfs/servers after disabling swap and unmounting the filesystem. The servers file will still list one entry. Also a "ps" listing will show the state manager thread is still present. There are two problems. 1/ rpc_clnt_swap_deactivate() doesn't walk up the ->cl_parent list to find the primary client on which the state manager runs. 2/ The thread is not woken up properly and it immediately goes back to sleep without checking whether it is really needed. Using nfs4_schedule_state_manager() ensures a proper wake-up. Reported-by: Olga Kornievskaia <aglo@umich.edu> Fixes: `4dc73c6791` ("NFSv4: keep state manager thread active if swap is enabled") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-02-15 10:33:00 -05:00
Jakub Kicinski	fda6c89fe3	net: mpls: fix stale pointer if allocation fails during device rename lianhui reports that when MPLS fails to register the sysctl table under new location (during device rename) the old pointers won't get overwritten and may be freed again (double free). Handle this gracefully. The best option would be unregistering the MPLS from the device completely on failure, but unfortunately mpls_ifdown() can fail. So failing fully is also unreliable. Another option is to register the new table first then only remove old one if the new one succeeds. That requires more code, changes order of notifications and two tables may be visible at the same time. sysctl point is not used in the rest of the code - set to NULL on failures and skip unregister if already NULL. Reported-by: lianhui tang <bluetlh@gmail.com> Fixes: `0fae3bf018` ("mpls: handle device renames for per-device sysctls") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-15 10:26:37 +00:00
Jason Xing	fe33311c3e	net: no longer support SOCK_REFCNT_DEBUG feature Commit `e48c414ee6` ("[INET]: Generalise the TCP sock ID lookup routines") commented out the definition of SOCK_REFCNT_DEBUG in 2005 and later another commit `463c84b97f` ("[NET]: Introduce inet_connection_sock") removed it. Since we could track all of them through bpf and kprobe related tools and the feature could print loads of information which might not be that helpful even under a little bit pressure, the whole feature which has been inactive for many years is no longer supported. Link: https://lore.kernel.org/lkml/20230211065153.54116-1-kerneljasonxing@gmail.com/ Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-15 10:25:21 +00:00
Pedro Tammela	42018a322b	net/sched: tcindex: search key must be 16 bits Syzkaller found an issue where a handle greater than 16 bits would trigger a null-ptr-deref in the imperfect hash area update. general protection fault, probably for non-canonical address 0xdffffc0000000015: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x00000000000000a8-0x00000000000000af] CPU: 0 PID: 5070 Comm: syz-executor456 Not tainted 6.2.0-rc7-syzkaller-00112-gc68f345b7c42 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/21/2023 RIP: 0010:tcindex_set_parms+0x1a6a/0x2990 net/sched/cls_tcindex.c:509 Code: 01 e9 e9 fe ff ff 4c 8b bd 28 fe ff ff e8 0e 57 7d f9 48 8d bb a8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 94 0c 00 00 48 8b 85 f8 fd ff ff 48 8b 9b a8 00 RSP: 0018:ffffc90003d3ef88 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000015 RSI: ffffffff8803a102 RDI: 00000000000000a8 RBP: ffffc90003d3f1d8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88801e2b10a8 R13: dffffc0000000000 R14: 0000000000030000 R15: ffff888017b3be00 FS: 00005555569af300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056041c6d2000 CR3: 000000002bfca000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcindex_change+0x1ea/0x320 net/sched/cls_tcindex.c:572 tc_new_tfilter+0x96e/0x2220 net/sched/cls_api.c:2155 rtnetlink_rcv_msg+0x959/0xca0 net/core/rtnetlink.c:6132 netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x91b/0xe10 net/netlink/af_netlink.c:1942 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xd3/0x120 net/socket.c:734 ____sys_sendmsg+0x334/0x8c0 net/socket.c:2476 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2530 __sys_sendmmsg+0x18f/0x460 net/socket.c:2616 __do_sys_sendmmsg net/socket.c:2645 [inline] __se_sys_sendmmsg net/socket.c:2642 [inline] __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2642 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 Fixes: `ee059170b1` ("net/sched: tcindex: update imperfect hash filters respecting rcu") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-15 10:23:54 +00:00
Thomas Weißschuh	b279351705	net-sysfs: make kobj_type structures constant Since commit `ee6d3dd4ed` ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-14 20:48:08 -08:00
Thomas Weißschuh	e8c6cbd765	net: bridge: make kobj_type structure constant Since commit `ee6d3dd4ed` ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-14 20:48:08 -08:00
Tung Nguyen	11a4d6f67c	tipc: fix kernel warning when sending SYN message When sending a SYN message, this kernel stack trace is observed: ... [ 13.396352] RIP: 0010:_copy_from_iter+0xb4/0x550 ... [ 13.398494] Call Trace: [ 13.398630] <TASK> [ 13.398630] ? __alloc_skb+0xed/0x1a0 [ 13.398630] tipc_msg_build+0x12c/0x670 [tipc] [ 13.398630] ? shmem_add_to_page_cache.isra.71+0x151/0x290 [ 13.398630] __tipc_sendmsg+0x2d1/0x710 [tipc] [ 13.398630] ? tipc_connect+0x1d9/0x230 [tipc] [ 13.398630] ? __local_bh_enable_ip+0x37/0x80 [ 13.398630] tipc_connect+0x1d9/0x230 [tipc] [ 13.398630] ? __sys_connect+0x9f/0xd0 [ 13.398630] __sys_connect+0x9f/0xd0 [ 13.398630] ? preempt_count_add+0x4d/0xa0 [ 13.398630] ? fpregs_assert_state_consistent+0x22/0x50 [ 13.398630] __x64_sys_connect+0x16/0x20 [ 13.398630] do_syscall_64+0x42/0x90 [ 13.398630] entry_SYSCALL_64_after_hwframe+0x63/0xcd It is because commit `a41dad905e` ("iov_iter: saner checks for attempt to copy to/from iterator") has introduced sanity check for copying from/to iov iterator. Lacking of copy direction from the iterator viewpoint would lead to kernel stack trace like above. This commit fixes this issue by initializing the iov iterator with the correct copy direction when sending SYN or ACK without data. Fixes: `f25dcc7687` ("tipc: tipc ->sendmsg() conversion") Reported-by: syzbot+d43608d061e8847ec9f3@syzkaller.appspotmail.com Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Link: https://lore.kernel.org/r/20230214012606.5804-1-tung.q.nguyen@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-14 20:46:24 -08:00
Eric Dumazet	2558b8039d	net: use a bounce buffer for copying skb->mark syzbot found arm64 builds would crash in sock_recv_mark() when CONFIG_HARDENED_USERCOPY=y x86 and powerpc are not detecting the issue because they define user_access_begin. This will be handled in a different patch, because a check_object_size() is missing. Only data from skb->cb[] can be copied directly to/from user space, as explained in commit `79a8a642bf` ("net: Whitelist the skbuff_head_cache "cb" field") syzbot report was: usercopy: Kernel memory exposure attempt detected from SLUB object 'skbuff_head_cache' (offset 168, size 4)! ------------[ cut here ]------------ kernel BUG at mm/usercopy.c:102 ! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 4410 Comm: syz-executor533 Not tainted 6.2.0-rc7-syzkaller-17907-g2d3827b3f393 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/21/2023 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : usercopy_abort+0x90/0x94 mm/usercopy.c:90 lr : usercopy_abort+0x90/0x94 mm/usercopy.c:90 sp : ffff80000fb9b9a0 x29: ffff80000fb9b9b0 x28: ffff0000c6073400 x27: 0000000020001a00 x26: 0000000000000014 x25: ffff80000cf52000 x24: fffffc0000000000 x23: 05ffc00000000200 x22: fffffc000324bf80 x21: ffff0000c92fe1a8 x20: 0000000000000001 x19: 0000000000000004 x18: 0000000000000000 x17: 656a626f2042554c x16: ffff0000c6073dd0 x15: ffff80000dbd2118 x14: ffff0000c6073400 x13: 00000000ffffffff x12: ffff0000c6073400 x11: ff808000081bbb4c x10: 0000000000000000 x9 : 7b0572d7cc0ccf00 x8 : 7b0572d7cc0ccf00 x7 : ffff80000bf650d4 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000001 x3 : 0000000000000000 x2 : ffff0001fefbff08 x1 : 0000000100000000 x0 : 000000000000006c Call trace: usercopy_abort+0x90/0x94 mm/usercopy.c:90 __check_heap_object+0xa8/0x100 mm/slub.c:4761 check_heap_object mm/usercopy.c:196 [inline] __check_object_size+0x208/0x6b8 mm/usercopy.c:251 check_object_size include/linux/thread_info.h:199 [inline] __copy_to_user include/linux/uaccess.h:115 [inline] put_cmsg+0x408/0x464 net/core/scm.c:238 sock_recv_mark net/socket.c:975 [inline] __sock_recv_cmsgs+0x1fc/0x248 net/socket.c:984 sock_recv_cmsgs include/net/sock.h:2728 [inline] packet_recvmsg+0x2d8/0x678 net/packet/af_packet.c:3482 ____sys_recvmsg+0x110/0x3a0 ___sys_recvmsg net/socket.c:2737 [inline] __sys_recvmsg+0x194/0x210 net/socket.c:2767 __do_sys_recvmsg net/socket.c:2777 [inline] __se_sys_recvmsg net/socket.c:2774 [inline] __arm64_sys_recvmsg+0x2c/0x3c net/socket.c:2774 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall+0x64/0x178 arch/arm64/kernel/syscall.c:52 el0_svc_common+0xbc/0x180 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x110 arch/arm64/kernel/syscall.c:193 el0_svc+0x58/0x14c arch/arm64/kernel/entry-common.c:637 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:655 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591 Code: 91388800 aa0903e1 f90003e8 94e6d752 (d4210000) Fixes: `6fd1d51cfa` ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Erin MacNeil <lnx.erin@gmail.com> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/r/20230213160059.3829741-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-14 20:31:13 -08:00
Thomas Weißschuh	d67307b414	SUNRPC: make kobj_type structures constant Since commit `ee6d3dd4ed` ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-02-14 16:21:52 -05:00
Johannes Berg	cf08e29db7	wifi: mac80211: fix off-by-one link setting The convention for find_first_bit() is 0-based, while ffs() is 1-based, so this is now off-by-one. I cannot reproduce the gcc-9 problem, but since the -1 is now removed, I'm hoping it will still avoid the original issue. Reported-by: Alexander Lobakin <alexandr.lobakin@intel.com> Fixes: `1d8d4af434` ("wifi: mac80211: avoid u32_encode_bits() warning") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 20:09:30 +01:00
Gilad Itzkovitch	e6f5dcb7ec	wifi: mac80211: Fix for Rx fragmented action frames The ieee80211_accept_frame() function performs a number of early checks to decide whether or not further processing needs to be done on a frame. One of those checks is the ieee80211_is_robust_mgmt_frame() function. It requires to peek into the frame payload, but because defragmentation does not occur until later on in the receive path, this peek is invalid for any fragment other than the first one. Also, in this scenario there is no STA and so the fragmented frame will be dropped later on in the process and will not reach the upper stack. This can happen with large action frames at low rates, for example, we see issues with DPP on S1G. This change will only check if the frame is robust if it's the first fragment. Invalid fragmented packets will be discarded later after defragmentation is completed. Signed-off-by: Gilad Itzkovitch <gilad.itzkovitch@morsemicro.com> Link: https://lore.kernel.org/r/20221124005336.1618411-1-gilad.itzkovitch@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 14:48:25 +01:00
Arnd Bergmann	1d8d4af434	wifi: mac80211: avoid u32_encode_bits() warning gcc-9 triggers a false-postive warning in ieee80211_mlo_multicast_tx() for u32_encode_bits(ffs(links) - 1, ...), since ffs() can return zero on an empty bitmask, and the negative argument to u32_encode_bits() is then out of range: In file included from include/linux/ieee80211.h:21, from include/net/cfg80211.h:23, from net/mac80211/tx.c:23: In function 'u32_encode_bits', inlined from 'ieee80211_mlo_multicast_tx' at net/mac80211/tx.c:4437:17, inlined from 'ieee80211_subif_start_xmit' at net/mac80211/tx.c:4485:3: include/linux/bitfield.h:177:3: error: call to '__field_overflow' declared with attribute error: value doesn't fit into mask 177 \| __field_overflow(); \ \| ^~~~~~~~~~~~~~~~~~ include/linux/bitfield.h:197:2: note: in expansion of macro '____MAKE_OP' 197 \| ____MAKE_OP(u##size,u##size,,) \| ^~~~~~~~~~~ include/linux/bitfield.h:200:1: note: in expansion of macro '__MAKE_OP' 200 \| __MAKE_OP(32) \| ^~~~~~~~~ Newer compiler versions do not cause problems with the zero argument because they do not consider this a __builtin_constant_p(). It's also harmless since the hweight16() check already guarantees that this cannot be 0. Replace the ffs() with an equivalent find_first_bit() check that matches the later for_each_set_bit() style and avoids the warning. Fixes: `963d0e8d08` ("wifi: mac80211: optionally implement MLO multicast TX") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230214132025.1532147-1-arnd@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 14:44:13 +01:00
Jiri Pirko	2edd925704	devlink: don't allow to change net namespace for FW_ACTIVATE reload action The change on network namespace only makes sense during re-init reload action. For FW activation it is not applicable. So check if user passed an ATTR indicating network namespace change request and forbid it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230213115836.3404039-1-jiri@resnulli.us Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-14 14:04:21 +01:00
Andrei Otcheretianski	daf8fb4295	wifi: mac80211: Don't translate MLD addresses for multicast MLD address translation should be done only for individually addressed frames. Otherwise, AAD calculation would be wrong and the decryption would fail. Fixes: `e66b7920aa` ("wifi: mac80211: fix initialization of rx->link and rx->link_sta") Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Link: https://lore.kernel.org/r/20230214101048.792414-1-andrei.otcheretianski@intel.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 13:36:06 +01:00
Wen Gong	d99975c495	wifi: cfg80211: call reg_notifier for self managed wiphy from driver hint Currently the regulatory driver does not call the regulatory callback reg_notifier for self managed wiphys. Sometimes driver needs cfg80211 to calculate the info of ieee80211_channel such as flags and power, and driver needs to get the info of ieee80211_channel after hint of driver, but driver does not know when calculation of the info of ieee80211_channel become finished, so add notify to driver in reg_process_self_managed_hint() from cfg80211 is a good way, then driver could get the correct info in callback of reg_notifier. Signed-off-by: Wen Gong <quic_wgong@quicinc.com> Link: https://lore.kernel.org/r/20230201065313.27203-1-quic_wgong@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:37:39 +01:00
Lorenzo Bianconi	935ef47b16	wifi: cfg80211: get rid of gfp in cfg80211_bss_color_notify Since cfg80211_bss_color_notify() is now always run in non-atomic context, get rid of gfp_t flags in the routine signature and always use GFP_KERNEL for netlink message allocation. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/c687724e7b53556f7a2d9cbe3d11cdcf065cb687.1675255390.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:35:02 +01:00
Vinay Gannevaram	9b89495e47	wifi: nl80211: Allow authentication frames and set keys on NAN interface Wi-Fi Aware R4 specification defines NAN Pairing which uses PASN handshake to authenticate the peer and generate keys. Hence allow to register and transmit the PASN authentication frames on NAN interface and set the keys to driver or underlying modules on NAN interface. The driver needs to configure the feature flag NL80211_EXT_FEATURE_SECURE_NAN, which also helps userspace modules to know if the driver supports secure NAN. Signed-off-by: Vinay Gannevaram <quic_vganneva@quicinc.com> Link: https://lore.kernel.org/r/1675519179-24174-1-git-send-email-quic_vganneva@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:35:02 +01:00
Karthikeyan Periyasamy	aaacf1740f	wifi: mac80211: fix non-MLO station association Non-MLO station frames are dropped in Rx path due to the condition check in ieee80211_rx_is_valid_sta_link_id(). In multi-link AP scenario, non-MLO stations try to connect in any of the valid links in the ML AP, where the station valid_links and link_id params are valid in the ieee80211_sta object. But ieee80211_rx_is_valid_sta_link_id() always return false for the non-MLO stations by the assumption taken is valid_links and link_id are not valid in non-MLO stations object (ieee80211_sta), this assumption is wrong. Due to this assumption, non-MLO station frames are dropped which leads to failure in association. Fix it by removing the condition check and allow the link validation check for the non-MLO stations. Fixes: `e66b7920aa` ("wifi: mac80211: fix initialization of rx->link and rx->link_sta") Signed-off-by: Karthikeyan Periyasamy <quic_periyasa@quicinc.com> Link: https://lore.kernel.org/r/20230206160330.1613-1-quic_periyasa@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:35:02 +01:00
Rameshkumar Sundaram	57b341e9ab	wifi: mac80211: Allow NSS change only up to capability Stations can update bandwidth/NSS change in VHT action frame with action type Operating Mode Notification. (IEEE Std 802.11-2020 - 9.4.1.53 Operating Mode field) For Operating Mode Notification, an RX NSS change to a value greater than AP's maximum NSS should not be allowed. During fuzz testing, by forcefully sending VHT Op. mode notif. frames from STA with random rx_nss values, it is found that AP accepts rx_nss values greater that APs maximum NSS instead of discarding such NSS change. Hence allow NSS change only up to maximum NSS that is negotiated and capped to AP's capability during association. Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com> Link: https://lore.kernel.org/r/20230207114146.10567-1-quic_ramess@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:35:02 +01:00
Felix Fietkau	6e4c0d0460	wifi: mac80211: add a workaround for receiving non-standard mesh A-MSDU At least ath10k and ath11k supported hardware (maybe more) does not implement mesh A-MSDU aggregation in a standard compliant way. 802.11-2020 9.3.2.2.2 declares that the Mesh Control field is part of the A-MSDU header (and little-endian). As such, its length must not be included in the subframe length field. Hardware affected by this bug treats the mesh control field as part of the MSDU data and sets the length accordingly. In order to avoid packet loss, keep track of which stations are affected by this and take it into account when converting A-MSDU to 802.3 + mesh control packets. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230213100855.34315-5-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:35:02 +01:00
Felix Fietkau	986e43b19a	wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces The current mac80211 mesh A-MSDU receive path fails to parse A-MSDU packets on mesh interfaces, because it assumes that the Mesh Control field is always directly after the 802.11 header. 802.11-2020 9.3.2.2.2 Figure 9-70 shows that the Mesh Control field is actually part of the A-MSDU subframe header. This makes more sense, since it allows packets for multiple different destinations to be included in the same A-MSDU, as long as RA and TID are still the same. Another issue is the fact that the A-MSDU subframe length field was apparently accidentally defined as little-endian in the standard. In order to fix this, the mesh forwarding path needs happen at a different point in the receive path. ieee80211_data_to_8023_exthdr is changed to ignore the mesh control field and leave it in after the ethernet header. This also affects the source/dest MAC address fields, which now in the case of mesh point to the mesh SA/DA. ieee80211_amsdu_to_8023s is changed to deal with the endian difference and to add the Mesh Control length to the subframe length, since it's not covered by the MSDU length field. With these changes, the mac80211 will get the same packet structure for converted regular data packets and unpacked A-MSDU subframes. The mesh forwarding checks are now only performed after the A-MSDU decap. For locally received packets, the Mesh Control header is stripped away. For forwarded packets, a new 802.11 header gets added. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230213100855.34315-4-nbd@nbd.name [fix fortify build error] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:34:51 +01:00
Felix Fietkau	5c1e269aa5	wifi: mac80211: remove mesh forwarding congestion check Now that all drivers use iTXQ, it does not make sense to check to drop tx forwarding packets when the driver has stopped the queues. fq_codel will take care of dropping packets when the queues fill up Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230213100855.34315-3-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:25:23 +01:00
Felix Fietkau	9f718554e7	wifi: cfg80211: factor out bridge tunnel / RFC1042 header check The same check is done in multiple places, unify it. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230213100855.34315-2-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:25:11 +01:00
Felix Fietkau	0f690e6b4d	wifi: cfg80211: move A-MSDU check in ieee80211_data_to_8023_exthdr When parsing the outer A-MSDU header, don't check for inner bridge tunnel or RFC1042 headers. This is handled by ieee80211_amsdu_to_8023s already. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230213100855.34315-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:25:01 +01:00
Shayne Chen	59336e07b2	wifi: mac80211: make rate u32 in sta_set_rate_info_rx() The value of last_rate in ieee80211_sta_rx_stats is degraded from u32 to u16 after being assigned to rate variable, which causes information loss in STA_STATS_FIELD_TYPE and later bitfields. Signed-off-by: Shayne Chen <shayne.chen@mediatek.com> Link: https://lore.kernel.org/r/20230209110659.25447-1-shayne.chen@mediatek.com Fixes: `41cbb0f5a2` ("mac80211: add support for HE") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:23:12 +01:00
Bo Liu	796703baea	rfkill: Use sysfs_emit() to instead of sprintf() Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230206081641.3193-1-liubo03@inspur.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:21:14 +01:00
Rameshkumar Sundaram	19085ef39f	wifi: cfg80211: Allow action frames to be transmitted with link BSS in MLD Currently action frames TX only with ML address as A3(BSSID) are allowed in an ML AP, but TX for a non-ML Station can happen in any link of an ML BSS with link BSS address as A3. In case of an MLD, if User-space has provided a valid link_id in action frame TX request, allow transmission of the frame in that link. Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com> Link: https://lore.kernel.org/r/20230201061602.3918-1-quic_ramess@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:17:54 +01:00
Aloka Dixit	2cc25e4b2a	wifi: mac80211: configure puncturing bitmap - Configure the bitmap in link_conf and notify the driver. - Modify 'change' in ieee80211_start_ap() from u32 to u64 to support BSS_CHANGED_EHT_PUNCTURING. - Propagate the bitmap in channel switch events to userspace. Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com> Signed-off-by: Muna Sinada <quic_msinada@quicinc.com> Link: https://lore.kernel.org/r/20230131001227.25014-5-quic_alokad@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:17:22 +01:00
Aloka Dixit	b345f0637c	wifi: cfg80211: include puncturing bitmap in channel switch events Add puncturing bitmap in channel switch notifications and corresponding trace functions. Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com> Link: https://lore.kernel.org/r/20230131001227.25014-4-quic_alokad@quicinc.com [fix qtnfmac] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:14:39 +01:00
Aloka Dixit	d7c1a9a0ed	wifi: nl80211: validate and configure puncturing bitmap - New feature flag, NL80211_EXT_FEATURE_PUNCT, to advertise driver support for preamble puncturing in AP mode. - New attribute, NL80211_ATTR_PUNCT_BITMAP, to receive a puncturing bitmap from the userspace during AP bring up (NL80211_CMD_START_AP) and channel switch (NL80211_CMD_CHANNEL_SWITCH) operations. Each bit corresponds to a 20 MHz channel in the operating bandwidth, lowest bit for the lowest channel. Bit set to 1 indicates that the channel is punctured. Higher 16 bits are reserved. - New members added to structures cfg80211_ap_settings and cfg80211_csa_settings to propagate the bitmap to the driver after validation. Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com> Signed-off-by: Muna Sinada <quic_msinada@quicinc.com> Link: https://lore.kernel.org/r/20230131001227.25014-3-quic_alokad@quicinc.com [move validation against 0xffff into policy] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:13:24 +01:00
Aloka Dixit	b25413fed3	wifi: cfg80211: move puncturing bitmap validation from mac80211 - Move ieee80211_valid_disable_subchannel_bitmap() from mlme.c to chan.c, rename it as cfg80211_valid_disable_subchannel_bitmap() and export it. - Modify the prototype to include struct cfg80211_chan_def instead of only bandwidth to support a check which returns false if the primary channel is punctured. Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com> Link: https://lore.kernel.org/r/20230131001227.25014-2-quic_alokad@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:09:18 +01:00
Jaewan Kim	90b2c3cc4b	wifi: nl80211: return error message for malformed chandef Add an error message to the missing frequency case to have all -EINVAL in nl80211_parse_chandef() return a better error. Signed-off-by: Jaewan Kim <jaewan@google.com> Link: https://lore.kernel.org/r/20230130074514.1560021-1-jaewan@google.com [rewrite commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:09:18 +01:00
Alvin Šipraga	cba7217a92	wifi: nl80211: add MLO_LINK_ID to CMD_STOP_AP event nl80211_send_ap_stopped() can be called multiple times on the same netdev for each link when using Multi-Link Operation. Add the MLO_LINK_ID attribute to the event to allow userspace to distinguish which link the event is for. Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Link: https://lore.kernel.org/r/20230128125844.2407135-2-alvin@pqrs.dk Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:09:17 +01:00
Alvin Šipraga	77669c151f	wifi: nl80211: emit CMD_START_AP on multicast group when an AP is started Userspace processes such as network daemons may wish to be informed when any AP interface is brought up on the system, for example to initiate a (re)configuration of IP settings or to start a DHCP server. Currently nl80211 does not broadcast any such event on its multicast groups, leaving userspace only two options: 1. the process must be the one that actually issued the NL80211_CMD_START_AP request, so that it can react on the response to that request; 2. the process must react to RTM_NEWLINK events indicating a change in carrier state, and may query for further information about the AP and react accordingly. Option (1) is robust, but it does not cover all scenarios. It is easy to imagine a situation where this is not the case (e.g. hostapd + systemd-networkd). Option (2) is not robust, because RTM_NEWLINK events may be silently discarded by the linkwatch logic (cf. linkwatch_fire_event()). Concretely, consider a scenario in which the carrier state flip-flops in the following way: ^ carrier state (high/low = carrier/no carrier) \| \| _______ _______ ... \| \| \| \| \| ______\| "foo" \|____\| "bar" (SSID in "quotes") \| +-------A-------B----C---------> time If the time interval between (A) and (C) is less than 1 second, then linkwatch may emit only a single RTM_NEWLINK event indicating carrier gain. This is problematic because it is possible that the network configuration that should be applied is a function of the AP's properties such as SSID (cf. SSID= in systemd.network(5)). As illustrated in the above diagram, it may be that the AP with SSID "bar" ends up being configured as though it had SSID "foo". Address the above issue by having nl80211 emit an NL80211_CMD_START_AP message on the MLME nl80211 multicast group. This allows for arbitrary processes to be reliably informed. Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Link: https://lore.kernel.org/r/20230128125844.2407135-1-alvin@pqrs.dk Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:09:17 +01:00
Johannes Berg	aa87cd8b35	wifi: mac80211: mlme: handle EHT channel puncturing Handle the Puncturing info received from the AP in the EHT Operation element in beacons. If the info is invalid: - during association: disable EHT connection for the AP - after association: disconnect This commit includes many (internal) bugfixes and spec updates various people. Co-developed-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://lore.kernel.org/r/20230127123930.4fbc74582331.I3547481d49f958389f59dfeba3fcc75e72b0aa6e@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:01:31 +01:00
Veerendranath Jakkam	8bb588d975	wifi: cfg80211: Extend cfg80211_update_owe_info_event() for MLD AP Add support to offload OWE processing to user space for MLD AP when driver's SME in use. Add new parameters in struct cfg80211_update_owe_info to provide below information in cfg80211_update_owe_info_event() call: - MLO link ID of the AP, with which station requested (re)association. This is applicable for both MLO and non-MLO station connections when the AP affiliated with an MLD. - Station's MLD address if the connection is MLO capable. Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com> Link: https://lore.kernel.org/r/20230126143256.960563-3-quic_vjakkam@quicinc.com [reformat the trace event macro] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 12:00:25 +01:00
Veerendranath Jakkam	a42e59eb96	wifi: cfg80211: Extend cfg80211_new_sta() for MLD AP Add support for drivers to indicate STA connection(MLO/non-MLO) when user space SME (e.g., hostapd) is not used for MLD AP. Add new parameters in struct station_info to provide below information in cfg80211_new_sta() call: - MLO link ID of the AP, with which station completed (re)association. This is applicable for both MLO and non-MLO station connections when the AP affiliated with an MLD. - Station's MLD address if the connection is MLO capable. - (Re)Association Response IEs sent to the station. User space needs this to determine rejected and accepted affiliated links information of the connected station if the connection is MLO capable. Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com> Link: https://lore.kernel.org/r/20230126143256.960563-2-quic_vjakkam@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 11:53:34 +01:00
Lorenzo Bianconi	9288188438	wifi: mac80211: move color collision detection report in a delayed work Move color collision report in a dedicated delayed work and do not run it in interrupt context in order to rate-limit the number of events reported to userspace. Moreover grab wdev mutex in ieee80211_color_collision_detection_work routine since it is required by cfg80211_obss_color_collision_notify(). Tested-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Fixes: `5f9404abdf` ("mac80211: add support for BSS color change") Link: https://lore.kernel.org/r/3f6cf60c892ad40c1cca4a55d62b1224ef1c6ce9.1674644379.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 11:53:21 +01:00
Alexander Wetzel	015b8cc5e7	wifi: cfg80211: Fix use after free for wext Key information in wext.connect is not reset on (re)connect and can hold data from a previous connection. Reset key data to avoid that drivers or mac80211 incorrectly detect a WEP connection request and access the freed or already reused memory. Additionally optimize cfg80211_sme_connect() and avoid an useless schedule of conn_work. Fixes: `fffd0934b9` ("cfg80211: rework key operation") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230124141856.356646-1-alexander@wetzel-home.de Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 11:51:07 +01:00
Veerendranath Jakkam	9a47c1ef5a	wifi: cfg80211: Authentication offload to user space for MLO connection in STA mode Currently authentication request event interface doesn't have support to indicate the user space whether it should enable MLO or not during the authentication with the specified AP. But driver needs such capability since the connection is MLO or not decided by the driver in case of SME offload to the driver. Add support for driver to indicate MLD address of the AP in authentication offload request to inform user space to enable MLO during authentication process. Driver shall look at NL80211_ATTR_MLO_SUPPORT flag capability in NL80211_CMD_CONNECT to know whether the user space supports enabling MLO during the authentication offload. User space should enable MLO during the authentication only when it receives the AP MLD address in authentication offload request. User space shouldn't enable MLO if the authentication offload request doesn't indicate the AP MLD address even if the AP is MLO capable. When MLO is enabled, user space should use the MAC address of the interface (on which driver sent request) as self MLD address. User space and driver to use MLD addresses in RA, TA and BSSID fields of the frames between them, and driver translates the MLD addresses to/from link addresses based on the link chosen for the authentication. Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com> Link: https://lore.kernel.org/r/20230116125058.1604843-1-quic_vjakkam@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 11:06:23 +01:00
Oz Shlomo	5246c896b8	net/sched: support per action hw stats There are currently two mechanisms for populating hardware stats: 1. Using flow_offload api to query the flow's statistics. The api assumes that the same stats values apply to all the flow's actions. This assumption breaks when action drops or jumps over following actions. 2. Using hw_action api to query specific action stats via a driver callback method. This api assures the correct action stats for the offloaded action, however, it does not apply to the rest of the actions in the flow's actions array. Extend the flow_offload stats callback to indicate that a per action stats update is required. Use the existing flow_offload_action api to query the action's hw stats. In addition, currently the tc action stats utility only updates hw actions. Reuse the existing action stats cb infrastructure to query any action stats. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-14 11:00:01 +01:00
Oz Shlomo	d307b2c6f9	net/sched: introduce flow_offload action cookie Currently a hardware action is uniquely identified by the <id, hw_index> tuple. However, the id is set by the flow_act_setup callback and tc core cannot enforce this, and it is possible that a future change could break this. In addition, <id, hw_index> are not unique across network namespaces. Uniquely identify the action by setting an action cookie by the tc core. Use the unique action cookie to query the action's hardware stats. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-14 11:00:01 +01:00
Oz Shlomo	ac7d27907d	net/sched: pass flow_stats instead of multiple stats args Instead of passing 6 stats related args, pass the flow_stats. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-14 11:00:01 +01:00
Oz Shlomo	3320f36fd8	net/sched: act_pedit, setup offload action for action stats query A single tc pedit action may be translated to multiple flow_offload actions. Offload only actions that translate to a single pedit command value. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-14 11:00:01 +01:00
Oz Shlomo	8f2ca70c07	net/sched: optimize action stats api calls Currently the hw action stats update is called from tcf_exts_hw_stats_update, when a tc filter is dumped, and from tcf_action_copy_stats, when a hw action is dumped. However, the tcf_action_copy_stats is also called from tcf_action_dump. As such, the hw action stats update cb is called 3 times for every tc flower filter dump. Move the tc action hw stats update from tcf_action_copy_stats to tcf_dump_walker to update the hw action stats when tc action is dumped. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-14 11:00:00 +01:00
Johannes Berg	3d9c361713	wifi: cfg80211: trace: remove MAC_PR_{FMT,ARG} With %pM, this really is no longer needed, and actually longer to spell out. Remove it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-02-14 10:59:28 +01:00
Pedro Tammela	21c167aa0b	net/sched: act_ctinfo: use percpu stats The tc action act_ctinfo was using shared stats, fix it to use percpu stats since bstats_update() must be called with locks or with a percpu pointer argument. tdc results: 1..12 ok 1 c826 - Add ctinfo action with default setting ok 2 0286 - Add ctinfo action with dscp ok 3 4938 - Add ctinfo action with valid cpmark and zone ok 4 7593 - Add ctinfo action with drop control ok 5 2961 - Replace ctinfo action zone and action control ok 6 e567 - Delete ctinfo action with valid index ok 7 6a91 - Delete ctinfo action with invalid index ok 8 5232 - List ctinfo actions ok 9 7702 - Flush ctinfo actions ok 10 3201 - Add ctinfo action with duplicate index ok 11 8295 - Add ctinfo action with invalid index ok 12 3964 - Replace ctinfo action with invalid goto_chain control Fixes: `24ec483cec` ("net: sched: Introduce act_ctinfo action") Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20230210200824.444856-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-13 20:09:01 -08:00
Eric Dumazet	545dbcd124	ipv6: icmp6: add drop reason support to ndisc_rcv() Creates three new drop reasons: SKB_DROP_REASON_IPV6_NDISC_FRAG: invalid frag (suppress_frag_ndisc). SKB_DROP_REASON_IPV6_NDISC_HOP_LIMIT: invalid hop limit. SKB_DROP_REASON_IPV6_NDISC_BAD_CODE: invalid NDISC icmp6 code. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-13 19:55:32 -08:00
Eric Dumazet	30c89bad3e	ipv6: icmp6: add drop reason support to icmpv6_notify() Accurately reports what happened in icmpv6_notify() when handling a packet. This makes use of the new IPV6_BAD_EXTHDR drop reason. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-13 19:55:32 -08:00
Shannon Nelson	5b4e9a7a71	net: ethtool: extend ringparam set/get APIs for rx_push Similar to what was done for TX_PUSH, add an RX_PUSH concept to the ethtool interfaces. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 11:05:12 +00:00
Herbert Xu	d3777ceaad	tls: Pass rec instead of aead_req into tls_encrypt_done The function tls_encrypt_done only uses aead_req to get ahold of the tls_rec object. So we could pass that in instead of aead_req to simplify the code. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:35:15 +08:00
Herbert Xu	8580e55aa8	tls: Remove completion function scaffolding This patch removes the temporary scaffolding now that the comletion function signature has been converted. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:35:15 +08:00
Herbert Xu	65cb4657ba	tipc: Remove completion function scaffolding This patch removes the temporary scaffolding now that the comletion function signature has been converted. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:35:15 +08:00
Herbert Xu	6002e20dd0	net: ipv6: Remove completion function scaffolding This patch removes the temporary scaffolding now that the comletion function signature has been converted. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:35:15 +08:00
Herbert Xu	fd5dabf764	net: ipv4: Remove completion function scaffolding This patch removes the temporary scaffolding now that the comletion function signature has been converted. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:35:15 +08:00
Herbert Xu	8d338c76f7	tls: Only use data field in crypto completion function The crypto_async_request passed to the completion is not guaranteed to be the original request object. Only the data field can be relied upon. Fix this by storing the socket pointer with the AEAD request. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:34:48 +08:00
Herbert Xu	1dbab13122	tipc: Add scaffolding to change completion function signature This patch adds temporary scaffolding so that the Crypto API completion function can take a void * instead of crypto_async_request. Once affected users have been converted this can be removed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:34:48 +08:00
Herbert Xu	ec2964e807	net: ipv6: Add scaffolding to change completion function signature This patch adds temporary scaffolding so that the Crypto API completion function can take a void * instead of crypto_async_request. Once affected users have been converted this can be removed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:34:48 +08:00
Herbert Xu	14d3109c9c	net: ipv4: Add scaffolding to change completion function signature This patch adds temporary scaffolding so that the Crypto API completion function can take a void * instead of crypto_async_request. Once affected users have been converted this can be removed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:34:48 +08:00
Herbert Xu	fe93d841dd	Bluetooth: Use crypto_wait_req This patch replaces the custom crypto completion function with crypto_req_done. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-02-13 18:34:48 +08:00
Felix Riemann	9b55d3f0a6	net: Fix unwanted sign extension in netdev_stats_to_stats64() When converting net_device_stats to rtnl_link_stats64 sign extension is triggered on ILP32 machines as `6c1c509778` changed the previous "ulong -> u64" conversion to "long -> u64" by accessing the net_device_stats fields through a (signed) atomic_long_t. This causes for example the received bytes counter to jump to 16EiB after having received 2^31 bytes. Casting the atomic value to "unsigned long" beforehand converting it into u64 avoids this. Fixes: `6c1c509778` ("net: add atomic_long_t to net_device_stats fields") Signed-off-by: Felix Riemann <felix.riemann@sma.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:53:25 +00:00
Eric Dumazet	4fab641268	net/sched: fix error recovery in qdisc_create() If TCA_STAB attribute is malformed, qdisc_get_stab() returns an error, and we end up calling ops->destroy() while ops->init() has not been called yet. While we are at it, call qdisc_put_stab() after ops->destroy(). Fixes: `1f62879e36` ("net/sched: make stab available before ops->init() call") Reported-by: syzbot+d44d88f1d11e6ca8576b@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Cc: Kurt Kanzenbach <kurt@linutronix.de> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:51:59 +00:00
Jiri Pirko	6b4bfa43ce	devlink: add forgotten devlink instance lock assertion to devl_param_driverinit_value_set() Driver calling devl_param_driverinit_value_set() has to hold devlink instance lock while doing that. Put an assertion there. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:49:14 +00:00
Jiri Pirko	280f7b2adc	devlink: allow to call devl_param_driverinit_value_get() without holding instance lock If the driver maintains following basic sane behavior, the devl_param_driverinit_value_get() function could be called without holding instance lock: 1) Driver ensures a call to devl_param_driverinit_value_get() cannot race with registering/unregistering the parameter with the same parameter ID. 2) Driver ensures a call to devl_param_driverinit_value_get() cannot race with devl_param_driverinit_value_set() call with the same parameter ID. 3) Driver ensures a call to devl_param_driverinit_value_get() cannot race with reload operation. By the nature of params usage, these requirements should be trivially achievable. If the driver for some off reason is not able to comply, it has to take the devlink->lock while calling devl_param_driverinit_value_get(). Remove the lock assertion and add comment describing the locking requirements. This fixes a splat in mlx5 driver introduced by the commit referenced in the "Fixes" tag. Lore: https://lore.kernel.org/netdev/719de4f0-76ac-e8b9-38a9-167ae239efc7@amd.com/ Reported-by: Kim Phillips <kim.phillips@amd.com> Fixes: `075935f0ae` ("devlink: protect devlink param list by instance lock") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:49:14 +00:00
Jiri Pirko	a72e17b452	devlink: convert param list to xarray Loose the linked list for params and use xarray instead. Note that this is required to be eventually possible to call devl_param_driverinit_value_get() without holding instance lock. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:49:14 +00:00
Jiri Pirko	fbcf938150	devlink: use xa_for_each_start() helper in devlink_nl_cmd_port_get_dump_one() As xarray has an iterator helper that allows to start from specified index, use this directly and avoid repeated iteration from 0. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:49:14 +00:00
Jiri Pirko	94ba1c316b	devlink: fix the name of value arg of devl_param_driverinit_value_get() Probably due to copy-paste error, the name of the arg is "init_val" which is misleading, as the pointer is used to point to struct where to store the current value. Rename it to "val" and change the arg comment a bit on the way. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:49:14 +00:00
Jiri Pirko	afd888c3e1	devlink: make sure driver does not read updated driverinit param before reload The driverinit param purpose is to serve the driver during init/reload time to provide a value, either default or set by user. Make sure that driver does not read value updated by user before the reload is performed. Hold the new value in a separate struct and switch it during reload. Note that this is required to be eventually possible to call devl_param_driverinit_value_get() without holding instance lock. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:49:14 +00:00
Jiri Pirko	fa2f921f3b	devlink: don't use strcpy() to copy param value No need to treat string params any different comparing to other types. Rely on the struct assign to copy the whole struct, including the string. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:49:14 +00:00
Hangyu Hua	2fa28f5c6f	net: openvswitch: fix possible memory leak in ovs_meter_cmd_set() old_meter needs to be free after it is detached regardless of whether the new meter is successfully attached. Fixes: `c7c4c44c9a` ("net: openvswitch: expand the meters supported number") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:38:25 +00:00
Jacob Keller	6d86bb0a5c	devlink: stop using NL_SET_ERR_MSG_MOD NL_SET_ERR_MSG_MOD inserts the KBUILD_MODNAME and a ':' before the actual extended error message. The devlink feature hasn't been able to be compiled as a module since commit `f4b6bcc700` ("net: devlink: turn devlink into a built-in"). Stop using NL_SET_ERR_MSG_MOD, and just use the base NL_SET_ERR_MSG. This aligns the extended error messages better with the NL_SET_ERR_MSG_ATTR messages as well. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:37:29 +00:00
Pietro Borrello	68762148d1	rds: rds_rm_zerocopy_callback() correct order for list_add_tail() rds_rm_zerocopy_callback() uses list_add_tail() with swapped arguments. This links the list head with the new entry, losing the references to the remaining part of the list. Fixes: `9426bbc6de` ("rds: use list structure to track information for zerocopy completion notification") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:33:39 +00:00
Hyunwoo Kim	2f47965183	af_key: Fix heap information leak Since x->encap of pfkey_msg2xfrm_state() is not initialized to 0, kernel heap data can be leaked. Fix with kzalloc() to prevent this. Signed-off-by: Hyunwoo Kim <v4bel@theori.io> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-13 09:30:14 +00:00
Kuniyuki Iwashima	62ec33b44e	net: Remove WARN_ON_ONCE(sk->sk_forward_alloc) from sk_stream_kill_queues(). Christoph Paasch reported that commit `b5fc29233d` ("inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy().") started triggering WARN_ON_ONCE(sk->sk_forward_alloc) in sk_stream_kill_queues(). [0 - 2] Also, we can reproduce it by a program in [3]. In the commit, we delay freeing ipv6_pinfo.pktoptions from sk->destroy() to sk->sk_destruct(), so sk->sk_forward_alloc is no longer zero in inet_csk_destroy_sock(). The same check has been in inet_sock_destruct() from at least v2.6, we can just remove the WARN_ON_ONCE(). However, among the users of sk_stream_kill_queues(), only CAIF is not calling inet_sock_destruct(). Thus, we add the same WARN_ON_ONCE() to caif_sock_destructor(). [0]: https://lore.kernel.org/netdev/39725AB4-88F1-41B3-B07F-949C5CAEFF4F@icloud.com/ [1]: https://github.com/multipath-tcp/mptcp_net-next/issues/341 [2]: WARNING: CPU: 0 PID: 3232 at net/core/stream.c:212 sk_stream_kill_queues+0x2f9/0x3e0 Modules linked in: CPU: 0 PID: 3232 Comm: syz-executor.0 Not tainted 6.2.0-rc5ab24eb4698afbe147b424149c529e2a43ec24eb5 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:sk_stream_kill_queues+0x2f9/0x3e0 Code: 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e ec 00 00 00 8b ab 08 01 00 00 e9 60 ff ff ff e8 d0 5f b6 fe 0f 0b eb 97 e8 c7 5f b6 fe <0f> 0b eb a0 e8 be 5f b6 fe 0f 0b e9 6a fe ff ff e8 02 07 e3 fe e9 RSP: 0018:ffff88810570fc68 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff888101f38f40 RSI: ffffffff8285e529 RDI: 0000000000000005 RBP: 0000000000000ce0 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000ce0 R11: 0000000000000001 R12: ffff8881009e9488 R13: ffffffff84af2cc0 R14: 0000000000000000 R15: ffff8881009e9458 FS: 00007f7fdfbd5800(0000) GS:ffff88811b600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b32923000 CR3: 00000001062fc006 CR4: 0000000000170ef0 Call Trace: <TASK> inet_csk_destroy_sock+0x1a1/0x320 __tcp_close+0xab6/0xe90 tcp_close+0x30/0xc0 inet_release+0xe9/0x1f0 inet6_release+0x4c/0x70 __sock_release+0xd2/0x280 sock_close+0x15/0x20 __fput+0x252/0xa20 task_work_run+0x169/0x250 exit_to_user_mode_prepare+0x113/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f7fdf7ae28d Code: c1 20 00 00 75 10 b8 03 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ee fb ff ff 48 89 04 24 b8 03 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 37 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01 RSP: 002b:00000000007dfbb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007f7fdf7ae28d RDX: 0000000000000000 RSI: ffffffffffffffff RDI: 0000000000000003 RBP: 0000000000000000 R08: 000000007f338e0f R09: 0000000000000e0f R10: 000000007f338e13 R11: 0000000000000293 R12: 00007f7fdefff000 R13: 00007f7fdefffcd8 R14: 00007f7fdefffce0 R15: 00007f7fdefffcd8 </TASK> [3]: https://lore.kernel.org/netdev/20230208004245.83497-1-kuniyu@amazon.com/ Fixes: `b5fc29233d` ("inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy().") Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Christoph Paasch <christophpaasch@icloud.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 19:53:42 -08:00
Kuniyuki Iwashima	ca43ccf412	dccp/tcp: Avoid negative sk_forward_alloc by ipv6_pinfo.pktoptions. Eric Dumazet pointed out [0] that when we call skb_set_owner_r() for ipv6_pinfo.pktoptions, sk_rmem_schedule() has not been called, resulting in a negative sk_forward_alloc. We add a new helper which clones a skb and sets its owner only when sk_rmem_schedule() succeeds. Note that we move skb_set_owner_r() forward in (dccp\|tcp)_v6_do_rcv() because tcp_send_synack() can make sk_forward_alloc negative before ipv6_opt_accepted() in the crossed SYN-ACK or self-connect() cases. [0]: https://lore.kernel.org/netdev/CANn89iK9oc20Jdi_41jb9URdF210r7d1Y-+uypbMSbOfY6jqrg@mail.gmail.com/ Fixes: `323fbd0edf` ("net: dccp: Add handling of IPV6_PKTOPTIONS to dccp_v6_do_rcv()") Fixes: `3df80d9320` ("[DCCP]: Introduce DCCPv6") Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 19:53:42 -08:00
Pedro Tammela	ee059170b1	net/sched: tcindex: update imperfect hash filters respecting rcu The imperfect hash area can be updated while packets are traversing, which will cause a use-after-free when 'tcf_exts_exec()' is called with the destroyed tcf_ext. CPU 0: CPU 1: tcindex_set_parms tcindex_classify tcindex_lookup tcindex_lookup tcf_exts_change tcf_exts_exec [UAF] Stop operating on the shared area directly, by using a local copy, and update the filter with 'rcu_replace_pointer()'. Delete the old filter version only after a rcu grace period elapsed. Fixes: `9b0d4446b5` ("net: sched: avoid atomic swap in tcf_exts_change") Reported-by: valis <sec@valis.email> Suggested-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://lore.kernel.org/r/20230209143739.279867-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 19:38:27 -08:00
Pietro Borrello	a1221703a0	sctp: sctp_sock_filter(): avoid list_entry() on possibly empty list Use list_is_first() to check whether tsp->asoc matches the first element of ep->asocs, as the list is not guaranteed to have an entry. Fixes: `8f840e47f1` ("sctp: add the sctp_diag.c file") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20230208-sctp-filter-v2-1-6e1f4017f326@diag.uniroma1.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 19:28:29 -08:00
Ido Schimmel	170afa71e3	bridge: mcast: Move validation to a policy Future patches are going to move parts of the bridge MDB code to the common rtnetlink code in preparation for VXLAN MDB support. To facilitate code sharing between both drivers, move the validation of the top level attributes in RTM_{NEW,DEL}MDB messages to a policy that will eventually be moved to the rtnetlink code. Use 'NLA_NESTED' for 'MDBA_SET_ENTRY_ATTRS' instead of NLA_POLICY_NESTED() as this attribute is going to be validated using different policies in the underlying drivers. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 19:21:13 -08:00
Ido Schimmel	7ea829664d	bridge: mcast: Remove pointless sequence generation counter assignment The purpose of the sequence generation counter in the netlink callback is to identify if a multipart dump is consistent or not by calling nl_dump_check_consistent() whenever a message is generated. The function is not invoked by the MDB code, rendering the sequence generation counter assignment pointless. Remove it. Note that even if the function was invoked, we still could not accurately determine if the dump is consistent or not, as there is no sequence generation counter for MDB entries, unlike nexthop objects, for example. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 19:21:13 -08:00
Ido Schimmel	ccd7f25b5b	bridge: mcast: Use correct define in MDB dump 'MDB_PG_FLAGS_PERMANENT' and 'MDB_PERMANENT' happen to have the same value, but the latter is uAPI and cannot change, so use it when dumping an MDB entry. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 19:21:13 -08:00
Jakub Kicinski	ee7e1788ae	bluetooth-next pull request for net-next: - Add new PID/VID 0489:e0f2 for MT7921 - Add VID:PID 13d3:3529 for Realtek RTL8821CE - Add CIS feature bits to controller information - Set Per Platform Antenna Gain(PPAG) for Intel controllers -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmPlhW0ZHGx1aXoudm9u LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKU0JD/9fUI9/UnkT/ohYr3XBURna Dxl1hIVMT/gnQPJi9tBxdBdd6RznDuENF5ki9UWeBMZdheDR1xgtXjyMmRNCu6Ri lpyLD6yvoEBkkHn8xmaR4o2hYCdTUek+P0m0SMRmabVdL57CW/KWlPgY04XoIJLK 8A9vcmfiWcSTMv/HF9txxZD8yYSHWKnin3RuOUPq4s6wExRHFhKynP9nx9gfTPvJ lZw/Ms2KRJs4YYEHSkGJ7WXTmjFdRT54Bffy1p1rZV65gRlzVVRuVxdej0VL2rc/ YSgH3NMADt+lZxvOWcqoseemWdA7Z8ivxyky11c9nGGH9koy6EmEJK+BIDaU+bPn USPP3L7UD8dOSSqr3fUXiOhB8j5mDGyB2Md85jkeA936cOdhhFcuKo/fyvxPBsOX myNuf83A8ZwIZYc9hVfgc/iM/GiDmWjjDfYvyPQIseXaCidikLQVCXA8WMb06YOH Y9v8Sz9aeA1cDW1s8cTE8Uib59xgpa6F5gB6RiaCGfOnWGN/BqC/Gi3mj7r8O28G u9kNN7lbBbHLQL2OmUiE/LOSPoONe9f9t6Wq8zRshwZq3aClY0wh8RkyUFA3pEsM FcyERBh9AJLN0A/0ImXoR/BhWJOIO/R4f5tOGJfHTLQUIoSL7bFEo3g1DLjHhAFC L9cqz3OFusNqbEEIWbkLdw== =lKcZ -----END PGP SIGNATURE----- Merge tag 'for-net-next-2023-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== pull-request: bluetooth-next - Add new PID/VID 0489:e0f2 for MT7921 - Add VID:PID 13d3:3529 for Realtek RTL8821CE - Add CIS feature bits to controller information - Set Per Platform Antenna Gain(PPAG) for Intel controllers * tag 'for-net-next-2023-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: Bluetooth: btintel: Set Per Platform Antenna Gain(PPAG) Bluetooth: Make sure LE create conn cancel is sent when timeout Bluetooth: Free potentially unfreed SCO connection Bluetooth: hci_qca: get wakeup status from serdev device handle Bluetooth: L2CAP: Fix potential user-after-free Bluetooth: MGMT: add CIS feature bits to controller information Bluetooth: hci_conn: Refactor hci_bind_bis() since it always succeeds Bluetooth: HCI: Replace zero-length arrays with flexible-array members Bluetooth: qca: Fix sparse warnings Bluetooth: btusb: Add VID:PID 13d3:3529 for Realtek RTL8821CE Bluetooth: btusb: Add new PID/VID 0489:e0f2 for MT7921 Bluetooth: Fix issue with Actions Semi ATS2851 based devices ==================== Link: https://lore.kernel.org/r/20230209234922.3756173-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 19:07:10 -08:00
Jakub Kicinski	de42873367	bpf-next-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY+bZrwAKCRDbK58LschI gzi4AP4+TYo0jnSwwkrOoN9l4f5VO9X8osmj3CXfHBv7BGWVxAD/WnvA3TDZyaUd agIZTkRs6BHF9He8oROypARZxTeMLwM= =nO1C -----END PGP SIGNATURE----- Daniel Borkmann says: ==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit `5b246e533d` ("ice: split probe into smaller functions") from the net-next tree and commit `66c0e13ad2` ("drivers: net: turn on XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev() is otherwise there a 2nd time, and add XDP features to the existing ice_cfg_netdev() one: [...] ice_set_netdev_features(netdev); netdev->xdp_features = NETDEV_XDP_ACT_BASIC \| NETDEV_XDP_ACT_REDIRECT \| NETDEV_XDP_ACT_XSK_ZEROCOPY; ice_set_ops(netdev); [...] Stephen's merge conflict mail: https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/ The main changes are: 1) Add support for BPF trampoline on s390x which finally allows to remove many test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich. 2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski. 3) Add capability to export the XDP features supported by the NIC. Along with that, add a XDP compliance test tool, from Lorenzo Bianconi & Marek Majtyka. 4) Add __bpf_kfunc tag for marking kernel functions as kfuncs, from David Vernet. 5) Add a deep dive documentation about the verifier's register liveness tracking algorithm, from Eduard Zingerman. 6) Fix and follow-up cleanups for resolve_btfids to be compiled as a host program to avoid cross compile issues, from Jiri Olsa & Ian Rogers. 7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted when testing on different NICs, from Jesper Dangaard Brouer. 8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang. 9) Extend libbpf to add an option for when the perf buffer should wake up, from Jon Doron. 10) Follow-up fix on xdp_metadata selftest to just consume on TX completion, from Stanislav Fomichev. 11) Extend the kfuncs.rst document with description on kfunc lifecycle & stability expectations, from David Vernet. 12) Fix bpftool prog profile to skip attaching to offline CPUs, from Tonghao Zhang. ==================== Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 17:51:27 -08:00
Xin Long	0785407e78	net: extract nf_ct_handle_fragments to nf_conntrack_ovs Now handle_fragments() in OVS and TC have the similar code, and this patch removes the duplicate code by moving the function to nf_conntrack_ovs. Note that skb_clear_hash(skb) or skb->ignore_df = 1 should be done only when defrag returns 0, as it does in other places in kernel. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 16:23:03 -08:00
Xin Long	558d95e7e1	net: sched: move frag check and tc_skb_cb update out of handle_fragments This patch has no functional changes and just moves frag check and tc_skb_cb update out of handle_fragments, to make it easier to move the duplicate code from handle_fragments() into nf_conntrack_ovs later. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 16:23:03 -08:00
Xin Long	1b83bf4489	openvswitch: move key and ovs_cb update out of handle_fragments This patch has no functional changes and just moves key and ovs_cb update out of handle_fragments, and skb_clear_hash() and skb->ignore_df change into handle_fragments(), to make it easier to move the duplicate code from handle_fragments() into nf_conntrack_ovs later. Note that it changes to pass info->family to handle_fragments() instead of key for the packet type check, as info->family is set according to key->eth.type in ovs_ct_copy_action() when creating the action. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 16:23:03 -08:00
Xin Long	67fc5d7ffb	net: extract nf_ct_skb_network_trim function to nf_conntrack_ovs There are almost the same code in ovs_skb_network_trim() and tcf_ct_skb_network_trim(), this patch extracts them into a function nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 16:23:03 -08:00
Xin Long	c0c3ab63de	net: create nf_conntrack_ovs for ovs and tc use Similar to nf_nat_ovs created by Commit `ebddb14049` ("net: move the nat function to nf_nat_ovs for ovs and tc"), this patch is to create nf_conntrack_ovs to get these functions shared by OVS and TC only. There are nf_ct_helper() and nf_ct_add_helper() from nf_conntrak_helper in this patch, and will be more in the following patches. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-10 16:23:03 -08:00
Jakub Kicinski	025a785ff0	net: skbuff: drop the word head from skb cache skbuff_head_cache is misnamed (perhaps for historical reasons?) because it does not hold heads. Head is the buffer which skb->data points to, and also where shinfo lives. struct sk_buff is a metadata structure, not the head. Eric recently added skb_small_head_cache (which allocates actual head buffers), let that serve as an excuse to finally clean this up :) Leave the user-space visible name intact, it could possibly be uAPI. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-10 09:10:28 +00:00
David S. Miller	21119e2c6e	rxrpc development -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmPjdqgACgkQ+7dXa6fL C2tbWw/9F6e5SAUKDMTk0LafoGweg/4/cUqAyVXib379RO3XesYPrtcSJ4qhd7P+ F+VA5xCHA8uGRpLVgkDtsgJuuEqcHbxHFN2dBiVLWnH+xnDXc9qAORG9P+M3QLsX ev5zTOXFVkcSFo7ZZjwuC2aTLdR3+z2/apFwtqoxIOOs9SCg+JonZDXDod2nf4Qn B8SB03kz+qQTYslaaYiCnffK25V0fPQDfAQy+WpOw+klPGHLuYMjZnTh3l++Qs7J A1RlZ5t15MpSP73KhKLV3uvmulnXhCi4AwLlLrTcv/ET8+xb78umht6Mz8oxx4Dq lDko3TWeGIrKkNKYmStFo8tYp2PCZgam78F+pC2NtrW1IIKBUYI2Vw3+qAn3XKFi 4UVN7ClRam0ROgRmwZwSt81Abbf2+qioBBSksXlDywbfaX/1HYVha2Ep/gI/YTcB QVs3acc3rtOsplLjHDdHK1p7l8tDEjPJPiW36L8Gwi7JhcC8YAP/n8oQm/fNi0eK nnK9VCbEoIpKi/8psdC4/BRI4LlGJ/KgLrj3uR5sx/Zbs3lIw2gqJOVltwEKjkBt YlnsH+gStbQ1I+tq8icSLNw45f/+WCgf3A4/eNjGHZnGJsmo4lXPGq8+5ZxboAhe CJ1/vSSuKA8jsgJNDlBhwFJenILJBUvw4FlP9JfFi3qqhOUO5/I= =1nmF -----END PGP SIGNATURE----- Merge tag 'rxrpc-next-20230208' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc development Here are some miscellaneous changes for rxrpc: (1) Use consume_skb() rather than kfree_skb_reason(). (2) Fix unnecessary waking when poking and already-poked call. (3) Add ack.rwind to the rxrpc_tx_ack tracepoint as this indicates how many incoming DATA packets we're telling the peer that we are currently willing to accept on this call. (4) Reduce duplicate ACK transmission. We send ACKs to let the peer know that we're increasing the receive window (ack.rwind) as we consume packets locally. Normal ACK transmission is triggered in three places and that leads to duplicates being sent. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-10 08:00:05 +00:00
Eric Dumazet	6e77a5a4af	net: initialize net->notrefcnt_tracker earlier syzbot was able to trigger a warning [1] from net_free() calling ref_tracker_dir_exit(&net->notrefcnt_tracker) while the corresponding ref_tracker_dir_init() has not been done yet. copy_net_ns() can indeed bypass the call to setup_net() in some error conditions. Note: We might factorize/move more code in preinit_net() in the future. [1] INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. CPU: 0 PID: 5817 Comm: syz-executor.3 Not tainted 6.2.0-rc7-next-20230208-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 assign_lock_key kernel/locking/lockdep.c:982 [inline] register_lock_class+0xdb6/0x1120 kernel/locking/lockdep.c:1295 __lock_acquire+0x10a/0x5df0 kernel/locking/lockdep.c:4951 lock_acquire.part.0+0x11c/0x370 kernel/locking/lockdep.c:5691 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162 ref_tracker_dir_exit+0x52/0x600 lib/ref_tracker.c:24 net_free net/core/net_namespace.c:442 [inline] net_free+0x98/0xd0 net/core/net_namespace.c:436 copy_net_ns+0x4f3/0x6b0 net/core/net_namespace.c:493 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:228 ksys_unshare+0x449/0x920 kernel/fork.c:3205 __do_sys_unshare kernel/fork.c:3276 [inline] __se_sys_unshare kernel/fork.c:3274 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3274 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 Fixes: `0cafd77dcd` ("net: add a refcount tracker for kernel sockets") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230208182123.3821604-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 22:49:25 -08:00
Guillaume Nault	8230680f36	ipv6: Fix tcp socket connection with DSCP. Take into account the IPV6_TCLASS socket option (DSCP) in tcp_v6_connect(). Otherwise fib6_rule_match() can't properly match the DSCP value, resulting in invalid route lookup. For example: ip route add unreachable table main 2001:db8::10/124 ip route add table 100 2001:db8::10/124 dev eth0 ip -6 rule add dsfield 0x04 table 100 echo test \| socat - TCP6:[2001:db8::11]:54321,ipv6-tclass=0x04 Without this patch, socat fails at connect() time ("No route to host") because the fib-rule doesn't jump to table 100 and the lookup ends up being done in the main table. Fixes: `2cc67cc731` ("[IPV6] ROUTE: Routing by Traffic Class.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 22:49:04 -08:00
Guillaume Nault	e010ae08c7	ipv6: Fix datagram socket connection with DSCP. Take into account the IPV6_TCLASS socket option (DSCP) in ip6_datagram_flow_key_init(). Otherwise fib6_rule_match() can't properly match the DSCP value, resulting in invalid route lookup. For example: ip route add unreachable table main 2001:db8::10/124 ip route add table 100 2001:db8::10/124 dev eth0 ip -6 rule add dsfield 0x04 table 100 echo test \| socat - UDP6:[2001:db8::11]:54321,ipv6-tclass=0x04 Without this patch, socat fails at connect() time ("No route to host") because the fib-rule doesn't jump to table 100 and the lookup ends up being done in the main table. Fixes: `2cc67cc731` ("[IPV6] ROUTE: Routing by Traffic Class.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 22:49:04 -08:00
Andy Shevchenko	5c72b4c644	openvswitch: Use string_is_terminated() helper Use string_is_terminated() helper instead of cpecific memchr() call. This shows better the intention of the call. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230208133153.22528-3-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 22:30:24 -08:00
Andy Shevchenko	d4545bf9c3	genetlink: Use string_is_terminated() helper Use string_is_terminated() helper instead of cpecific memchr() call. This shows better the intention of the call. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230208133153.22528-2-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 22:30:24 -08:00
Andy Shevchenko	f1db99c07b	string_helpers: Move string_is_valid() to the header Move string_is_valid() to the header for wider use. While at it, rename to string_is_terminated() to be precise about its semantics. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230208133153.22528-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 22:30:24 -08:00
Paolo Abeni	605cfa1b10	net: introduce default_rps_mask netns attribute If RPS is enabled, this allows configuring a default rps mask, which is effective since receive queue creation time. A default RPS mask allows the system admin to ensure proper isolation, avoiding races at network namespace or device creation time. The default RPS mask is initially empty, and can be modified via a newly added sysctl entry. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 17:45:55 -08:00
Paolo Abeni	370ca718fd	net-sysctl: factor-out rpm mask manipulation helpers Will simplify the following patch. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 17:45:55 -08:00
Paolo Abeni	135746c61f	net-sysctl: factor out cpumask parsing helper Will be used by the following patch to avoid code duplication. No functional changes intended. The only difference is that now flow_limit_cpu_sysctl() will always compute the flow limit mask on each read operation, even when read() will not return any byte to user-space. Note that the new helper is placed under a new #ifdef at the file start to better fit the usage in the later patch Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 17:45:55 -08:00
Suren Baghdasaryan	1c71222e5f	mm: replace vma->vm_flags direct modifications with modifier calls Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:39 -08:00
Archie Pusaka	5cd39700de	Bluetooth: Make sure LE create conn cancel is sent when timeout When sending LE create conn command, we set a timer with a duration of HCI_LE_CONN_TIMEOUT before timing out and calling create_le_conn_complete. Additionally, when receiving the command complete, we also set a timer with the same duration to call le_conn_timeout. Usually the latter will be triggered first, which then sends a LE create conn cancel command. However, due to the nature of racing, it is possible for the former to be called first, thereby calling the chain hci_conn_failed -> hci_conn_del -> cancel_delayed_work, thereby preventing LE create conn cancel to be sent. In this situation, the controller will be stuck in trying the LE connection. This patch flushes le_conn_timeout on create_le_conn_complete to make sure we always send LE create connection cancel, if necessary. Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Ying Hsu <yinghsu@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-02-09 14:19:45 -08:00
Archie Pusaka	0f00cd322d	Bluetooth: Free potentially unfreed SCO connection It is possible to initiate a SCO connection while deleting the corresponding ACL connection, e.g. in below scenario: (1) < hci setup sync connect command (2) > hci disconn complete event (for the acl connection) (3) > hci command complete event (for(1), failure) When it happens, hci_cs_setup_sync_conn won't be able to obtain the reference to the SCO connection, so it will be stuck and potentially hinder subsequent connections to the same device. This patch prevents that by also deleting the SCO connection if it is still not established when the corresponding ACL connection is deleted. Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Ying Hsu <yinghsu@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-02-09 14:19:27 -08:00
Luiz Augusto von Dentz	df57033488	Bluetooth: L2CAP: Fix potential user-after-free This fixes all instances of which requires to allocate a buffer calling alloc_skb which may release the chan lock and reacquire later which makes it possible that the chan is disconnected in the meantime. Fixes: `a6a5568c03` ("Bluetooth: Lock the L2CAP channel when sending") Reported-by: Alexander Coffin <alex.coffin@matician.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-02-09 14:18:48 -08:00
Pauli Virtanen	2394186a2c	Bluetooth: MGMT: add CIS feature bits to controller information Userspace needs to know whether the adapter has feature support for Connected Isochronous Stream - Central/Peripheral, so it can set up LE Audio features accordingly. Expose these feature bits as settings in MGMT controller info. Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-02-09 14:18:27 -08:00
Kees Cook	a00a29b0ee	Bluetooth: hci_conn: Refactor hci_bind_bis() since it always succeeds The compiler thinks "conn" might be NULL after a call to hci_bind_bis(), which cannot happen. Avoid any confusion by just making it not return a value since it cannot fail. Fixes the warnings seen with GCC 13: In function 'arch_atomic_dec_and_test', inlined from 'atomic_dec_and_test' at ../include/linux/atomic/atomic-instrumented.h:576:9, inlined from 'hci_conn_drop' at ../include/net/bluetooth/hci_core.h:1391:6, inlined from 'hci_connect_bis' at ../net/bluetooth/hci_conn.c:2124:3: ../arch/x86/include/asm/rmwcc.h:37:9: warning: array subscript 0 is outside array bounds of 'atomic_t[0]' [-Warray-bounds=] 37 \| asm volatile (fullop CC_SET(cc) \ \| ^~~ ... In function 'hci_connect_bis': cc1: note: source object is likely at address zero Fixes: `eca0ae4aea` ("Bluetooth: Add initial implementation of BIS connections") Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: linux-bluetooth@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2023-02-09 14:18:07 -08:00
Jakub Kicinski	8697a258ae	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net net/devlink/leftover.c / net/core/devlink.c: `565b4824c3` ("devlink: change port event netdev notifier from per-net to global") `f05bd8ebeb` ("devlink: move code to a dedicated directory") `687125b579` ("devlink: split out core code") https://lore.kernel.org/all/20230208094657.379f2b1a@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 12:25:40 -08:00
Eric Dumazet	0b34d68049	net: enable usercopy for skb_small_head_cache syzbot and other bots reported that we have to enable user copy to/from skb->head. [1] We can prevent access to skb_shared_info, which is a nice improvement over standard kmem_cache. Layout of these kmem_cache objects is: < SKB_SMALL_HEAD_HEADROOM >< struct skb_shared_info > usercopy: Kernel memory overwrite attempt detected to SLUB object 'skbuff_small_head' (offset 32, size 20)! ------------[ cut here ]------------ kernel BUG at mm/usercopy.c:102 ! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc6-syzkaller-01425-gcb6b2e11a42d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023 RIP: 0010:usercopy_abort+0xbd/0xbf mm/usercopy.c:102 Code: e8 ee ad ba f7 49 89 d9 4d 89 e8 4c 89 e1 41 56 48 89 ee 48 c7 c7 20 2b 5b 8a ff 74 24 08 41 57 48 8b 54 24 20 e8 7a 17 fe ff <0f> 0b e8 c2 ad ba f7 e8 7d fb 08 f8 48 8b 0c 24 49 89 d8 44 89 ea RSP: 0000:ffffc90000067a48 EFLAGS: 00010286 RAX: 000000000000006b RBX: ffffffff8b5b6ea0 RCX: 0000000000000000 RDX: ffff8881401c0000 RSI: ffffffff8166195c RDI: fffff5200000cf3b RBP: ffffffff8a5b2a60 R08: 000000000000006b R09: 0000000000000000 R10: 0000000080000000 R11: 0000000000000000 R12: ffffffff8bf2a925 R13: ffffffff8a5b29a0 R14: 0000000000000014 R15: ffffffff8a5b2960 FS: 0000000000000000(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000000c48e000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __check_heap_object+0xdd/0x110 mm/slub.c:4761 check_heap_object mm/usercopy.c:196 [inline] __check_object_size mm/usercopy.c:251 [inline] __check_object_size+0x1da/0x5a0 mm/usercopy.c:213 check_object_size include/linux/thread_info.h:199 [inline] check_copy_size include/linux/thread_info.h:235 [inline] copy_from_iter include/linux/uio.h:186 [inline] copy_from_iter_full include/linux/uio.h:194 [inline] memcpy_from_msg include/linux/skbuff.h:3977 [inline] qrtr_sendmsg+0x65f/0x970 net/qrtr/af_qrtr.c:965 sock_sendmsg_nosec net/socket.c:722 [inline] sock_sendmsg+0xde/0x190 net/socket.c:745 say_hello+0xf6/0x170 net/qrtr/ns.c:325 qrtr_ns_init+0x220/0x2b0 net/qrtr/ns.c:804 qrtr_proto_init+0x59/0x95 net/qrtr/af_qrtr.c:1296 do_one_initcall+0x141/0x790 init/main.c:1306 do_initcall_level init/main.c:1379 [inline] do_initcalls init/main.c:1395 [inline] do_basic_setup init/main.c:1414 [inline] kernel_init_freeable+0x6f9/0x782 init/main.c:1634 kernel_init+0x1e/0x1d0 init/main.c:1522 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 </TASK> Fixes: `bf9f1baa27` ("net: add dedicated kmem_cache for typical/small skb->head") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Link: https://lore.kernel.org/linux-next/CA+G9fYs-i-c2KTSA7Ai4ES_ZESY1ZnM=Zuo8P1jN00oed6KHMA@mail.gmail.com Link: https://lore.kernel.org/r/20230208142508.3278406-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-09 09:57:23 -08:00
Pablo Neira Ayuso	92f3e96d64	netfilter: nf_tables: allow to fetch set elements when table has an owner NFT_MSG_GETSETELEM returns -EPERM when fetching set elements that belong to table that has an owner. This results in empty set/map listing from userspace. Fixes: `6001a930ce` ("netfilter: nftables: introduce table ownership") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-02-09 17:34:11 +01:00
Pietro Borrello	f753a68980	rds: rds_rm_zerocopy_callback() use list_first_entry() rds_rm_zerocopy_callback() uses list_entry() on the head of a list causing a type confusion. Use list_first_entry() to actually access the first element of the rs_zcookie_queue list. Fixes: `9426bbc6de` ("rds: use list structure to track information for zerocopy completion notification") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Link: https://lore.kernel.org/r/20230202-rds-zerocopy-v3-1-83b0df974f9a@diag.uniroma1.it Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-09 10:37:26 +01:00
Jakub Kicinski	646be03ec4	ipsec-2023-02-08 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmPjhfIACgkQrB3Eaf9P W7c5cg//b8Kpjp9AVOHzQ2UfZ5YbFNC3F+goJRWtCVIfXTft2N+1gn80uXk5q1r1 GjSAlOxh2lQcNAZR4pWAXXqDq8vQuSmuNn7H6YZg8n6a9oFBi3fX5m1tNdu+G0wu Hv28M1jZgmnwJXcUQax3N59NHe1dv/CdtDd9qg6stvFDW3lcD96EbJUkKNsPwpin RZ/CqZX8UBpc+JRQmPPHw7dRSlvUh0hmr2m4HD63RJiYDrgOG9SO/Xx/GW6t/i9u XbuXezJxaulOvbB77uhevPkJqyDGy7IT3Ou0s5GGswiYZ6+ctcZZTayRsKcPWjnk n+OS7jk0yEApN1ZAo+wZD9inYKZuYOFNs2hkSetbca9NJ7ZpnXUoqvFxbZKl8biR v/ecf2yd31T+Vk0mhwBlgmcVfKNQzmRZA45UnOnxuYEXWNok89uQMqHmLdbSBrpR T0LptPzP5Vd7N33zRIiZS7t9vrlw9Xjfbm9O/sSHh0LuRED2gfHHp85MwBhcNfYA AKuxpj8Q9vEf1VW0OYhXyz8Te2+6oB2JJhqNdFsjea64zEoXFeJuSMYq/9DBzIDu glToJME9CV9BCN+r/q96zXBJao2UU4BhSGUVActKOktCnLJOE+XRsR7tpm2hHzr1 wY1PHvCvbpvNXPTGsAf6nijwSFSAcrErfD3ZmyvWvkmopdEajW0= =Bkiy -----END PGP SIGNATURE----- Merge tag 'ipsec-2023-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== ipsec 2023-02-08 1) Fix policy checks for nested IPsec tunnels when using xfrm interfaces. From Benedict Wong. 2) Fix netlink message expression on 32=>64-bit messages translators. From Anastasia Belova. 3) Prevent potential spectre v1 gadget in xfrm_xlate32_attr. From Eric Dumazet. 4) Always consistently use time64_t in xfrm_timer_handler. From Eric Dumazet. 5) Fix KCSAN reported bug: Multiple cpus can update use_time at the same time. From Eric Dumazet. 6) Fix SCP copy from IPv4 to IPv6 on interfamily tunnel. From Christian Hopps. * tag 'ipsec-2023-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: fix bug with DSCP copy to v6 from v4 tunnel xfrm: annotate data-race around use_time xfrm: consistently use time64_t in xfrm_timer_handler() xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr() xfrm: compat: change expression for switch in xfrm_xlate64 Fix XFRM-I support for nested ESP tunnels ==================== Link: https://lore.kernel.org/r/20230208114322.266510-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-08 21:35:39 -08:00
Oliver Hartkopp	f2f527d595	can: raw: use temp variable instead of rolling back config Introduce a temporary variable to check for an invalid configuration attempt from user space. Before this patch the value was copied to the real config variable and rolled back in the case of an error. Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20230203090807.97100-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-08 21:53:24 +01:00
Vladimir Oltean	39b02d6d10	net/sched: taprio: don't segment unnecessarily Improve commit `497cc00224` ("taprio: Handle short intervals and large packets") to only perform segmentation when skb->len exceeds what taprio_dequeue() expects. In practice, this will make the biggest difference when a traffic class gate is always open in the schedule. This is because the max_frm_len will be U32_MAX, and such large skb->len values as Kurt reported will be sent just fine unsegmented. What I don't seem to know how to handle is how to make sure that the segmented skbs themselves are smaller than the maximum frame size given by the current queueMaxSDU[tc]. Nonetheless, we still need to drop those, otherwise the Qdisc will hang. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:53 +00:00
Vladimir Oltean	2d5e8071c4	net/sched: taprio: split segmentation logic from qdisc_enqueue() The majority of the taprio_enqueue()'s function is spent doing TCP segmentation, which doesn't look right to me. Compilers shouldn't have a problem in inlining code no matter how we write it, so move the segmentation logic to a separate function. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:53 +00:00
Vladimir Oltean	fed87cc671	net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations taprio today has a huge problem with small TC gate durations, because it might accept packets in taprio_enqueue() which will never be sent by taprio_dequeue(). Since not much infrastructure was available, a kludge was added in commit `497cc00224` ("taprio: Handle short intervals and large packets"), which segmented large TCP segments, but the fact of the matter is that the issue isn't specific to large TCP segments (and even worse, the performance penalty in segmenting those is absolutely huge). In commit `a54fc09e4c` ("net/sched: taprio: allow user input of per-tc max SDU"), taprio gained support for queueMaxSDU, which is precisely the mechanism through which packets should be dropped at qdisc_enqueue() if they cannot be sent. After that patch, it was necessary for the user to manually limit the maximum MTU per TC. This change adds the necessary logic for taprio to further limit the values specified (or not specified) by the user to some minimum values which never allow oversized packets to be sent. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:53 +00:00
Vladimir Oltean	a878fd46fe	net/sched: keep the max_frm_len information inside struct sched_gate_list I have one practical reason for doing this and one concerning correctness. The practical reason has to do with a follow-up patch, which aims to mix 2 sources of max_sdu (one coming from the user and the other automatically calculated based on TC gate durations @current link speed). Among those 2 sources of input, we must always select the smaller max_sdu value, but this can change at various link speeds. So the max_sdu coming from the user must be kept separated from the value that is operationally used (the minimum of the 2), because otherwise we overwrite it and forget what the user asked us to do. To solve that, this patch proposes that struct sched_gate_list contains the operationally active max_frm_len, and q->max_sdu contains just what was requested by the user. The reason having to do with correctness is based on the following observation: the admin sched_gate_list becomes operational at a given base_time in the future. Until then, it is inactive and applies no shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't apply either (this is a mechanism to ensure that packets smaller than the largest gate duration for that TC don't hang the port; clearly it makes little sense if the gates are always open). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:53 +00:00
Vladimir Oltean	a3d91b2c6f	net/sched: taprio: warn about missing size table Vinicius intended taprio to take the L1 overhead into account when estimating packet transmission time through user input, specifically through the qdisc size table (man tc-stab). Something like this: tc qdisc replace dev $eth root stab overhead 24 taprio \ num_tc 8 \ map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ base-time 0 \ sched-entry S 0x7e 9000000 \ sched-entry S 0x82 1000000 \ max-sdu 0 0 0 0 0 0 0 200 \ flags 0x0 clockid CLOCK_TAI Without the overhead being specified, transmission times will be underestimated and will cause late transmissions. For an offloading driver, it might even cause TX hangs if there is no open gate large enough to send the maximum sized packets for that TC (including L1 overhead). Properly knowing the L1 overhead will ensure that we are able to auto-calculate the queueMaxSDU per traffic class just right, and avoid these hangs due to head-of-line blocking. We can't make the stab mandatory due to existing setups, but we can warn the user that it's important with a warning netlink extack. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:53 +00:00
Vladimir Oltean	1f62879e36	net/sched: make stab available before ops->init() call Some qdiscs like taprio turn out to be actually pretty reliant on a well configured stab, to not underestimate the skb transmission time (by properly accounting for L1 overhead). In a future change, taprio will need the stab, if configured by the user, to be available at ops->init() time. It will become even more important in upcoming work, when the overhead will be used for the queueMaxSDU calculation that is passed to an offloading driver. However, rcu_assign_pointer(sch->stab, stab) is called right after ops->init(), making it unavailable, and I don't really see a good reason for that. Move it earlier, which nicely seems to simplify the error handling path as well. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:53 +00:00
Vladimir Oltean	a1e6ad30fa	net/sched: taprio: calculate guard band against actual TC gate close time taprio_dequeue_from_txq() looks at the entry->end_time to determine whether the skb will overrun its traffic class gate, as if at the end of the schedule entry there surely is a "gate close" event for it. Hint: maybe there isn't. For each schedule entry, introduce an array of kernel times which actually tracks when in the future will there be an actual gate close event for that traffic class, and use that in the guard band overrun calculation. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:53 +00:00
Vladimir Oltean	d2ad689dec	net/sched: taprio: calculate budgets per traffic class Currently taprio assumes that the budget for a traffic class expires at the end of the current interval as if the next interval contains a "gate close" event for this traffic class. This is, however, an unfounded assumption. Allow schedule entry intervals to be fused together for a particular traffic class by calculating the budget until the gate actually closes. This means we need to keep budgets per traffic class, and we also need to update the budget consumption procedure. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:53 +00:00
Vladimir Oltean	e551755111	net/sched: taprio: rename close_time to end_time There is a confusion in terms in taprio which makes what is called "close_time" to be actually used for 2 things: 1. determining when an entry "closes" such that transmitted skbs are never allowed to overrun that time (?!) 2. an aid for determining when to advance and/or restart the schedule using the hrtimer It makes more sense to call this so-called "close_time" "end_time", because it's not clear at all to me what "closes". Future patches will hopefully make better use of the term "to close". This is an absolutely mechanical change. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:52 +00:00
Vladimir Oltean	a306a90c8f	net/sched: taprio: calculate tc gate durations Current taprio code operates on a very simplistic (and incorrect) assumption: that egress scheduling for a traffic class can only take place for the duration of the current interval, or i.o.w., it assumes that at the end of each schedule entry, there is a "gate close" event for all traffic classes. As an example, traffic sent with the schedule below will be jumpy, even though all 8 TC gates are open, so there is absolutely no "gate close" event (effectively a transition from BIT(tc)==1 to BIT(tc)==0 in consecutive schedule entries): tc qdisc replace dev veth0 parent root taprio \ num_tc 2 \ map 0 1 \ queues 1@0 1@1 \ base-time 0 \ sched-entry S 0xff 4000000000 \ clockid CLOCK_TAI \ flags 0x0 This qdisc simply does not have what it takes in terms of logic to actually compute the durations of traffic classes. Also, it does not recognize the need to use this information on a per-traffic-class basis: it always looks at entry->interval and entry->close_time. This change proposes that each schedule entry has an array called tc_gate_duration[tc]. This holds the information: "for how long will this traffic class gate remain open, starting from this schedule entry". If the traffic class gate is always open, that value is equal to the cycle time of the schedule. We'll also need to keep track, for the purpose of queueMaxSDU[tc] calculation, what is the maximum time duration for a traffic class having an open gate. This gives us directly what is the maximum sized packet that this traffic class will have to accept. For everything else it has to qdisc_drop() it in qdisc_enqueue(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:52 +00:00
Vladimir Oltean	2f530df76c	net/sched: taprio: give higher priority to higher TCs in software dequeue mode Current taprio software implementation is haunted by the shadow of the igb/igc hardware model. It iterates over child qdiscs in increasing order of TXQ index, therefore giving higher xmit priority to TXQ 0 and lower to TXQ N. According to discussions with Vinicius, that is the default (perhaps even unchangeable) prioritization scheme used for the NICs that taprio was first written for (igb, igc), and we have a case of two bugs canceling out, resulting in a functional setup on igb/igc, but a less sane one on other NICs. To the best of my understanding, taprio should prioritize based on the traffic class, so it should really dequeue starting with the highest traffic class and going down from there. We get to the TXQ using the tc_to_txq[] netdev property. TXQs within the same TC have the same (strict) priority, so we should pick from them as fairly as we can. We can achieve that by implementing something very similar to q->curband from multiq_dequeue(). Since igb/igc really do have TXQ 0 of higher hardware priority than TXQ 1 etc, we need to preserve the behavior for them as well. We really have no choice, because in txtime-assist mode, taprio is essentially a software scheduler towards offloaded child tc-etf qdiscs, so the TXQ selection really does matter (not all igb TXQs support ETF/SO_TXTIME, says Kurt Kanzenbach). To preserve the behavior, we need a capability bit so that taprio can determine if it's running on igb/igc, or on something else. Because igb doesn't offload taprio at all, we can't piggyback on the qdisc_offload_query_caps() call from taprio_enable_offload(), but instead we need a separate call which is also made for software scheduling. Introduce two static keys to minimize the performance penalty on systems which only have igb/igc NICs, and on systems which only have other NICs. For mixed systems, taprio will have to dynamically check whether to dequeue using one prioritization algorithm or using the other. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:52 +00:00
Vladimir Oltean	4c22942734	net/sched: taprio: avoid calling child->ops->dequeue(child) twice Simplify taprio_dequeue_from_txq() by noticing that we can goto one call earlier than the previous skb_found label. This is possible because we've unified the treatment of the child->ops->dequeue(child) return call, we always try other TXQs now, instead of abandoning the root dequeue completely if we failed in the peek() case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:52 +00:00
Vladimir Oltean	92f966674f	net/sched: taprio: refactor one skb dequeue from TXQ to separate function Future changes will refactor the TXQ selection procedure, and a lot of stuff will become messy, the indentation of the bulk of the dequeue procedure would increase, etc. Break out the bulk of the function into a new one, which knows the TXQ (child qdisc) we should perform a dequeue from. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:52 +00:00
Vladimir Oltean	1638bbbe4e	net/sched: taprio: continue with other TXQs if one dequeue() failed This changes the handling of an unlikely condition to not stop dequeuing if taprio failed to dequeue the peeked skb in taprio_dequeue(). I've no idea when this can happen, but the only side effect seems to be that the atomic_sub_return() call right above will have consumed some budget. This isn't a big deal, since either that made us remain without any budget (and therefore, we'd exit on the next peeked skb anyway), or we could send some packets from other TXQs. I'm making this change because in a future patch I'll be refactoring the dequeue procedure to simplify it, and this corner case will have to go away. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:52 +00:00
Vladimir Oltean	ecc0cc9863	net/sched: taprio: delete peek() implementation There isn't any code in the network stack which calls taprio_peek(). We only see qdisc->ops->peek() being called on child qdiscs of other classful qdiscs, never from the generic qdisc code. Whereas taprio is never a child qdisc, it is always root. This snippet of a comment from qdisc_peek_dequeued() seems to confirm: /* we can reuse ->gso_skb because peek isn't called for root qdiscs */ Since I've been known to be wrong many times though, I'm not completely removing it, but leaving a stub function in place which emits a warning. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:48:52 +00:00
Paolo Abeni	1249db44a1	mptcp: be careful on subflow status propagation on errors Currently the subflow error report callback unconditionally propagates the fallback subflow status to the owning msk. If the msk is already orphaned, the above prevents the code from correctly tracking the msk moving to the TCP_CLOSE state and doing the appropriate cleanup. All the above causes increasing memory usage over time and sporadic self-tests failures. There is a great deal of infrastructure trying to propagate correctly the fallback subflow status to the owning mptcp socket, e.g. via mptcp_subflow_eof() and subflow_sched_work_if_closed(): in the error propagation path we need only to cope with unorphaned sockets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/339 Fixes: `15cc104533` ("mptcp: deliver ssk errors to msk") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:39:34 +00:00
Paolo Abeni	ad2171009d	mptcp: fix locking for in-kernel listener creation For consistency, in mptcp_pm_nl_create_listen_socket(), we need to call the __mptcp_nmpc_socket() under the msk socket lock. Note that as a side effect, mptcp_subflow_create_socket() needs a 'nested' lockdep annotation, as it will acquire the subflow (kernel) socket lock under the in-kernel listener msk socket lock. The current lack of locking is almost harmless, because the relevant socket is not exposed to the user space, but in future we will add more complexity to the mentioned helper, let's play safe. Fixes: `1729cf186d` ("mptcp: create the listening socket for new port") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:39:34 +00:00
Paolo Abeni	21e4356968	mptcp: fix locking for setsockopt corner-case We need to call the __mptcp_nmpc_socket(), and later subflow socket access under the msk socket lock, or e.g. a racing connect() could change the socket status under the hood, with unexpected results. Fixes: `54635bd047` ("mptcp: add TCP_FASTOPEN_CONNECT socket option") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:39:34 +00:00
Paolo Abeni	d4e85922e3	mptcp: do not wait for bare sockets' timeout If the peer closes all the existing subflows for a given mptcp socket and later the application closes it, the current implementation let it survive until the timewait timeout expires. While the above is allowed by the protocol specification it consumes resources for almost no reason and additionally causes sporadic self-tests failures. Let's move the mptcp socket to the TCP_CLOSE state when there are no alive subflows at close time, so that the allocated resources will be freed immediately. Fixes: `e16163b6e2` ("mptcp: refactor shutdown and close") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:39:34 +00:00
Kevin Yang	c11204c78d	txhash: fix sk->sk_txrehash default This code fix a bug that sk->sk_txrehash gets its default enable value from sysctl_txrehash only when the socket is a TCP listener. We should have sysctl_txrehash to set the default sk->sk_txrehash, no matter TCP, nor listerner/connector. Tested by following packetdrill: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 4 // SO_TXREHASH == 74, default to sysctl_txrehash == 1 +0 getsockopt(3, SOL_SOCKET, 74, [1], [4]) = 0 +0 getsockopt(4, SOL_SOCKET, 74, [1], [4]) = 0 Fixes: `26859240e4` ("txhash: Add socket option to control TX hash rethink behavior") Signed-off-by: Kevin Yang <yyd@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-08 09:07:11 +00:00
Dan Carpenter	9cec2aaffe	net: sched: sch: Fix off by one in htb_activate_prios() The > needs be >= to prevent an out of bounds access. Fixes: `de5ca4c385` ("net: sched: sch: Bounds check priority") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/Y+D+KN18FQI2DKLq@kili Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 23:38:53 -08:00
Jakub Kicinski	b1f4fbabbb	linux-can-fixes-for-6.2-20230207 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmPiWbYTHG1rbEBwZW5n dXRyb25peC5kZQAKCRC+UBxKKooE6HT3CACcKiUxmkFfS3I2MUCLt+spz2/4q3cG +MX8m03fzSU8efdKN1ckUOmQfsPp2RNP7NSeI/Anu83oka40zFuejC/Dfg9V1SBF EKD39VIWQTuPcYJP4GlxqKTkDHSVsA5iN/4vxzCFpp7C6LR3h2f0bTwj6SA44IvD eatH6Vzr/Y1rG6Yj27hZzFpC8HXqHzaIcC1UJpODqNTmhrux/hskem0XqH68VQXK FRi8K7xHgIs4MonamWGJj/oz1NkbEje3xvT4nZRkaMLgZY0prTQfJetSDaGwkBci WJmqJ15PpBRedDpdTIeXkZK8U95BWt/gZEp28XwgkohP6A8DsilJFS7F =Tcyi -----END PGP SIGNATURE----- Merge tag 'linux-can-fixes-for-6.2-20230207' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== can 2023-02-07 The patch is from Devid Antonio Filoni and fixes an address claiming problem in the J1939 CAN protocol. * tag 'linux-can-fixes-for-6.2-20230207' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: j1939: do not wait 250 ms if the same addr was already claimed ==================== Link: https://lore.kernel.org/r/20230207140514.2885065-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 20:50:30 -08:00
Moshe Shemesh	cb6b2e11a4	devlink: Fix memleak in health diagnose callback The callback devlink_nl_cmd_health_reporter_diagnose_doit() miss devlink_fmsg_free(), which leads to memory leak. Fix it by adding devlink_fmsg_free(). Fixes: `e994a75fb7` ("devlink: remove reporter reference counting") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/1675698976-45993-1-git-send-email-moshe@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 20:21:33 -08:00
David Howells	5a2c5a5b08	rxrpc: Reduce unnecessary ack transmission rxrpc_recvmsg_data() schedules an ACK to be transmitted every time at least two packets have been consumed and any time it runs out of data and would return -EAGAIN to the caller. Both events may occur within a single loop, however, and if the I/O thread is quick enough it may send duplicate ACKs. The ACKs are sent to inform the peer that more space has been made in the local Rx window, but the I/O thread is going to send an ACK every couple of DATA packets anyway, so we end up sending a lot more ACKs than we really need to. So reduce the rate at which recvmsg() schedules ACKs, such that if the I/O thread sends ACKs at its normal faster rate, recvmsg() won't actually schedule ACKs until the Rx flow stops (call->rx_consumed is cleared any time we transmit an ACK for that call, resetting the counter used by recvmsg). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org	2023-02-07 23:11:21 +00:00
David Howells	f789bff2de	rxrpc: Trace ack.rwind Log ack.rwind in the rxrpc_tx_ack tracepoint. This value is useful to see as it represents flow-control information to the peer. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org	2023-02-07 23:11:21 +00:00
David Howells	a33395ab85	rxrpc: Fix overwaking on call poking If an rxrpc call is given a poke, it will get woken up unconditionally, even if there's already a poke pending (for which there will have been a wake) or if the call refcount has gone to 0. Fix this by only waking the call if it is still referenced and if it doesn't already have a poke pending. Fixes: `15f661dc95` ("rxrpc: Implement a mechanism to send an event notification to a call") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org	2023-02-07 23:11:21 +00:00
David Howells	16d5677ef1	rxrpc: Use consume_skb() rather than kfree_skb_reason() Use consume_skb() rather than kfree_skb_reason(). Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org	2023-02-07 23:11:20 +00:00
Eric Dumazet	bf9f1baa27	net: add dedicated kmem_cache for typical/small skb->head Recent removal of ksize() in alloc_skb() increased performance because we no longer read the associated struct page. We have an equivalent cost at kfree_skb() time. kfree(skb->head) has to access a struct page, often cold in cpu caches to get the owning struct kmem_cache. Considering that many allocations are small (at least for TCP ones) we can have our own kmem_cache to avoid the cache line miss. This also saves memory because these small heads are no longer padded to 1024 bytes. CONFIG_SLUB=y $ grep skbuff_small_head /proc/slabinfo skbuff_small_head 2907 2907 640 51 8 : tunables 0 0 0 : slabdata 57 57 0 CONFIG_SLAB=y $ grep skbuff_small_head /proc/slabinfo skbuff_small_head 607 624 640 6 1 : tunables 54 27 8 : slabdata 104 104 5 Notes: - After Kees Cook patches and this one, we might be able to revert commit `dbae2b0628` ("net: skb: introduce and use a single page frag cache") because GRO_MAX_HEAD is also small. - This patch is a NOP for CONFIG_SLOB=y builds. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 10:59:58 -08:00
Eric Dumazet	5c0e820cbb	net: factorize code in kmalloc_reserve() All kmalloc_reserve() callers have to make the same computation, we can factorize them, to prepare following patch in the series. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 10:59:55 -08:00
Eric Dumazet	65998d2bf8	net: remove osize variable in __alloc_skb() This is a cleanup patch, to prepare following change. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 10:59:52 -08:00
Eric Dumazet	115f1a5c42	net: add SKB_HEAD_ALIGN() helper We have many places using this expression: SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) Use of SKB_HEAD_ALIGN() will allow to clean them. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 10:59:48 -08:00
Paolo Abeni	61d731e653	linux-can-next-for-6.3-20230206 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmPg+r4THG1rbEBwZW5n dXRyb25peC5kZQAKCRC+UBxKKooE6A44B/0c8jjyLikVE2fEBKkvPk8iRlFKap48 140HNR+1TuVBBPCJTe2Gq3l8fDfFUgUVnFfVDjAQq6pMWufLkz9jVr5uGN7O94v+ tFCjip8NLRTa6ibgQlGrFXl31X5vqT+jV6V4l29wFgbwgOMRrod7wLDo+bHTtuuY 46Sf4GgIazYjxEeZBBsvjnAbraFq0twZFcEz9SyeOxqjR+9+71X4Y3x0kgQuz72N CrvRng0yFLqbHOBE60iaDqbjuNnwBGTH476pwX25Kj9gVyIH20T45h8tV4j6z2Sq A42cHKDlWZRWUiWQjGOHHT4igUf//Sfv8WDaFjqobO/DDY+nBA75EPAv =hWLT -----END PGP SIGNATURE----- Merge tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2023-02-06 this is a pull request of 47 patches for net-next/master. The first two patch is by Oliver Hartkopp. One adds missing error checking to the CAN_GW protocol, the other adds a missing CAN address family check to the CAN ISO TP protocol. Thomas Kopp contributes a performance optimization to the mcp251xfd driver. The next 11 patches are by Geert Uytterhoeven and add support for R-Car V4H systems to the rcar_canfd driver. Stephane Grosjean and Lukas Magel contribute 8 patches to the peak_usb driver, which add support for configurable CAN channel ID. The last 17 patches are by me and target the CAN bit timing configuration. The bit timing is cleaned up, error messages are improved and forwarded to user space via NL_SET_ERR_MSG_FMT() instead of netdev_err(), and the SJW handling is updated, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. * tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (47 commits) can: bittiming: can_validate_bitrate(): report error via netlink can: bittiming: can_calc_bittiming(): convert from netdev_err() to NL_SET_ERR_MSG_FMT() can: bittiming: can_calc_bittiming(): clean up SJW handling can: bittiming: can_sjw_set_default(): use Phase Seg2 / 2 as default for SJW can: bittiming: can_sjw_check(): check that SJW is not longer than either Phase Buffer Segment can: bittiming: can_sjw_check(): report error via netlink and harmonize error value can: bittiming: can_fixup_bittiming(): report error via netlink and harmonize error value can: bittiming: factor out can_sjw_set_default() and can_sjw_check() can: bittiming: can_changelink() pass extack down callstack can: netlink: can_changelink(): convert from netdev_err() to NL_SET_ERR_MSG_FMT() can: netlink: can_validate(): validate sample point for CAN and CAN-FD can: dev: register_candev(): bail out if both fixed bit rates and bit timing constants are provided can: dev: register_candev(): ensure that bittiming const are valid can: bittiming: can_get_bittiming(): use direct return and remove unneeded else can: bittiming: can_fixup_bittiming(): set effective tq can: bittiming: can_fixup_bittiming(): use CAN_SYNC_SEG instead of 1 can: bittiming(): replace open coded variants of can_bit_time() can: peak_usb: Reorder include directives alphabetically can: peak_usb: align CAN channel ID format in log with sysfs attribute can: peak_usb: export PCAN CAN channel ID as sysfs device attribute ... ==================== Link: https://lore.kernel.org/r/20230206131620.2758724-1-mkl@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-07 15:54:09 +01:00
Vladimir Oltean	ca8e4cbff6	ethtool: mm: fix get_mm() return code not propagating to user space If ops->get_mm() returns a non-zero error code, we goto out_complete, but there, we return 0. Fix that to propagate the "ret" variable to the caller. If ops->get_mm() succeeds, it will always return 0. Fixes: `2b30f8291a` ("net: ethtool: add support for MAC Merge layer") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230206094932.446379-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-07 15:39:06 +01:00
Devid Antonio Filoni	4ae5e1e97c	can: j1939: do not wait 250 ms if the same addr was already claimed The ISO 11783-5 standard, in "4.5.2 - Address claim requirements", states: d) No CF shall begin, or resume, transmission on the network until 250 ms after it has successfully claimed an address except when responding to a request for address-claimed. But "Figure 6" and "Figure 7" in "4.5.4.2 - Address-claim prioritization" show that the CF begins the transmission after 250 ms from the first AC (address-claimed) message even if it sends another AC message during that time window to resolve the address contention with another CF. As stated in "4.4.2.3 - Address-claimed message": In order to successfully claim an address, the CF sending an address claimed message shall not receive a contending claim from another CF for at least 250 ms. As stated in "4.4.3.2 - NAME management (NM) message": 1) A commanding CF can d) request that a CF with a specified NAME transmit the address- claimed message with its current NAME. 2) A target CF shall d) send an address-claimed message in response to a request for a matching NAME Taking the above arguments into account, the 250 ms wait is requested only during network initialization. Do not restart the timer on AC message if both the NAME and the address match and so if the address has already been claimed (timer has expired) or the AC message has been sent to resolve the contention with another CF (timer is still running). Signed-off-by: Devid Antonio Filoni <devid.filoni@egluetechnologies.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/all/20221125170418.34575-1-devid.filoni@egluetechnologies.com Fixes: `9d71dd0c70` ("can: add support of SAE J1939 protocol") Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-07 15:00:22 +01:00
Jiri Pirko	565b4824c3	devlink: change port event netdev notifier from per-net to global Currently only the network namespace of devlink instance is monitored for port events. If netdev is moved to a different namespace and then unregistered, NETDEV_PRE_UNINIT is missed which leads to trigger following WARN_ON in devl_port_unregister(). WARN_ON(devlink_port->type != DEVLINK_PORT_TYPE_NOTSET); Fix this by changing the netdev notifier from per-net to global so no event is missed. Fixes: `02a68a47ea` ("net: devlink: track netdev with devlink_port assigned") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230206094151.2557264-1-jiri@resnulli.us Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-07 14:13:55 +01:00
Eddy Tao	15ea59a0e9	net: openvswitch: reduce cpu_used_mask memory Use actual CPU number instead of hardcoded value to decide the size of 'cpu_used_mask' in 'struct sw_flow'. Below is the reason. 'struct cpumask cpu_used_mask' is embedded in struct sw_flow. Its size is hardcoded to CONFIG_NR_CPUS bits, which can be 8192 by default, it costs memory and slows down ovs_flow_alloc. To address this: Redefine cpu_used_mask to pointer. Append cpumask_size() bytes after 'stat' to hold cpumask. Initialization cpu_used_mask right after stats_last_writer. APIs like cpumask_next and cpumask_set_cpu never access bits beyond cpu count, cpumask_size() bytes of memory is enough. Signed-off-by: Eddy Tao <taoyuan_eddy@hotmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/OS3P286MB229570CCED618B20355D227AF5D59@OS3P286MB2295.JPNP286.PROD.OUTLOOK.COM Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-06 22:36:29 -08:00
Pietro Borrello	584f374289	net: add sock_init_data_uid() Add sock_init_data_uid() to explicitly initialize the socket uid. To initialise the socket uid, sock_init_data() assumes a the struct socket* sock is always embedded in a struct socket_alloc, used to access the corresponding inode uid. This may not be true. Examples are sockets created in tun_chr_open() and tap_open(). Fixes: `86741ec254` ("net: core: Add a UID field to struct sock.") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:16:55 +00:00
Vladimir Oltean	522d15ea83	net/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpsw There are 2 classes of in-tree drivers currently: - those who act upon struct tc_taprio_sched_entry :: gate_mask as if it holds a bit mask of TXQs - those who act upon the gate_mask as if it holds a bit mask of TCs When it comes to the standard, IEEE 802.1Q-2018 does say this in the second paragraph of section 8.6.8.4 Enhancements for scheduled traffic: \| A gate control list associated with each Port contains an ordered list \| of gate operations. Each gate operation changes the transmission gate \| state for the gate associated with each of the Port's traffic class \| queues and allows associated control operations to be scheduled. In typically obtuse language, it refers to a "traffic class queue" rather than a "traffic class" or a "queue". But careful reading of 802.1Q clarifies that "traffic class" and "queue" are in fact synonymous (see 8.6.6 Queuing frames): \| A queue in this context is not necessarily a single FIFO data structure. \| A queue is a record of all frames of a given traffic class awaiting \| transmission on a given Bridge Port. The structure of this record is not \| specified. i.o.w. their definition of "queue" isn't the Linux TX queue. The gate_mask really is input into taprio via its UAPI as a mask of traffic classes, but taprio_sched_to_offload() converts it into a TXQ mask. The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is: - hellcreek, felix, sja1105: these are DSA switches, it's not even very clear what TXQs correspond to, other than purely software constructs. Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense. So it's fine to convert these to a gate mask per TC. - enetc: I have the hardware and can confirm that the gate mask is per TC, and affects all TXQs (BD rings) configured for that priority. - igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted to be per-TXQ. - tsnep: Gerhard Engleder clarifies that even though this hardware supports at most 1 TXQ per TC, the TXQ indices may be different from the TC values themselves, and it is the TXQ indices that matter to this hardware. So keep it per-TXQ as well. - stmmac: I have a GMAC datasheet, and in the EST section it does specify that the gate events are per TXQ rather than per TC. - lan966x: again, this is a switch, and while not a DSA one, the way in which it implements lan966x_mqprio_add() - by only allowing num_tc == NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely software construct here as well. They seem to map 1:1 with TCs. - am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the impression that the fetch_allow variable is treated like a prio_mask. This definitely sounds closer to a per-TC gate mask rather than a per-TXQ one, and TI documentation does seem to recomment an identity mapping between TCs and TXQs. However, Roger Quadros would like to do some testing before making changes, so I'm leaving this driver to operate as it did before, for now. Link with more details at the end. Based on this breakdown, we have 5 drivers with a gate mask per TC and 4 with a gate mask per TXQ. So let's make the gate mask per TXQ the opt-in and the gate mask per TC the default. Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and query the device driver before calling the proper ndo_setup_tc(), and figure out if it expects one or the other format. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204 Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Cc: Siddharth Vadapalli <s-vadapalli@ti.com> Cc: Roger Quadros <rogerq@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	09c794c0a8	net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() The taprio qdisc does not currently pass the mqprio queue configuration down to the offloading device driver. So the driver cannot act upon the TXQ counts/offsets per TC, or upon the prio->tc map. It was probably assumed that the driver only wants to offload num_tc (see TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(), but there's clearly more to the mqprio configuration than that. I've considered 2 mechanisms to remedy that. First is to pass a struct tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second is to make taprio actually call TC_SETUP_QDISC_MQPRIO, in addition to TC_SETUP_QDISC_TAPRIO. The difference is that in the first case, existing drivers (offloading or not) all ignore taprio's mqprio portion currently, whereas in the second case, we could control whether to call TC_SETUP_QDISC_MQPRIO, based on a new capability. The question is which approach would be better. I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based on a taprio capability bit) would risk introducing regressions. For example, taprio doesn't populate (or validate) qopt->hw, as well as mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate. In comparison, adding a capability is functionally equivalent to just passing the mqprio in a way that drivers can ignore it, except it's slightly more complicated to use it (need to set the capability). Ultimately, what made me go for the "mqprio in taprio" variant was that it's easier for offloading drivers to interpret the mqprio qopt slightly differently when it comes from taprio vs when it comes from mqprio, should that ever become necessary. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	9dd6ad674c	net/sched: refactor mqprio qopt reconstruction to a library function The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from netdev settings once more in a future patch, but this code was already written twice, once in taprio and once in mqprio. Refactor the code to a helper in the common mqprio library. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	1dfe086dd7	net/sched: taprio: centralize mqprio qopt validation There is a lot of code in taprio which is "borrowed" from mqprio. It makes sense to put a stop to the "borrowing" and start actually reusing code. Because taprio and mqprio are built as part of different kernel modules, code reuse can only take place either by writing it as static inline (limiting), putting it in sch_generic.o (not generic enough), or creating a third auto-selectable kernel module which only holds library code. I opted for the third variant. In a previous change, mqprio gained support for reverse TC:TXQ mappings, something which taprio still denies. Make taprio use the same validation logic so that it supports this configuration as well. The taprio code didn't enforce TXQ overlaps in txtime-assist mode and that looks intentional, even if I've no idea why that might be. Preserve that, but add a comment. There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to update there. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	d404959fa2	net/sched: mqprio: add extack messages for queue count validation To make mqprio more user-friendly, create netlink extended ack messages which say exactly what is wrong about the queue counts. This uses the new support for printf-formatted extack messages. Example: $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \ map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0 Error: sch_mqprio: TC 0 queues 3@0 overlap with TC 1 queues 1@1. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	19278d7691	net/sched: mqprio: allow offloading drivers to request queue count validation mqprio_parse_opt() proudly has a comment: /* If hardware offload is requested we will leave it to the device * to either populate the queue counts itself or to validate the * provided queue counts. */ Unfortunately some device drivers did not get this memo, and don't validate the queue counts, or populate them. In case drivers don't want to populate the queue counts themselves, just act upon the requested configuration, it makes sense to introduce a tc capability, and make mqprio query it, so they don't have to do the validation themselves. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	d7045f520a	net/sched: mqprio: allow reverse TC:TXQ mappings By imposing that the last TXQ of TC i is smaller than the first TXQ of any TC j (j := i+1 .. n), mqprio imposes a strict ordering condition for the TXQ indices (they must increase as TCs increase). Claudiu points out that the complexity of the TXQ count validation is too high for this logic, i.e. instead of iterating over j, it is sufficient that the TXQ indices of TC i and i + 1 are ordered, and that will eventually ensure global ordering. This is true, however it doesn't appear to me that is what the code really intended to do. Instead, based on the comments, it just wanted to check for overlaps (and this isn't how one does that). So the following mqprio configuration, which I had recommended to Vinicius more than once for igb/igc (to account for the fact that on this hardware, lower numbered TXQs have higher dequeue priority than higher ones): num_tc 4 map 0 1 2 3 queues 1@3 1@2 1@1 1@0 is in fact denied today by mqprio. The full story is that in fact, it's only denied with "hw 0"; if hardware offloading is requested, mqprio defers TXQ range overlap validation to the device driver (a strange decision in itself). This is most certainly a bug, but it's not one that has any merit for being fixed on "stable" as far as I can tell. This is because mqprio always rejected a configuration which was in fact valid, and this has shaped the way in which mqprio configuration scripts got built for various hardware (see igb/igc in the link below). Therefore, one could consider it to be merely an improvement for mqprio to allow reverse TC:TXQ mappings. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-9-vladimir.oltean@nxp.com/#25188310 Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230128010719.2182346-6-vladimir.oltean@nxp.com/#25186442 Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:43 +00:00
Vladimir Oltean	5cfb45e2fb	net/sched: mqprio: refactor offloading and unoffloading to dedicated functions Some more logic will be added to mqprio offloading, so split that code up from mqprio_init(), which is already large, and create a new function, mqprio_enable_offload(), similar to taprio_enable_offload(). Also create the opposite function mqprio_disable_offload(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:43 +00:00
Vladimir Oltean	feb2cf3dcf	net/sched: mqprio: refactor nlattr parsing to a separate function mqprio_init() is quite large and unwieldy to add more code to. Split the netlink attribute parsing to a dedicated function. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:43 +00:00
Jesper Dangaard Brouer	9dde0cd3b1	net: introduce skb_poison_list and use in kfree_skb_list First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs earlier like introduced in commit `eedade12f4` ("net: kfree_skb_list use kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in commit `f72ff8b81e` ("net: fix kfree_skb_list use of skb_mark_not_on_list"). In case of a bug like mentioned commit we would have seen OOPS with: general protection fault, probably for non-canonical address 0xdead000000000870 And content of one the registers e.g. R13: dead000000000800 In this case skb->len is at offset 112 bytes (0x70) why fault happens at 0x800+0x70 = 0x870 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 09:56:38 +00:00
Greg Kroah-Hartman	f6b2ce79b5	Merge 6.2-rc7 into tty-next We need the tty/serial fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-06 10:48:49 +01:00
Qingfang DENG	542bcea4be	net: page_pool: use in_softirq() instead We use BH context only for synchronization, so we don't care if it's actually serving softirq or not. As a side node, in case of threaded NAPI, in_serving_softirq() will return false because it's in process context with BH off, making page_pool_recycle_in_cache() unreachable. Signed-off-by: Qingfang DENG <qingfang.deng@siflower.com.cn> Tested-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 09:15:22 +00:00
Petr Machata	a1aee20d5d	net: bridge: Add netlink knobs for number / maximum MDB entries The previous patch added accounting for number of MDB entries per port and per port-VLAN, and the logic to verify that these values stay within configured bounds. However it didn't provide means to actually configure those bounds or read the occupancy. This patch does that. Two new netlink attributes are added for the MDB occupancy: IFLA_BRPORT_MCAST_N_GROUPS for the per-port occupancy and BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS for the per-port-VLAN occupancy. And another two for the maximum number of MDB entries: IFLA_BRPORT_MCAST_MAX_GROUPS for the per-port maximum, and BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS for the per-port-VLAN one. Note that the two new IFLA_BRPORT_ attributes prompt bumping of RTNL_SLAVE_MAX_TYPE to size the slave attribute tables large enough. The new attributes are used like this: # ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \ mcast_vlan_snooping 1 mcast_querier 1 # ip link set dev v1 master br # bridge vlan add dev v1 vid 2 # bridge vlan set dev v1 vid 1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1 # bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1 Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1. # bridge link set dev v1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2 Error: bridge: Port is already in 1 groups, and mcast_max_groups=1. # bridge -d link show 5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...] [...] mcast_n_groups 1 mcast_max_groups 1 # bridge -d vlan show port vlan-id br 1 PVID Egress Untagged state forwarding mcast_router 1 v1 1 PVID Egress Untagged [...] mcast_n_groups 1 mcast_max_groups 1 2 [...] mcast_n_groups 0 mcast_max_groups 0 Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:26 +00:00
Petr Machata	b57e8d870d	net: bridge: Maintain number of MDB entries in net_bridge_mcast_port The MDB maintained by the bridge is limited. When the bridge is configured for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its capacity. In SW datapath, the capacity is configurable through the IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a similar limit exists in the HW datapath for purposes of offloading. In order to prevent the issue of unilateral exhaustion of MDB resources, introduce two parameters in each of two contexts: - Per-port and per-port-VLAN number of MDB entries that the port is member in. - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled) per-port-VLAN maximum permitted number of MDB entries, or 0 for no limit. The per-port multicast context is used for tracking of MDB entries for the port as a whole. This is available for all bridges. The per-port-VLAN multicast context is then only available on VLAN-filtering bridges on VLANs that have multicast snooping on. With these changes in place, it will be possible to configure MDB limit for bridge as a whole, or any one port as a whole, or any single port-VLAN. Note that unlike the global limit, exhaustion of the per-port and per-port-VLAN maximums does not cause disablement of multicast snooping. It is also permitted to configure the local limit larger than hash_max, even though that is not useful. In this patch, introduce only the accounting for number of entries, and the max field itself, but not the means to toggle the max. The next patch introduces the netlink APIs to toggle and read the values. Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:26 +00:00
Petr Machata	d47230a348	net: bridge: Add a tracepoint for MDB overflows The following patch will add two more maximum MDB allowances to the global one, mcast_hash_max, that exists today. In all these cases, attempts to add MDB entries above the configured maximums through netlink, fail noisily and obviously. Such visibility is missing when adding entries through the control plane traffic, by IGMP or MLD packets. To improve visibility in those cases, add a trace point that reports the violation, including the relevant netdevice (be it a slave or the bridge itself), and the MDB entry parameters: # perf record -e bridge:br_mdb_full & # [...] # perf script \| cut -d: -f4- dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10 CC: Steven Rostedt <rostedt@goodmis.org> CC: linux-trace-kernel@vger.kernel.org Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	eceb30854f	net: bridge: Change a cleanup in br_multicast_new_port_group() to goto This function is getting more to clean up in the following patches. Structuring the cleanups in one labeled block will allow reusing the same cleanup from several places. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	976b3858dd	net: bridge: Add br_multicast_del_port_group() Since cleaning up the effects of br_multicast_new_port_group() just consists of delisting and freeing the memory, the function br_mdb_add_group_star_g() inlines the corresponding code. In the following patches, number of per-port and per-port-VLAN MDB entries is going to be maintained, and that counter will have to be updated. Because that logic is going to be hidden in the br_multicast module, introduce a new hook intended to again remove a newly-created group. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	1c85b80b20	net: bridge: Move extack-setting to br_multicast_new_port_group() Now that br_multicast_new_port_group() takes an extack argument, move setting the extack there. The downside is that the error messages end up being less specific (the function cannot distinguish between (S,G) and (*,G) groups). However, the alternative is to check in the caller whether the callee set the extack, and if it didn't, set it. But that is only done when the callee is not exactly known. (E.g. in case of a notifier invocation.) Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	60977a0c63	net: bridge: Add extack to br_multicast_new_port_group() Make it possible to set an extack in br_multicast_new_port_group(). Eventually, this function will check for per-port and per-port-vlan MDB maximums, and will use the extack to communicate the reason for the bounce. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	c00041cf1c	net: bridge: Set strict_start_type at two policies Make any attributes newly-added to br_port_policy or vlan_tunnel_policy parsed strictly, to prevent userspace from passing garbage. Note that this patchset only touches the former policy. The latter was adjusted for completeness' sake. There do not appear to be other _deprecated calls with non-NULL policies. Suggested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Julian Anastasov	c1d2ecdf5e	neigh: make sure used and confirmed times are valid Entries can linger in cache without timer for days, thanks to the gc_thresh1 limit. As result, without traffic, the confirmed time can be outdated and to appear to be in the future. Later, on traffic, NUD_STALE entries can switch to NUD_DELAY and start the timer which can see the invalid confirmed time and wrongly switch to NUD_REACHABLE state instead of NUD_PROBE. As result, timer is set many days in the future. This is more visible on 32-bit platforms, with higher HZ value. Why this is a problem? While we expect unused entries to expire, such entries stay in REACHABLE state for too long, locked in cache. They are not expired normally, only when cache is full. Problem and the wrong state change reported by Zhang Changzhong: 172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE 350520 seconds have elapsed since this entry was last updated, but it is still in the REACHABLE state (base_reachable_time_ms is 30000), preventing lladdr from being updated through probe. Fix it by ensuring timer is started with valid used/confirmed times. Considering the valid time range is LONG_MAX jiffies, we try not to go too much in the past while we are in DELAY/PROBE state. There are also places that need used/updated times to be validated while timer is not running. Reported-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:44:31 +00:00
D. Wythe	aff7bfed90	net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore It's clear that rmbs_lock and sndbufs_lock are aims to protect the rmbs list or the sndbufs list. During connection establieshment, smc_buf_get_slot() will always be invoked, and it only performs read semantics in rmbs list and sndbufs list. Based on the above considerations, we replace mutex with rw_semaphore. Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot() run concurrently, other part use down_write() to keep exclusive semantics. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-04 09:48:19 +00:00
D. Wythe	4da687448d	net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is exclusive when assigned rmb_desc was not registered, although it can be executed in parallel when assigned rmb_desc was registered already and only performs read semtamics on it. Hence, we can not simply replace it with read semaphore. The idea here is that if the assigned rmb_desc was registered already, use read semaphore to protect the critical section, once the assigned rmb_desc was not registered, keep using keep write semaphore still to keep its exclusivity. Thanks to the reusable features of rmb_desc, which allows us to execute in parallel in most cases. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-04 09:48:19 +00:00
D. Wythe	f6421014e8	net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() Following is part of Off-CPU graph during frequent SMC-R short-lived processing: process_one_work (51.19%) smc_close_passive_work (28.36%) smcr_buf_unuse (28.34%) rwsem_down_write_slowpath (28.22%) smc_listen_work (22.83%) smc_clc_wait_msg (1.84%) smc_buf_create (20.45%) smcr_buf_map_usable_links rwsem_down_write_slowpath (20.43%) smcr_lgr_reg_rmbs (0.53%) rwsem_down_write_slowpath (0.43%) smc_llc_do_confirm_rkey (0.08%) We can clearly see that during the connection establishment time, waiting time of connections is not on IO, but on llc_conf_mutex. What is more important, the core critical area (smcr_buf_unuse() & smc_buf_create()) only perfroms read semantics on links, we can easily replace it with read semaphore. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-04 09:48:19 +00:00
D. Wythe	b5dd4d6981	net/smc: llc_conf_mutex refactor, replace it with rw_semaphore llc_conf_mutex was used to protect links and link related configurations in the same link group, for example, add or delete links. However, in most cases, the protected critical area has only read semantics and with no write semantics at all, such as obtaining a usable link or an available rmb_desc. This patch do simply code refactoring, replace mutex with rw_semaphore, replace mutex_lock with down_write and replace mutex_unlock with up_write. Theoretically, this replacement is equivalent, but after this patch, we can distinguish lock granularity according to different semantics of critical areas. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-04 09:48:19 +00:00
Eric Dumazet	6579f5bacc	raw: use net_hash_mix() in hash function Some applications seem to rely on RAW sockets. If they use private netns, we can avoid piling all RAW sockets bound to a given protocol into a single bucket. Also place (struct raw_hashinfo).lock into its own cache line to limit false sharing. Alternative would be to have per-netns hashtables, but this seems too expensive for most netns where RAW sockets are not used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:56:23 -08:00
Eric Dumazet	42186e6c00	ipv4: raw: add drop reasons Use existing helpers and drop reason codes for RAW input path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:56:23 -08:00
Eric Dumazet	8d8ebd77f5	ipv6: raw: add drop reasons Use existing helpers and drop reason codes for RAW input path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:56:23 -08:00
Moshe Shemesh	7c976c7cfc	devlink: Move devlink dev selftest code to dev Move devlink dev selftest callbacks and related code from leftover.c to file dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:26 -08:00
Moshe Shemesh	ec4a0ce92e	devlink: Move devlink_info_req struct to be local As all users of the struct devlink_info_req are already in dev.c, move this struct from devl_internal.c to be local in dev.c. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:26 -08:00
Moshe Shemesh	a13aab66cb	devlink: Move devlink dev flash code to dev Move devlink dev flash callbacks, helpers and other related code from leftover.c to dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:26 -08:00
Moshe Shemesh	d60191c46e	devlink: Move devlink dev info code to dev Move devlink dev info callbacks, related drivers helpers functions and other related code from leftover.c to dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:26 -08:00
Moshe Shemesh	af2f8c1f82	devlink: Move devlink dev eswitch code to dev Move devlink dev eswitch callbacks and related code from leftover.c to file dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:25 -08:00
Moshe Shemesh	c6ed7d6ef9	devlink: Move devlink dev reload code to dev Move devlink dev reload callback and related code from leftover.c to file dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:25 -08:00
Moshe Shemesh	dbeeca81bd	devlink: Split out dev get and dump code Move devlink dev get and dump callbacks and related dev code to new file dev.c. This file shall include all callbacks that are specific on devlink dev object. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:25 -08:00
Vladimir Oltean	d795527d50	net: dsa: use NL_SET_ERR_MSG_WEAK_MOD() more consistently Now that commit `028fb19c6b` ("netlink: provide an ability to set default extack message") provides a weak function that doesn't override an existing extack message provided by the driver, it makes sense to use it also for LAG and HSR offloading, not just for bridge offloading. Also consistently put the message string on a separate line, to reduce line length from 92 to 84 characters. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230202140354.3158129-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:23:32 -08:00
Christoph Hellwig	1eb9cd1500	libceph: use bvec_set_page to initialize bvecs Use the bvec_set_page helper to initialize bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Link: https://lore.kernel.org/r/20230203150634.3199647-24-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 10:17:43 -07:00
Christoph Hellwig	9088151f1b	sunrpc: use bvec_set_page to initialize bvecs Use the bvec_set_page helper to initialize bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/20230203150634.3199647-22-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 10:17:42 -07:00
Christoph Hellwig	efde918ac6	rxrpc: use bvec_set_page to initialize a bvec Use the bvec_set_page helper to initialize a bvec. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230203150634.3199647-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 10:17:42 -07:00
Vlad Buslov	df25455e5a	netfilter: nf_conntrack: allow early drop of offloaded UDP conns Both synchronous early drop algorithm and asynchronous gc worker completely ignore connections with IPS_OFFLOAD_BIT status bit set. With new functionality that enabled UDP NEW connection offload in action CT malicious user can flood the conntrack table with offloaded UDP connections by just sending a single packet per 5tuple because such connections can no longer be deleted by early drop algorithm. To mitigate the issue allow both early drop and gc to consider offloaded UDP connections for deletion. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-03 09:31:24 +00:00
Vlad Buslov	6a9bad0069	net/sched: act_ct: offload UDP NEW connections Modify the offload algorithm of UDP connections to the following: - Offload NEW connection as unidirectional. - When connection state changes to ESTABLISHED also update the hardware flow. However, in order to prevent act_ct from spamming offload add wq for every packet coming in reply direction in this state verify whether connection has already been updated to ESTABLISHED in the drivers. If that it the case, then skip flow_table and let conntrack handle such packets which will also allow conntrack to potentially promote the connection to ASSURED. - When connection state changes to ASSURED set the flow_table flow NF_FLOW_HW_BIDIRECTIONAL flag which will cause refresh mechanism to offload the reply direction. All other protocols have their offload algorithm preserved and are always offloaded as bidirectional. Note that this change tries to minimize the load on flow_table add workqueue. First, it tracks the last ctinfo that was offloaded by using new flow 'NF_FLOW_HW_ESTABLISHED' flag and doesn't schedule the refresh for reply direction packets when the offloads have already been updated with current ctinfo. Second, when 'add' task executes on workqueue it always update the offload with current flow state (by checking 'bidirectional' flow flag and obtaining actual ctinfo/cookie through meta action instead of caching any of these from the moment of scheduling the 'add' work) preventing the need from scheduling more updates if state changed concurrently while the 'add' work was pending on workqueue. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-03 09:31:24 +00:00
Vlad Buslov	d5774cb6c5	net/sched: act_ct: set ctinfo in meta action depending on ct state Currently tcf_ct_flow_table_fill_actions() function assumes that only established connections can be offloaded and always sets ctinfo to either IP_CT_ESTABLISHED or IP_CT_ESTABLISHED_REPLY strictly based on direction without checking actual connection state. To enable UDP NEW connection offload set the ctinfo, metadata cookie and NF_FLOW_HW_ESTABLISHED flow_offload flags bit based on ct->status value. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-03 09:31:24 +00:00
Vlad Buslov	1a441a9b8b	netfilter: flowtable: cache info of last offload Modify flow table offload to cache the last ct info status that was passed to the driver offload callbacks by extending enum nf_flow_flags with new "NF_FLOW_HW_ESTABLISHED" flag. Set the flag if ctinfo was 'established' during last act_ct meta actions fill call. This infrastructure change is necessary to optimize promoting of UDP connections from 'new' to 'established' in following patches in this series. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-03 09:31:24 +00:00
Vlad Buslov	8f84780b84	netfilter: flowtable: allow unidirectional rules Modify flow table offload to support unidirectional connections by extending enum nf_flow_flags with new "NF_FLOW_HW_BIDIRECTIONAL" flag. Only offload reply direction when the flag is set. This infrastructure change is necessary to support offloading UDP NEW connections in original direction in following patches in series. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-03 09:31:24 +00:00
Vlad Buslov	0eb5acb164	netfilter: flowtable: fixup UDP timeout depending on ct state Currently flow_offload_fixup_ct() function assumes that only replied UDP connections can be offloaded and hardcodes UDP_CT_REPLIED timeout value. To enable UDP NEW connection offload in following patches extract the actual connections state from ct->status and set the timeout according to it. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-03 09:31:24 +00:00
Luca Ceresoli	1b381f6fe4	scripts/spelling.txt: add "exsits" pattern and fix typo instances Fix typos and add the following to the scripts/spelling.txt: exsits\|\|exists Link: https://lkml.kernel.org/r/20230126152205.959277-1-luca.ceresoli@bootlin.com Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:50:07 -08:00
Eric Dumazet	2798e36dc2	tcp: add TCP_MINTTL drop reason In the unlikely case incoming packets are dropped because of IP_MINTTL / IPV6_MINHOPCOUNT constraints... Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230201174345.2708943-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-02 21:14:50 -08:00
Lorenzo Bianconi	b9d460c924	bpf: devmap: check XDP features in __xdp_enqueue routine Check if the destination device implements ndo_xdp_xmit callback relying on NETDEV_XDP_ACT_NDO_XMIT flags. Moreover, check if the destination device supports XDP non-linear frame in __xdp_enqueue and is_valid_dst routines. This patch allows to perform XDP_REDIRECT on non-linear XDP buffers. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/26a94c33520c0bfba021b3fbb2cb8c1e69bf53b8.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-02-02 20:48:24 -08:00
Marek Majtyka	0ae0cb2bb2	xsk: add usage of XDP features flags Change necessary condition check for XSK from ndo functions to xdp features flags. Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/45a98ec67b4556a6a22dfd85df3eb8276beeeb74.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-02-02 20:48:23 -08:00
Marek Majtyka	66c0e13ad2	drivers: net: turn on XDP features A summary of the flags being set for various drivers is given below. Note that XDP_F_REDIRECT_TARGET and XDP_F_FRAG_TARGET are features that can be turned off and on at runtime. This means that these flags may be set and unset under RTNL lock protection by the driver. Hence, READ_ONCE must be used by code loading the flag value. Also, these flags are not used for synchronization against the availability of XDP resources on a device. It is merely a hint, and hence the read may race with the actual teardown of XDP resources on the device. This may change in the future, e.g. operations taking a reference on the XDP resources of the driver, and in turn inhibiting turning off this flag. However, for now, it can only be used as a hint to check whether device supports becoming a redirection target. Turn 'hw-offload' feature flag on for: - netronome (nfp) - netdevsim. Turn 'native' and 'zerocopy' features flags on for: - intel (i40e, ice, ixgbe, igc) - mellanox (mlx5). - stmmac - netronome (nfp) Turn 'native' features flags on for: - amazon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2, enetc) - funeth - intel (igb) - marvell (mvneta, mvpp2, octeontx2) - mellanox (mlx4) - mtk_eth_soc - qlogic (qede) - sfc - socionext (netsec) - ti (cpsw) - tap - tsnep - veth - xen - virtio_net. Turn 'basic' (tx, pass, aborted and drop) features flags on for: - netronome (nfp) - cavium (thunder) - hyperv. Turn 'redirect_target' feature flag on for: - amanzon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2) - intel (i40e, ice, igb, ixgbe) - ti (cpsw) - marvell (mvneta, mvpp2) - sfc - socionext (netsec) - qlogic (qede) - mellanox (mlx5) - tap - veth - virtio_net - xen Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Marek Majtyka <alardam@gmail.com> Link: https://lore.kernel.org/r/3eca9fafb308462f7edb1f58e451d59209aa07eb.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-02-02 20:48:23 -08:00
Jakub Kicinski	d3d854fd6a	netdev-genl: create a simple family for netdev stuff Add a Netlink spec-compatible family for netdevs. This is a very simple implementation without much thought going into it. It allows us to reap all the benefits of Netlink specs, one can use the generic client to issue the commands: $ ./cli.py --spec netdev.yaml --dump dev_get [{'ifindex': 1, 'xdp-features': set()}, {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}}, {'ifindex': 3, 'xdp-features': {'rx-sg'}}] the generic python library does not have flags-by-name support, yet, but we also don't have to carry strings in the messages, as user space can get the names from the spec. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-02-02 20:48:23 -08:00
Jakub Kicinski	82b4a9412b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net net/core/gro.c `7d2c89b325` ("skb: Do mix page pool and page referenced frags in GRO") `b1a78b9b98` ("net: add support for ipv4 big tcp") https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-02 14:49:55 -08:00
Jakub Kicinski	b0de13d307	linux-can-fixes-for-6.2-20230202 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmPbg4YTHG1rbEBwZW5n dXRyb25peC5kZQAKCRC+UBxKKooE6AuHB/46pXdhKmIwJlyioOUM7L3TBOHN5WDR PqOo8l1VoiU2mjL934+QwrKd45WKyuzoMudaNzzzoB2hF+PTCzvoQS13n/iSEbxk DhEHrkVeBi7fuQObgcCyYXXDq/igiGawlPRtuIej9MUEtfzjO61BvMgo6HVaxNSi Q8uZxjDcUnkr3qBRQPPPgcdJmUbYzLUMoWg6lAlab5alHjFAuoQpYuiunEE+xFDU PN9gofkrdXDcy0aoFPPdLs1AIF7MlmGAFeh9DEkpBF4gHQVGulOW8yEcQFBLekNN hOz3chd2eTdDZvpIiIIHQursA25nWK0X9PqAarUENMz0jhG6WSJIyEDe =qIgG -----END PGP SIGNATURE----- Merge tag 'linux-can-fixes-for-6.2-20230202' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== can 2023-02-02 The first patch is by Ziyang Xuan and removes a errant WARN_ON_ONCE() in the CAN J1939 protocol. The next 3 patches are by Oliver Hartkopp. The first 2 target the CAN ISO-TP protocol and fix the state machine with respect to signals and a regression found by the syzbot. The last patch is by me an missing assignment during the ethtool ring configuration callback. * tag 'linux-can-fixes-for-6.2-20230202' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: mcp251xfd: mcp251xfd_ring_set_ringparam(): assign missing tx_obj_num_coalesce_irq can: isotp: split tx timer into transmission and timeout can: isotp: handle wait_event_interruptible() return values can: raw: fix CAN FD frame transmissions over CAN XL devices can: j1939: fix errant WARN_ON_ONCE in j1939_session_deactivate ==================== Link: https://lore.kernel.org/r/20230202094135.2293939-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-02 11:51:24 -08:00
Fedor Pchelkin	0c598aed44	net: openvswitch: fix flow memory leak in ovs_flow_cmd_new Syzkaller reports a memory leak of new_flow in ovs_flow_cmd_new() as it is not freed when an allocation of a key fails. BUG: memory leak unreferenced object 0xffff888116668000 (size 632): comm "syz-executor231", pid 1090, jiffies 4294844701 (age 18.871s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000defa3494>] kmem_cache_zalloc include/linux/slab.h:654 [inline] [<00000000defa3494>] ovs_flow_alloc+0x19/0x180 net/openvswitch/flow_table.c:77 [<00000000c67d8873>] ovs_flow_cmd_new+0x1de/0xd40 net/openvswitch/datapath.c:957 [<0000000010a539a8>] genl_family_rcv_msg_doit+0x22d/0x330 net/netlink/genetlink.c:739 [<00000000dff3302d>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] [<00000000dff3302d>] genl_rcv_msg+0x328/0x590 net/netlink/genetlink.c:800 [<000000000286dd87>] netlink_rcv_skb+0x153/0x430 net/netlink/af_netlink.c:2515 [<0000000061fed410>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 [<000000009dc0f111>] netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] [<000000009dc0f111>] netlink_unicast+0x545/0x7f0 net/netlink/af_netlink.c:1339 [<000000004a5ee816>] netlink_sendmsg+0x8e7/0xde0 net/netlink/af_netlink.c:1934 [<00000000482b476f>] sock_sendmsg_nosec net/socket.c:651 [inline] [<00000000482b476f>] sock_sendmsg+0x152/0x190 net/socket.c:671 [<00000000698574ba>] ____sys_sendmsg+0x70a/0x870 net/socket.c:2356 [<00000000d28d9e11>] ___sys_sendmsg+0xf3/0x170 net/socket.c:2410 [<0000000083ba9120>] __sys_sendmsg+0xe5/0x1b0 net/socket.c:2439 [<00000000c00628f8>] do_syscall_64+0x30/0x40 arch/x86/entry/common.c:46 [<000000004abfdcf4>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 To fix this the patch rearranges the goto labels to reflect the order of object allocations and adds appropriate goto statements on the error paths. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: `68bb10101e` ("openvswitch: Fix flow lookup to use unmasked key") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230201210218.361970-1-pchelkin@ispras.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-02 11:32:51 -08:00
Oliver Hartkopp	c6adf659a8	can: isotp: check CAN address family in isotp_bind() Add missing check to block non-AF_CAN binds. Syzbot created some code which matched the right sockaddr struct size but used AF_XDP (0x2C) instead of AF_CAN (0x1D) in the address family field: bind$xdp(r2, &(0x7f0000000540)={0x2c, 0x0, r4, 0x0, r2}, 0x10) ^^^^ This has no funtional impact but the userspace should be notified about the wrong address family field content. Link: https://syzkaller.appspot.com/text?tag=CrashLog&x=11ff9d8c480000 Reported-by: syzbot+5aed6c3aaba661f5b917@syzkaller.appspotmail.com Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20230104201844.13168-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-02 15:42:10 +01:00
Oliver Hartkopp	2a30b2bd01	can: gw: give feedback on missing CGW_FLAGS_CAN_IIF_TX_OK flag To send CAN traffic back to the incoming interface a special flag has to be set. When creating a routing job for identical interfaces without this flag the rule is created but has no effect. This patch adds an error return value in the case that the CAN interfaces are identical but the CGW_FLAGS_CAN_IIF_TX_OK flag was not set. Reported-by: Jannik Hartung <jannik.hartung@tu-bs.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20230125055407.2053-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-02 15:42:10 +01:00
Bo Liu	b18ea3d9d2	net: dsa: Use sysfs_emit() to instead of sprintf() Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Bo Liu <liubo03@inspur.com> Link: https://lore.kernel.org/r/20230201081438.3151-1-liubo03@inspur.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-02 15:28:59 +01:00
Julian Anastasov	e4d0fe71f5	ipvs: avoid kfree_rcu without 2nd arg Avoid possible synchronize_rcu() as part from the kfree_rcu() call when 2nd arg is not provided. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-02-02 14:02:01 +01:00
Pedro Tammela	95b0693823	net/sched: simplify tcf_pedit_act Remove the check for a negative number of keys as this cannot ever happen Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-02 13:19:02 +01:00
Pedro Tammela	52cf89f78c	net/sched: transition act_pedit to rcu and percpu stats The software pedit action didn't get the same love as some of the other actions and it's still using spinlocks and shared stats in the datapath. Transition the action to rcu and percpu stats as this improves the action's performance dramatically on multiple cpu deployments. Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-02 13:19:02 +01:00
Paolo Abeni	a8248fc4ad	rxrpc development -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmPZRFAACgkQ+7dXa6fL C2sf2w/+JPdtpughFCFQEztVYrw6pDr3EU4iDGrbOW1GDGHppTP+l6SaGXIG3uLD ahd/3PrTCDNwSEu7acY7RADKtse8DT2nzEutoOStTSFISpD49d/K1h7jWfvCaDO8 p18rmRPnn/MTHmWOBazivBuKiA7W8tiCm+CSIEXldfuBZ1cHTY2qjR/MdSjEnMQX T2cuCONkb7yuT0K8BCFm2Oj3LPi7EoMorie18dY9zdEsBK22unqssQ/hWzbYHxX9 JC2Xqx8G76xCDKymejpf75V3NrO/OYjr0X8JItb77uc6Fc7i944cK5WM+1ikPnox kFfPauT/vIOZFBm0bXxzVN7x4VOtfO6wc8VNNtuCYKKJFwqwbbT82bLXfVGwEFNf ZC7KeU6y3dN2iEqcPpytYIocIUE3SkYwEk8xBWGK3Ufm4MK06QqqcFowUPJbDAnQ 9RXdO6JPdswUmDtrgaqtx9K/+BWH7eRFtmbySG2+ZTIVEfthFPPr0ZGtFSGukdBr 0VGF5oUywAyW36pPGgTosOysC+wZlDdKKKlVzhWxbzhvF/+U+QGW7W0BZH6aqN8C ZslReMY4xauzluAMdyyX/ZewBCjdlX2wkOPIgieZPO0LL+ptB+gpCXXwmzVzO6ln 2i9+se4MOiWMSVENdvYRVN5tcZwf9u/r8asX4QsqtRBhjdNI5n8= =s0Uq -----END PGP SIGNATURE----- Merge tag 'rxrpc-next-20230131' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== Here's the fifth part of patches in the process of moving rxrpc from doing a lot of its stuff in softirq context to doing it in an I/O thread in process context and thereby making it easier to support a larger SACK table. The full description is in the description for the first part[1] which is now upstream. The second and third parts are also upstream[2]. A subset of the original fourth part[3] got applied as a fix for a race[4]. The fifth part includes some cleanups: (1) Miscellaneous trace header cleanups: fix a trace string, display the security index in rx_packet rather than displaying the type twice, remove some whitespace to make checkpatch happier and remove some excess tabulation. (2) Convert ->recvmsg_lock to a spinlock as it's only ever locked exclusively. (3) Make ->ackr_window and ->ackr_nr_unacked non-atomic as they're only used in the I/O thread. (4) Don't use call->tx_lock to access ->tx_buffer as that is only accessed inside the I/O thread. sendmsg() loads onto ->tx_sendmsg and the I/O thread decants from that to the buffer. (5) Remove local->defrag_sem as DATA packets are transmitted serially by the I/O thread. (6) Remove the service connection bundle is it was only used for its channel_lock - which has now gone. And some more significant changes: (7) Add a debugging option to allow a delay to be injected into packet reception to help investigate the behaviour over longer links than just a few cm. (8) Generate occasional PING ACKs to probe for RTT information during a receive heavy call. (9) Simplify the SACK table maintenance and ACK generation. Now that both parts are done in the same thread, there's no possibility of a race and no need to try and be cunning to avoid taking a BH spinlock whilst copying the SACK table (which in the future will be up to 2K) and no need to rotate the copy to fit the ACK packet table. (10) Use SKB_CONSUMED when freeing received DATA packets (stop dropwatch complaining). * tag 'rxrpc-next-20230131' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: rxrpc: Kill service bundle rxrpc: Change rx_packet tracepoint to display securityIndex not type twice rxrpc: Show consumed and freed packets as non-dropped in dropwatch rxrpc: Remove local->defrag_sem rxrpc: Don't lock call->tx_lock to access call->tx_buffer rxrpc: Simplify ACK handling rxrpc: De-atomic call->ackr_window and call->ackr_nr_unacked rxrpc: Generate extra pings for RTT during heavy-receive call rxrpc: Allow a delay to be injected into packet reception rxrpc: Convert call->recvmsg_lock to a spinlock rxrpc: Shrink the tabulation in the rxrpc trace header a bit rxrpc: Remove whitespace before ')' in trace header rxrpc: Fix trace string ==================== Link: https://lore.kernel.org/all/20230131171227.3912130-1-dhowells@redhat.com/ Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-02 12:47:28 +01:00
Oliver Hartkopp	4f027cba82	can: isotp: split tx timer into transmission and timeout The timer for the transmission of isotp PDUs formerly had two functions: 1. send two consecutive frames with a given time gap 2. monitor the timeouts for flow control frames and the echo frames This led to larger txstate checks and potentially to a problem discovered by syzbot which enabled the panic_on_warn feature while testing. The former 'txtimer' function is split into 'txfrtimer' and 'txtimer' to handle the two above functionalities with separate timer callbacks. The two simplified timers now run in one-shot mode and make the state transitions (especially with isotp_rcv_echo) better understandable. Fixes: `866337865f` ("can: isotp: fix tx state handling for echo tx processing") Reported-by: syzbot+5aed6c3aaba661f5b917@syzkaller.appspotmail.com Cc: stable@vger.kernel.org # >= v6.0 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20230104145701.2422-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-02 10:33:26 +01:00
Oliver Hartkopp	823b2e4272	can: isotp: handle wait_event_interruptible() return values When wait_event_interruptible() has been interrupted by a signal the tx.state value might not be ISOTP_IDLE. Force the state machines into idle state to inhibit the timer handlers to continue working. Fixes: `866337865f` ("can: isotp: fix tx state handling for echo tx processing") Cc: stable@vger.kernel.org Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20230112192347.1944-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-02 10:33:26 +01:00
Oliver Hartkopp	3793301cba	can: raw: fix CAN FD frame transmissions over CAN XL devices A CAN XL device is always capable to process CAN FD frames. The former check when sending CAN FD frames relied on the existence of a CAN FD device and did not check for a CAN XL device that would be correct too. With this patch the CAN FD feature is enabled automatically when CAN XL is switched on - and CAN FD cannot be switch off while CAN XL is enabled. This precondition also leads to a clean up and reduction of checks in the hot path in raw_rcv() and raw_sendmsg(). Some conditions are reordered to handle simple checks first. changes since v1: https://lore.kernel.org/all/20230131091012.50553-1-socketcan@hartkopp.net - fixed typo: devive -> device changes since v2: https://lore.kernel.org/all/20230131091824.51026-1-socketcan@hartkopp.net/ - reorder checks in if statements to handle simple checks first Fixes: `626332696d` ("can: raw: add CAN XL support") Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20230131105613.55228-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-02 10:33:26 +01:00
Ziyang Xuan	d0553680f9	can: j1939: fix errant WARN_ON_ONCE in j1939_session_deactivate The conclusion "j1939_session_deactivate() should be called with a session ref-count of at least 2" is incorrect. In some concurrent scenarios, j1939_session_deactivate can be called with the session ref-count less than 2. But there is not any problem because it will check the session active state before session putting in j1939_session_deactivate_locked(). Here is the concurrent scenario of the problem reported by syzbot and my reproduction log. cpu0 cpu1 j1939_xtp_rx_eoma j1939_xtp_rx_abort_one j1939_session_get_by_addr [kref == 2] j1939_session_get_by_addr [kref == 3] j1939_session_deactivate [kref == 2] j1939_session_put [kref == 1] j1939_session_completed j1939_session_deactivate WARN_ON_ONCE(kref < 2) ===================================================== WARNING: CPU: 1 PID: 21 at net/can/j1939/transport.c:1088 j1939_session_deactivate+0x5f/0x70 CPU: 1 PID: 21 Comm: ksoftirqd/1 Not tainted 5.14.0-rc7+ #32 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 RIP: 0010:j1939_session_deactivate+0x5f/0x70 Call Trace: j1939_session_deactivate_activate_next+0x11/0x28 j1939_xtp_rx_eoma+0x12a/0x180 j1939_tp_recv+0x4a2/0x510 j1939_can_recv+0x226/0x380 can_rcv_filter+0xf8/0x220 can_receive+0x102/0x220 ? process_backlog+0xf0/0x2c0 can_rcv+0x53/0xf0 __netif_receive_skb_one_core+0x67/0x90 ? process_backlog+0x97/0x2c0 __netif_receive_skb+0x22/0x80 Fixes: `0c71437dd5` ("can: j1939: j1939_session_deactivate(): clarify lifetime of session object") Reported-by: syzbot+9981a614060dcee6eeca@syzkaller.appspotmail.com Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/all/20210906094200.95868-1-william.xuanziyang@huawei.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-02 10:33:26 +01:00
Leon Romanovsky	028fb19c6b	netlink: provide an ability to set default extack message In netdev common pattern, extack pointer is forwarded to the drivers to be filled with error message. However, the caller can easily overwrite the filled message. Instead of adding multiple "if (!extack->_msg)" checks before any NL_SET_ERR_MSG() call, which appears after call to the driver, let's add new macro to common code. [1] https://lore.kernel.org/all/Y9Irgrgf3uxOjwUm@unreal Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/6993fac557a40a1973dfa0095107c3d03d40bec1.1675171790.git.leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 21:04:09 -08:00
Brian Haley	62e395f82d	neighbor: fix proxy_delay usage when it is zero When set to zero, the neighbor sysctl proxy_delay value does not cause an immediate reply for ARP/ND requests as expected, it instead causes a random delay between [0, U32_MAX). Looking at this comment from __get_random_u32_below() explains the reason: /* * This function is technically undefined for ceil == 0, and in fact * for the non-underscored constant version in the header, we build bug * on that. But for the non-constant case, it's convenient to have that * evaluate to being a straight call to get_random_u32(), so that * get_random_u32_inclusive() can work over its whole range without * undefined behavior. */ Added helper function that does not call get_random_u32_below() if proxy_delay is zero and just uses the current value of jiffies instead, causing pneigh_enqueue() to respond immediately. Also added definition of proxy_delay to ip-sysctl.txt since it was missing. Signed-off-by: Brian Haley <haleyb.dev@gmail.com> Link: https://lore.kernel.org/r/20230130171428.367111-1-haleyb.dev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 21:02:54 -08:00
Xin Long	b1a78b9b98	net: add support for ipv4 big tcp Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP. Firstly, allow sk->sk_gso_max_size to be set to a value greater than GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size() for IPv4 TCP sockets. Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU in __ip_local_out() to allow to send BIG TCP packets, and this implies that skb->len is the length of a IPv4 packet; On RX path, use skb->len as the length of the IPv4 packet when the IP header tot_len is 0 and skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only need to update these APIs. Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In GRO complete, set IP header tot_len to 0 when the merged packet size greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed on RX path. Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP packets. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	9eefedd58a	net: add gso_ipv4_max_size and gro_ipv4_max_size per device This patch introduces gso_ipv4_max_size and gro_ipv4_max_size per device and adds netlink attributes for them, so that IPV4 BIG TCP can be guarded by a separate tunable in the next patch. To not break the old application using "gso/gro_max_size" for IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size" in netif_set_gso/gro_max_size() if the new size isn't greater than GSO_LEGACY_MAX_SIZE, so that nothing will change even if userspace doesn't realize the new netlink attributes. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	8e08bb75b6	packet: add TP_STATUS_GSO_TCP for tp_status Introduce TP_STATUS_GSO_TCP tp_status flag to tell the af_packet user that this is a TCP GSO packet. When parsing IPv4 BIG TCP packets in tcpdump/libpcap, it can use tp_len as the IPv4 packet len when this flag is set, as iph tot_len is set to 0 for IPv4 BIG TCP packets. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	7eb072be41	cipso_ipv4: use iph_set_totlen in skbuff_setattr It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so the iph->tot_len update should use iph_set_totlen(). Note that for these non GSO packets, the new iph tot_len with extra iph option len added may become greater than 65535, the old process will cast it and set iph->tot_len to it, which is a bug. In theory, iph options shouldn't be added for these big packets in here, a fix may be needed here in the future. For now this patch is only to set iph->tot_len to 0 when it happens. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	a13fbf5ed5	netfilter: use skb_ip_totlen and iph_totlen There are also quite some places in netfilter that may process IPv4 TCP GSO packets, we need to replace them too. In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen() return value, otherwise it may overflow and mismatch. This change will also help us add selftest for IPv4 BIG TCP in the following patch. Note that we don't need to replace the one in tcpmss_tg4(), as it will return if there is data after tcphdr in tcpmss_mangle_packet(). The same in mangle_contents() in nf_nat_helper.c, it returns false when skb->len + extra > 65535 in enlarge_skb(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	043e397e48	net: sched: use skb_ip_totlen and iph_totlen There are 1 action and 1 qdisc that may process IPv4 TCP GSO packets and access iph->tot_len, replace them with skb_ip_totlen() and iph_totlen() accordingly. Note that we don't need to replace the one in tcf_csum_ipv4(), as it will return for TCP GSO packets in tcf_csum_ipv4_tcp(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	ec84c955a0	openvswitch: use skb_ip_totlen in conntrack IPv4 GSO packets may get processed in ovs_skb_network_trim(), and we need to use skb_ip_totlen() to get iph totlen. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	46abd17302	bridge: use skb_ip_totlen in br netfilter These 3 places in bridge netfilter are called on RX path after GRO and IPv4 TCP GSO packets may come through, so replace iph tot_len accessing with skb_ip_totlen() in there. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Jiapeng Chong	bc61761394	ipv6: ICMPV6: Use swap() instead of open coding it Swap is a function interface that provides exchange function. To avoid code duplication, we can use swap function. ./net/ipv6/icmp.c:344:25-26: WARNING opportunity for swap(). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3896 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230131063456.76302-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 19:55:41 -08:00
Thomas Winter	30e2291f61	ip/ip6_gre: Fix non-point-to-point tunnel not generating IPv6 link local address We recently found that our non-point-to-point tunnels were not generating any IPv6 link local address and instead generating an IPv6 compat address, breaking IPv6 communication on the tunnel. Previously, addrconf_gre_config always would call addrconf_addr_gen and generate a EUI64 link local address for the tunnel. Then commit `e5dd729460` changed the code path so that add_v4_addrs is called but this only generates a compat IPv6 address for non-point-to-point tunnels. I assume the compat address is specifically for SIT tunnels so have kept that only for SIT - GRE tunnels now always generate link local addresses. Fixes: `e5dd729460` ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address") Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 19:52:22 -08:00
Thomas Winter	23ca0c2c93	ip/ip6_gre: Fix changing addr gen mode not generating IPv6 link local address For our point-to-point GRE tunnels, they have IN6_ADDR_GEN_MODE_NONE when they are created then we set IN6_ADDR_GEN_MODE_EUI64 when they come up to generate the IPv6 link local address for the interface. Recently we found that they were no longer generating IPv6 addresses. This issue would also have affected SIT tunnels. Commit `e5dd729460` changed the code path so that GRE tunnels generate an IPv6 address based on the tunnel source address. It also changed the code path so GRE tunnels don't call addrconf_addr_gen in addrconf_dev_config which is called by addrconf_sysctl_addr_gen_mode when the IN6_ADDR_GEN_MODE is changed. This patch aims to fix this issue by moving the code in addrconf_notify which calls the addr gen for GRE and SIT into a separate function and calling it in the places that expect the IPv6 address to be generated. The previous addrconf_dev_config is renamed to addrconf_eth_config since it only expected eth type interfaces and follows the addrconf_gre/sit_config format. A part of this changes means that the loopback address will be attempted to be configured when changing addr_gen_mode for lo. This should not be a problem because the address should exist anyway and if does already exist then no error is produced. Fixes: `e5dd729460` ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address") Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 19:52:22 -08:00
David Vernet	6aed15e330	selftests/bpf: Add testcase for static kfunc with unused arg kfuncs are allowed to be static, or not use one or more of their arguments. For example, bpf_xdp_metadata_rx_hash() in net/core/xdp.c is meant to be implemented by drivers, with the default implementation just returning -EOPNOTSUPP. As described in [0], such kfuncs can have their arguments elided, which can cause BTF encoding to be skipped. The new __bpf_kfunc macro should address this, and this patch adds a selftest which verifies that a static kfunc with at least one unused argument can still be encoded and invoked by a BPF program. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230201173016.342758-5-void@manifault.com	2023-02-02 00:25:14 +01:00
David Vernet	400031e05a	bpf: Add __bpf_kfunc tag to all kfuncs Now that we have the __bpf_kfunc tag, we should use add it to all existing kfuncs to ensure that they'll never be elided in LTO builds. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com	2023-02-02 00:25:14 +01:00
Jiri Pirko	8589ba4e64	devlink: rename and reorder instances of struct devlink_cmd In order to maintain naming consistency, rename and reorder all usages of struct struct devlink_cmd in the following way: 1) Remove "gen" and replace it with "cmd" to match the struct name 2) Order devl_cmds[] and the header file to match the order of enum devlink_command 3) Move devl_cmd_rate_get among the peers 4) Remove "inst" for DEVLINK_CMD_GET 5) Add "_get" suffix to all to match DEVLINK_CMD_*_GET (only rate had it done correctly) Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 10:57:01 -08:00
Jiri Pirko	f87445953d	devlink: remove "gen" from struct devlink_gen_cmd name No need to have "gen" inside name of the structure for devlink commands. Remove it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 10:57:01 -08:00
Jiri Pirko	c3a4fd5718	devlink: rename devlink_nl_instance_iter_dump() to "dumpit" To have the name of the function consistent with the struct cb name, rename devlink_nl_instance_iter_dump() to devlink_nl_instance_iter_dumpit(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 10:57:01 -08:00
Gavrilov Ilia	f6477ec62f	netfilter: conntrack: remote a return value of the 'seq_print_acct' function. The static 'seq_print_acct' function always returns 0. Change the return value to 'void' and remove unnecessary checks. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `1ca9e41770` ("netfilter: Remove uses of seq_<foo> return values") Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-02-01 12:18:52 +01:00
Florian Westphal	28af0f009d	netfilter: conntrack: udp: fix seen-reply test IPS_SEEN_REPLY_BIT is only useful for test_bit() api. Fixes: `4883ec512c` ("netfilter: conntrack: avoid reload of ct->status") Reported-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-02-01 12:18:51 +01:00
Yang Yingliang	1fb7696ac6	netfilter: nf_tables: fix wrong pointer passed to PTR_ERR() It should be 'chain' passed to PTR_ERR() in the error path after calling nft_chain_lookup() in nf_tables_delrule(). Fixes: `f80a612dd7` ("netfilter: nf_tables: add support to destroy operation") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-02-01 12:18:51 +01:00
Alok Tiwari	dac7f50a45	netfilter: nf_tables: NULL pointer dereference in nf_tables_updobj() static analyzer detect null pointer dereference case for 'type' function __nft_obj_type_get() can return NULL value which require to handle if type is NULL pointer return -ENOENT. This is a theoretical issue, since an existing object has a type, but better add this failsafe check. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2023-02-01 12:18:48 +01:00
Jakub Kicinski	64466c407a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Release bridge info once packet escapes the br_netfilter path, from Florian Westphal. 2) Revert incorrect fix for the SCTP connection tracking chunk iterator, also from Florian. First path fixes a long standing issue, the second path addresses a mistake in the previous pull request for net. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: Revert "netfilter: conntrack: fix bug in for_each_sctp_chunk" netfilter: br_netfilter: disable sabotage_in hook after first suppression ==================== Link: https://lore.kernel.org/r/20230131133158.4052-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:19:20 -08:00
Yan Zhai	876e8ca836	net: fix NULL pointer in skb_segment_list Commit `3a1296a38d` ("net: Support GRO/GSO fraglist chaining.") introduced UDP listifyed GRO. The segmentation relies on frag_list being untouched when passing through the network stack. This assumption can be broken sometimes, where frag_list itself gets pulled into linear area, leaving frag_list being NULL. When this happens it can trigger following NULL pointer dereference, and panic the kernel. Reverse the test condition should fix it. [19185.577801][ C1] BUG: kernel NULL pointer dereference, address: ... [19185.663775][ C1] RIP: 0010:skb_segment_list+0x1cc/0x390 ... [19185.834644][ C1] Call Trace: [19185.841730][ C1] <TASK> [19185.848563][ C1] __udp_gso_segment+0x33e/0x510 [19185.857370][ C1] inet_gso_segment+0x15b/0x3e0 [19185.866059][ C1] skb_mac_gso_segment+0x97/0x110 [19185.874939][ C1] __skb_gso_segment+0xb2/0x160 [19185.883646][ C1] udp_queue_rcv_skb+0xc3/0x1d0 [19185.892319][ C1] udp_unicast_rcv_skb+0x75/0x90 [19185.900979][ C1] ip_protocol_deliver_rcu+0xd2/0x200 [19185.910003][ C1] ip_local_deliver_finish+0x44/0x60 [19185.918757][ C1] __netif_receive_skb_one_core+0x8b/0xa0 [19185.927834][ C1] process_backlog+0x88/0x130 [19185.935840][ C1] __napi_poll+0x27/0x150 [19185.943447][ C1] net_rx_action+0x27e/0x5f0 [19185.951331][ C1] ? mlx5_cq_tasklet_cb+0x70/0x160 [mlx5_core] [19185.960848][ C1] __do_softirq+0xbc/0x25d [19185.968607][ C1] irq_exit_rcu+0x83/0xb0 [19185.976247][ C1] common_interrupt+0x43/0xa0 [19185.984235][ C1] asm_common_interrupt+0x22/0x40 ... [19186.094106][ C1] </TASK> Fixes: `3a1296a38d` ("net: Support GRO/GSO fraglist chaining.") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/Y9gt5EUizK1UImEP@debian Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:07:04 -08:00
Xin Long	8f35ae17ef	sctp: do not check hb_timer.expires when resetting hb_timer It tries to avoid the frequently hb_timer refresh in commit `ba6f5e33bd` ("sctp: avoid refreshing heartbeat timer too often"), and it only allows mod_timer when the new expires is after hb_timer.expires. It means even a much shorter interval for hb timer gets applied, it will have to wait until the current hb timer to time out. In sctp_do_8_2_transport_strike(), when a transport enters PF state, it expects to update the hb timer to resend a heartbeat every rto after calling sctp_transport_reset_hb_timer(), which will not work as the change mentioned above. The frequently hb_timer refresh was caused by sctp_transport_reset_timers() called in sctp_outq_flush() and it was already removed in the commit above. So we don't have to check hb_timer.expires when resetting hb_timer as it is now not called very often. Fixes: `ba6f5e33bd` ("sctp: avoid refreshing heartbeat timer too often") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/d958c06985713ec84049a2d5664879802710179a.1675095933.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:01:28 -08:00

... 3 4 5 6 7 ...

72508 Коммитов