WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Eric Dumazet	fb420d5d91	tcp/fq: move back to CLOCK_MONOTONIC In the recent TCP/EDT patch series, I switched TCP and sch_fq clocks from MONOTONIC to TAI, in order to meet the choice done earlier for sch_etf packet scheduler. But sure enough, this broke some setups were the TAI clock jumps forward (by almost 50 year...), as reported by Leonard Crestez. If we want to converge later, we'll probably need to add an skb field to differentiate the clock bases, or a socket option. In the meantime, an UDP application will need to use CLOCK_MONOTONIC base for its SCM_TXTIME timestamps if using fq packet scheduler. Fixes: `72b0094f91` ("tcp: switch tcp_clock_ns() to CLOCK_TAI base") Fixes: `142537e419` ("net_sched: sch_fq: switch to CLOCK_TAI") Fixes: `fd2bca2aa7` ("tcp: switch internal pacing timer to CLOCK_TAI") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Leonard Crestez <leonard.crestez@nxp.com> Tested-by: Leonard Crestez <leonard.crestez@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:18:51 -07:00
Cong Wang	460b360104	net_sched: fix a crash in tc_new_tfilter() When tcf_block_find() fails, it already rollbacks the qdisc refcnt, so its caller doesn't need to clean up this again. Avoid calling qdisc_put() again by resetting qdisc to NULL for callers. Reported-by: syzbot+37b8770e6d5a8220a039@syzkaller.appspotmail.com Fixes: `e368fdb61d` ("net: sched: use Qdisc rcu API instead of relying on rtnl lock") Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:09:41 -07:00
Dan Carpenter	aeadd93f2b	net: sched: act_ipt: check for underflow in __tcf_ipt_init() If "td->u.target_size" is larger than sizeof(struct xt_entry_target) we return -EINVAL. But we don't check whether it's smaller than sizeof(struct xt_entry_target) and that could lead to an out of bounds read. Fixes: `7ba699c604` ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 22:34:14 -07:00
Wei Yongjun	5362700c94	net: sched: make function qdisc_free_cb() static Fixes the following sparse warning: net/sched/sch_generic.c:944:6: warning: symbol 'qdisc_free_cb' was not declared. Should it be static? Fixes: `3a7d0d07a3` ("net: sched: extend Qdisc with rcu") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-28 11:04:58 -07:00
Vlad Buslov	787ce6d02d	net: sched: use reference counting for tcf blocks on rules update In order to remove dependency on rtnl lock on rules update path, always take reference to block while using it on rules update path. Change tcf_block_get() error handling to properly release block with reference counting, instead of just destroying it, in order to accommodate potential concurrent users. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 20:17:36 -07:00
Vlad Buslov	0607e43994	net: sched: implement tcf_block_refcnt_{get\|put}() Implement get/put function for blocks that only take/release the reference and perform deallocation. These functions are intended to be used by unlocked rules update path to always hold reference to block while working with it. They use on new fine-grained locking mechanisms introduced in previous patches in this set, instead of relying on global protection provided by rtnl lock. Extract code that is common with tcf_block_detach_ext() into common function __tcf_block_put(). Extend tcf_block with rcu to allow safe deallocation when it is accessed concurrently. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 20:17:36 -07:00
Vlad Buslov	ab2816295f	net: sched: protect block idr with spinlock Protect block idr access with spinlock, instead of relying on rtnl lock. Take tn->idr_lock spinlock during block insertion and removal. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 20:17:36 -07:00
Vlad Buslov	f00234367b	net: sched: implement functions to put and flush all chains Extract code that flushes and puts all chains on tcf block to two standalone function to be shared with functions that locklessly get/put reference to block. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 20:17:36 -07:00
Vlad Buslov	cfebd7e242	net: sched: change tcf block reference counter type to refcount_t As a preparation for removing rtnl lock dependency from rules update path, change tcf block reference counter type to refcount_t to allow modification by concurrent users. In block put function perform decrement and check reference counter once to accommodate concurrent modification by unlocked users. After this change tcf_chain_put at the end of block put function is called with block->refcnt==0 and will deallocate block after the last chain is released, so there is no need to manually deallocate block in this case. However, if block reference counter reached 0 and there are no chains to release, block must still be deallocated manually. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 20:17:36 -07:00
Vlad Buslov	e368fdb61d	net: sched: use Qdisc rcu API instead of relying on rtnl lock As a preparation from removing rtnl lock dependency from rules update path, use Qdisc rcu and reference counting capabilities instead of relying on rtnl lock while working with Qdiscs. Create new tcf_block_release() function, and use it to free resources taken by tcf_block_find(). Currently, this function only releases Qdisc and it is extended in next patches in this series. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 20:17:36 -07:00
Vlad Buslov	3a7d0d07a3	net: sched: extend Qdisc with rcu Currently, Qdisc API functions assume that users have rtnl lock taken. To implement rtnl unlocked classifiers update interface, Qdisc API must be extended with functions that do not require rtnl lock. Extend Qdisc structure with rcu. Implement special version of put function qdisc_put_unlocked() that is called without rtnl lock taken. This function only takes rtnl lock if Qdisc reference counter reached zero and is intended to be used as optimization. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 20:17:35 -07:00
Vlad Buslov	86bd446b5c	net: sched: rename qdisc_destroy() to qdisc_put() Current implementation of qdisc_destroy() decrements Qdisc reference counter and only actually destroy Qdisc if reference counter value reached zero. Rename qdisc_destroy() to qdisc_put() in order for it to better describe the way in which this function currently implemented and used. Extract code that deallocates Qdisc into new private qdisc_destroy() function. It is intended to be shared between regular qdisc_put() and its unlocked version that is introduced in next patch in this series. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-25 20:17:35 -07:00
Eelco Chaudron	28169abadb	net/sched: Add hardware specific counters to TC actions Add additional counters that will store the bytes/packets processed by hardware. These will be exported through the netlink interface for displaying by the iproute2 tc tool Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-24 12:18:42 -07:00
Eric Dumazet	90caf67b01	net_sched: sch_fq: remove dead code dealing with retransmits With the earliest departure time model, we no longer plan special casing TCP retransmits. We therefore remove dead code (since most compilers understood skb_is_retransmit() was false) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-21 19:38:00 -07:00
Eric Dumazet	ab408b6dc7	tcp: switch tcp and sch_fq to new earliest departure time model TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq no longer has to do it. Thanks to this model, TCP can get more accurate RTT samples, since pacing no longer inflates them. This has the nice effect of removing some delays caused by FQ quantum mechanism, causing inflated max/P99 latencies. Also we might relax TCP Small Queue tight limits in the future, since this new model allow TCP to build bigger batches, since sch_fq (or a device with earliest departure time offload) ensure these packets will be delivered on time. Note that other protocols are not converted (they will probably never be) so sch_fq has still support for SO_MAX_PACING_RATE Tested: Test showing FQ pacing quantum artifact for low-rate flows, adding unexpected throttles for RPC flows, inflating max and P99 latencies. The parameters chosen here are to show what happens typically when a TCP flow has a reduced pacing rate (this can be caused by a reduced cwin after few losses, or/and rtt above few ms) MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY" Before : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 19,82.78,5279,3825,482.02 After : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 20,49.94,128,63,3.18 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-21 19:38:00 -07:00
Eric Dumazet	142537e419	net_sched: sch_fq: switch to CLOCK_TAI TCP will soon provide per skb->tstamp with earliest departure time, so that sch_fq does not have to determine departure time by looking at socket sk_pacing_rate. We chose in linux-4.19 CLOCK_TAI as the clock base for transports, qdiscs, and NIC offloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-21 19:37:59 -07:00
Vlad Buslov	ec3ed293e7	net_sched: change tcf_del_walker() to take idrinfo->lock Action API was changed to work with actions and action_idr in concurrency safe manner, however tcf_del_walker() still uses actions without taking a reference or idrinfo->lock first, and deletes them directly, disregarding possible concurrent delete. Change tcf_del_walker() to take idrinfo->lock while iterating over actions and use new tcf_idr_release_unsafe() to release them while holding the lock. And the blocking function fl_hw_destroy_tmplt() could be called when we put a filter chain, so defer it to a work queue. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> [xiyou.wangcong@gmail.com: heavily modify the code and changelog] Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-21 08:55:05 -07:00
zhong jiang	cb205a8174	net: sched: Use FIELD_SIZEOF directly instead of reimplementing its function FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore, We prefer to use the macro rather than calculating its value. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-19 20:58:04 -07:00
David S. Miller	e366fa4350	Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net Two new tls tests added in parallel in both net and net-next. Used Stephen Rothwell's linux-next resolution. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-18 09:33:27 -07:00
Davide Caratti	2d550dbad8	net/sched: act_police: don't use spinlock in the data path use RCU instead of spinlocks, to protect concurrent read/write on act_police configuration. This reduces the effects of contention in the data path, in case multiple readers are present. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-16 15:30:22 -07:00
Davide Caratti	93be42f917	net/sched: act_police: use per-cpu counters use per-CPU counters, instead of sharing a single set of stats with all cores. This removes the need of using spinlock when statistics are read or updated. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-16 15:30:22 -07:00
Davide Caratti	34043d250f	net/sched: act_sample: fix NULL dereference in the data path Matteo reported the following splat, testing the datapath of TC 'sample': BUG: KASAN: null-ptr-deref in tcf_sample_act+0xc4/0x310 Read of size 8 at addr 0000000000000000 by task nc/433 CPU: 0 PID: 433 Comm: nc Not tainted 4.19.0-rc3-kvm #17 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 Call Trace: kasan_report.cold.6+0x6c/0x2fa tcf_sample_act+0xc4/0x310 ? dev_hard_start_xmit+0x117/0x180 tcf_action_exec+0xa3/0x160 tcf_classify+0xdd/0x1d0 htb_enqueue+0x18e/0x6b0 ? deref_stack_reg+0x7a/0xb0 ? htb_delete+0x4b0/0x4b0 ? unwind_next_frame+0x819/0x8f0 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 __dev_queue_xmit+0x722/0xca0 ? unwind_get_return_address_ptr+0x50/0x50 ? netdev_pick_tx+0xe0/0xe0 ? save_stack+0x8c/0xb0 ? kasan_kmalloc+0xbe/0xd0 ? __kmalloc_track_caller+0xe4/0x1c0 ? __kmalloc_reserve.isra.45+0x24/0x70 ? __alloc_skb+0xdd/0x2e0 ? sk_stream_alloc_skb+0x91/0x3b0 ? tcp_sendmsg_locked+0x71b/0x15a0 ? tcp_sendmsg+0x22/0x40 ? __sys_sendto+0x1b0/0x250 ? __x64_sys_sendto+0x6f/0x80 ? do_syscall_64+0x5d/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 ? __sys_sendto+0x1b0/0x250 ? __x64_sys_sendto+0x6f/0x80 ? do_syscall_64+0x5d/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 ip_finish_output2+0x495/0x590 ? ip_copy_metadata+0x2e0/0x2e0 ? skb_gso_validate_network_len+0x6f/0x110 ? ip_finish_output+0x174/0x280 __tcp_transmit_skb+0xb17/0x12b0 ? __tcp_select_window+0x380/0x380 tcp_write_xmit+0x913/0x1de0 ? __sk_mem_schedule+0x50/0x80 tcp_sendmsg_locked+0x49d/0x15a0 ? tcp_rcv_established+0x8da/0xa30 ? tcp_set_state+0x220/0x220 ? clear_user+0x1f/0x50 ? iov_iter_zero+0x1ae/0x590 ? __fget_light+0xa0/0xe0 tcp_sendmsg+0x22/0x40 __sys_sendto+0x1b0/0x250 ? __ia32_sys_getpeername+0x40/0x40 ? _copy_to_user+0x58/0x70 ? poll_select_copy_remaining+0x176/0x200 ? __pollwait+0x1c0/0x1c0 ? ktime_get_ts64+0x11f/0x140 ? kern_select+0x108/0x150 ? core_sys_select+0x360/0x360 ? vfs_read+0x127/0x150 ? kernel_write+0x90/0x90 __x64_sys_sendto+0x6f/0x80 do_syscall_64+0x5d/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fefef2b129d Code: ff ff ff ff eb b6 0f 1f 80 00 00 00 00 48 8d 05 51 37 0c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b f3 c3 66 0f 1f 84 00 00 00 00 00 41 56 41 RSP: 002b:00007fff2f5350c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000056118d60c120 RCX: 00007fefef2b129d RDX: 0000000000002000 RSI: 000056118d629320 RDI: 0000000000000003 RBP: 000056118d530370 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000002000 R13: 000056118d5c2a10 R14: 000056118d5c2a10 R15: 000056118d5303b8 tcf_sample_act() tried to update its per-cpu stats, but tcf_sample_init() forgot to allocate them, because tcf_idr_create() was called with a wrong value of 'cpustats'. Setting it to true proved to fix the reported crash. Reported-by: Matteo Croce <mcroce@redhat.com> Fixes: `65a206c01e` ("net/sched: Change act_api and act_xxx modules to use IDR") Fixes: `5c5670fae4` ("net/sched: Introduce sample tc action") Tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-14 08:46:28 -07:00
Cong Wang	f5b9bac745	net_sched: notify filter deletion when deleting a chain When we delete a chain of filters, we need to notify user-space we are deleting each filters in this chain too. Fixes: `32a4f5ecd7` ("net: sched: introduce chain object to uapi") Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-13 09:07:40 -07:00
David S. Miller	aaf9253025	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2018-09-12 22:22:42 -07:00
Cong Wang	11957be20f	htb: use anonymous union for simplicity cl->leaf.q is slightly more readable than cl->un.leaf.q. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-10 10:39:57 -07:00
Cong Wang	8ecc7c8a1c	net_sched: remove redundant qdisc lock classes We no longer take any spinlock on RX path for ingress qdisc, so this lockdep annotation is no longer needed. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-10 10:39:38 -07:00
Vlad Buslov	86c55361e5	net: sched: cls_flower: dump offload count value Change flower in_hw_count type to fixed-size u32 and dump it as TCA_FLOWER_IN_HW_COUNT. This change is necessary to properly test shared blocks and re-offload functionality. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-10 10:35:15 -07:00
David S. Miller	a8305bff68	net: Add and use skb_mark_not_on_list(). An SKB is not on a list if skb->next is NULL. Codify this convention into a helper function and use it where we are dequeueing an SKB and need to mark it as such. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-10 10:06:54 -07:00
David S. Miller	596977300a	sch_netem: Move private queue handler to generic location. By hand copies of SKB list handlers do not belong in individual packet schedulers. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-10 10:06:53 -07:00
David S. Miller	aea890b8b2	sch_htb: Remove local SKB queue handling code. Instead, adjust __qdisc_enqueue_tail() such that HTB can use it instead. The only other caller of __qdisc_enqueue_tail() is qdisc_enqueue_tail() so we can move the backlog and return value handling (which HTB doesn't need/want) to the latter. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-10 10:06:52 -07:00
Vlad Buslov	f20a4d0117	net: sched: act_nat: remove dependency on rtnl lock According to the new locking rule, we have to take tcf_lock for both ->init() and ->dump(), as RTNL will be removed. Use tcf spinlock to protect private nat action data from concurrent modification during dump. (nat init already uses tcf spinlock when changing action state) Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-08 10:18:25 -07:00
Vlad Buslov	6d7a8df6df	net: sched: act_skbedit: remove dependency on rtnl lock According to the new locking rule, we have to take tcf_lock for both ->init() and ->dump(), as RTNL will be removed. Use tcf lock to protect skbedit action struct private data from concurrent modification in init and dump. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl lock assertion that is no longer required. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-08 10:17:35 -07:00
Cong Wang	a162c35114	net_sched: properly cancel netlink dump on failure When nla_put*() fails after nla_nest_start(), we need to call nla_nest_cancel() to cancel the message, otherwise we end up calling nla_nest_end() like a success. Fixes: `0ed5269f9e` ("net/sched: add tunnel option support to act_tunnel_key") Cc: Davide Caratti <dcaratti@redhat.com> Cc: Simon Horman <simon.horman@netronome.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-07 23:05:07 -07:00
Davide Caratti	ee28bb56ac	net/sched: fix memory leak in act_tunnel_key_init() If users try to install act_tunnel_key 'set' rules with duplicate values of 'index', the tunnel metadata are allocated, but never released. Then, kmemleak complains as follows: # tc a a a tunnel_key set src_ip 1.1.1.1 dst_ip 2.2.2.2 id 42 index 111 # echo clear > /sys/kernel/debug/kmemleak # tc a a a tunnel_key set src_ip 1.1.1.1 dst_ip 2.2.2.2 id 42 index 111 Error: TC IDR already exists. We have an error talking to the kernel # echo scan > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800574e6c80 (size 256): comm "tc", pid 5617, jiffies 4298118009 (age 57.990s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 1c e8 b0 ff ff ff ff ................ 81 24 c2 ad ff ff ff ff 00 00 00 00 00 00 00 00 .$.............. backtrace: [<00000000b7afbf4e>] tunnel_key_init+0x8a5/0x1800 [act_tunnel_key] [<000000007d98fccd>] tcf_action_init_1+0x698/0xac0 [<0000000099b8f7cc>] tcf_action_init+0x15c/0x590 [<00000000dc60eebe>] tc_ctl_action+0x336/0x5c2 [<000000002f5a2f7d>] rtnetlink_rcv_msg+0x357/0x8e0 [<000000000bfe7575>] netlink_rcv_skb+0x124/0x350 [<00000000edab656f>] netlink_unicast+0x40f/0x5d0 [<00000000b322cdcb>] netlink_sendmsg+0x6e8/0xba0 [<0000000063d9d490>] sock_sendmsg+0xb3/0xf0 [<00000000f0d3315a>] ___sys_sendmsg+0x654/0x960 [<00000000c06cbd42>] __sys_sendmsg+0xd3/0x170 [<00000000ce72e4b0>] do_syscall_64+0xa5/0x470 [<000000005caa2d97>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<00000000fac1b476>] 0xffffffffffffffff This problem theoretically happens also in case users attempt to setup a geneve rule having wrong configuration data, or when the kernel fails to allocate 'params_new'. Ensure that tunnel_key_init() releases the tunnel metadata also in the above conditions. Addresses-Coverity-ID: 1373974 ("Resource leak") Fixes: `d0f6dd8a91` ("net/sched: Introduce act_tunnel_key") Fixes: `0ed5269f9e` ("net/sched: add tunnel option support to act_tunnel_key") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-05 22:18:54 -07:00
David S. Miller	36302685f5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2018-09-04 21:33:03 -07:00
Vlad Buslov	84cb8eb26c	net: sched: action_ife: take reference to meta module Recent refactoring of add_metainfo() caused use_all_metadata() to add metainfo to ife action metalist without taking reference to module. This causes warning in module_put called from ife action cleanup function. Implement add_metainfo_and_get_ops() function that returns with reference to module taken if metainfo was added successfully, and call it from use_all_metadata(), instead of calling __add_metainfo() directly. Example warning: [ 646.344393] WARNING: CPU: 1 PID: 2278 at kernel/module.c:1139 module_put+0x1cb/0x230 [ 646.352437] Modules linked in: act_meta_skbtcindex act_meta_mark act_meta_skbprio act_ife ife veth nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c tun ebtable_filter ebtables ip6table_filter ip6_tables bridge stp llc mlx5_ib ib_uverbs ib_core intel_rapl sb_edac x86_pkg_temp_thermal mlx5_core coretemp kvm_intel kvm nfsd igb irqbypass crct10dif_pclmul devlink crc32_pclmul mei_me joydev ses crc32c_intel enclosure auth_rpcgss i2c_algo_bit ioatdma ptp mei pps_core ghash_clmulni_intel iTCO_wdt iTCO_vendor_support pcspkr dca ipmi_ssif lpc_ich target_core_mod i2c_i801 ipmi_si ipmi_devintf pcc_cpufreq wmi ipmi_msghandler nfs_acl lockd acpi_pad acpi_power_meter grace sunrpc mpt3sas raid_class scsi_transport_sas [ 646.425631] CPU: 1 PID: 2278 Comm: tc Not tainted 4.19.0-rc1+ #799 [ 646.432187] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 646.440595] RIP: 0010:module_put+0x1cb/0x230 [ 646.445238] Code: f3 66 94 02 e8 26 ff fa ff 85 c0 74 11 0f b6 1d 51 30 94 02 80 fb 01 77 60 83 e3 01 74 13 65 ff 0d 3a 83 db 73 e9 2b ff ff ff <0f> 0b e9 00 ff ff ff e8 59 01 fb ff 85 c0 75 e4 48 c7 c2 20 62 6b [ 646.464997] RSP: 0018:ffff880354d37068 EFLAGS: 00010286 [ 646.470599] RAX: 0000000000000000 RBX: ffffffffc0a52518 RCX: ffffffff8c2668db [ 646.478118] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffc0a52518 [ 646.485641] RBP: ffffffffc0a52180 R08: fffffbfff814a4a4 R09: fffffbfff814a4a3 [ 646.493164] R10: ffffffffc0a5251b R11: fffffbfff814a4a4 R12: 1ffff1006a9a6e0d [ 646.500687] R13: 00000000ffffffff R14: ffff880362bab890 R15: dead000000000100 [ 646.508213] FS: 00007f4164c99800(0000) GS:ffff88036fe40000(0000) knlGS:0000000000000000 [ 646.516961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 646.523080] CR2: 00007f41638b8420 CR3: 0000000351df0004 CR4: 00000000001606e0 [ 646.530595] Call Trace: [ 646.533408] ? find_symbol_in_section+0x260/0x260 [ 646.538509] tcf_ife_cleanup+0x11b/0x200 [act_ife] [ 646.543695] tcf_action_cleanup+0x29/0xa0 [ 646.548078] __tcf_action_put+0x5a/0xb0 [ 646.552289] ? nla_put+0x65/0xe0 [ 646.555889] __tcf_idr_release+0x48/0x60 [ 646.560187] tcf_generic_walker+0x448/0x6b0 [ 646.564764] ? tcf_action_dump_1+0x450/0x450 [ 646.569411] ? __lock_is_held+0x84/0x110 [ 646.573720] ? tcf_ife_walker+0x10c/0x20f [act_ife] [ 646.578982] tca_action_gd+0x972/0xc40 [ 646.583129] ? tca_get_fill.constprop.17+0x250/0x250 [ 646.588471] ? mark_lock+0xcf/0x980 [ 646.592324] ? check_chain_key+0x140/0x1f0 [ 646.596832] ? debug_show_all_locks+0x240/0x240 [ 646.601839] ? memset+0x1f/0x40 [ 646.605350] ? nla_parse+0xca/0x1a0 [ 646.609217] tc_ctl_action+0x215/0x230 [ 646.613339] ? tcf_action_add+0x220/0x220 [ 646.617748] rtnetlink_rcv_msg+0x56a/0x6d0 [ 646.622227] ? rtnl_fdb_del+0x3f0/0x3f0 [ 646.626466] netlink_rcv_skb+0x18d/0x200 [ 646.630752] ? rtnl_fdb_del+0x3f0/0x3f0 [ 646.634959] ? netlink_ack+0x500/0x500 [ 646.639106] netlink_unicast+0x2d0/0x370 [ 646.643409] ? netlink_attachskb+0x340/0x340 [ 646.648050] ? _copy_from_iter_full+0xe9/0x3e0 [ 646.652870] ? import_iovec+0x11e/0x1c0 [ 646.657083] netlink_sendmsg+0x3b9/0x6a0 [ 646.661388] ? netlink_unicast+0x370/0x370 [ 646.665877] ? netlink_unicast+0x370/0x370 [ 646.670351] sock_sendmsg+0x6b/0x80 [ 646.674212] ___sys_sendmsg+0x4a1/0x520 [ 646.678443] ? copy_msghdr_from_user+0x210/0x210 [ 646.683463] ? lock_downgrade+0x320/0x320 [ 646.687849] ? debug_show_all_locks+0x240/0x240 [ 646.692760] ? do_raw_spin_unlock+0xa2/0x130 [ 646.697418] ? _raw_spin_unlock+0x24/0x30 [ 646.701798] ? __handle_mm_fault+0x1819/0x1c10 [ 646.706619] ? __pmd_alloc+0x320/0x320 [ 646.710738] ? debug_show_all_locks+0x240/0x240 [ 646.715649] ? restore_nameidata+0x7b/0xa0 [ 646.720117] ? check_chain_key+0x140/0x1f0 [ 646.724590] ? check_chain_key+0x140/0x1f0 [ 646.729070] ? __fget_light+0xbc/0xd0 [ 646.733121] ? __sys_sendmsg+0xd7/0x150 [ 646.737329] __sys_sendmsg+0xd7/0x150 [ 646.741359] ? __ia32_sys_shutdown+0x30/0x30 [ 646.746003] ? up_read+0x53/0x90 [ 646.749601] ? __do_page_fault+0x484/0x780 [ 646.754105] ? do_syscall_64+0x1e/0x2c0 [ 646.758320] do_syscall_64+0x72/0x2c0 [ 646.762353] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 646.767776] RIP: 0033:0x7f4163872150 [ 646.771713] Code: 8b 15 3c 7d 2b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 83 3d b9 d5 2b 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be cd 00 00 48 89 04 24 [ 646.791474] RSP: 002b:00007ffdef7d6b58 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 646.799721] RAX: ffffffffffffffda RBX: 0000000000000024 RCX: 00007f4163872150 [ 646.807240] RDX: 0000000000000000 RSI: 00007ffdef7d6bd0 RDI: 0000000000000003 [ 646.814760] RBP: 000000005b8b9482 R08: 0000000000000001 R09: 0000000000000000 [ 646.822286] R10: 00000000000005e7 R11: 0000000000000246 R12: 00007ffdef7dad20 [ 646.829807] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000679bc0 [ 646.837360] irq event stamp: 6083 [ 646.841043] hardirqs last enabled at (6081): [<ffffffff8c220a7d>] __call_rcu+0x17d/0x500 [ 646.849882] hardirqs last disabled at (6083): [<ffffffff8c004f06>] trace_hardirqs_off_thunk+0x1a/0x1c [ 646.859775] softirqs last enabled at (5968): [<ffffffff8d4004a1>] __do_softirq+0x4a1/0x6ee [ 646.868784] softirqs last disabled at (6082): [<ffffffffc0a78759>] tcf_ife_cleanup+0x39/0x200 [act_ife] [ 646.878845] ---[ end trace b1b8c12ffe51e657 ]--- Fixes: `5ffe57da29` ("act_ife: fix a potential deadlock") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-04 12:20:21 -07:00
Cong Wang	6d784f1625	act_ife: fix a potential use-after-free Immediately after module_put(), user could delete this module, so e->ops could be already freed before we call e->ops->release(). Fix this by moving module_put() after ops->release(). Fixes: `ef6980b6be` ("introduce IFE action") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-04 12:18:25 -07:00
Vlad Buslov	c10bbfae3a	net: sched: null actions array pointer before releasing action Currently, tcf_action_delete() nulls actions array pointer after putting and deleting it. However, if tcf_idr_delete_index() returns an error, pointer to action is not set to null. That results it being released second time in error handling code of tca_action_gd(). Kasan error: [ 807.367755] ================================================================== [ 807.375844] BUG: KASAN: use-after-free in tc_setup_cb_call+0x14e/0x250 [ 807.382763] Read of size 8 at addr ffff88033e636000 by task tc/2732 [ 807.391289] CPU: 0 PID: 2732 Comm: tc Tainted: G W 4.19.0-rc1+ #799 [ 807.399542] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 807.407948] Call Trace: [ 807.410763] dump_stack+0x92/0xeb [ 807.414456] print_address_description+0x70/0x360 [ 807.419549] kasan_report+0x14d/0x300 [ 807.423582] ? tc_setup_cb_call+0x14e/0x250 [ 807.428150] tc_setup_cb_call+0x14e/0x250 [ 807.432539] ? nla_put+0x65/0xe0 [ 807.436146] fl_dump+0x394/0x3f0 [cls_flower] [ 807.440890] ? fl_tmplt_dump+0x140/0x140 [cls_flower] [ 807.446327] ? lock_downgrade+0x320/0x320 [ 807.450702] ? lock_acquire+0xe2/0x220 [ 807.454819] ? is_bpf_text_address+0x5/0x140 [ 807.459475] ? memcpy+0x34/0x50 [ 807.462980] ? nla_put+0x65/0xe0 [ 807.466582] tcf_fill_node+0x341/0x430 [ 807.470717] ? tcf_block_put+0xe0/0xe0 [ 807.474859] tcf_node_dump+0xdb/0xf0 [ 807.478821] fl_walk+0x8e/0x170 [cls_flower] [ 807.483474] tcf_chain_dump+0x35a/0x4d0 [ 807.487703] ? tfilter_notify+0x170/0x170 [ 807.492091] ? tcf_fill_node+0x430/0x430 [ 807.496411] tc_dump_tfilter+0x362/0x3f0 [ 807.500712] ? tc_del_tfilter+0x850/0x850 [ 807.505104] ? kasan_unpoison_shadow+0x30/0x40 [ 807.509940] ? __mutex_unlock_slowpath+0xcf/0x410 [ 807.515031] netlink_dump+0x263/0x4f0 [ 807.519077] __netlink_dump_start+0x2a0/0x300 [ 807.523817] ? tc_del_tfilter+0x850/0x850 [ 807.528198] rtnetlink_rcv_msg+0x46a/0x6d0 [ 807.532671] ? rtnl_fdb_del+0x3f0/0x3f0 [ 807.536878] ? tc_del_tfilter+0x850/0x850 [ 807.541280] netlink_rcv_skb+0x18d/0x200 [ 807.545570] ? rtnl_fdb_del+0x3f0/0x3f0 [ 807.549773] ? netlink_ack+0x500/0x500 [ 807.553913] netlink_unicast+0x2d0/0x370 [ 807.558212] ? netlink_attachskb+0x340/0x340 [ 807.562855] ? _copy_from_iter_full+0xe9/0x3e0 [ 807.567677] ? import_iovec+0x11e/0x1c0 [ 807.571890] netlink_sendmsg+0x3b9/0x6a0 [ 807.576192] ? netlink_unicast+0x370/0x370 [ 807.580684] ? netlink_unicast+0x370/0x370 [ 807.585154] sock_sendmsg+0x6b/0x80 [ 807.589015] ___sys_sendmsg+0x4a1/0x520 [ 807.593230] ? copy_msghdr_from_user+0x210/0x210 [ 807.598232] ? do_wp_page+0x174/0x880 [ 807.602276] ? __handle_mm_fault+0x749/0x1c10 [ 807.607021] ? __handle_mm_fault+0x1046/0x1c10 [ 807.611849] ? __pmd_alloc+0x320/0x320 [ 807.615973] ? check_chain_key+0x140/0x1f0 [ 807.620450] ? check_chain_key+0x140/0x1f0 [ 807.624929] ? __fget_light+0xbc/0xd0 [ 807.628970] ? __sys_sendmsg+0xd7/0x150 [ 807.633172] __sys_sendmsg+0xd7/0x150 [ 807.637201] ? __ia32_sys_shutdown+0x30/0x30 [ 807.641846] ? up_read+0x53/0x90 [ 807.645442] ? __do_page_fault+0x484/0x780 [ 807.649949] ? do_syscall_64+0x1e/0x2c0 [ 807.654164] do_syscall_64+0x72/0x2c0 [ 807.658198] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 807.663625] RIP: 0033:0x7f42e9870150 [ 807.667568] Code: 8b 15 3c 7d 2b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 83 3d b9 d5 2b 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be cd 00 00 48 89 04 24 [ 807.687328] RSP: 002b:00007ffdbf595b58 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 807.695564] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f42e9870150 [ 807.703083] RDX: 0000000000000000 RSI: 00007ffdbf595b80 RDI: 0000000000000003 [ 807.710605] RBP: 00007ffdbf599d90 R08: 0000000000679bc0 R09: 000000000000000f [ 807.718127] R10: 00000000000005e7 R11: 0000000000000246 R12: 00007ffdbf599d88 [ 807.725651] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 807.735048] Allocated by task 2687: [ 807.738902] kasan_kmalloc+0xa0/0xd0 [ 807.742852] __kmalloc+0x118/0x2d0 [ 807.746615] tcf_idr_create+0x44/0x320 [ 807.750738] tcf_nat_init+0x41e/0x530 [act_nat] [ 807.755638] tcf_action_init_1+0x4e0/0x650 [ 807.760104] tcf_action_init+0x1ce/0x2d0 [ 807.764395] tcf_exts_validate+0x1d8/0x200 [ 807.768861] fl_change+0x55a/0x26b4 [cls_flower] [ 807.773845] tc_new_tfilter+0x748/0xa20 [ 807.778051] rtnetlink_rcv_msg+0x56a/0x6d0 [ 807.782517] netlink_rcv_skb+0x18d/0x200 [ 807.786804] netlink_unicast+0x2d0/0x370 [ 807.791095] netlink_sendmsg+0x3b9/0x6a0 [ 807.795387] sock_sendmsg+0x6b/0x80 [ 807.799240] ___sys_sendmsg+0x4a1/0x520 [ 807.803445] __sys_sendmsg+0xd7/0x150 [ 807.807473] do_syscall_64+0x72/0x2c0 [ 807.811506] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 807.818776] Freed by task 2728: [ 807.822283] __kasan_slab_free+0x122/0x180 [ 807.826752] kfree+0xf4/0x2f0 [ 807.830080] __tcf_action_put+0x5a/0xb0 [ 807.834281] tcf_action_put_many+0x46/0x70 [ 807.838747] tca_action_gd+0x232/0xc40 [ 807.842862] tc_ctl_action+0x215/0x230 [ 807.846977] rtnetlink_rcv_msg+0x56a/0x6d0 [ 807.851444] netlink_rcv_skb+0x18d/0x200 [ 807.855731] netlink_unicast+0x2d0/0x370 [ 807.860021] netlink_sendmsg+0x3b9/0x6a0 [ 807.864312] sock_sendmsg+0x6b/0x80 [ 807.868166] ___sys_sendmsg+0x4a1/0x520 [ 807.872372] __sys_sendmsg+0xd7/0x150 [ 807.876401] do_syscall_64+0x72/0x2c0 [ 807.880431] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 807.887704] The buggy address belongs to the object at ffff88033e636000 which belongs to the cache kmalloc-256 of size 256 [ 807.900909] The buggy address is located 0 bytes inside of 256-byte region [ffff88033e636000, ffff88033e636100) [ 807.913155] The buggy address belongs to the page: [ 807.918322] page:ffffea000cf98d80 count:1 mapcount:0 mapping:ffff88036f80ee00 index:0x0 compound_mapcount: 0 [ 807.928831] flags: 0x5fff8000008100(slab\|head) [ 807.933647] raw: 005fff8000008100 ffffea000db44f00 0000000400000004 ffff88036f80ee00 [ 807.942050] raw: 0000000000000000 0000000080190019 00000001ffffffff 0000000000000000 [ 807.950456] page dumped because: kasan: bad access detected [ 807.958240] Memory state around the buggy address: [ 807.963405] ffff88033e635f00: fc fc fc fc fb fb fb fb fb fb fb fc fc fc fc fb [ 807.971288] ffff88033e635f80: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc [ 807.979166] >ffff88033e636000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 807.994882] ^ [ 807.998477] ffff88033e636080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 808.006352] ffff88033e636100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 808.014230] ================================================================== [ 808.022108] Disabling lock debugging due to kernel taint Fixes: `edfaf94fa7` ("net_sched: improve and refactor tcf_action_put_many()") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-03 21:47:33 -07:00
Cong Wang	506a03aa04	net_sched: add missing tcf_lock for act_connmark According to the new locking rule, we have to take tcf_lock for both ->init() and ->dump(), as RTNL will be removed. However, it is missing for act_connmark. Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-31 22:57:43 -07:00
Cong Wang	f061b48c17	Revert "net: sched: act: add extack for lookup callback" This reverts commit `331a9295de` ("net: sched: act: add extack for lookup callback"). This extack is never used after 6 months... In fact, it can be just set in the caller, right after ->lookup(). Cc: Alexander Aring <aring@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-31 22:50:15 -07:00
Paolo Abeni	97763dc0f4	net_sched: reject unknown tcfa_action values After the commit `802bfb1915` ("net/sched: user-space can't set unknown tcfa_action values"), unknown tcfa_action values are converted to TC_ACT_UNSPEC, but the common agreement is instead rejecting such configurations. This change also introduces a helper to simplify the destruction of a single action, avoiding code duplication. v1 -> v2: - helper is now static and renamed according to act_* convention - updated extack message, according to the new behavior Fixes: `802bfb1915` ("net/sched: user-space can't set unknown tcfa_action values") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-29 22:10:59 -07:00
Davide Caratti	85eb9af182	net/sched: act_pedit: fix dump of extended layered op in the (rare) case of failure in nla_nest_start(), missing NULL checks in tcf_pedit_key_ex_dump() can make the following command # tc action add action pedit ex munge ip ttl set 64 dereference a NULL pointer: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 PGD 800000007d1cd067 P4D 800000007d1cd067 PUD 7acd3067 PMD 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 3336 Comm: tc Tainted: G E 4.18.0.pedit+ #425 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:tcf_pedit_dump+0x19d/0x358 [act_pedit] Code: be 02 00 00 00 48 89 df 66 89 44 24 20 e8 9b b1 fd e0 85 c0 75 46 8b 83 c8 00 00 00 49 83 c5 08 48 03 83 d0 00 00 00 4d 39 f5 <66> 89 04 25 00 00 00 00 0f 84 81 01 00 00 41 8b 45 00 48 8d 4c 24 RSP: 0018:ffffb5d4004478a8 EFLAGS: 00010246 RAX: ffff8880fcda2070 RBX: ffff8880fadd2900 RCX: 0000000000000000 RDX: 0000000000000002 RSI: ffffb5d4004478ca RDI: ffff8880fcda206e RBP: ffff8880fb9cb900 R08: 0000000000000008 R09: ffff8880fcda206e R10: ffff8880fadd2900 R11: 0000000000000000 R12: ffff8880fd26cf40 R13: ffff8880fc957430 R14: ffff8880fc957430 R15: ffff8880fb9cb988 FS: 00007f75a537a740(0000) GS:ffff8880fda00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000007a2fa005 CR4: 00000000001606f0 Call Trace: ? __nla_reserve+0x38/0x50 tcf_action_dump_1+0xd2/0x130 tcf_action_dump+0x6a/0xf0 tca_get_fill.constprop.31+0xa3/0x120 tcf_action_add+0xd1/0x170 tc_ctl_action+0x137/0x150 rtnetlink_rcv_msg+0x263/0x2d0 ? _cond_resched+0x15/0x40 ? rtnl_calcit.isra.30+0x110/0x110 netlink_rcv_skb+0x4d/0x130 netlink_unicast+0x1a3/0x250 netlink_sendmsg+0x2ae/0x3a0 sock_sendmsg+0x36/0x40 ___sys_sendmsg+0x26f/0x2d0 ? do_wp_page+0x8e/0x5f0 ? handle_pte_fault+0x6c3/0xf50 ? __handle_mm_fault+0x38e/0x520 ? __sys_sendmsg+0x5e/0xa0 __sys_sendmsg+0x5e/0xa0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f75a4583ba0 Code: c3 48 8b 05 f2 62 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d fd c3 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae cc 00 00 48 89 04 24 RSP: 002b:00007fff60ee7418 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fff60ee7540 RCX: 00007f75a4583ba0 RDX: 0000000000000000 RSI: 00007fff60ee7490 RDI: 0000000000000003 RBP: 000000005b842d3e R08: 0000000000000002 R09: 0000000000000000 R10: 00007fff60ee6ea0 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fff60ee7554 R14: 0000000000000001 R15: 000000000066c100 Modules linked in: act_pedit(E) ip6table_filter ip6_tables iptable_filter binfmt_misc crct10dif_pclmul ext4 crc32_pclmul mbcache ghash_clmulni_intel jbd2 pcbc snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer cryptd glue_helper snd joydev pcspkr soundcore virtio_balloon i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net net_failover virtio_blk virtio_console failover qxl crc32c_intel drm_kms_helper syscopyarea serio_raw sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix virtio_pci libata virtio_ring i2c_core virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_pedit] CR2: 0000000000000000 Like it's done for other TC actions, give up dumping pedit rules and return an error if nla_nest_start() returns NULL. Fixes: `71d0ed7079` ("net/act_pedit: Support using offset relative to the conventional network headers") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-29 18:11:05 -07:00
Jiri Pirko	b7b4247d55	net: sched: return -ENOENT when trying to remove filter from non-existent chain When chain 0 was implicitly created, removal of non-existent filter from chain 0 gave -ENOENT. Once chain 0 became non-implicit, the same call is giving -EINVAL. Fix this by returning -ENOENT in that case. Reported-by: Roman Mashak <mrv@mojatatu.com> Fixes: `f71e0ca4db` ("net: sched: Avoid implicit chain 0 creation") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-27 15:16:16 -07:00
Jiri Pirko	d5ed72a55b	net: sched: fix extack error message when chain is failed to be created Instead "Cannot find" say "Cannot create". Fixes: `c35a4acc29` ("net: sched: cls_api: handle generic cls errors") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-27 15:16:16 -07:00
Kees Cook	98c8f125fd	net: sched: Fix memory exposure from short TCA_U32_SEL Via u32_change(), TCA_U32_SEL has an unspecified type in the netlink policy, so max length isn't enforced, only minimum. This means nkeys (from userspace) was being trusted without checking the actual size of nla_len(), which could lead to a memory over-read, and ultimately an exposure via a call to u32_dump(). Reachability is CAP_NET_ADMIN within a namespace. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-26 14:21:50 -07:00
Toke Høiland-Jørgensen	93cfb6c176	sch_cake: Fix TC filter flow override and expand it to hosts as well The TC filter flow mapping override completely skipped the call to cake_hash(); however that meant that the internal state was not being updated, which ultimately leads to deadlocks in some configurations. Fix that by passing the overridden flow ID into cake_hash() instead so it can react appropriately. In addition, the major number of the class ID can now be set to override the host mapping in host isolation mode. If both host and flow are overridden (or if the respective modes are disabled), flow dissection and hashing will be skipped entirely; otherwise, the hashing will be kept for the portions that are not set by the filter. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-22 21:39:45 -07:00
Cong Wang	5ffe57da29	act_ife: fix a potential deadlock use_all_metadata() acquires read_lock(&ife_mod_lock), then calls add_metainfo() which calls find_ife_oplist() which acquires the same lock again. Deadlock! Introduce __add_metainfo() which accepts struct tcf_meta_ops *ops as an additional parameter and let its callers to decide how to find it. For use_all_metadata(), it already has ops, no need to find it again, just call __add_metainfo() directly. And, as ife_mod_lock is only needed for find_ife_oplist(), this means we can make non-atomic allocation for populate_metalist() now. Fixes: `817e9f2c5c` ("act_ife: acquire ife_mod_lock before reading ifeoplist") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 12:45:45 -07:00
Cong Wang	4e407ff5cd	act_ife: move tcfa_lock down to where necessary The only time we need to take tcfa_lock is when adding a new metainfo to an existing ife->metalist. We don't need to take tcfa_lock so early and so broadly in tcf_ife_init(). This means we can always take ife_mod_lock first, avoid the reverse locking ordering warning as reported by Vlad. Reported-by: Vlad Buslov <vladbu@mellanox.com> Tested-by: Vlad Buslov <vladbu@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 12:45:45 -07:00
Cong Wang	8ce5be1c89	Revert "net: sched: act_ife: disable bh when taking ife_mod_lock" This reverts commit `42c625a486` ("net: sched: act_ife: disable bh when taking ife_mod_lock"), because what ife_mod_lock protects is absolutely not touched in rate est timer BH context, they have no race. A better fix is following up. Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 12:45:45 -07:00
Cong Wang	244cd96adb	net_sched: remove list_head from tc_action After commit `90b73b77d0`, list_head is no longer needed. Now we just need to convert the list iteration to array iteration for drivers. Fixes: `90b73b77d0` ("net: sched: change action API to use array of pointers to actions") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 12:45:44 -07:00
Cong Wang	7d485c451f	net_sched: remove unused tcf_idr_check() tcf_idr_check() is replaced by tcf_idr_check_alloc(), and __tcf_idr_check() now can be folded into tcf_idr_search(). Fixes: `0190c1d452` ("net: sched: atomically check-allocate action") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 12:45:44 -07:00
Cong Wang	b144e7ec51	net_sched: remove unused parameter for tcf_action_delete() Fixes: `16af606739` ("net: sched: implement reference counted action release") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 12:45:44 -07:00
Cong Wang	97a3f84f2c	net_sched: remove unnecessary ops->delete() All ops->delete() wants is getting the tn->idrinfo, but we already have tc_action before calling ops->delete(), and tc_action has a pointer ->idrinfo. More importantly, each type of action does the same thing, that is, just calling tcf_idr_delete_index(). So it can be just removed. Fixes: `b409074e66` ("net: sched: add 'delete' function to action ops") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 12:45:44 -07:00
Cong Wang	edfaf94fa7	net_sched: improve and refactor tcf_action_put_many() tcf_action_put_many() is mostly called to clean up actions on failure path, but tcf_action_put_many(&actions[acts_deleted]) is used in the ugliest way: it passes a slice of the array and uses an additional NULL at the end to avoid out-of-bound access. acts_deleted is completely unnecessary since we can teach tcf_action_put_many() scan the whole array and checks against NULL pointer. Which also means tcf_action_delete() should set deleted action pointers to NULL to avoid double free. Fixes: `90b73b77d0` ("net: sched: change action API to use array of pointers to actions") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 12:45:44 -07:00
Yue Haibing	093dee661d	sch_cake: Remove unused including <linux/version.h> Remove including <linux/version.h> that don't need it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-21 09:50:53 -07:00
Vlad Buslov	653cd284a8	net: sched: always disable bh when taking tcf_lock Recently, ops->init() and ops->dump() of all actions were modified to always obtain tcf_lock when accessing private action state. Actions that don't depend on tcf_lock for synchronization with their data path use non-bh locking API. However, tcf_lock is also used to protect rate estimator stats in softirq context by timer callback. Change ops->init() and ops->dump() of all actions to disable bh when using tcf_lock to prevent deadlock reported by following lockdep warning: [ 105.470398] ================================ [ 105.475014] WARNING: inconsistent lock state [ 105.479628] 4.18.0-rc8+ #664 Not tainted [ 105.483897] -------------------------------- [ 105.488511] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 105.494871] swapper/16/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 105.500449] 00000000f86c012e (&(&p->tcfa_lock)->rlock){+.?.}, at: est_fetch_counters+0x3c/0xa0 [ 105.509696] {SOFTIRQ-ON-W} state was registered at: [ 105.514925] _raw_spin_lock+0x2c/0x40 [ 105.519022] tcf_bpf_init+0x579/0x820 [act_bpf] [ 105.523990] tcf_action_init_1+0x4e4/0x660 [ 105.528518] tcf_action_init+0x1ce/0x2d0 [ 105.532880] tcf_exts_validate+0x1d8/0x200 [ 105.537416] fl_change+0x55a/0x268b [cls_flower] [ 105.542469] tc_new_tfilter+0x748/0xa20 [ 105.546738] rtnetlink_rcv_msg+0x56a/0x6d0 [ 105.551268] netlink_rcv_skb+0x18d/0x200 [ 105.555628] netlink_unicast+0x2d0/0x370 [ 105.559990] netlink_sendmsg+0x3b9/0x6a0 [ 105.564349] sock_sendmsg+0x6b/0x80 [ 105.568271] ___sys_sendmsg+0x4a1/0x520 [ 105.572547] __sys_sendmsg+0xd7/0x150 [ 105.576655] do_syscall_64+0x72/0x2c0 [ 105.580757] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 105.586243] irq event stamp: 489296 [ 105.590084] hardirqs last enabled at (489296): [<ffffffffb507e639>] _raw_spin_unlock_irq+0x29/0x40 [ 105.599765] hardirqs last disabled at (489295): [<ffffffffb507e745>] _raw_spin_lock_irq+0x15/0x50 [ 105.609277] softirqs last enabled at (489292): [<ffffffffb413a6a3>] irq_enter+0x83/0xa0 [ 105.618001] softirqs last disabled at (489293): [<ffffffffb413a800>] irq_exit+0x140/0x190 [ 105.626813] other info that might help us debug this: [ 105.633976] Possible unsafe locking scenario: [ 105.640526] CPU0 [ 105.643325] ---- [ 105.646125] lock(&(&p->tcfa_lock)->rlock); [ 105.650747] <Interrupt> [ 105.653717] lock(&(&p->tcfa_lock)->rlock); [ 105.658514] * DEADLOCK * [ 105.665349] 1 lock held by swapper/16/0: [ 105.669629] #0: 00000000a640ad99 ((&est->timer)){+.-.}, at: call_timer_fn+0x10b/0x550 [ 105.678200] stack backtrace: [ 105.683194] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 4.18.0-rc8+ #664 [ 105.690249] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 105.698626] Call Trace: [ 105.701421] <IRQ> [ 105.703791] dump_stack+0x92/0xeb [ 105.707461] print_usage_bug+0x336/0x34c [ 105.711744] mark_lock+0x7c9/0x980 [ 105.715500] ? print_shortest_lock_dependencies+0x2e0/0x2e0 [ 105.721424] ? check_usage_forwards+0x230/0x230 [ 105.726315] __lock_acquire+0x923/0x26f0 [ 105.730597] ? debug_show_all_locks+0x240/0x240 [ 105.735478] ? mark_lock+0x493/0x980 [ 105.739412] ? check_chain_key+0x140/0x1f0 [ 105.743861] ? __lock_acquire+0x836/0x26f0 [ 105.748323] ? lock_acquire+0x12e/0x290 [ 105.752516] lock_acquire+0x12e/0x290 [ 105.756539] ? est_fetch_counters+0x3c/0xa0 [ 105.761084] _raw_spin_lock+0x2c/0x40 [ 105.765099] ? est_fetch_counters+0x3c/0xa0 [ 105.769633] est_fetch_counters+0x3c/0xa0 [ 105.773995] est_timer+0x87/0x390 [ 105.777670] ? est_fetch_counters+0xa0/0xa0 [ 105.782210] ? lock_acquire+0x12e/0x290 [ 105.786410] call_timer_fn+0x161/0x550 [ 105.790512] ? est_fetch_counters+0xa0/0xa0 [ 105.795055] ? del_timer_sync+0xd0/0xd0 [ 105.799249] ? __lock_is_held+0x93/0x110 [ 105.803531] ? mark_held_locks+0x20/0xe0 [ 105.807813] ? _raw_spin_unlock_irq+0x29/0x40 [ 105.812525] ? est_fetch_counters+0xa0/0xa0 [ 105.817069] ? est_fetch_counters+0xa0/0xa0 [ 105.821610] run_timer_softirq+0x3c4/0x9f0 [ 105.826064] ? lock_acquire+0x12e/0x290 [ 105.830257] ? __bpf_trace_timer_class+0x10/0x10 [ 105.835237] ? __lock_is_held+0x25/0x110 [ 105.839517] __do_softirq+0x11d/0x7bf [ 105.843542] irq_exit+0x140/0x190 [ 105.847208] smp_apic_timer_interrupt+0xac/0x3b0 [ 105.852182] apic_timer_interrupt+0xf/0x20 [ 105.856628] </IRQ> [ 105.859081] RIP: 0010:cpuidle_enter_state+0xd8/0x4d0 [ 105.864395] Code: 46 ff 48 89 44 24 08 0f 1f 44 00 00 31 ff e8 cf ec 46 ff 80 7c 24 07 00 0f 85 1d 02 00 00 e8 9f 90 4b ff fb 66 0f 1f 44 00 00 <4c> 8b 6c 24 08 4d 29 fd 0f 80 36 03 00 00 4c 89 e8 48 ba cf f7 53 [ 105.884288] RSP: 0018:ffff8803ad94fd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 105.892494] RAX: 0000000000000000 RBX: ffffe8fb300829c0 RCX: ffffffffb41e19e1 [ 105.899988] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8803ad9358ac [ 105.907503] RBP: ffffffffb6636300 R08: 0000000000000004 R09: 0000000000000000 [ 105.914997] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 [ 105.922487] R13: ffffffffb6636140 R14: ffffffffb66362d8 R15: 000000188d36091b [ 105.929988] ? trace_hardirqs_on_caller+0x141/0x2d0 [ 105.935232] do_idle+0x28e/0x320 [ 105.938817] ? arch_cpu_idle_exit+0x40/0x40 [ 105.943361] ? mark_lock+0x8c1/0x980 [ 105.947295] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 105.952619] cpu_startup_entry+0xc2/0xd0 [ 105.956900] ? cpu_in_idle+0x20/0x20 [ 105.960830] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 105.966146] ? trace_hardirqs_on_caller+0x141/0x2d0 [ 105.971391] start_secondary+0x2b5/0x360 [ 105.975669] ? set_cpu_sibling_map+0x1330/0x1330 [ 105.980654] secondary_startup_64+0xa5/0xb0 Taking tcf_lock in sample action with bh disabled causes lockdep to issue a warning regarding possible irq lock inversion dependency between tcf_lock, and psample_groups_lock that is taken when holding tcf_lock in sample init: [ 162.108959] Possible interrupt unsafe locking scenario: [ 162.116386] CPU0 CPU1 [ 162.121277] ---- ---- [ 162.126162] lock(psample_groups_lock); [ 162.130447] local_irq_disable(); [ 162.136772] lock(&(&p->tcfa_lock)->rlock); [ 162.143957] lock(psample_groups_lock); [ 162.150813] <Interrupt> [ 162.153808] lock(&(&p->tcfa_lock)->rlock); [ 162.158608] * DEADLOCK * In order to prevent potential lock inversion dependency between tcf_lock and psample_groups_lock, extract call to psample_group_get() from tcf_lock protected section in sample action init function. Fixes: `4e232818bd` ("net: sched: act_mirred: remove dependency on rtnl lock") Fixes: `764e9a2448` ("net: sched: act_vlan: remove dependency on rtnl lock") Fixes: `729e012609` ("net: sched: act_tunnel_key: remove dependency on rtnl lock") Fixes: `d772849566` ("net: sched: act_sample: remove dependency on rtnl lock") Fixes: `e8917f4370` ("net: sched: act_gact: remove dependency on rtnl lock") Fixes: `b6a2b971c0` ("net: sched: act_csum: remove dependency on rtnl lock") Fixes: `2142236b45` ("net: sched: act_bpf: remove dependency on rtnl lock") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-19 10:46:21 -07:00
Vlad Buslov	32039eac4c	net: sched: act_ife: always release ife action on init error Action init API was changed to always take reference to action, even when overwriting existing action. Substitute conditional action release, which was executed only if action is newly created, with unconditional release in tcf_ife_init() error handling code to prevent double free or memory leak in case of overwrite. Fixes: `4e8ddd7f17` ("net: sched: don't release reference on action overwrite") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-16 12:12:12 -07:00
Hangbin Liu	a51c76b4df	cls_matchall: fix tcf_unbind_filter missing Fix tcf_unbind_filter missing in cls_matchall as this will trigger WARN_ON() in cbq_destroy_class(). Fixes: `fd62d9f5c5` ("net/sched: matchall: Fix configuration race") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-16 12:08:26 -07:00
Hangbin Liu	008369dcc5	net_sched: Fix missing res info when create new tc_index filter Li Shuang reported the following warn: [ 733.484610] WARNING: CPU: 6 PID: 21123 at net/sched/sch_cbq.c:1418 cbq_destroy_class+0x5d/0x70 [sch_cbq] [ 733.495190] Modules linked in: sch_cbq cls_tcindex sch_dsmark rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat l [ 733.574155] syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb ixgbe ahci libahci i2c_algo_bit libata i40e i2c_core dca mdio megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [ 733.592500] CPU: 6 PID: 21123 Comm: tc Not tainted 4.18.0-rc8.latest+ #131 [ 733.600169] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.1.5 04/11/2016 [ 733.608518] RIP: 0010:cbq_destroy_class+0x5d/0x70 [sch_cbq] [ 733.614734] Code: e7 d9 d2 48 8b 7b 48 e8 61 05 da d2 48 8d bb f8 00 00 00 e8 75 ae d5 d2 48 39 eb 74 0a 48 89 df 5b 5d e9 16 6c 94 d2 5b 5d c3 <0f> 0b eb b6 0f 1f 44 00 00 66 2e 0f 1f 84 [ 733.635798] RSP: 0018:ffffbfbb066bb9d8 EFLAGS: 00010202 [ 733.641627] RAX: 0000000000000001 RBX: ffff9cdd17392800 RCX: 000000008010000f [ 733.649588] RDX: ffff9cdd1df547e0 RSI: ffff9cdd17392800 RDI: ffff9cdd0f84c800 [ 733.657547] RBP: ffff9cdd0f84c800 R08: 0000000000000001 R09: 0000000000000000 [ 733.665508] R10: ffff9cdd0f84d000 R11: 0000000000000001 R12: 0000000000000001 [ 733.673469] R13: 0000000000000000 R14: 0000000000000001 R15: ffff9cdd17392200 [ 733.681430] FS: 00007f911890a740(0000) GS:ffff9cdd1f8c0000(0000) knlGS:0000000000000000 [ 733.690456] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 733.696864] CR2: 0000000000b5544c CR3: 0000000859374002 CR4: 00000000001606e0 [ 733.704826] Call Trace: [ 733.707554] cbq_destroy+0xa1/0xd0 [sch_cbq] [ 733.712318] qdisc_destroy+0x62/0x130 [ 733.716401] dsmark_destroy+0x2a/0x70 [sch_dsmark] [ 733.721745] qdisc_destroy+0x62/0x130 [ 733.725829] qdisc_graft+0x3ba/0x470 [ 733.729817] tc_get_qdisc+0x2a6/0x2c0 [ 733.733901] ? cred_has_capability+0x7d/0x130 [ 733.738761] rtnetlink_rcv_msg+0x263/0x2d0 [ 733.743330] ? rtnl_calcit.isra.30+0x110/0x110 [ 733.748287] netlink_rcv_skb+0x4d/0x130 [ 733.752576] netlink_unicast+0x1a3/0x250 [ 733.756949] netlink_sendmsg+0x2ae/0x3a0 [ 733.761324] sock_sendmsg+0x36/0x40 [ 733.765213] ___sys_sendmsg+0x26f/0x2d0 [ 733.769493] ? handle_pte_fault+0x586/0xdf0 [ 733.774158] ? __handle_mm_fault+0x389/0x500 [ 733.778919] ? __sys_sendmsg+0x5e/0xa0 [ 733.783099] __sys_sendmsg+0x5e/0xa0 [ 733.787087] do_syscall_64+0x5b/0x180 [ 733.791171] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 733.796805] RIP: 0033:0x7f9117f23f10 [ 733.800791] Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 [ 733.821873] RSP: 002b:00007ffe96818398 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 733.830319] RAX: ffffffffffffffda RBX: 000000005b71244c RCX: 00007f9117f23f10 [ 733.838280] RDX: 0000000000000000 RSI: 00007ffe968183e0 RDI: 0000000000000003 [ 733.846241] RBP: 00007ffe968183e0 R08: 000000000000ffff R09: 0000000000000003 [ 733.854202] R10: 00007ffe96817e20 R11: 0000000000000246 R12: 0000000000000000 [ 733.862161] R13: 0000000000662ee0 R14: 0000000000000000 R15: 0000000000000000 [ 733.870121] ---[ end trace 28edd4aad712ddca ]--- This is because we didn't update f->result.res when create new filter. Then in tcindex_delete() -> tcf_unbind_filter(), we will failed to find out the res and unbind filter, which will trigger the WARN_ON() in cbq_destroy_class(). Fix it by updating f->result.res when create new filter. Fixes: `6e0565697a` ("net_sched: fix another crash in cls_tcindex") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 19:37:42 -07:00
Hangbin Liu	2df8bee565	net_sched: fix NULL pointer dereference when delete tcindex filter Li Shuang reported the following crash: [ 71.267724] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 [ 71.276456] PGD 800000085d9bd067 P4D 800000085d9bd067 PUD 859a0b067 PMD 0 [ 71.284127] Oops: 0000 [#1] SMP PTI [ 71.288015] CPU: 12 PID: 2386 Comm: tc Not tainted 4.18.0-rc8.latest+ #131 [ 71.295686] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.1.5 04/11/2016 [ 71.304037] RIP: 0010:tcindex_delete+0x72/0x280 [cls_tcindex] [ 71.310446] Code: 00 31 f6 48 87 75 20 48 85 f6 74 11 48 8b 47 18 48 8b 40 08 48 8b 40 50 e8 fb a6 f8 fc 48 85 db 0f 84 dc 00 00 00 48 8b 73 18 <8b> 56 04 48 8d 7e 04 85 d2 0f 84 7b 01 00 [ 71.331517] RSP: 0018:ffffb45207b3f898 EFLAGS: 00010282 [ 71.337345] RAX: ffff8ad3d72d6360 RBX: ffff8acc84393680 RCX: 000000000000002e [ 71.345306] RDX: ffff8ad3d72c8570 RSI: 0000000000000000 RDI: ffff8ad847a45800 [ 71.353277] RBP: ffff8acc84393688 R08: ffff8ad3d72c8400 R09: 0000000000000000 [ 71.361238] R10: ffff8ad3de786e00 R11: 0000000000000000 R12: ffffb45207b3f8c7 [ 71.369199] R13: ffff8ad3d93bd2a0 R14: 000000000000002e R15: ffff8ad3d72c9600 [ 71.377161] FS: 00007f9d3ec3e740(0000) GS:ffff8ad3df980000(0000) knlGS:0000000000000000 [ 71.386188] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.392597] CR2: 0000000000000004 CR3: 0000000852f06003 CR4: 00000000001606e0 [ 71.400558] Call Trace: [ 71.403299] tcindex_destroy_element+0x25/0x40 [cls_tcindex] [ 71.409611] tcindex_walk+0xbb/0x110 [cls_tcindex] [ 71.414953] tcindex_destroy+0x44/0x90 [cls_tcindex] [ 71.420492] ? tcindex_delete+0x280/0x280 [cls_tcindex] [ 71.426323] tcf_proto_destroy+0x16/0x40 [ 71.430696] tcf_chain_flush+0x51/0x70 [ 71.434876] tcf_block_put_ext.part.30+0x8f/0x1b0 [ 71.440122] tcf_block_put+0x4d/0x70 [ 71.444108] cbq_destroy+0x4d/0xd0 [sch_cbq] [ 71.448869] qdisc_destroy+0x62/0x130 [ 71.452951] dsmark_destroy+0x2a/0x70 [sch_dsmark] [ 71.458300] qdisc_destroy+0x62/0x130 [ 71.462373] qdisc_graft+0x3ba/0x470 [ 71.466359] tc_get_qdisc+0x2a6/0x2c0 [ 71.470443] ? cred_has_capability+0x7d/0x130 [ 71.475307] rtnetlink_rcv_msg+0x263/0x2d0 [ 71.479875] ? rtnl_calcit.isra.30+0x110/0x110 [ 71.484832] netlink_rcv_skb+0x4d/0x130 [ 71.489109] netlink_unicast+0x1a3/0x250 [ 71.493482] netlink_sendmsg+0x2ae/0x3a0 [ 71.497859] sock_sendmsg+0x36/0x40 [ 71.501748] ___sys_sendmsg+0x26f/0x2d0 [ 71.506029] ? handle_pte_fault+0x586/0xdf0 [ 71.510694] ? __handle_mm_fault+0x389/0x500 [ 71.515457] ? __sys_sendmsg+0x5e/0xa0 [ 71.519636] __sys_sendmsg+0x5e/0xa0 [ 71.523626] do_syscall_64+0x5b/0x180 [ 71.527711] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 71.533345] RIP: 0033:0x7f9d3e257f10 [ 71.537331] Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 [ 71.558401] RSP: 002b:00007fff6f893398 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 71.566848] RAX: ffffffffffffffda RBX: 000000005b71274d RCX: 00007f9d3e257f10 [ 71.574810] RDX: 0000000000000000 RSI: 00007fff6f8933e0 RDI: 0000000000000003 [ 71.582770] RBP: 00007fff6f8933e0 R08: 000000000000ffff R09: 0000000000000003 [ 71.590729] R10: 00007fff6f892e20 R11: 0000000000000246 R12: 0000000000000000 [ 71.598689] R13: 0000000000662ee0 R14: 0000000000000000 R15: 0000000000000000 [ 71.606651] Modules linked in: sch_cbq cls_tcindex sch_dsmark xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_coni [ 71.685425] libahci i2c_algo_bit i2c_core i40e libata dca mdio megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [ 71.697075] CR2: 0000000000000004 [ 71.700792] ---[ end trace f604eb1acacd978b ]--- Reproducer: tc qdisc add dev lo handle 1:0 root dsmark indices 64 set_tc_index tc filter add dev lo parent 1:0 protocol ip prio 1 tcindex mask 0xfc shift 2 tc qdisc add dev lo parent 1:0 handle 2:0 cbq bandwidth 10Mbit cell 8 avpkt 1000 mpu 64 tc class add dev lo parent 2:0 classid 2:1 cbq bandwidth 10Mbit rate 1500Kbit avpkt 1000 prio 1 bounded isolated allot 1514 weight 1 maxburst 10 tc filter add dev lo parent 2:0 protocol ip prio 1 handle 0x2e tcindex classid 2:1 pass_on tc qdisc add dev lo parent 2:1 pfifo limit 5 tc qdisc del dev lo root This is because in tcindex_set_parms, when there is no old_r, we set new exts to cr.exts. And we didn't set it to filter when r == &new_filter_result. Then in tcindex_delete() -> tcf_exts_get_net(), we will get NULL pointer dereference as we didn't init exts. Fix it by moving tcf_exts_change() after "if (old_r && old_r != r)" check. Then we don't need "cr" as there is no errout after that. Fixes: `bf63ac73b3` ("net_sched: fix an oops in tcindex filter") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 19:37:42 -07:00
Vlad Buslov	42c625a486	net: sched: act_ife: disable bh when taking ife_mod_lock Lockdep reports deadlock for following locking scenario in ife action: Task one: 1) Executes ife action update. 2) Takes tcfa_lock. 3) Waits on ife_mod_lock which is already taken by task two. Task two: 1) Executes any path that obtains ife_mod_lock without disabling bh (any path that takes ife_mod_lock while holding tcfa_lock has bh disabled) like loading a meta module, or creating new action. 2) Takes ife_mod_lock. 3) Task is preempted by rate estimator timer. 4) Timer callback waits on tcfa_lock which is taken by task one. In described case tasks deadlock because they take same two locks in different order. To prevent potential deadlock reported by lockdep, always disable bh when obtaining ife_mod_lock. Lockdep warning: [ 508.101192] ===================================================== [ 508.107708] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected [ 508.114728] 4.18.0-rc8+ #646 Not tainted [ 508.119050] ----------------------------------------------------- [ 508.125559] tc/5460 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire: [ 508.132025] 000000005a938c68 (ife_mod_lock){++++}, at: find_ife_oplist+0x1e/0xc0 [act_ife] [ 508.140996] and this task is already holding: [ 508.147548] 00000000d46f6c56 (&(&p->tcfa_lock)->rlock){+.-.}, at: tcf_ife_init+0x6ae/0xf40 [act_ife] [ 508.157371] which would create a new lock dependency: [ 508.162828] (&(&p->tcfa_lock)->rlock){+.-.} -> (ife_mod_lock){++++} [ 508.169572] but this new dependency connects a SOFTIRQ-irq-safe lock: [ 508.178197] (&(&p->tcfa_lock)->rlock){+.-.} [ 508.178201] ... which became SOFTIRQ-irq-safe at: [ 508.189771] _raw_spin_lock+0x2c/0x40 [ 508.193906] est_fetch_counters+0x41/0xb0 [ 508.198391] est_timer+0x83/0x3c0 [ 508.202180] call_timer_fn+0x16a/0x5d0 [ 508.206400] run_timer_softirq+0x399/0x920 [ 508.210967] __do_softirq+0x157/0x97d [ 508.215102] irq_exit+0x152/0x1c0 [ 508.218888] smp_apic_timer_interrupt+0xc0/0x4e0 [ 508.223976] apic_timer_interrupt+0xf/0x20 [ 508.228540] cpuidle_enter_state+0xf8/0x5d0 [ 508.233198] do_idle+0x28a/0x350 [ 508.236881] cpu_startup_entry+0xc7/0xe0 [ 508.241296] start_secondary+0x2e8/0x3f0 [ 508.245678] secondary_startup_64+0xa5/0xb0 [ 508.250347] to a SOFTIRQ-irq-unsafe lock: (ife_mod_lock){++++} [ 508.256531] ... which became SOFTIRQ-irq-unsafe at: [ 508.267279] ... [ 508.267283] _raw_write_lock+0x2c/0x40 [ 508.273653] register_ife_op+0x118/0x2c0 [act_ife] [ 508.278926] do_one_initcall+0xf7/0x4d9 [ 508.283214] do_init_module+0x18b/0x44e [ 508.287521] load_module+0x4167/0x5730 [ 508.291739] __do_sys_finit_module+0x16d/0x1a0 [ 508.296654] do_syscall_64+0x7a/0x3f0 [ 508.300788] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 508.306302] other info that might help us debug this: [ 508.315286] Possible interrupt unsafe locking scenario: [ 508.322771] CPU0 CPU1 [ 508.327681] ---- ---- [ 508.332604] lock(ife_mod_lock); [ 508.336300] local_irq_disable(); [ 508.342608] lock(&(&p->tcfa_lock)->rlock); [ 508.349793] lock(ife_mod_lock); [ 508.355990] <Interrupt> [ 508.358974] lock(&(&p->tcfa_lock)->rlock); [ 508.363803] * DEADLOCK * [ 508.370715] 2 locks held by tc/5460: [ 508.374680] #0: 00000000e27e4fa4 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x583/0x7b0 [ 508.383366] #1: 00000000d46f6c56 (&(&p->tcfa_lock)->rlock){+.-.}, at: tcf_ife_init+0x6ae/0xf40 [act_ife] [ 508.393648] the dependencies between SOFTIRQ-irq-safe lock and the holding lock: [ 508.403505] -> (&(&p->tcfa_lock)->rlock){+.-.} ops: 1001553 { [ 508.409646] HARDIRQ-ON-W at: [ 508.413136] _raw_spin_lock_bh+0x34/0x40 [ 508.419059] gnet_stats_start_copy_compat+0xa2/0x230 [ 508.426021] gnet_stats_start_copy+0x16/0x20 [ 508.432333] tcf_action_copy_stats+0x95/0x1d0 [ 508.438735] tcf_action_dump_1+0xb0/0x4e0 [ 508.444795] tcf_action_dump+0xca/0x200 [ 508.450673] tcf_exts_dump+0xd9/0x320 [ 508.456392] fl_dump+0x1b7/0x4a0 [cls_flower] [ 508.462798] tcf_fill_node+0x380/0x530 [ 508.468601] tfilter_notify+0xdf/0x1c0 [ 508.474404] tc_new_tfilter+0x84a/0xc90 [ 508.480270] rtnetlink_rcv_msg+0x5bd/0x7b0 [ 508.486419] netlink_rcv_skb+0x184/0x220 [ 508.492394] netlink_unicast+0x31b/0x460 [ 508.507411] netlink_sendmsg+0x3fb/0x840 [ 508.513390] sock_sendmsg+0x7b/0xd0 [ 508.518907] ___sys_sendmsg+0x4c6/0x610 [ 508.524797] __sys_sendmsg+0xd7/0x150 [ 508.530510] do_syscall_64+0x7a/0x3f0 [ 508.536201] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 508.543301] IN-SOFTIRQ-W at: [ 508.546834] _raw_spin_lock+0x2c/0x40 [ 508.552522] est_fetch_counters+0x41/0xb0 [ 508.558571] est_timer+0x83/0x3c0 [ 508.563912] call_timer_fn+0x16a/0x5d0 [ 508.569699] run_timer_softirq+0x399/0x920 [ 508.575840] __do_softirq+0x157/0x97d [ 508.581538] irq_exit+0x152/0x1c0 [ 508.586882] smp_apic_timer_interrupt+0xc0/0x4e0 [ 508.593533] apic_timer_interrupt+0xf/0x20 [ 508.599686] cpuidle_enter_state+0xf8/0x5d0 [ 508.605895] do_idle+0x28a/0x350 [ 508.611147] cpu_startup_entry+0xc7/0xe0 [ 508.617097] start_secondary+0x2e8/0x3f0 [ 508.623029] secondary_startup_64+0xa5/0xb0 [ 508.629245] INITIAL USE at: [ 508.632686] _raw_spin_lock_bh+0x34/0x40 [ 508.638557] gnet_stats_start_copy_compat+0xa2/0x230 [ 508.645491] gnet_stats_start_copy+0x16/0x20 [ 508.651719] tcf_action_copy_stats+0x95/0x1d0 [ 508.657992] tcf_action_dump_1+0xb0/0x4e0 [ 508.663937] tcf_action_dump+0xca/0x200 [ 508.669716] tcf_exts_dump+0xd9/0x320 [ 508.675337] fl_dump+0x1b7/0x4a0 [cls_flower] [ 508.681650] tcf_fill_node+0x380/0x530 [ 508.687366] tfilter_notify+0xdf/0x1c0 [ 508.693031] tc_new_tfilter+0x84a/0xc90 [ 508.698820] rtnetlink_rcv_msg+0x5bd/0x7b0 [ 508.704869] netlink_rcv_skb+0x184/0x220 [ 508.710758] netlink_unicast+0x31b/0x460 [ 508.716627] netlink_sendmsg+0x3fb/0x840 [ 508.722510] sock_sendmsg+0x7b/0xd0 [ 508.727931] ___sys_sendmsg+0x4c6/0x610 [ 508.733729] __sys_sendmsg+0xd7/0x150 [ 508.739346] do_syscall_64 +0x7a/0x3f0 [ 508.744943] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 508.751930] } [ 508.753964] ... key at: [<ffffffff916b3e20>] __key.61145+0x0/0x40 [ 508.760946] ... acquired at: [ 508.764294] _raw_read_lock+0x2f/0x40 [ 508.768513] find_ife_oplist+0x1e/0xc0 [act_ife] [ 508.773692] tcf_ife_init+0x82f/0xf40 [act_ife] [ 508.778785] tcf_action_init_1+0x510/0x750 [ 508.783468] tcf_action_init+0x1e8/0x340 [ 508.787938] tcf_action_add+0xc5/0x240 [ 508.792241] tc_ctl_action+0x203/0x2a0 [ 508.796550] rtnetlink_rcv_msg+0x5bd/0x7b0 [ 508.801200] netlink_rcv_skb+0x184/0x220 [ 508.805674] netlink_unicast+0x31b/0x460 [ 508.810129] netlink_sendmsg+0x3fb/0x840 [ 508.814611] sock_sendmsg+0x7b/0xd0 [ 508.818665] ___sys_sendmsg+0x4c6/0x610 [ 508.823029] __sys_sendmsg+0xd7/0x150 [ 508.827246] do_syscall_64+0x7a/0x3f0 [ 508.831483] entry_SYSCALL_64_after_hwframe+0x49/0xbe the dependencies between the lock to be acquired [ 508.838945] and SOFTIRQ-irq-unsafe lock: [ 508.851177] -> (ife_mod_lock){++++} ops: 95 { [ 508.855920] HARDIRQ-ON-W at: [ 508.859478] _raw_write_lock+0x2c/0x40 [ 508.865264] register_ife_op+0x118/0x2c0 [act_ife] [ 508.872071] do_one_initcall+0xf7/0x4d9 [ 508.877947] do_init_module+0x18b/0x44e [ 508.883819] load_module+0x4167/0x5730 [ 508.889595] __do_sys_finit_module+0x16d/0x1a0 [ 508.896043] do_syscall_64+0x7a/0x3f0 [ 508.901734] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 508.908827] HARDIRQ-ON-R at: [ 508.912359] _raw_read_lock+0x2f/0x40 [ 508.918043] find_ife_oplist+0x1e/0xc0 [act_ife] [ 508.924692] tcf_ife_init+0x82f/0xf40 [act_ife] [ 508.931252] tcf_action_init_1+0x510/0x750 [ 508.937393] tcf_action_init+0x1e8/0x340 [ 508.943366] tcf_action_add+0xc5/0x240 [ 508.949130] tc_ctl_action+0x203/0x2a0 [ 508.954922] rtnetlink_rcv_msg+0x5bd/0x7b0 [ 508.961024] netlink_rcv_skb+0x184/0x220 [ 508.966970] netlink_unicast+0x31b/0x460 [ 508.972915] netlink_sendmsg+0x3fb/0x840 [ 508.978859] sock_sendmsg+0x7b/0xd0 [ 508.984400] ___sys_sendmsg+0x4c6/0x610 [ 508.990264] __sys_sendmsg+0xd7/0x150 [ 508.995952] do_syscall_64+0x7a/0x3f0 [ 509.001643] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 509.008722] SOFTIRQ-ON-W at:\ [ 509.012242] _raw_write_lock+0x2c/0x40 [ 509.018013] register_ife_op+0x118/0x2c0 [act_ife] [ 509.024841] do_one_initcall+0xf7/0x4d9 [ 509.030720] do_init_module+0x18b/0x44e [ 509.036604] load_module+0x4167/0x5730 [ 509.042397] __do_sys_finit_module+0x16d/0x1a0 [ 509.048865] do_syscall_64+0x7a/0x3f0 [ 509.054551] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 509.061636] SOFTIRQ-ON-R at: [ 509.065145] _raw_read_lock+0x2f/0x40 [ 509.070854] find_ife_oplist+0x1e/0xc0 [act_ife] [ 509.077515] tcf_ife_init+0x82f/0xf40 [act_ife] [ 509.084051] tcf_action_init_1+0x510/0x750 [ 509.090172] tcf_action_init+0x1e8/0x340 [ 509.096124] tcf_action_add+0xc5/0x240 [ 509.101891] tc_ctl_action+0x203/0x2a0 [ 509.107671] rtnetlink_rcv_msg+0x5bd/0x7b0 [ 509.113811] netlink_rcv_skb+0x184/0x220 [ 509.119768] netlink_unicast+0x31b/0x460 [ 509.125716] netlink_sendmsg+0x3fb/0x840 [ 509.131668] sock_sendmsg+0x7b/0xd0 [ 509.137167] ___sys_sendmsg+0x4c6/0x610 [ 509.143010] __sys_sendmsg+0xd7/0x150 [ 509.148718] do_syscall_64+0x7a/0x3f0 [ 509.154443] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 509.161533] INITIAL USE at: [ 509.164956] _raw_read_lock+0x2f/0x40 [ 509.170574] find_ife_oplist+0x1e/0xc0 [act_ife] [ 509.177134] tcf_ife_init+0x82f/0xf40 [act_ife] [ 509.183619] tcf_action_init_1+0x510/0x750 [ 509.189674] tcf_action_init+0x1e8/0x340 [ 509.195534] tcf_action_add+0xc5/0x240 [ 509.201229] tc_ctl_action+0x203/0x2a0 [ 509.206920] rtnetlink_rcv_msg+0x5bd/0x7b0 [ 509.212936] netlink_rcv_skb+0x184/0x220 [ 509.218818] netlink_unicast+0x31b/0x460 [ 509.224699] netlink_sendmsg+0x3fb/0x840 [ 509.230581] sock_sendmsg+0x7b/0xd0 [ 509.235984] ___sys_sendmsg+0x4c6/0x610 [ 509.241791] __sys_sendmsg+0xd7/0x150 [ 509.247425] do_syscall_64+0x7a/0x3f0 [ 509.253007] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 509.259975] } [ 509.261998] ... key at: [<ffffffffc1554258>] ife_mod_lock+0x18/0xffffffffffff8dc0 [act_ife] [ 509.271569] ... acquired at: [ 509.274912] _raw_read_lock+0x2f/0x40 [ 509.279134] find_ife_oplist+0x1e/0xc0 [act_ife] [ 509.284324] tcf_ife_init+0x82f/0xf40 [act_ife] [ 509.289425] tcf_action_init_1+0x510/0x750 [ 509.294068] tcf_action_init+0x1e8/0x340 [ 509.298553] tcf_action_add+0xc5/0x240 [ 509.302854] tc_ctl_action+0x203/0x2a0 [ 509.307153] rtnetlink_rcv_msg+0x5bd/0x7b0 [ 509.311805] netlink_rcv_skb+0x184/0x220 [ 509.316282] netlink_unicast+0x31b/0x460 [ 509.320769] netlink_sendmsg+0x3fb/0x840 [ 509.325248] sock_sendmsg+0x7b/0xd0 [ 509.329290] ___sys_sendmsg+0x4c6/0x610 [ 509.333687] __sys_sendmsg+0xd7/0x150 [ 509.337902] do_syscall_64+0x7a/0x3f0 [ 509.342116] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 509.349601] stack backtrace: [ 509.354663] CPU: 6 PID: 5460 Comm: tc Not tainted 4.18.0-rc8+ #646 [ 509.361216] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 Fixes: `ef6980b6be` ("introduce IFE action") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 11:45:06 -07:00
Jamal Hadi Salim	7c5790c4da	net: sched: act_mirred method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:17 -07:00
Jamal Hadi Salim	8aa7f22e56	net: sched: act_vlan method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:17 -07:00
Jamal Hadi Salim	353d2c253f	net: sched: act_skbmod method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:17 -07:00
Jamal Hadi Salim	45da1dac61	net: sched: act_skbedit method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:17 -07:00
Jamal Hadi Salim	798de374e5	net: sched: act_simple method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:17 -07:00
Jamal Hadi Salim	2ac063474d	net: sched: act_police method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:17 -07:00
Jamal Hadi Salim	6a2b401cd1	net: sched: act_pedit method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:17 -07:00
Jamal Hadi Salim	0390514fe1	net: sched: act_nat method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:16 -07:00
Jamal Hadi Salim	11b9695b3f	net: sched: act_ipt method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:16 -07:00
Jamal Hadi Salim	1740005e2a	net: sched: act_gact method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:16 -07:00
Jamal Hadi Salim	c831549c3f	net: sched: act_sum method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:16 -07:00
Jamal Hadi Salim	2fbec27f81	net: sched: act_bpf method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:16 -07:00
Jamal Hadi Salim	962ad1f937	net: sched: act_connmark method rename for grep-ability and consistency Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-13 09:06:16 -07:00
Vlad Buslov	e329bc4273	net: sched: act_police: remove dependency on rtnl lock Use tcf spinlock to protect police action private data from concurrent modification during dump. (init already uses tcf spinlock when changing police action state) Pass tcf spinlock as estimator lock argument to gen_replace_estimator() during action init. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	4e232818bd	net: sched: act_mirred: remove dependency on rtnl lock Re-introduce mirred list spinlock, that was removed some time ago, in order to protect it from concurrent modifications, instead of relying on rtnl lock. Use tcf spinlock to protect mirred action private data from concurrent modification in init and dump. Rearrange access to mirred data in order to be performed only while holding the lock. Rearrange net dev access to always hold reference while working with it, instead of relying on rntl lock. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	84a75b329b	net: sched: extend action ops with put_dev callback As a preparation for removing dependency on rtnl lock from rules update path, all users of shared objects must take reference while working with them. Extend action ops with put_dev() API to be used on net device returned by get_dev(). Modify mirred action (only action that implements get_dev callback): - Take reference to net device in get_dev. - Implement put_dev API that releases reference to net device. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	764e9a2448	net: sched: act_vlan: remove dependency on rtnl lock Use tcf spinlock to protect vlan action private data from concurrent modification during dump and init. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	729e012609	net: sched: act_tunnel_key: remove dependency on rtnl lock Use tcf lock to protect tunnel key action struct private data from concurrent modification in init and dump. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl lock assertion that is no longer required. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:10 -07:00
Vlad Buslov	c8814552fe	net: sched: act_skbmod: remove dependency on rtnl lock Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf spinlock to protect private skbmod data from concurrent modification during dump. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	5e48180ed8	net: sched: act_simple: remove dependency on rtnl lock Use tcf spinlock to protect private simple action data from concurrent modification during dump. (simple init already uses tcf spinlock when changing action state) Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	d772849566	net: sched: act_sample: remove dependency on rtnl lock Use tcf spinlock to protect private sample action data from concurrent modification during dump and init. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	67b0c1a3c9	net: sched: act_pedit: remove dependency on rtnl lock Rearrange pedit init code to only access pedit action data while holding tcf spinlock. Change keys allocation type to atomic to allow it to execute while holding tcf spinlock. Take tcf spinlock in dump function when accessing pedit action data. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	ff25276de9	net: sched: act_ipt: remove dependency on rtnl lock Use tcf spinlock to protect ipt action private data from concurrent modification during dump. Ipt init already takes tcf spinlock when modifying ipt state. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	54d0d423a4	net: sched: act_ife: remove dependency on rtnl lock Use tcf spinlock and rcu to protect params pointer from concurrent modification during dump and init. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Ife action has meta-actions that are compiled as standalone modules. Rtnl mutex must be released while loading a kernel module. In order to support execution without rtnl mutex, propagate 'rtnl_held' argument to meta action loading functions. When requesting meta action module, conditionally release rtnl lock depending on 'rtnl_held' argument. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	e8917f4370	net: sched: act_gact: remove dependency on rtnl lock Use tcf spinlock to protect gact action private state from concurrent modification during dump and init. Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	b6a2b971c0	net: sched: act_csum: remove dependency on rtnl lock Use tcf lock to protect csum action struct private data from concurrent modification in init and dump. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Vlad Buslov	2142236b45	net: sched: act_bpf: remove dependency on rtnl lock Use tcf spinlock to protect bpf action private data from concurrent modification during dump and init. Remove rtnl lock assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-11 12:37:09 -07:00
Jiri Pirko	63cc5bcc9f	net: sched: fix block->refcnt decrement Currently the refcnt is never decremented in case the value is not 1. Fix it by adding decrement in case the refcnt is not 1. Reported-by: Vlad Buslov <vladbu@mellanox.com> Fixes: `f71e0ca4db` ("net: sched: Avoid implicit chain 0 creation") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-09 14:12:04 -07:00
Vlad Buslov	9ca6163005	net: sched: cls_flower: set correct offload data in fl_reoffload fl_reoffload implementation sets following members of struct tc_cls_flower_offload incorrectly: - masked key instead of mask - key instead of masked key Fix fl_reoffload to provide correct data to offload callback. Fixes: `31533cba43` ("net: sched: cls_flower: implement offload tcf_proto_op") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-07 12:35:17 -07:00
Pieter Jansen van Vuuren	0a6e77784f	net/sched: allow flower to match tunnel options Allow matching on options in Geneve tunnel headers. This makes use of existing tunnel metadata support. The options can be described in the form CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is represented as a 16bit hexadecimal value, TYPE as an 8bit hexadecimal value and DATA as a variable length hexadecimal value. e.g. # ip link add name geneve0 type geneve dstport 0 external # tc qdisc add dev geneve0 ingress # tc filter add dev geneve0 protocol ip parent ffff: \ flower \ enc_src_ip 10.0.99.192 \ enc_dst_ip 10.0.99.193 \ enc_key_id 11 \ geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \ ip_proto udp \ action mirred egress redirect dev eth1 This patch adds support for matching Geneve options in the order supplied by the user. This leads to an efficient implementation in the software datapath (and in our opinion hardware datapaths that offload this feature). It is also compatible with Geneve options matching provided by the Open vSwitch kernel datapath which is relevant here as the Flower classifier may be used as a mechanism to program flows into hardware as a form of Open vSwitch datapath offload (sometimes referred to as OVS-TC). The netlink Kernel/Userspace API may be extended, for example by adding a flag, if other matching options are desired, for example matching given options in any order. This would require an implementation in the TC software datapath. And be done in a way that drivers that facilitate offload of the Flower classifier can reject or accept such flows based on hardware datapath capabilities. This approach was discussed and agreed on at Netconf 2017 in Seoul. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-07 12:22:15 -07:00
Dan Carpenter	1cbc36a53b	net: sched: cls_flower: Fix an error code in fl_tmplt_create() We forgot to set the error code on this path, so we return NULL instead of an error pointer. In the current code kzalloc() won't fail for small allocations so this doesn't really affect runtime. Fixes: `b95ec7eb3b` ("net: sched: cls_flower: implement chain templates") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-05 17:25:46 -07:00
Jiri Pirko	5ca8a25c14	net: sched: fix flush on non-existing chain User was able to perform filter flush on chain 0 even if it didn't have any filters in it. With the patch that avoided implicit chain 0 creation, this changed. So in case user wants filter flush on chain which does not exist, just return success. There's no reason for non-0 chains to behave differently than chain 0, so do the same for them. Reported-by: Ido Schimmel <idosch@mellanox.com> Fixes: `f71e0ca4db` ("net: sched: Avoid implicit chain 0 creation") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-03 09:44:37 -07:00
Jiri Pirko	290b1c8b1a	net: sched: make tcf_chain_{get,put}() static These are no longer used outside of cls_api.c so make them static. Move tcf_chain_flush() to avoid fwd declaration of tcf_chain_put(). Signed-off-by: Jiri Pirko <jiri@mellanox.com> v1->v2: - new patch Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-01 10:06:19 -07:00
Jiri Pirko	5368140730	net: sched: fix notifications for action-held chains Chains that only have action references serve as placeholders. Until a non-action reference is created, user should not be aware of the chain. Also he should not receive any notifications about it. So send notifications for the new chain only in case the chain gets the first non-action reference. Symmetrically to that, when the last non-action reference is dropped, send the notification about deleted chain. Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> v1->v2: - made __tcf_chain_{get,put}() static as suggested by Cong Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-01 10:06:19 -07:00
Jiri Pirko	3d32f4c548	net: sched: change name of zombie chain to "held_by_acts_only" As mentioned by Cong and Jakub during the review process, it is a bit odd to sometimes (act flow) create a new chain which would be immediately a "zombie". So just rename it to "held_by_acts_only". Signed-off-by: Jiri Pirko <jiri@mellanox.com> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-01 10:06:19 -07:00
Paolo Abeni	e5cf1baf92	act_mirred: use TC_ACT_REINSERT when possible When mirred is invoked from the ingress path, and it wants to redirect the processed packet, it can now use the TC_ACT_REINSERT action, filling the tcf_result accordingly, and avoiding a per packet skb_clone(). Overall this gives a ~10% improvement in forwarding performance for the TC S/W data path and TC S/W performances are now comparable to the kernel openvswitch datapath. v1 -> v2: use ACT_MIRRED instead of ACT_REDIRECT v2 -> v3: updated after action rename, fixed typo into the commit message v3 -> v4: updated again after action rename, added more comments to the code (JiriP), skip the optimization if the control action need to touch the tcf_result (Paolo) v4 -> v5: fix sparse warning (kbuild bot) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-30 09:31:14 -07:00
Paolo Abeni	7fd4b288ea	tc/act: remove unneeded RCU lock in action callback Each lockless action currently does its own RCU locking in ->act(). This allows using plain RCU accessor, even if the context is really RCU BH. This change drops the per action RCU lock, replace the accessors with the _bh variant, cleans up a bit the surrounding code and documents the RCU status in the relevant header. No functional nor performance change is intended. The goal of this patch is clarifying that the RCU critical section used by the tc actions extends up to the classifier's caller. v1 -> v2: - preserve rcu lock in act_bpf: it's needed by eBPF helpers, as pointed out by Daniel v3 -> v4: - fixed some typos in the commit message (JiriP) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-30 09:31:13 -07:00
Paolo Abeni	802bfb1915	net/sched: user-space can't set unknown tcfa_action values Currently, when initializing an action, the user-space can specify and use arbitrary values for the tcfa_action field. If the value is unknown by the kernel, is implicitly threaded as TC_ACT_UNSPEC. This change explicitly checks for unknown values at action creation time, and explicitly convert them to TC_ACT_UNSPEC. No functional changes are introduced, but this will allow introducing tcfa_action values not exposed to user-space in a later patch. Note: we can't use the above to hide TC_ACT_REDIRECT from user-space, as the latter is already part of uAPI. v3 -> v4: - use an helper to check for action validity (JiriP) - emit an extack for invalid actions (JiriP) v4 -> v5: - keep messages on a single line, drop net_warn (Marcelo) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-30 09:31:13 -07:00
YueHaibing	3f6bcc5162	act_bpf: Use kmemdup instead of duplicating it in tcf_bpf_init_from_ops Replace calls to kmalloc followed by a memcpy with a direct call to kmemdup. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-29 13:20:16 -07:00
YueHaibing	f9562fa4a5	cls_bpf: Use kmemdup instead of duplicating it in cls_bpf_prog_from_ops Replace calls to kmalloc followed by a memcpy with a direct call to kmemdup. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-29 13:19:49 -07:00
YueHaibing	0a80848ec5	act_pedit: remove unnecessary semicolon net/sched/act_pedit.c:289:2-3: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-29 13:19:20 -07:00
Dave Taht	2db6dc2662	sch_cake: Make gso-splitting configurable from userspace This patch restores cake's deployed behavior at line rate to always split gso, and makes gso splitting configurable from userspace. running cake unlimited (unshaped) at 1gigE, local traffic: no-split-gso bql limit: 131966 split-gso bql limit: ~42392-45420 On this 4 stream test splitting gso apart results in halving the observed interpacket latency at no loss in throughput. Summary of tcp_nup test run 'gso-split' (at 2018-07-26 16:03:51.824728): Ping (ms) ICMP : 0.83 0.81 ms 341 TCP upload avg : 235.43 235.39 Mbits/s 301 TCP upload sum : 941.71 941.56 Mbits/s 301 TCP upload::1 : 235.45 235.43 Mbits/s 271 TCP upload::2 : 235.45 235.41 Mbits/s 289 TCP upload::3 : 235.40 235.40 Mbits/s 288 TCP upload::4 : 235.41 235.40 Mbits/s 291 verses Summary of tcp_nup test run 'no-split-gso' (at 2018-07-26 16:37:23.563960): avg median # data pts Ping (ms) ICMP : 1.67 1.73 ms 348 TCP upload avg : 234.56 235.37 Mbits/s 301 TCP upload sum : 938.24 941.49 Mbits/s 301 TCP upload::1 : 234.55 235.38 Mbits/s 285 TCP upload::2 : 234.57 235.37 Mbits/s 286 TCP upload::3 : 234.58 235.37 Mbits/s 274 TCP upload::4 : 234.54 235.42 Mbits/s 288 Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-27 13:38:20 -07:00
Jiri Pirko	1f3ed383fb	net: sched: don't dump chains only held by actions In case a chain is empty and not explicitly created by a user, such chain should not exist. The only exception is if there is an action "goto chain" pointing to it. In that case, don't show the chain in the dump. Track the chain references held by actions and use them to find out if a chain should or should not be shown in chain dump. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-27 09:38:46 -07:00
Jiri Pirko	c921d7db3d	net: sched: unmark chain as explicitly created on delete Once user manually deletes the chain using "chain del", the chain cannot be marked as explicitly created anymore. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Fixes: `32a4f5ecd7` ("net: sched: introduce chain object to uapi") Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-26 14:12:58 -07:00
Gustavo A. R. Silva	2ed9db3074	net: sched: cls_api: fix dead code in switch Code at line 1850 is unreachable. Fix this by removing the break statement above it, so the code for case RTM_GETCHAIN can be properly executed. Addresses-Coverity-ID: 1472050 ("Structurally dead code") Fixes: `32a4f5ecd7` ("net: sched: introduce chain object to uapi") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-26 14:09:27 -07:00
Vinicius Costa Gomes	990e35ecba	cbs: Add support for the graft function This will allow to install a child qdisc under cbs. The main use case is to install ETF (Earliest TxTime First) qdisc under cbs, so there's another level of control for time-sensitive traffic. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-26 13:58:30 -07:00
Jianbo Liu	158abbf170	net/sched: cls_flower: Use correct inline function for assignment of vlan tpid This fixes the following sparse warning: net/sched/cls_flower.c:1356:36: warning: incorrect type in argument 3 (different base types) net/sched/cls_flower.c:1356:36: expected unsigned short [unsigned] [usertype] value net/sched/cls_flower.c:1356:36: got restricted __be16 [usertype] vlan_tpid Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-25 16:33:02 -07:00
Nishanth Devarajan	aea5f654e6	net/sched: add skbprio scheduler Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets according to their skb->priority field. Under congestion, already-enqueued lower priority packets will be dropped to make space available for higher priority packets. Skbprio was conceived as a solution for denial-of-service defenses that need to route packets with different priorities as a means to overcome DoS attacks. v5 Do not reference qdisc_dev(sch)->tx_queue_len for setting limit. Instead set default sch->limit to 64. v4 Drop Documentation/networking/sch_skbprio.txt doc file to move it to tc man page for Skbprio, in iproute2. v3 Drop max_limit parameter in struct skbprio_sched_data and instead use sch->limit. Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for qdisc (previously being referenced every time qdisc changes). Move qdisc's detailed description from in-code to Documentation/networking. When qdisc is saturated, enqueue incoming packet first before dequeueing lowest priority packet in queue - improves usage of call stack registers. Introduce and use overlimit stat to keep track of number of dropped packets. v2 Use skb->priority field rather than DS field. Rename queueing discipline as SKB Priority Queue (previously Gatekeeper Priority Queue). *Queueing discipline is made classful to expose Skbprio's internal priority queues. Signed-off-by: Nishanth Devarajan <ndev2021@gmail.com> Reviewed-by: Sachin Paryani <sachin.paryani@gmail.com> Reviewed-by: Cody Doucette <doucette@bu.edu> Reviewed-by: Michel Machado <michel@digirati.com.br> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-24 14:44:00 -07:00
Stephen Hemminger	50f699b1f8	sched: fix trailing whitespace Remove trailing whitespace and blank lines at EOF Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-24 14:10:42 -07:00
Jiri Pirko	3473845273	net: sched: cls_flower: propagate chain teplate creation and destruction to drivers Introduce a couple of flower offload commands in order to propagate template creation/destruction events down to device drivers. Drivers may use this information to prepare HW in an optimal way for future filter insertions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 20:44:12 -07:00
Jiri Pirko	b95ec7eb3b	net: sched: cls_flower: implement chain templates Use the previously introduced template extension and implement callback to create, destroy and dump chain template. The existing parsing and dumping functions are re-used. Also, check if newly added filters fit the template if it is set. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 20:44:12 -07:00
Jiri Pirko	33fb5cba11	net: sched: cls_flower: change fl_init_dissector to accept mask and dissector This function is going to be used for templates as well, so we need to pass the pointer separately. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 20:44:12 -07:00
Jiri Pirko	f5749081f0	net: sched: cls_flower: move key/mask dumping into a separate function Push key/mask dumping from fl_dump() into a separate function fl_dump_key(), that will be reused for template dumping. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 20:44:12 -07:00
Jiri Pirko	9f407f1768	net: sched: introduce chain templates Allow user to set a template for newly created chains. Template lock down the chain for particular classifier type/options combinations. The classifier needs to support templates, otherwise kernel would reply with error. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 20:44:12 -07:00
Jiri Pirko	32a4f5ecd7	net: sched: introduce chain object to uapi Allow user to create, destroy, get and dump chain objects. Do that by extending rtnl commands by the chain-specific ones. User will now be able to explicitly create or destroy chains (so far this was done only automatically according the filter/act needs and refcounting). Also, the user will receive notification about any chain creation or destuction. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 20:44:12 -07:00
Jiri Pirko	f71e0ca4db	net: sched: Avoid implicit chain 0 creation Currently, chain 0 is implicitly created during block creation. However that does not align with chain object exposure, creation and destruction api introduced later on. So make the chain 0 behave the same way as any other chain and only create it when it is needed. Since chain 0 is somehow special as the qdiscs need to hold pointer to the first chain tp, this requires to move the chain head change callback infra to the block structure. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 20:44:12 -07:00
Jiri Pirko	f34e8bff58	net: sched: push ops lookup bits into tcf_proto_lookup_ops() Push all bits that take care of ops lookup, including module loading outside tcf_proto_create() function, into tcf_proto_lookup_ops() Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-23 20:44:12 -07:00
Gustavo A. R. Silva	baa2d2b17e	net: sched: use PTR_ERR_OR_ZERO macro in tcf_block_cb_register This line makes up what macro PTR_ERR_OR_ZERO already does. So, make use of PTR_ERR_OR_ZERO rather than an open-code version. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-21 16:17:08 -07:00
David S. Miller	c4c5551df1	Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-20 21:17:12 -07:00
Or Gerlitz	0e2c17b64d	net/sched: cls_flower: Support matching on ip tos and ttl for tunnels Allow users to set rules matching on ipv4 tos and ttl or ipv6 traffic-class and hoplimit of tunnel headers. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-19 23:26:01 -07:00
Or Gerlitz	07a557f47d	net/sched: tunnel_key: Allow to set tos and ttl for tc based ip tunnels Allow user-space to provide tos and ttl to be set for the tunnel headers. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-19 23:26:01 -07:00
YueHaibing	5318918390	net: sched: Using NULL instead of plain integer Fixes the following sparse warnings: net/sched/cls_api.c:1101:43: warning: Using plain integer as NULL pointer net/sched/cls_api.c:1492:75: warning: Using plain integer as NULL pointer Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-18 13:44:07 -07:00
Toke Høiland-Jørgensen	301f935be9	sch_cake: Fix tin order when set through skb->priority In diffserv mode, CAKE stores tins in a different order internally than the logical order exposed to userspace. The order remapping was missing in the handling of 'tc filter' priority mappings through skb->priority, resulting in bulk and best effort mappings being reversed relative to how they are displayed. Fix this by adding the missing mapping when reading skb->priority. Fixes: `83f8fd69af` ("sch_cake: Add DiffServ handling") Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 14:47:45 -07:00
Vlad Buslov	01683a1469	net: sched: refactor flower walk to iterate over idr Extend struct tcf_walker with additional 'cookie' field. It is intended to be used by classifier walk implementations to continue iteration directly from particular filter, instead of iterating 'skip' number of times. Change flower walk implementation to save filter handle in 'cookie'. Each time flower walk is called, it looks up filter with saved handle directly with idr, instead of iterating over filter linked list 'skip' number of times. This change improves complexity of dumping flower classifier from quadratic to linearithmic. (assuming idr lookup has logarithmic complexity) Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reported-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-13 18:24:27 -07:00
Davide Caratti	c749cdda90	net/sched: act_skbedit: don't use spinlock in the data path use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on act_skbedit configuration. This reduces the effects of contention in the data path, in case multiple readers are present. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 14:54:12 -07:00
Davide Caratti	6f3dfb0dc8	net/sched: skbedit: use per-cpu counters use per-CPU counters, instead of sharing a single set of stats with all cores: this removes the need of spinlocks when stats are read/updated. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 14:54:12 -07:00
Jacob Keller	83fe6b8709	sch_fq_codel: zero q->flows_cnt when fq_codel_init fails When fq_codel_init fails, qdisc_create_dflt will cleanup by using qdisc_destroy. This function calls the ->reset() op prior to calling the ->destroy() op. Unfortunately, during the failure flow for sch_fq_codel, the ->flows parameter is not initialized, so the fq_codel_reset function will null pointer dereference. kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] kernel: PGD 0 P4D 0 kernel: Oops: 0000 [#1] SMP PTI kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci kernel: mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e] kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G OE 4.16.13custom-fq-codel-test+ #3 kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015 kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel] kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246 kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9 kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00 kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9 kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00 kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001 kernel: FS: 00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0 kernel: Call Trace: kernel: qdisc_destroy+0x56/0x140 kernel: qdisc_create_dflt+0x8b/0xb0 kernel: mq_init+0xc1/0xf0 kernel: qdisc_create_dflt+0x5a/0xb0 kernel: dev_activate+0x205/0x230 kernel: __dev_open+0xf5/0x160 kernel: __dev_change_flags+0x1a3/0x210 kernel: dev_change_flags+0x21/0x60 kernel: do_setlink+0x660/0xdf0 kernel: ? down_trylock+0x25/0x30 kernel: ? xfs_buf_trylock+0x1a/0xd0 [xfs] kernel: ? rtnl_newlink+0x816/0x990 kernel: ? _xfs_buf_find+0x327/0x580 [xfs] kernel: ? _cond_resched+0x15/0x30 kernel: ? kmem_cache_alloc+0x20/0x1b0 kernel: ? rtnetlink_rcv_msg+0x200/0x2f0 kernel: ? rtnl_calcit.isra.30+0x100/0x100 kernel: ? netlink_rcv_skb+0x4c/0x120 kernel: ? netlink_unicast+0x19e/0x260 kernel: ? netlink_sendmsg+0x1ff/0x3c0 kernel: ? sock_sendmsg+0x36/0x40 kernel: ? ___sys_sendmsg+0x295/0x2f0 kernel: ? ebitmap_cmp+0x6d/0x90 kernel: ? dev_get_by_name_rcu+0x73/0x90 kernel: ? skb_dequeue+0x52/0x60 kernel: ? __inode_wait_for_writeback+0x7f/0xf0 kernel: ? bit_waitqueue+0x30/0x30 kernel: ? fsnotify_grab_connector+0x3c/0x60 kernel: ? __sys_sendmsg+0x51/0x90 kernel: ? do_syscall_64+0x74/0x180 kernel: ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2 kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00 kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620 kernel: CR2: 0000000000000008 kernel: ---[ end trace e81a62bede66274e ]--- This is caused because flows_cnt is non-zero, but flows hasn't been initialized. fq_codel_init has left the private data in a partially initialized state. To fix this, reset flows_cnt to 0 when we fail to initialize. Additionally, to make the state more consistent, also cleanup the flows pointer when the allocation of backlogs fails. This fixes the NULL pointer dereference, since both the for-loop and memset in fq_codel_reset will be no-ops when flow_cnt is zero. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 12:32:09 -07:00
Vlad Buslov	e0479b670d	net: sched: fix unprotected access to rcu cookie pointer Fix action attribute size calculation function to take rcu read lock and access act_cookie pointer with rcu dereference. Fixes: `eec94fdb04` ("net: sched: use rcu for action cookie update") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:01:02 -07:00
Vlad Buslov	01e866bf07	net: sched: act_ife: fix memory leak in ife init Free params if tcf_idr_check_alloc() returned error. Fixes: `0190c1d452` ("net: sched: atomically check-allocate action") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:53:00 -07:00
Jianbo Liu	5e9a0fe492	net/sched: flower: Fix null pointer dereference when run tc vlan command Zahari issued tc vlan command without setting vlan_ethtype, which will crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE] is not null before use it. Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case. Fixes: `d64efd0926` ('net/sched: flower: Add supprt for matching on QinQ vlan headers') Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reported-by: Zahari Doychev <zahari.doychev@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:48:13 -07:00
Toke Høiland-Jørgensen	0c850344d3	sch_cake: Conditionally split GSO segments At lower bandwidths, the transmission time of a single GSO segment can add an unacceptable amount of latency due to HOL blocking. Furthermore, with a software shaper, any tuning mechanism employed by the kernel to control the maximum size of GSO segments is thrown off by the artificial limit on bandwidth. For this reason, we split GSO segments into their individual packets iff the shaper is active and configured to a bandwidth <= 1 Gbps. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	a729b7f0bd	sch_cake: Add overhead compensation support to the rate shaper This commit adds configurable overhead compensation support to the rate shaper. With this feature, userspace can configure the actual bottleneck link overhead and encapsulation mode used, which will be used by the shaper to calculate the precise duration of each packet on the wire. This feature is needed because CAKE is often deployed one or two hops upstream of the actual bottleneck (which can be, e.g., inside a DSL or cable modem). In this case, the link layer characteristics and overhead reported by the kernel does not match the actual bottleneck. Being able to set the actual values in use makes it possible to configure the shaper rate much closer to the actual bottleneck rate (our experience shows it is possible to get with 0.1% of the actual physical bottleneck rate), thus keeping latency low without sacrificing bandwidth. The overhead compensation has three tunables: A fixed per-packet overhead size (which, if set, will be accounted from the IP packet header), a minimum packet size (MPU) and a framing mode supporting either ATM or PTM framing. We include a set of common keywords in TC to help users configure the right parameters. If no overhead value is set, the value reported by the kernel is used. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	83f8fd69af	sch_cake: Add DiffServ handling This adds support for DiffServ-based priority queueing to CAKE. If the shaper is in use, each priority tier gets its own virtual clock, which limits that tier's rate to a fraction of the overall shaped rate, to discourage trying to game the priority mechanism. CAKE defaults to a simple, three-tier mode that interprets most code points as "best effort", but places CS1 traffic into a low-priority "bulk" tier which is assigned 1/16 of the total rate, and a few code points indicating latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7) into a "latency sensitive" high-priority tier, which is assigned 1/4 rate. The other supported DiffServ modes are a 4-tier mode matching the 802.11e precedence rules, as well as two 8-tier modes, one of which implements strict precedence of the eight priority levels. This commit also adds an optional DiffServ 'wash' mode, which will zero out the DSCP fields of any packet passing through CAKE. While this can technically be done with other mechanisms in the kernel, having the feature available in CAKE significantly decreases configuration complexity; and the implementation cost is low on top of the other DiffServ-handling code. Filters and applications can set the skb->priority field to override the DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches CAKE's qdisc handle, the minor number will be interpreted as a priority tier if it is less than or equal to the number of configured priority tiers. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	ea82511518	sch_cake: Add NAT awareness to packet classifier When CAKE is deployed on a gateway that also performs NAT (which is a common deployment mode), the host fairness mechanism cannot distinguish internal hosts from each other, and so fails to work correctly. To fix this, we add an optional NAT awareness mode, which will query the kernel conntrack mechanism to obtain the pre-NAT addresses for each packet and use that in the flow and host hashing. When the shaper is enabled and the host is already performing NAT, the cost of this lookup is negligible. However, in unlimited mode with no NAT being performed, there is a significant CPU cost at higher bandwidths. For this reason, the feature is turned off by default. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	8b7138814f	sch_cake: Add optional ACK filter The ACK filter is an optional feature of CAKE which is designed to improve performance on links with very asymmetrical rate limits. On such links (which are unfortunately quite prevalent, especially for DSL and cable subscribers), the downstream throughput can be limited by the number of ACKs capable of being transmitted in the upstream direction. Filtering ACKs can, in general, have adverse effects on TCP performance because it interferes with ACK clocking (especially in slow start), and it reduces the flow's resiliency to ACKs being dropped further along the path. To alleviate these drawbacks, the ACK filter in CAKE tries its best to always keep enough ACKs queued to ensure forward progress in the TCP flow being filtered. It does this by only filtering redundant ACKs. In its default 'conservative' mode, the filter will always keep at least two redundant ACKs in the queue, while in 'aggressive' mode, it will filter down to a single ACK. The ACK filter works by inspecting the per-flow queue on every packet enqueue. Starting at the head of the queue, the filter looks for another eligible packet to drop (so the ACK being dropped is always closer to the head of the queue than the packet being enqueued). An ACK is eligible only if it ACKs fewer bytes than the new packet being enqueued, including any SACK options. This prevents duplicate ACKs from being filtered, to avoid interfering with retransmission logic. In addition, we check TCP header options and only drop those that are known to not interfere with sender state. In particular, packets with unknown option codes are never dropped. In aggressive mode, an eligible packet is always dropped, while in conservative mode, at least two ACKs are kept in the queue. Only pure ACKs (with no data segments) are considered eligible for dropping, but when an ACK with data segments is enqueued, this can cause another pure ACK to become eligible for dropping. The approach described above ensures that this ACK filter avoids most of the drawbacks of a naive filtering mechanism that only keeps flow state but does not inspect the queue. This is the rationale for including the ACK filter in CAKE itself rather than as separate module (as the TC filter, for instance). Our performance evaluation has shown that on a 30/1 Mbps link with a bidirectional traffic test (RRUL), turning on the ACK filter on the upstream link improves downstream throughput by ~20% (both modes) and upstream throughput by ~12% in conservative mode and ~40% in aggressive mode, at the cost of ~5ms of inter-flow latency due to the increased congestion. In really pathological cases, the effect can be a lot more; for instance, the ACK filter increases the achievable downstream throughput on a link with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5 Mbps to ~25 Mbps). Finally, even though we consider the ACK filter to be safer than most, we do not recommend turning it on everywhere: on more symmetrical link bandwidths the effect is negligible at best. Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	7298de9cd7	sch_cake: Add ingress mode The ingress mode is meant to be enabled when CAKE runs downlink of the actual bottleneck (such as on an IFB device). The mode changes the shaper to also account dropped packets to the shaped rate, as these have already traversed the bottleneck. Enabling ingress mode will also tune the AQM to always keep at least two packets queued for each flow. This is done by scaling the minimum queue occupancy level that will disable the AQM by the number of active bulk flows. The rationale for this is that retransmits are more expensive in ingress mode, since dropped packets have to traverse the bottleneck again when they are retransmitted; thus, being more lenient and keeping a minimum number of packets queued will improve throughput in cases where the number of active flows are so large that they saturate the bottleneck even at their minimum window size. This commit also adds a separate switch to enable ingress mode rate autoscaling. If enabled, the autoscaling code will observe the actual traffic rate and adjust the shaper rate to match it. This can help avoid latency increases in the case where the actual bottleneck rate decreases below the shaped rate. The scaling filters out spikes by an EWMA filter. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	046f6fd5da	sched: Add Common Applications Kept Enhanced (cake) qdisc sch_cake targets the home router use case and is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it. Example of use on a cable ISP uplink: tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter To shape a cable download link (ifb and tc-mirred setup elided) tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash CAKE is filled with: * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel derived Flow Queuing system, which autoconfigures based on the bandwidth. * A novel "triple-isolate" mode (the default) which balances per-host and per-flow FQ even through NAT. * An deficit based shaper, that can also be used in an unlimited mode. * 8 way set associative hashing to reduce flow collisions to a minimum. * A reasonable interpretation of various diffserv latency/loss tradeoffs. * Support for zeroing diffserv markings for entering and exiting traffic. * Support for interacting well with Docsis 3.0 shaper framing. * Extensive support for DSL framing types. * Support for ack filtering. * Extensive statistics for measuring, loss, ecn markings, latency variation. A paper describing the design of CAKE is available at https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). This patch adds the base shaper and packet scheduler, while subsequent commits add the optional (configurable) features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. Various versions baking have been available as an out of tree build for kernel versions going back to 3.10, as the embedded router world has been running a few years behind mainline Linux. A stable version has been generally available on lede-17.01 and later. sch_cake replaces a combination of iptables, tc filter, htb and fq_codel in the sqm-scripts, with sane defaults and vastly simpler configuration. CAKE's principal author is Jonathan Morton, with contributions from Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller, Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht, and Loganaden Velvindron. Testing from Pete Heist, Georgios Amanakis, and the many other members of the cake@lists.bufferbloat.net mailing list. tc -s qdisc show dev eth2 qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84 Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12) backlog 0b 0p requeues 12 memory used: 1053008b of 15140Kb capacity estimate: 970Mbit min/max network layer size: 28 / 1500 min/max overhead-adjusted size: 84 / 1538 average network hdr offset: 14 Bulk Best Effort Voice thresh 62500Kbit 1Gbit 250Mbit target 5.0ms 5.0ms 5.0ms interval 100.0ms 100.0ms 100.0ms pk_delay 5us 5us 6us av_delay 3us 2us 2us sp_delay 2us 1us 1us backlog 0b 0b 0b pkts 3164050 25030267 9530280 bytes 3227519915 35396974782 12879808898 way_inds 0 8 0 way_miss 21 366 25 way_cols 0 0 0 drops 5 0 1 marks 0 0 0 ack_drop 0 0 0 sp_flows 1 3 0 bk_flows 0 1 1 un_flows 0 0 0 max_len 68130 68130 68130 Tested-by: Pete Heist <peteheist@gmail.com> Tested-by: Georgios Amanakis <gamanakis@gmail.com> Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
David S. Miller	0dbc81eab4	net: sched: Fix warnings from xchg() on RCU'd cookie pointer. The kbuild test robot reports: >> net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different address spaces) @@ expected struct tc_cookie [noderef] <asn:4>__ret @@ got [noderef] <asn:4>__ret @@ net/sched/act_api.c:71:15: expected struct tc_cookie [noderef] <asn:4>__ret net/sched/act_api.c:71:15: got struct tc_cookie new_cookie >> net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different address spaces) @@ expected struct tc_cookie old @@ got struct tc_cookie [noderef] <struct tc_cookie old @@ net/sched/act_api.c:71:13: expected struct tc_cookie old net/sched/act_api.c:71:13: got struct tc_cookie [noderef] <asn:4>[assigned] __ret >> net/sched/act_api.c:132:48: sparse: dereference of noderef expression Handle this in the usual way by force casting away the __rcu annotation when we are using xchg() on it. Fixes: `eec94fdb04` ("net: sched: use rcu for action cookie update") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:02:59 +09:00
Vlad Buslov	90b73b77d0	net: sched: change action API to use array of pointers to actions Act API used linked list to pass set of actions to functions. It is intrusive data structure that stores list nodes inside action structure itself, which means it is not safe to modify such list concurrently. However, action API doesn't use any linked list specific operations on this set of actions, so it can be safely refactored into plain pointer array. Refactor action API to use array of pointers to tc_actions instead of linked list. Change argument 'actions' type of exported action init, destroy and dump functions. Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	0190c1d452	net: sched: atomically check-allocate action Implement function that atomically checks if action exists and either takes reference to it, or allocates idr slot for action index to prevent concurrent allocations of actions with same index. Use EBUSY error pointer to indicate that idr slot is reserved. Implement cleanup helper function that removes temporary error pointer from idr. (in case of error between idr allocation and insertion of newly created action to specified index) Refactor all action init functions to insert new action to idr using this API. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	cae422f379	net: sched: use reference counting action init Change action API to assume that action init function always takes reference to action, even when overwriting existing action. This is necessary because action API continues to use action pointer after init function is done. At this point action becomes accessible for concurrent modifications, so user must always hold reference to it. Implement helper put list function to atomically release list of actions after action API init code is done using them. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	4e8ddd7f17	net: sched: don't release reference on action overwrite Return from action init function with reference to action taken, even when overwriting existing action. Action init API initializes its fourth argument (pointer to pointer to tc action) to either existing action with same index or newly created action. In case of existing index(and bind argument is zero), init function returns without incrementing action reference counter. Caller of action init then proceeds working with action, without actually holding reference to it. This means that action could be deleted concurrently. Change action init behavior to always take reference to action before returning successfully, in order to protect from concurrent deletion. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	16af606739	net: sched: implement reference counted action release Implement helper delete function that uses new action ops 'delete', instead of destroying action directly. This is required so act API could delete actions by index, without holding any references to action that is being deleted. Implement function __tcf_action_put() that releases reference to action and frees it, if necessary. Refactor action deletion code to use new put function and not to rely on rtnl lock. Remove rtnl lock assertions that are no longer needed. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	b409074e66	net: sched: add 'delete' function to action ops Extend action ops with 'delete' function. Each action type to implements its own delete function that doesn't depend on rtnl lock. Implement delete function that is required to delete actions without holding rtnl lock. Use action API function that atomically deletes action only if it is still in action idr. This implementation prevents concurrent threads from deleting same action twice. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	2a2ea34970	net: sched: implement action API that deletes action by index Implement new action API function that atomically finds and deletes action from idr by index. Intended to be used by lockless actions that do not rely on rtnl lock. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Vlad Buslov	3f7c72bc42	net: sched: always take reference to action Without rtnl lock protection it is no longer safe to use pointer to tc action without holding reference to it. (it can be destroyed concurrently) Remove unsafe action idr lookup function. Instead of it, implement safe tcf idr check function that atomically looks up action in idr and increments its reference and bind counters. Implement both action search and check using new safe function Reference taken by idr check is temporal and should not be accounted by userspace clients (both logically and to preserver current API behavior). Subtract temporal reference when dumping action to userspace using existing tca_get_fill function arguments. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Vlad Buslov	789871bb2a	net: sched: implement unlocked action init API Add additional 'rtnl_held' argument to act API init functions. It is required to implement actions that need to release rtnl lock before loading kernel module and reacquire if afterwards. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Vlad Buslov	036bb44327	net: sched: change type of reference and bind counters Change type of action reference counter to refcount_t. Change type of action bind counter to atomic_t. This type is used to allow decrementing bind counter without testing for 0 result. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Vlad Buslov	eec94fdb04	net: sched: use rcu for action cookie update Implement functions to atomically update and free action cookie using rcu mechanism. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00

1 2 3 4 5 ...

2515 Коммитов