After IPv4 packets are forwarded, the priority of the corresponding SKB
is updated according to the TOS field of IPv4 header. This overrides any
prioritization done earlier by e.g. an skbedit action or ingress-qos-map
defined at a vlan device.
Such overriding may not always be desirable. Even if the packet ends up
being routed, which implies this is an L3 network node, an administrator
may wish to preserve whatever prioritization was done earlier on in the
pipeline.
Therefore introduce a sysctl that controls this behavior. Keep the
default value at 1 to maintain backward-compatible behavior.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The construction "net->ipv4.sysctl_ip_nonlocal_bind || inet->freebind
|| inet->transparent" is present three times and its IPv6 counterpart
is also present three times. We introduce two small helpers to
characterize these tests uniformly.
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_frag_queue() might call pskb_pull() on one skb that
is already in the fragment queue.
We need to take care of possible truesize change, or we
might have an imbalance of the netns frags memory usage.
IPv6 is immune to this bug, because RFC5722, Section 4,
amended by Errata ID 3089 states :
When reassembling an IPv6 datagram, if
one or more its constituent fragments is determined to be an
overlapping fragment, the entire datagram (and any constituent
fragments) MUST be silently discarded.
Fixes: 158f323b98 ("net: adjust skb->truesize in pskb_expand_head()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently check current frags memory usage only when
a new frag queue is created. This allows attackers to first
consume the memory budget (default : 4 MB) creating thousands
of frag queues, then sending tiny skbs to exceed high_thresh
limit by 2 to 3 order of magnitude.
Note that before commit 648700f76b ("inet: frags: use rhashtables
for reassembly units"), work queue could be starved under DOS,
getting no cpu cycles.
After commit 648700f76b, only the per frag queue timer can eventually
remove an incomplete frag queue and its skbs.
Fixes: b13d3cbfb8 ("inet: frag: move eviction of queues to work queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Peter Oskolkov <posk@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The wait_address argument is always directly derived from the filp
argument, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the feature described in rfc1812#section-5.3.5.2
and rfc2644. It allows the router to forward directed broadcast when
sysctl bc_forwarding is enabled.
Note that this feature could be done by iptables -j TEE, but it would
cause some problems:
- target TEE's gateway param has to be set with a specific address,
and it's not flexible especially when the route wants forward all
directed broadcasts.
- this duplicates the directed broadcasts so this may cause side
effects to applications.
Besides, to keep consistent with other os router like BSD, it's also
necessary to implement it in the route rx path.
Note that route cache needs to be flushed when bc_forwarding is
changed.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some very small BDPs (with just a few packets) there was a
quantization effect where the target number of packets in flight
during the super-unity-gain (1.25x) phase of gain cycling was
implicitly truncated to a number of packets no larger than the normal
unity-gain (1.0x) phase of gain cycling. This meant that in multi-flow
scenarios some flows could get stuck with a lower bandwidth, because
they did not push enough packets inflight to discover that there was
more bandwidth available. This was really only an issue in multi-flow
LAN scenarios, where RTTs and BDPs are low enough for this to be an
issue.
This fix ensures that gain cycling can raise inflight for small BDPs
by ensuring that in PROBE_BW mode target inflight values with a
super-unity gain are always greater than inflight values with a gain
<= 1. Importantly, this applies whether the inflight value is
calculated for use as a cwnd value, or as a target inflight value for
the end of the super-unity phase in bbr_is_next_cycle_phase() (both
need to be bigger to ensure we can probe with more packets in flight
reliably).
This is a candidate fix for stable releases.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove BUG_ON() from fib_compute_spec_dst routine and check
in_dev pointer during flowi4 data structure initialization.
fib_compute_spec_dst routine can be run concurrently with device removal
where ip_ptr net_device pointer is set to NULL. This can happen
if userspace enables pkt info on UDP rx socket and the device
is removed while traffic is flowing
Fixes: 35ebf65e85 ("ipv4: Create and use fib_compute_spec_dst() helper")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2018-07-27
1) Extend the output_mark to also support the input direction
and masking the mark values before applying to the skb.
2) Add a new lookup key for the upcomming xfrm interfaces.
3) Extend the xfrm lookups to match xfrm interface IDs.
4) Add virtual xfrm interfaces. The purpose of these interfaces
is to overcome the design limitations that the existing
VTI devices have.
The main limitations that we see with the current VTI are the
following:
VTI interfaces are L3 tunnels with configurable endpoints.
For xfrm, the tunnel endpoint are already determined by the SA.
So the VTI tunnel endpoints must be either the same as on the
SA or wildcards. In case VTI tunnel endpoints are same as on
the SA, we get a one to one correlation between the SA and
the tunnel. So each SA needs its own tunnel interface.
On the other hand, we can have only one VTI tunnel with
wildcard src/dst tunnel endpoints in the system because the
lookup is based on the tunnel endpoints. The existing tunnel
lookup won't work with multiple tunnels with wildcard
tunnel endpoints. Some usecases require more than on
VTI tunnel of this type, for example if somebody has multiple
namespaces and every namespace requires such a VTI.
VTI needs separate interfaces for IPv4 and IPv6 tunnels.
So when routing to a VTI, we have to know to which address
family this traffic class is going to be encapsulated.
This is a lmitation because it makes routing more complex
and it is not always possible to know what happens behind the
VTI, e.g. when the VTI is move to some namespace.
VTI works just with tunnel mode SAs. We need generic interfaces
that ensures transfomation, regardless of the xfrm mode and
the encapsulated address family.
VTI is configured with a combination GRE keys and xfrm marks.
With this we have to deal with some extra cases in the generic
tunnel lookup because the GRE keys on the VTI are actually
not GRE keys, the GRE keys were just reused for something else.
All extensions to the VTI interfaces would require to add
even more complexity to the generic tunnel lookup.
So to overcome this, we developed xfrm interfaces with the
following design goal:
It should be possible to tunnel IPv4 and IPv6 through the same
interface.
No limitation on xfrm mode (tunnel, transport and beet).
Should be a generic virtual interface that ensures IPsec
transformation, no need to know what happens behind the
interface.
Interfaces should be configured with a new key that must match a
new policy/SA lookup key.
The lookup logic should stay in the xfrm codebase, no need to
change or extend generic routing and tunnel lookups.
Should be possible to use IPsec hardware offloads of the underlying
interface.
5) Remove xfrm pcpu policy cache. This was added after the flowcache
removal, but it turned out to make things even worse.
From Florian Westphal.
6) Allow to update the set mark on SA updates.
From Nathan Harold.
7) Convert some timestamps to time64_t.
From Arnd Bergmann.
8) Don't check the offload_handle in xfrm code,
it is an opaque data cookie for the driver.
From Shannon Nelson.
9) Remove xfrmi interface ID from flowi. After this pach
no generic code is touched anymore to do xfrm interface
lookups. From Benedict Wong.
10) Allow to update the xfrm interface ID on SA updates.
From Nathan Harold.
11) Don't pass zero to ERR_PTR() in xfrm_resolve_and_create_bundle.
From YueHaibing.
12) Return more detailed errors on xfrm interface creation.
From Benedict Wong.
13) Use PTR_ERR_OR_ZERO instead of IS_ERR + PTR_ERR.
From the kbuild test robot.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
net/ipv4/igmp.c:1391:6: warning:
symbol '__ip_mc_inc_group' was not declared. Should it be static?
Fixes: 6e2059b53f ("ipv4/igmp: init group mode as INCLUDE when join source group")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
net/ipv4/tcp_timer.c:25:5: warning:
symbol 'tcp_retransmit_stamp' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reported a read beyond the end of the skb head when returning
IPV6_ORIGDSTADDR:
BUG: KMSAN: kernel-infoleak in put_cmsg+0x5ef/0x860 net/core/scm.c:242
CPU: 0 PID: 4501 Comm: syz-executor128 Not tainted 4.17.0+ #9
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x185/0x1d0 lib/dump_stack.c:113
kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1125
kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1219
kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1261
copy_to_user include/linux/uaccess.h:184 [inline]
put_cmsg+0x5ef/0x860 net/core/scm.c:242
ip6_datagram_recv_specific_ctl+0x1cf3/0x1eb0 net/ipv6/datagram.c:719
ip6_datagram_recv_ctl+0x41c/0x450 net/ipv6/datagram.c:733
rawv6_recvmsg+0x10fb/0x1460 net/ipv6/raw.c:521
[..]
This logic and its ipv4 counterpart read the destination port from
the packet at skb_transport_offset(skb) + 4.
With MSG_MORE and a local SOCK_RAW sender, syzbot was able to cook a
packet that stores headers exactly up to skb_transport_offset(skb) in
the head and the remainder in a frag.
Call pskb_may_pull before accessing the pointer to ensure that it lies
in skb head.
Link: http://lkml.kernel.org/r/CAF=yD-LEJwZj5a1-bAAj2Oy_hKmGygV6rsJ_WOrAYnv-fnayiQ@mail.gmail.com
Reported-by: syzbot+9adb4b567003cac781f0@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files have extra line at end of file.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.
I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.
Fixes: 9f5afeae51 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.
1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.
We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.
In the future, we might add the possibility of terminating flows
that are proven to be malicious.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);
tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])
Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.
Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.
Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.
Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
Fixes: 36a6503fed ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb hash for locally generated ip[v6] fragments belonging
to the same datagram can vary in several circumstances:
* for connected UDP[v6] sockets, the first fragment get its hash
via set_owner_w()/skb_set_hash_from_sk()
* for unconnected IPv6 UDPv6 sockets, the first fragment can get
its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
auto_flowlabel is enabled
For the following frags the hash is usually computed via
skb_get_hash().
The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
scenario the egress tx queue can be selected on a per packet basis
via the skb hash.
It may also fool flow-oriented schedulers to place fragments belonging
to the same datagram in different flows.
Fix the issue by copying the skb hash from the head frag into
the others at fragmentation time.
Before this commit:
perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8"
netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
perf script
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0
After this commit:
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
Fixes: b73c3d0e4f ("net: Save TX flow hash in sock and set in skbuf on xmit")
Fixes: 67800f9b1f ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two scenarios that we will restore deleted records. The first is
when device down and up(or unmap/remap). In this scenario the new filter
mode is same with previous one. Because we get it from in_dev->mc_list and
we do not touch it during device down and up.
The other scenario is when a new socket join a group which was just delete
and not finish sending status reports. In this scenario, we should use the
current filter mode instead of restore old one. Here are 4 cases in total.
old_socket new_socket before_fix after_fix
IN(A) IN(A) ALLOW(A) ALLOW(A)
IN(A) EX( ) TO_IN( ) TO_EX( )
EX( ) IN(A) TO_EX( ) ALLOW(A)
EX( ) EX( ) TO_EX( ) TO_EX( )
Fixes: 24803f38a5 (igmp: do not remove igmp souce list info when set link down)
Fixes: 1666d49e1d (mld: do not remove mld souce list info when set link down)
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the mode parameter for igmp/igmp6_group_added as we can get it
from first parameter.
Fixes: 6e2059b53f (ipv4/igmp: init group mode as INCLUDE when join source group)
Fixes: c7ea20c9da (ipv6/mcast: init as INCLUDE when join SSM INCLUDE group)
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create the tcp_clamp_rto_to_user_timeout() helper routine. To calculate
the correct rto, so that the TCP_USER_TIMEOUT socket option is more
accurate. Taking suggestions and feedback into account from
Eric Dumazet, Neal Cardwell and David Laight. Due to the 1st commit we
can avoid the msecs_to_jiffies() and jiffies_to_msecs() dance.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a seperate helper routine as per Neal Cardwells suggestion. To
be used by the final commit in this series and retransmits_timed_out().
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a preparatory commit. Part of this series that improves the
socket TCP_USER_TIMEOUT option accuracy. Implement Eric Dumazets idea
to convert icsk->icsk_user_timeout from jiffies to msecs. To eliminate
the msecs_to_jiffies() and jiffies_to_msecs() dance in future.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree:
1) No need to set ttl from reject action for the bridge family, from
Taehee Yoo.
2) Use a fixed timeout for flow that are passed up from the flowtable
to conntrack, from Florian Westphal.
3) More preparation patches for tproxy support for nf_tables, from Mate
Eckl.
4) Remove unnecessary indirection in core IPv6 checksum function, from
Florian Westphal.
5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it.
From Florian Westphal.
6) socket match now selects socket infrastructure, instead of depending
on it. From Mate Eckl.
7) Patch series to simplify conntrack tuple building/parsing from packet
path and ctnetlink, from Florian Westphal.
8) Fetch timeout policy from protocol helpers, instead of doing it from
core, from Florian Westphal.
9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from
Florian Westphal.
10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES
respectively, instead of IPV6. Patch from Mate Eckl.
11) Add specific function for garbage collection in conncount,
from Yi-Hung Wei.
12) Catch number of elements in the connlimit list, from Yi-Hung Wei.
13) Move locking to nf_conncount, from Yi-Hung Wei.
14) Series of patches to add lockless tree traversal in nf_conncount,
from Yi-Hung Wei.
15) Resolve clash in matching conntracks when race happens, from
Martynas Pumputis.
16) If connection entry times out, remove template entry from the
ip_vs_conn_tab table to improve behaviour under flood, from
Julian Anastasov.
17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng.
18) Call abort from 2-phase commit protocol before requesting modules,
make sure this is done under the mutex, from Florian Westphal.
19) Grab module reference when starting transaction, also from Florian.
20) Dynamically allocate expression info array for pre-parsing, from
Florian.
21) Add per netns mutex for nf_tables, from Florian Westphal.
22) A couple of patches to simplify and refactor nf_osf code to prepare
for nft_osf support.
23) Break evaluation on missing socket, from Mate Eckl.
24) Allow to match socket mark from nft_socket, from Mate Eckl.
25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is
built-in into nf_conntrack. From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:
""" When receiving packets, the CE codepoint MUST be processed as follows:
1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
true and send an immediate ACK.
2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
to false and send an immediate ACK.
"""
Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.
Tested with this packetdrill script provided by Larry Brakmo
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001
0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257
+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001
+0.500 < F. 9501:9501(0) ack 4 win 257
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).
Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.
The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor and create helpers to send the special ACK in DCTCP.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The offload_handle should be an opaque data cookie for the driver
to use, much like the data cookie for a timer or alarm callback.
Thus, the XFRM stack should not be checking for non-zero, because
the driver might use that to store an array reference, which could
be zero, or some other zero but meaningful value.
We can remove the checks for non-zero because there are plenty
other attributes also being checked to see if there is an offload
in place for the SA in question.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Attempt to make cryptic TCP seq number error messages clearer by
(1) identifying the source of the message as "TCP", (2) identifying the
errors as "seq # bug", and (3) grouping the field identifiers and values
by separating them with commas.
E.g., the following message is changed from:
recvmsg bug 2: copied 73BCB6CD seq 70F17CBE rcvnxt 73BCB9AA fl 0
WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:1881 tcp_recvmsg+0x649/0xb90
to:
TCP recvmsg seq # bug 2: copied 73BCB6CD, seq 70F17CBE, rcvnxt 73BCB9AA, fl 0
WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:2011 tcp_recvmsg+0x694/0xba0
Suggested-by: 積丹尼 Dan Jacobson <jidanni@jidanni.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This unifies ipv4 and ipv6 protocol trackers and removes the l3proto
abstraction.
This gets rid of all l3proto indirect calls and the need to do
a lookup on the function to call for l3 demux.
It increases module size by only a small amount (12kbyte), so this reduces
size because nf_conntrack.ko is useless without either nf_conntrack_ipv4
or nf_conntrack_ipv6 module.
before:
text data bss dec hex filename
7357 1088 0 8445 20fd nf_conntrack_ipv4.ko
7405 1084 4 8493 212d nf_conntrack_ipv6.ko
72614 13689 236 86539 1520b nf_conntrack.ko
19K nf_conntrack_ipv4.ko
19K nf_conntrack_ipv6.ko
179K nf_conntrack.ko
after:
text data bss dec hex filename
79277 13937 236 93450 16d0a nf_conntrack.ko
191K nf_conntrack.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Correct previous bad attempt at allowing sockets to come out of TCP
repair without sending window probes. To avoid changing size of
the repair variable in struct tcp_sock, this lets the decision for
sending probes or not to be made when coming out of repair by
introducing two ways to turn it off.
v2:
* Remove erroneous comment; defines now make behavior clear
Fixes: 70b7ff1302 ("tcp: allow user to create repair socket without window probes")
Signed-off-by: Stefan Baranoff <sbaranoff@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on RFC3376 5.1
If no interface
state existed for that multicast address before the change (i.e., the
change consisted of creating a new per-interface record), or if no
state exists after the change (i.e., the change consisted of deleting
a per-interface record), then the "non-existent" state is considered
to have a filter mode of INCLUDE and an empty source list.
Which means a new multicast group should start with state IN().
Function ip_mc_join_group() works correctly for IGMP ASM(Any-Source Multicast)
mode. It adds a group with state EX() and inits crcount to mc_qrv,
so the kernel will send a TO_EX() report message after adding group.
But for IGMPv3 SSM(Source-specific multicast) JOIN_SOURCE_GROUP mode, we
split the group joining into two steps. First we join the group like ASM,
i.e. via ip_mc_join_group(). So the state changes from IN() to EX().
Then we add the source-specific address with INCLUDE mode. So the state
changes from EX() to IN(A).
Before the first step sends a group change record, we finished the second
step. So we will only send the second change record. i.e. TO_IN(A).
Regarding the RFC stands, we should actually send an ALLOW(A) message for
SSM JOIN_SOURCE_GROUP as the state should mimic the 'IN() to IN(A)'
transition.
The issue was exposed by commit a052517a8f ("net/multicast: should not
send source list records when have filter mode change"). Before this change,
we used to send both ALLOW(A) and TO_IN(A). After this change we only send
TO_IN(A).
Fix it by adding a new parameter to init group mode. Also add new wrapper
functions so we don't need to change too much code.
v1 -> v2:
In my first version I only cleared the group change record. But this is not
enough. Because when a new group join, it will init as EXCLUDE and trigger
an filter mode change in ip/ip6_mc_add_src(), which will clear all source
addresses' sf_crcount. This will prevent early joined address sending state
change records if multi source addressed joined at the same time.
In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
for IPv4 and IPv6.
Fixes: a052517a8f ("net/multicast: should not send source list records when have filter mode change")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not needed, we can have the l4trackers fetch it themselvs.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Handle it in the core instead.
ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this
doesn't create an ipv6 dependency.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its simpler to just handle it directly in nf_ct_invert_tuple().
Also gets rid of need to pass l3proto pointer to resolve_conntrack().
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
handle everything from ctnetlink directly.
After all these years we still only support ipv4 and ipv6, so it
seems reasonable to remove l3 protocol tracker support and instead
handle ipv4/ipv6 from a common, always builtin inet tracker.
Step 1: Get rid of all the l3proto->func() calls.
Start with ctnetlink, then move on to packet-path ones.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
allows to make nf_ip_checksum_partial static, it no longer
has an external caller.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Prevent coalescing of decrypted and encrypted SKBs in GRO
and TCP layer.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-15
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various different arm32 JIT improvements in order to optimize code emission
and make the JIT code itself more robust, from Russell.
2) Support simultaneous driver and offloaded XDP in order to allow for advanced
use-cases where some work is offloaded to the NIC and some to the host. Also
add ability for bpftool to load programs and maps beyond just the cgroup case,
from Jakub.
3) Add BPF JIT support in nfp for multiplication as well as division. For the
latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.
4) Add BTF pretty print functionality to bpftool in plain and JSON output
format, from Okash.
5) Add build and installation to the BPF helper man page into bpftool, from Quentin.
6) Add a TCP BPF callback for listening sockets which is triggered right after
the socket transitions to TCP_LISTEN state, from Andrey.
7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
tree and prints all attached programs, from Roman.
8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
packets, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new TCP-BPF callback that is called on listen(2) right after socket
transition to TCP_LISTEN state.
It fills the gap for listening sockets in TCP-BPF. For example BPF
program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening
and track later transition from TCP_LISTEN to TCP_CLOSE with
BPF_SOCK_OPS_STATE_CB callback.
Before there was no way to do it with TCP-BPF and other options were
much harder to work with. E.g. socket state tracking can be done with
tracepoints (either raw or regular) but they can't be attached to cgroup
and their lifetime has to be managed separately.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
tcp_rcv_nxt_update() is already executed in tcp_data_queue().
This line is redundant.
See bellow,
tcp_queue_rcv
tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq); <<<< redundant
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, when a data segment was sent an ACK was piggybacked
on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
event to notify congestion control modules. So the DCTCP
ca->delayed_ack_reserved flag could incorrectly stay set when
in fact there were no delayed ACKs being reserved. This could result
in sending a special ECN notification ACK that carries an older
ACK sequence, when in fact there was no need for such an ACK.
DCTCP keeps track of the delayed ACK status with its own separate
state ca->delayed_ack_reserved. Previously it may accidentally cancel
the delayed ACK without updating this field upon sending a special
ACK that carries a older ACK sequence. This inconsistency would
lead to DCTCP receiver never acknowledging the latest data until the
sender times out and retry in some cases.
Packetdrill script (provided by Larry Brakmo)
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001
0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 2:3(1) ack 2001
0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
0.200 > [ect01] . 3:3(0) ack 4001
0.210 < [ce] P. 4001:4501(500) ack 3 win 257
+0.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect01] PE. 3:4(1) ack 4501
+0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
// Previously the ACK sequence below would be 4501, causing a long RTO
+0.040~+0.045 > [ect01] . 4:4(0) ack 5501 // delayed ack
+0.311 < [ect0] . 5501:6501(1000) ack 4 win 257 // More data
+0 > [ect01] . 4:4(0) ack 6501 // now acks everything
+0.500 < F. 9501:9501(0) ack 4 win 257
Reported-by: Larry Brakmo <brakmo@fb.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for IGMPMSG_WRVIFWHOLE which is used to pass
full packet and real vif id when the incoming interface is wrong.
While the RP and FHR are setting up state we need to be sending the
registers encapsulated with all the data inside otherwise we lose it.
The RP then decapsulates it and forwards it to the interested parties.
Currently with WRONGVIF we can only be sending empty register packets
and will lose that data.
This behaviour can be enabled by using MRT_PIM with
val == IGMPMSG_WRVIFWHOLE. This doesn't prevent IGMPMSG_WRONGVIF from
happening, it happens in addition to it, also it is controlled by the same
throttling parameters as WRONGVIF (i.e. 1 packet per 3 seconds currently).
Both messages are generated to keep backwards compatibily and avoid
breaking someone who was enabling MRT_PIM with val == 4, since any
positive val is accepted and treated the same.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 5fa12739a5 ("net: ipv4: listify ip_rcv_finish") calling
dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls
dst_input(skb) in a loop.
The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
from the list before invoking dst_input(). Further more we need to
clear skb->next as other parts of the network stack use another kind
of SKB lists for xmit_more (see dev_hard_start_xmit).
A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().
This patch only fixes the crash, but there is a huge potential for
a performance boost if we can pass an SKB-list through to ip_forward.
Fixes: 5fa12739a5 ("net: ipv4: listify ip_rcv_finish")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using get_seconds() for timestamps is deprecated since it can lead
to overflows on 32-bit systems. While the interface generally doesn't
overflow until year 2106, the specific implementation of the TCP PAWS
algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
overflow.
A related problem is that the local timestamps in CLOCK_REALTIME form
lead to unexpected behavior when settimeofday is called to set the system
clock backwards or forwards by more than 24 days.
While the first problem could be solved by using an overflow-safe method
of comparing the timestamps, a nicer solution is to use a monotonic
clocksource with ktime_get_seconds() that simply doesn't overflow (at
least not until 136 years after boot) and that doesn't change during
settimeofday().
To make 32-bit and 64-bit architectures behave the same way here, and
also save a few bytes in the tcp_options_received structure, I'm changing
the type to a 32-bit integer, which is now safe on all architectures.
Finally, the ts_recent_stamp field also (confusingly) gets used to store
a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
This is currently safe, but changing the type to 32-bit requires
some small changes there to keep it working.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under rare conditions where repair code may be used it is possible that
window probes are either unnecessary or undesired. If the user knows that
window probes are not wanted or needed this change allows them to skip
sending them when a socket comes out of repair.
Signed-off-by: Stefan Baranoff <sbaranoff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug where the sequence numbers of a socket created using
TCP repair functionality are lower than set after connect is called.
This occurs when the repair socket overlaps with a TIME-WAIT socket and
triggers the re-use code. The amount lower is equal to the number of times
that a particular IP/port set is re-used and then put back into TIME-WAIT.
Re-using the first time the sequence number is 1 lower, closing that socket
and then re-opening (with repair) a new socket with the same addresses/ports
puts the sequence number 2 lower than set via setsockopt. The third time is
3 lower, etc. I have not tested what the limit of this acrewal is, if any.
The fix is, if a socket is in repair mode, to respect the already set
sequence number and timestamp when it would have already re-used the
TIME-WAIT socket.
Signed-off-by: Stefan Baranoff <sbaranoff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Congestion control algorithms, which access the rate sample
through the tcp_cong_control function, only have access to the maximum
of the send and receive interval, for cases where the acknowledgment
rate may be inaccurate due to ACK compression or decimation. Algorithms
may want to use send rates and receive rates as separate signals.
Signed-off-by: Deepti Raghavan <deeptir@mit.edu>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 74d4a8f8d3 ("tcp: remove sk_can_gso() use"), the code
doesn't care whether the interface supports SG.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree:
1) Missing module autoloadfor icmp and icmpv6 x_tables matches,
from Florian Westphal.
2) Possible non-linear access to TCP header from tproxy, from
Mate Eckl.
3) Do not allow rbtree to be used for single elements, this patch
moves all set backend into one single module since such thing
can only happen if hashtable module is explicitly blacklisted,
which should not ever be done.
4) Reject error and standard targets from nft_compat for sanity
reasons, they are never used from there.
5) Don't crash on double hashsize module parameter, from Andrey
Ryabinin.
6) Drop dst on skb before placing it in the fragmentation
reassembly queue, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In both tcp_splice_read() and tcp_recvmsg(), we already test
sock_flag(sk, SOCK_DONE) right before evaluating sk->sk_state,
so "!sock_flag(sk, SOCK_DONE)" is always true.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_zerocopy_receive() relies on tcp_inq() to limit number of bytes
requested by user.
syzbot found that after tcp_disconnect(), tcp_inq() was returning
a stale value (number of bytes in queue before the disconnect).
Note that after this patch, ioctl(fd, SIOCINQ, &val) is also fixed
and returns 0, so this might be a candidate for all known linux kernels.
While we are at this, we probably also should clear urg_data to
avoid other syzkaller reports after it discovers how to deal with
urgent data.
syzkaller repro :
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(20000), sin_addr=inet_addr("224.0.0.1")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(20000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
send(3, ..., 4096, 0) = 4096
connect(3, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 128) = 0
getsockopt(3, SOL_TCP, TCP_ZEROCOPY_RECEIVE, ..., [16]) = 0 // CRASH
Fixes: 05255b823a ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting the low threshold to 0 has no effect on frags allocation,
we need to clear high_thresh instead.
The code was pre-existent to commit 648700f76b ("inet: frags:
use rhashtables for reassembly units"), but before the above,
such assignment had a different role: prevent concurrent eviction
from the worker and the netns cleanup helper.
Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcp_diag_destroy closes a TCP_NEW_SYN_RECV socket, it first
frees it by calling inet_csk_reqsk_queue_drop_and_and_put in
tcp_abort, and then frees it again by calling sock_gen_put.
Since tcp_abort only has one caller, and all the other codepaths
in tcp_abort don't free the socket, just remove the free in that
function.
Cc: David Ahern <dsa@cumulusnetworks.com>
Tested: passes Android sock_diag_test.py, which exercises this codepath
Fixes: d7226c7a4d ("net: diag: Fix refcnt leak in error path destroying socket")
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin reported that icmp replies may not use the address on the device the
echo request is received if the destination address is broadcast. Instead
a route lookup is done without considering VRF context. Fix by setting
oif in flow struct to the master device if it is enslaved. That directs
the lookup to the VRF table. If the device is not enslaved, oif is still
0 so no affect.
Fixes: cd2fbe1b6b ("net: Use VRF device index for lookups on RX")
Reported-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that ipc(6)->gso_size is correctly initialized in all callers of
ip(6)_setup_cork, it is safe to unconditionally pass it to the cork.
Link: http://lkml.kernel.org/r/20180619164752.143249-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.
The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.
There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.
Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.
After this change the IPv4 and IPv6 logic is the same.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba199 ("ip: limit use of gso_size to udp").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba199 ("ip: limit use of gso_size to udp").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a silent out-of-bound read possibility that was present
because of the misuse of this function.
Mostly it was called with a struct udphdr *hp which had only the udphdr
part linearized by the skb_header_pointer, however
nf_tproxy_get_sock_v{4,6} uses it as a tcphdr pointer, so some reads for
tcp specific attributes may be invalid.
Fixes: a583636a83 ("inet: refactor inet[6]_lookup functions to take skb")
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The low and high values of the net.ipv4.ping_group_range sysctl were
being silently forced to the default disabled state when a write to the
sysctl contained GIDs that didn't map to the associated user namespace.
Confusingly, the sysctl's write operation would return success and then
a subsequent read of the sysctl would indicate that the low and high
values are the overflowgid.
This patch changes the behavior by clearly returning an error when the
sysctl write operation receives a GID range that doesn't map to the
associated user namespace. In such a situation, the previous value of
the sysctl is preserved and that range will be returned in a subsequent
read of the sysctl.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we have an L3 master device, l3mdev_ip_rcv() will steal the skb, but
we were returning NET_RX_SUCCESS from ip_rcv_finish_core() which meant
that ip_list_rcv_finish() would keep it on the list. Instead let's
move the l3mdev_ip_rcv() call into the caller, so that our response to
a steal can be different in the single packet path (return
NET_RX_SUCCESS) and the list path (forget this packet and continue).
Fixes: 5fa12739a5 ("net: ipv4: listify ip_rcv_finish")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nft_compat relies on xt_request_find_match to increment
refcount of the module that provides the match/target.
The (builtin) icmp matches did't set the module owner so it
was possible to rmmod ip(6)tables while icmp extensions were still in use.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since callees (ip_rcv_core() and ip_rcv_finish_core()) might free or steal
the skb, we can't use the list_cut_before() method; we can't even do a
list_del(&skb->list) in the drop case, because skb might have already been
freed and reused.
So instead, take each skb off the source list before processing, and add it
to the sublist afterwards if it wasn't freed or stolen.
Fixes: 5fa12739a5 net: ipv4: listify ip_rcv_finish
Fixes: 17266ee939 net: ipv4: listified version of ip_rcv
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().
For the raw fast path, just perform the copy at raw_send_hdrinc().
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
demux or route lookup. The last step, calling dst_input(skb), is left to
the caller; in the listified case, we split to form sublists with a common
dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
The next step in listification would thus be to add a list_input() method
to struct dst_entry.
Early demux is an indirect call based on iph->protocol; this is another
opportunity for listification which is not taken here (it would require
slicing up ip_rcv_finish_core() to allow splitting on protocol changes).
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
a major project in itself) we just let it call the regular okfn (in this
case ip_rcv_finish()) for any packets it steals, and have it give us back
a list of packets it's synchronously accepted (which normally NF_HOOK
would automatically call okfn() on, but we want to be able to potentially
pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
packet (see nf_hook_entry_hookfn()), but again, changing that can be left
for future work.
There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any
accepts earlier in the list. However, it was already possible for an
asynchronous accept to cause out-of-order receives, so presumably this is
considered OK.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces __ip_queue_xmit(), through which the callers
can pass tos param into it without having to set inet->tos. For
ipv6, ip6_xmit() already allows passing tclass parameter.
It's needed when some transport protocol doesn't use inet->tos,
like sctp's per transport dscp, which will be added in next patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple overlapping changes in stmmac driver.
Adjust skb_gro_flush_final_remcsum function signature to make GRO list
changes in net-next, as per Stephen Rothwell's example merge
resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Verify netlink attributes properly in nf_queue, from Eric Dumazet.
2) Need to bump memory lock rlimit for test_sockmap bpf test, from
Yonghong Song.
3) Fix VLAN handling in lan78xx driver, from Dave Stevenson.
4) Fix uninitialized read in nf_log, from Jann Horn.
5) Fix raw command length parsing in mlx5, from Alex Vesker.
6) Cleanup loopback RDS connections upon netns deletion, from Sowmini
Varadhan.
7) Fix regressions in FIB rule matching during create, from Jason A.
Donenfeld and Roopa Prabhu.
8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren.
9) More bpfilter build fixes/adjustments from Masahiro Yamada.
10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper
Dangaard Brouer.
11) fib_tests.sh file permissions were broken, from Shuah Khan.
12) Make sure BH/preemption is disabled in data path of mac80211, from
Denis Kenzior.
13) Don't ignore nla_parse_nested() return values in nl80211, from
Johannes berg.
14) Properly account sock objects ot kmemcg, from Shakeel Butt.
15) Adjustments to setting bpf program permissions to read-only, from
Daniel Borkmann.
16) TCP Fast Open key endianness was broken, it always took on the host
endiannness. Whoops. Explicitly make it little endian. From Yuching
Cheng.
17) Fix prefix route setting for link local addresses in ipv6, from
David Ahern.
18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva.
19) Various bpf sockmap fixes, from John Fastabend.
20) Use after free for GRO with ESP, from Sabrina Dubroca.
21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from
Eric Biggers.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
qede: Adverstise software timestamp caps when PHC is not available.
qed: Fix use of incorrect size in memcpy call.
qed: Fix setting of incorrect eswitch mode.
qed: Limit msix vectors in kdump kernel to the minimum required count.
ipvlan: call dev_change_flags when ipvlan mode is reset
ipv6: sr: fix passing wrong flags to crypto_alloc_shash()
net: fix use-after-free in GRO with ESP
tcp: prevent bogus FRTO undos with non-SACK flows
bpf: sockhash, add release routine
bpf: sockhash fix omitted bucket lock in sock_close
bpf: sockmap, fix smap_list_map_remove when psock is in many maps
bpf: sockmap, fix crash when ipv6 sock is added
net: fib_rules: bring back rule_exists to match rule during add
hv_netvsc: split sub-channel setup into async and sync
net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
atm: zatm: Fix potential Spectre v1
s390/qeth: consistently re-enable device features
s390/qeth: don't clobber buffer on async TX completion
s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]
s390/qeth: fix race when setting MAC address
...
Since the addition of GRO for ESP, gro_receive can consume the skb and
return -EINPROGRESS. In that case, the lower layer GRO handler cannot
touch the skb anymore.
Commit 5f114163f2 ("net: Add a skb_gro_flush_final helper.") converted
some of the gro_receive handlers that can lead to ESP's gro_receive so
that they wouldn't access the skb when -EINPROGRESS is returned, but
missed other spots, mainly in tunneling protocols.
This patch finishes the conversion to using skb_gro_flush_final(), and
adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and
GUE.
Fixes: 5f114163f2 ("net: Add a skb_gro_flush_final helper.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.
We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.
(FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
to better indicate its purpose but to keep this change minimal, it
will be done in another patch).
Besides burstiness and congestion control violations, this problem
can result in RTO loop: When the loss recovery is prematurely
undoed, only new data will be transmitted (if available) and
the next retransmission can occur only after a new RTO which in case
of multiple losses (that are not for consecutive packets) requires
one RTO per loss to recover.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Tested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sk_rmem_alloc is larger than the receive buffer and we can't
schedule more memory for it, the skb will be dropped.
In above situation, if this skb is put into the ofo queue,
LINUX_MIB_TCPOFODROP is incremented to track it.
While if this skb is put into the receive queue, there's no record.
So a new SNMP counter is introduced to track this behavior.
LINUX_MIB_TCPRCVQDROP: Number of packets meant to be queued in rcv queue
but dropped because socket rcvbuf limit hit.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fast Open key could be stored in different endian based on the CPU.
Previously hosts in different endianness in a server farm using
the same key config (sysctl value) would produce different cookies.
This patch fixes it by always storing it as little endian to keep
same API for LE hosts.
Reported-by: Daniele Iamartino <danielei@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check the tunnel option type stored in tunnel flags when creating options
for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
options on interfaces that are not associated with them.
Make sure all users of the infrastructure set correct flags, for the BPF
helper we have to set all bits to keep backward compatibility.
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.
Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.
But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.
[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Netfilter assumes that if the socket is present in the skb, then
it can be used because that reference is cleaned up while the skb
is crossing netns.
We want to change that to preserve the socket reference in a future
patch, so this is a preparation updating netfilter to check if the
socket netns matches before use it.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
about our recent patch removing ~16 quick acks after ECN events.
tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
but in the case the sender cwnd was lowered to 1, we do not want
to have a delayed ack for the next packet we will receive.
Fixes: 522040ea5f ("tcp: do not aggressively quick ack after ECN events")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It will be helpful if we could display the drops due to zero window or no
enough window space.
So a new SNMP MIB entry is added to track this behavior.
This entry is named LINUX_MIB_TCPZEROWINDOWDROP and published in
/proc/net/netstat in TcpExt line as TCPZeroWindowDrop.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manage pending per-NAPI GRO packets via list_head.
Return an SKB pointer from the GRO receive handlers. When GRO receive
handlers return non-NULL, it means that this SKB needs to be completed
at this time and removed from the NAPI queue.
Several operations are greatly simplified by this transformation,
especially timing out the oldest SKB in the list when gro_count
exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
in reverse order.
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit makes BBR use only the MSS (without any headers) to
calculate pacing rates when internal TCP-layer pacing is used.
This is necessary to achieve the correct pacing behavior in this case,
since tcp_internal_pacing() uses only the payload length to calculate
pacing delays.
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving multiple packets with the same ts ecr value, only try
to compute rcv_rtt sample with the earliest received packet.
This is because the rcv_rtt calculated by later received packets
could possibly include long idle time or other types of delay.
For example:
(1) server sends last packet of reply with TS val V1
(2) client ACKs last packet of reply with TS ecr V1
(3) long idle time passes
(4) client sends next request data packet with TS ecr V1 (again!)
At this time, the rcv_rtt computed on server with TS ecr V1 will be
inflated with the idle time and should get ignored.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.
This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code. rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipcm(6)_cookie field gso_size is set only in the udp path. The ip
layer copies this to cork only if sk_type is SOCK_DGRAM. This check
proved too permissive. Ping and l2tp sockets have the same type.
Limit to sockets of type SOCK_DGRAM and protocol IPPROTO_UDP to
exclude ping sockets.
v1 -> v2
- remove irrelevant whitespace changes
Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to 69678bcd4d ("udp: fix SO_BINDTODEVICE"), TCP socket lookups
need to fail if dev_match is not true. Currently, a packet to a given port
can match a socket bound to device when it should not. In the VRF case,
this causes the lookup to hit a VRF socket and not a global socket
resulting in a response trying to go through the VRF when it should not.
Fixes: 3fa6f616a7 ("net: ipv4: add second dif to inet socket lookups")
Fixes: 4297a0ef08 ("net: ipv6: add second dif to inet6 socket lookups")
Reported-by: Lou Berger <lberger@labn.net>
Diagnosed-by: Renato Westphal <renato@opensourcerouting.org>
Tested-by: Renato Westphal <renato@opensourcerouting.org>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Various netfilter fixlets from Pablo and the netfilter team.
2) Fix regression in IPVS caused by lack of PMTU exceptions on local
routes in ipv6, from Julian Anastasov.
3) Check pskb_trim_rcsum for failure in DSA, from Zhouyang Jia.
4) Don't crash on poll in TLS, from Daniel Borkmann.
5) Revert SO_REUSE{ADDR,PORT} change, it regresses various things
including Avahi mDNS. From Bart Van Assche.
6) Missing of_node_put in qcom/emac driver, from Yue Haibing.
7) We lack checking of the TCP checking in one special case during SYN
receive, from Frank van der Linden.
8) Fix module init error paths of mac80211 hwsim, from Johannes Berg.
9) Handle 802.1ad properly in stmmac driver, from Elad Nachman.
10) Must grab HW caps before doing quirk checks in stmmac driver, from
Jose Abreu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
net: stmmac: Run HWIF Quirks after getting HW caps
neighbour: skip NTF_EXT_LEARNED entries during forced gc
net: cxgb3: add error handling for sysfs_create_group
tls: fix waitall behavior in tls_sw_recvmsg
tls: fix use-after-free in tls_push_record
l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl()
l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels
mlxsw: spectrum_switchdev: Fix port_vlan refcounting
mlxsw: spectrum_router: Align with new route replace logic
mlxsw: spectrum_router: Allow appending to dev-only routes
ipv6: Only emit append events for appended routes
stmmac: added support for 802.1ad vlan stripping
cfg80211: fix rcu in cfg80211_unregister_wdev
mac80211: Move up init of TXQs
mac80211_hwsim: fix module init error paths
cfg80211: initialize sinfo in cfg80211_get_station
nl80211: fix some kernel doc tag mistakes
hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
rds: avoid unenecessary cong_update in loop transport
l2tp: clean up stale tunnel or session in pppol2tp_connect's error path
...
commit 079096f103 ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.
But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly. These lower-level processing functions do not do any checksum
verification.
Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is not necessary. skb_gro_receive() will never change what
'head' points to.
In it's original implementation (see commit 71d93b39e5 ("net: Add
skb_gro_receive")), it did:
====================
+ *head = nskb;
+ nskb->next = p->next;
+ p->next = NULL;
====================
This sequence was removed in commit 58025e46ea ("net: gro: remove
obsolete code from skb_gro_receive()")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree:
1) Reject non-null terminated helper names from xt_CT, from Gao Feng.
2) Fix KASAN splat due to out-of-bound access from commit phase, from
Alexey Kodanev.
3) Missing conntrack hook registration on IPVS FTP helper, from Julian
Anastasov.
4) Incorrect skbuff allocation size in bridge nft_reject, from Taehee Yoo.
5) Fix inverted check on packet xmit to non-local addresses, also from
Julian.
6) Fix ebtables alignment compat problems, from Alin Nastac.
7) Hook mask checks are not correct in xt_set, from Serhey Popovych.
8) Fix timeout listing of element in ipsets, from Jozsef.
9) Cap maximum timeout value in ipset, also from Jozsef.
10) Don't allow family option for hash:mac sets, from Florent Fourcot.
11) Restrict ebtables to work with NFPROTO_BRIDGE targets only, this
Florian.
12) Another bug reported by KASAN in the rbtree set backend, from
Taehee Yoo.
13) Missing __IPS_MAX_BIT update doesn't include IPS_OFFLOAD_BIT.
From Gao Feng.
14) Missing initialization of match/target in ebtables, from Florian
Westphal.
15) Remove useless nft_dup.h file in include path, from C. Labbe.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The user-provided value to setsockopt(SO_RCVLOWAT) can be
larger than the maximum possible receive buffer. Such values
mute POLLIN signals on the socket which can stall progress
on the socket.
Limit the user-provided value to half of the maximum receive
buffer, i.e., half of sk_rcvbuf when the receive buffer size
is set by the user, or otherwise half of sysctl_tcp_rmem[2].
Fixes: d1361840f8 ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 6b229cf77d ("udp: add batching to udp_rmem_release()")
the sk_rmem_alloc field does not measure exactly anymore the
receive queue length, because we batch the rmem release. The issue
is really apparent only after commit 0d4a6608f6 ("udp: do rmem bulk
free even if the rx sk queue is empty"): the user space can easily
check for an empty socket with not-0 queue length reported by the 'ss'
tool or the procfs interface.
We need to use a custom UDP helper to report the correct queue length,
taking into account the forward allocation deficit.
Reported-by: trevor.francis@46labs.com
Fixes: 6b229cf77d ("UDP: add batching to udp_rmem_release()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reports following splat:
BUG: KMSAN: uninit-value in ebt_stp_mt_check+0x24b/0x450
net/bridge/netfilter/ebt_stp.c:162
ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162
xt_check_match+0x1438/0x1650 net/netfilter/x_tables.c:506
ebt_check_match net/bridge/netfilter/ebtables.c:372 [inline]
ebt_check_entry net/bridge/netfilter/ebtables.c:702 [inline]
The uninitialised access is
xt_mtchk_param->nft_compat
... which should be set to 0.
Fix it by zeroing the struct beforehand, same for tgchk.
ip(6)tables targetinfo uses c99-style initialiser, so no change
needed there.
Reported-by: syzbot+da4494182233c23a5fcf@syzkaller.appspotmail.com
Fixes: 55917a21d0 ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
By passing a limit of 2 bytes to strncat, strncat is limited to writing
fewer bytes than what it's supposed to append to the name here.
Since the bounds are checked on the line above this, just remove the string
bounds checks entirely since they're unneeded.
Signed-off-by: Sultan Alsawaf <sultanxda@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-06-05
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add a new BPF hook for sendmsg similar to existing hooks for bind and
connect: "This allows to override source IP (including the case when it's
set via cmsg(3)) and destination IP:port for unconnected UDP (slow path).
TCP and connected UDP (fast path) are not affected. This makes UDP support
complete, that is, connected UDP is handled by connect hooks, unconnected
by sendmsg ones.", from Andrey.
2) Rework of the AF_XDP API to allow extending it in future for type writer
model if necessary. In this mode a memory window is passed to hardware
and multiple frames might be filled into that window instead of just one
that is the case in the current fixed frame-size model. With the new
changes made this can be supported without having to add a new descriptor
format. Also, core bits for the zero-copy support for AF_XDP have been
merged as agreed upon, where i40e bits will be routed via Jeff later on.
Various improvements to documentation and sample programs included as
well, all from Björn and Magnus.
3) Given BPF's flexibility, a new program type has been added to implement
infrared decoders. Quote: "The kernel IR decoders support the most
widely used IR protocols, but there are many protocols which are not
supported. [...] There is a 'long tail' of unsupported IR protocols,
for which lircd is need to decode the IR. IR encoding is done in such
a way that some simple circuit can decode it; therefore, BPF is ideal.
[...] user-space can define a decoder in BPF, attach it to the rc
device through the lirc chardev.", from Sean.
4) Several improvements and fixes to BPF core, among others, dumping map
and prog IDs into fdinfo which is a straight forward way to correlate
BPF objects used by applications, removing an indirect call and therefore
retpoline in all map lookup/update/delete calls by invoking the callback
directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper
for tc BPF programs to have an efficient way of looking up cgroup v2 id
for policy or other use cases. Fixes to make sure we zero tunnel/xfrm
state that hasn't been filled, to allow context access wrt pt_regs in
32 bit archs for tracing, and last but not least various test cases
for fixes that landed in bpf earlier, from Daniel.
5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with
a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect
call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper.
6) Add a new bpf_get_current_cgroup_id() helper that can be used in
tracing to retrieve the cgroup id from the current process in order
to allow for e.g. aggregation of container-level events, from Yonghong.
7) Two follow-up fixes for BTF to reject invalid input values and
related to that also two test cases for BPF kselftests, from Martin.
8) Various API improvements to the bpf_fib_lookup() helper, that is,
dropping MPLS bits which are not fully hashed out yet, rejecting
invalid helper flags, returning error for unsupported address
families as well as renaming flowlabel to flowinfo, from David.
9) Various fixes and improvements to sockmap BPF kselftests in particular
in proper error detection and data verification, from Prashant.
10) Two arm32 BPF JIT improvements. One is to fix imm range check with
regards to whether immediate fits into 24 bits, and a naming cleanup
to get functions related to rsh handling consistent to those handling
lsh, from Wang.
11) Two compile warning fixes in BPF, one for BTF and a false positive
to silent gcc in stack_map_get_build_id_offset(), from Arnd.
12) Add missing seg6.h header into tools include infrastructure in order
to fix compilation of BPF kselftests, from Mathieu.
13) Several formatting cleanups in the BPF UAPI helper description that
also fix an error during rst2man compilation, from Quentin.
14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is
not built into the kernel, from Yue.
15) Remove a useless double assignment in dev_map_enqueue(), from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 0bbbf0e7d0 ("ipmr, ip6mr: Unite creation of new mr_table")
refactored ipmr_new_table, so that it now returns NULL when
mr_table_alloc fails. Unfortunately, all callers of ipmr_new_table
expect an ERR_PTR.
This can result in NULL deref, for example when ipmr_rules_exit calls
ipmr_free_table with NULL net->ipv4.mrt in the
!CONFIG_IP_MROUTE_MULTIPLE_TABLES version.
This patch makes mr_table_alloc return errors, and changes
ip6mr_new_table and its callers to return/expect error pointers as
well. It also removes the version of mr_table_alloc defined under
!CONFIG_IP_MROUTE_COMMON, since it is never used.
Fixes: 0bbbf0e7d0 ("ipmr, ip6mr: Unite creation of new mr_table")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is now possible to enable the libified nf_tproxy modules without
also enabling NETFILTER_XT_TARGET_TPROXY, which throws off the
ifdef logic in the udp core code:
net/ipv6/netfilter/nf_tproxy_ipv6.o: In function `nf_tproxy_get_sock_v6':
nf_tproxy_ipv6.c:(.text+0x1a8): undefined reference to `udp6_lib_lookup'
net/ipv4/netfilter/nf_tproxy_ipv4.o: In function `nf_tproxy_get_sock_v4':
nf_tproxy_ipv4.c:(.text+0x3d0): undefined reference to `udp4_lib_lookup'
We can actually simplify the conditions now to provide the two functions
exactly when they are needed.
Fixes: 45ca4e0cf2 ("netfilter: Libify xt_TPROXY")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested: 'git grep tw_timeout' comes up empty and it builds :-)
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor tcp_ecn_check_ce and __tcp_ecn_check_ce to accept struct sock*
instead of tcp_sock* to clean up type casts. This is a pure refactor
patch.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This changes the /proc/sys/net/ipv4/tcp_tw_reuse from a boolean
to an integer.
It now takes the values 0, 1 and 2, where 0 and 1 behave as before,
while 2 enables timewait socket reuse only for sockets that we can
prove are loopback connections:
ie. bound to 'lo' interface or where one of source or destination
IPs is 127.0.0.0/8, ::ffff:127.0.0.0/104 or ::1.
This enables quicker reuse of ephemeral ports for loopback connections
- where tcp_tw_reuse is 100% safe from a protocol perspective
(this assumes no artificially induced packet loss on 'lo').
This also makes estblishing many loopback connections *much* faster
(allocating ports out of the first half of the ephemeral port range
is significantly faster, then allocating from the second half)
Without this change in a 32K ephemeral port space my sample program
(it just establishes and closes [::1]:ephemeral -> [::1]:server_port
connections in a tight loop) fails after 32765 connections in 24 seconds.
With it enabled 50000 connections only take 4.7 seconds.
This is particularly problematic for IPv6 where we only have one local
address and cannot play tricks with varying source IP from 127.0.0.0/8
pool.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Wei Wang <weiwan@google.com>
Change-Id: I0377961749979d0301b7b62871a32a4b34b654e1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull aio updates from Al Viro:
"Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.
The only thing I'm holding back for a day or so is Adam's aio ioprio -
his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
but let it sit in -next for decency sake..."
* 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
aio: sanitize the limit checking in io_submit(2)
aio: fold do_io_submit() into callers
aio: shift copyin of iocb into io_submit_one()
aio_read_events_ring(): make a bit more readable
aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
aio: take list removal to (some) callers of aio_complete()
aio: add missing break for the IOCB_CMD_FDSYNC case
random: convert to ->poll_mask
timerfd: convert to ->poll_mask
eventfd: switch to ->poll_mask
pipe: convert to ->poll_mask
crypto: af_alg: convert to ->poll_mask
net/rxrpc: convert to ->poll_mask
net/iucv: convert to ->poll_mask
net/phonet: convert to ->poll_mask
net/nfc: convert to ->poll_mask
net/caif: convert to ->poll_mask
net/bluetooth: convert to ->poll_mask
net/sctp: convert to ->poll_mask
net/tipc: convert to ->poll_mask
...
Filling in the padding slot in the bpf structure as a bug fix in 'ne'
overlapped with actually using that padding area for something in
'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The extracted functions will likely be usefull to implement tproxy
support in nf_tables.
Extrancted functions:
- nf_tproxy_sk_is_transparent
- nf_tproxy_laddr4
- nf_tproxy_handle_time_wait4
- nf_tproxy_get_sock_v4
- nf_tproxy_laddr6
- nf_tproxy_handle_time_wait6
- nf_tproxy_get_sock_v6
(nf_)tproxy_handle_time_wait6 also needed some refactor as its current
implementation was xtables-specific.
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree, the most relevant things in this batch are:
1) Compile masquerade infrastructure into NAT module, from Florian Westphal.
Same thing with the redirection support.
2) Abort transaction if early initialization of the commit phase fails.
Also from Florian.
3) Get rid of synchronize_rcu() by using rule array in nf_tables, from
Florian.
4) Abort nf_tables batch if fatal signal is pending, from Florian.
5) Use .call_rcu nfnetlink from nf_tables to make dumps fully lockless.
From Florian Westphal.
6) Support to match transparent sockets from nf_tables, from Máté Eckl.
7) Audit support for nf_tables, from Phil Sutter.
8) Validate chain dependencies from commit phase, fall back to fine grain
validation only in case of errors.
9) Attach dst to skbuff from netfilter flowtable packet path, from
Jason A. Donenfeld.
10) Use artificial maximum attribute cap to remove VLA from nfnetlink.
Patch from Kees Cook.
11) Add extension to allow to forward packets through neighbour layer.
12) Add IPv6 conntrack helper support to IPVS, from Julian Anastasov.
13) Add IPv6 FTP conntrack support to IPVS, from Julian Anastasov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit f6cc9c054e, the following conf is broken (note that the
default loopback mtu is 65536, ie IP_MAX_MTU + 1):
$ ip tunnel add gre1 mode gre local 10.125.0.1 remote 10.125.0.2 dev lo
add tunnel "gre0" failed: Invalid argument
$ ip l a type dummy
$ ip l s dummy1 up
$ ip l s dummy1 mtu 65535
$ ip tunnel add gre1 mode gre local 10.125.0.1 remote 10.125.0.2 dev dummy1
add tunnel "gre0" failed: Invalid argument
dev_set_mtu() doesn't allow to set a mtu which is too large.
First, let's cap the mtu returned by ip_tunnel_bind_dev(). Second, remove
the magic value 0xFFF8 and use IP_MAX_MTU instead.
0xFFF8 seems to be there for ages, I don't know why this value was used.
With a recent kernel, it's also possible to set a mtu > IP_MAX_MTU:
$ ip l s dummy1 mtu 66000
After that patch, it's also possible to bind an ip tunnel on that kind of
interface.
CC: Petr Machata <petrm@mellanox.com>
CC: Ido Schimmel <idosch@mellanox.com>
Link: https://git.kernel.org/pub/scm/linux/kernel/git/davem/netdev-vger-cvs.git/commit/?id=e5afd356a411a
Fixes: f6cc9c054e ("ip_tunnel: Emit events for post-register MTU changes")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is additional to the
commit ea1627c20c ("tcp: minor optimizations around tcp_hdr() usage").
At this point, skb->data is same with tcp_hdr() as tcp header has not
been pulled yet. So use the less expensive one to get the tcp header.
Remove the third parameter of tcp_rcv_established() and put it into
the function body.
Furthermore, the local variables are listed as a reverse christmas tree :)
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for IFA_RT_PRIORITY to ipv4 addresses.
If the metric is changed on an existing address then the new route
is inserted before removing the old one. Since the metric is one
of the route keys, the prefix route can not be replaced.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
net/ipv4/bpfilter/sockopt.c:13:5: warning:
symbol 'bpfilter_mbox_request' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using extra modules for these, turn the config options into
an implicit dependency that adds masq feature to the protocol specific nf_nat module.
before:
text data bss dec hex filename
2001 860 4 2865 b31 net/ipv4/netfilter/nf_nat_masquerade_ipv4.ko
5579 780 2 6361 18d9 net/ipv4/netfilter/nf_nat_ipv4.ko
2860 836 8 3704 e78 net/ipv6/netfilter/nf_nat_masquerade_ipv6.ko
6648 780 2 7430 1d06 net/ipv6/netfilter/nf_nat_ipv6.ko
after:
text data bss dec hex filename
7245 872 8 8125 1fbd net/ipv4/netfilter/nf_nat_ipv4.ko
9165 848 12 10025 2729 net/ipv6/netfilter/nf_nat_ipv6.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In addition to already existing BPF hooks for sys_bind and sys_connect,
the patch provides new hooks for sys_sendmsg.
It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
that provides access to socket itlself (properties like family, type,
protocol) and user-passed `struct sockaddr *` so that BPF program can
override destination IP and port for system calls such as sendto(2) or
sendmsg(2) and/or assign source IP to the socket.
The hooks are implemented as two new attach types:
`BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
UDPv6 correspondingly.
UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
sys_connect hooks, i.e. to prevent reading from / writing to e.g.
user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
The difference with already existing hooks is sys_sendmsg are
implemented only for unconnected UDP.
For TCP it doesn't make sense to change user-provided `struct sockaddr *`
at sendto(2)/sendmsg(2) time since socket either was already connected
and has source/destination set or wasn't connected and call to
sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
Connected UDP is already handled by sys_connect hooks that can override
source/destination at connect time and use fast-path later, i.e. these
hooks don't affect UDP fast-path.
Rewriting source IP is implemented differently than that in sys_connect
hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
just bind socket to desired local IP address since source IP can be set
on per-packet basis by using ancillary data (cmsg(3)). So no matter if
socket is bound or not, source IP has to be rewritten on every call to
sys_sendmsg.
To do so two new fields are added to UAPI `struct bpf_sock_addr`;
* `msg_src_ip4` to set source IPv4 for UDPv4;
* `msg_src_ip6` to set source IPv6 for UDPv6.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tracepoint does not add value and the call to fib_lookup follows
it which shows the same information and the fib lookup result.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 4a2d73a4fb ("ipv4: fib_rules: support match on sport, dport
and ip proto") added support for protocol and ports to FIB rules.
Update the FIB lookup tracepoint to dump the parameters.
In addition, make the IPv4 tracepoint similar to the IPv6 one where
the lookup parameters and result are dumped in 1 event. It is much
easier to use and understand the outcome of the lookup.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-05-24
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).
2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.
3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.
4) Jiong Wang adds support for indirect and arithmetic shifts to NFP
5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.
6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.
7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.
8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.
9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A precondition check in ip_recv_error triggered on an otherwise benign
race. Remove the warning.
The warning triggers when passing an ipv6 socket to this ipv4 error
handling function. RaceFuzzer was able to trigger it due to a race
in setsockopt IPV6_ADDRFORM.
---
CPU0
do_ipv6_setsockopt
sk->sk_socket->ops = &inet_dgram_ops;
---
CPU1
sk->sk_prot->recvmsg
udp_recvmsg
ip_recv_error
WARN_ON_ONCE(sk->sk_family == AF_INET6);
---
CPU0
do_ipv6_setsockopt
sk->sk_family = PF_INET;
This socket option converts a v6 socket that is connected to a v4 peer
to an v4 socket. It updates the socket on the fly, changing fields in
sk as well as other structs. This is inherently non-atomic. It races
with the lockless udp_recvmsg path.
No other code makes an assumption that these fields are updated
atomically. It is benign here, too, as ip_recv_error cares only about
the protocol of the skbs enqueued on the error queue, for which
sk_family is not a precise predictor (thanks to another isue with
IPV6_ADDRFORM).
Link: http://lkml.kernel.org/r/20180518120826.GA19515@dragonet.kaist.ac.kr
Fixes: 7ce875e5ec ("ipv4: warn once on passing AF_INET6 socket to ip_recv_error")
Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, they are:
1) Remove obsolete nf_log tracing from nf_tables, from Florian Westphal.
2) Add support for map lookups to numgen, random and hash expressions,
from Laura Garcia.
3) Allow to register nat hooks for iptables and nftables at the same
time. Patchset from Florian Westpha.
4) Timeout support for rbtree sets.
5) ip6_rpfilter works needs interface for link-local addresses, from
Vincent Bernat.
6) Add nf_ct_hook and nf_nat_hook structures and use them.
7) Do not drop packets on packets raceing to insert conntrack entries
into hashes, this is particularly a problem in nfqueue setups.
8) Address fallout from xt_osf separation to nf_osf, patches
from Florian Westphal and Fernando Mancera.
9) Remove reference to struct nft_af_info, which doesn't exist anymore.
From Taehee Yoo.
This batch comes with is a conflict between 25fd386e0b ("netfilter:
core: add missing __rcu annotation") in your tree and 2c205dd398
("netfilter: add struct nf_nat_hook and use it") coming in this batch.
This conflict can be solved by leaving the __rcu tag on
__netfilter_net_init() - added by 25fd386e0b - and remove all code
related to nf_nat_decode_session_hook - which is gone after
2c205dd398, as described by:
diff --cc net/netfilter/core.c
index e0ae4aae96f5,206fb2c4c319..168af54db975
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@@ -611,7 -580,13 +611,8 @@@ const struct nf_conntrack_zone nf_ct_zo
EXPORT_SYMBOL_GPL(nf_ct_zone_dflt);
#endif /* CONFIG_NF_CONNTRACK */
- static void __net_init __netfilter_net_init(struct nf_hook_entries **e, int max)
-#ifdef CONFIG_NF_NAT_NEEDED
-void (*nf_nat_decode_session_hook)(struct sk_buff *, struct flowi *);
-EXPORT_SYMBOL(nf_nat_decode_session_hook);
-#endif
-
+ static void __net_init
+ __netfilter_net_init(struct nf_hook_entries __rcu **e, int max)
{
int h;
I can also merge your net-next tree into nf-next, solve the conflict and
resend the pull request if you prefer so.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a followup to fib rules sport, dport and ipproto
match support. Only supports tcp, udp and icmp for ipproto.
Used by fib rule self tests.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP GSO delays final datagram construction to the GSO layer. This
conflicts with protocol transformations.
Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
CC: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
and user mode helper code that is embedded into bpfilter.ko
The steps to build bpfilter.ko are the following:
- main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
- with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
is converted into bpfilter_umh.o object file
with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
Example:
$ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
- bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
bpfilter_kern.c is a normal kernel module code that calls
the fork_usermode_blob() helper to execute part of its own data
as a user mode process.
Notice that _binary_net_bpfilter_bpfilter_umh_start - end
is placed into .init.rodata section, so it's freed as soon as __init
function of bpfilter.ko is finished.
As part of __init the bpfilter.ko does first request/reply action
via two unix pipe provided by fork_usermode_blob() helper to
make sure that umh is healthy. If not it will kill it via pid.
Later bpfilter_process_sockopt() will be called from bpfilter hooks
in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
kill umh as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the packet rewrite and instantiation of nat NULL bindings
happens from the protocol specific nat backend.
Invocation occurs either via ip(6)table_nat or the nf_tables nat chain type.
Invocation looks like this (simplified):
NF_HOOK()
|
`---iptable_nat
|
`---> nf_nat_l3proto_ipv4 -> nf_nat_packet
|
new packet? pass skb though iptables nat chain
|
`---> iptable_nat: ipt_do_table
In nft case, this looks the same (nft_chain_nat_ipv4 instead of
iptable_nat).
This is a problem for two reasons:
1. Can't use iptables nat and nf_tables nat at the same time,
as the first user adds a nat binding (nf_nat_l3proto_ipv4 adds a
NULL binding if do_table() did not find a matching nat rule so we
can detect post-nat tuple collisions).
2. If you use e.g. nft_masq, snat, redir, etc. uses must also register
an empty base chain so that the nat core gets called fro NF_HOOK()
to do the reverse translation, which is neither obvious nor user
friendly.
After this change, the base hook gets registered not from iptable_nat or
nftables nat hooks, but from the l3 nat core.
iptables/nft nat base hooks get registered with the nat core instead:
NF_HOOK()
|
`---> nf_nat_l3proto_ipv4 -> nf_nat_packet
|
new packet? pass skb through iptables/nftables nat chains
|
+-> iptables_nat: ipt_do_table
+-> nft nat chain x
`-> nft nat chain y
The nat core deals with null bindings and reverse translation.
When no mapping exists, it calls the registered nat lookup hooks until
one creates a new mapping.
If both iptables and nftables nat hooks exist, the first matching
one is used (i.e., higher priority wins).
Also, nft users do not need to create empty nat hooks anymore,
nat core always registers the base hooks that take care of reverse/reply
translation.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Will be used in followup patch when nat types no longer
use nf_register_net_hook() but will instead register with the nat core.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The ip(6)tables nat table is currently receiving skbs from the netfilter
core, after a followup patch skbs will be coming from the netfilter nat
core instead, so the table is no longer backed by normal hook_ops.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Copy-pasted, both l3 helpers almost use same code here.
Split out the common part into an 'inet' helper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ECN signals currently forces TCP to enter quickack mode for
up to 16 (TCP_MAX_QUICKACKS) following incoming packets.
We believe this is not needed, and only sending one immediate ack
for the current packet should be enough.
This should reduce the extra load noticed in DCTCP environments,
after congestion events.
This is part 2 of our effort to reduce pure ACK packets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to add finer control of the number of ACK packets sent after
ECN events.
This patch is not changing current behavior, it only enables following
change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Determine path MTU from a FIB lookup result. Logic is a distillation of
ip_dst_mtu_maybe_forward.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.
TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.
The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.
Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 20b654dfe1 ("tcp: support DUPACK threshold in RACK")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This per netns sysctl allows for TCP SACK compression fine-tuning.
This limits number of SACK that can be compressed.
Using 0 disables SACK compression.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This per netns sysctl allows for TCP SACK compression fine-tuning.
Its default value is 1,000,000, or 1 ms to meet TSO autosizing period.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This counter tracks number of ACK packets that the host has not sent,
thanks to ACK compression.
Sample output :
$ nstat -n;sleep 1;nstat|egrep "IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed"
IpInReceives 123250 0.0
IpOutRequests 3684 0.0
TcpInSegs 123251 0.0
TcpOutSegs 3684 0.0
TcpExtTCPAckCompressed 119252 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
As explained in commit 9f9843a751 ("tcp: properly handle stretch
acks in slow start"), TCP stacks have to consider how many packets
are acknowledged in one single ACK, because of GRO, but also
because of ACK compression or losses.
We plan to add SACK compression in the following patch, we
must therefore not call tcp_enter_quickack_mode()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device features may change during transmission. In particular with
corking, a device may toggle scatter-gather in between allocating
and writing to an skb.
Do not unconditionally assume that !NETIF_F_SG at write time implies
that the same held at alloc time and thus the skb has sufficient
tailroom.
This issue predates git history.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ERSPAN only support version 1 and 2. When packets send to an
erspan device which does not have proper version number set,
drop the packet. In real case, we observe multicast packets
sent to the erspan pernet device, erspan0, which does not have
erspan version configured.
Reported-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An RTO event indicates the head has not been acked for a long time
after its last (re)transmission. But the other packets are not
necessarily lost if they have been only sent recently (for example
due to application limit). This patch would prohibit marking packets
sent within an RTT to be lost on RTO event, using similar logic in
TCP RACK detection.
Normally the head (SND.UNA) would be marked lost since RTO should
fire strictly after the head was sent. An exception is when the
most recent RACK RTT measurement is larger than the (previous)
RTO. To address this exception the head is always marked lost.
Congestion control interaction: since we may not mark every packet
lost, the congestion window may be more than 1 (inflight plus 1).
But only one packet will be retransmitted after RTO, since
tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The
connection still performs slow start from one packet (with Cubic
congestion control).
This commit was tested in an A/B test with Google web servers,
and showed a reduction of 2% in (spurious) retransmits post
timeout (SlowStartRetrans), and correspondingly reduced DSACKs
(DSACKIgnoredOld) by 7%.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
to prepare the final RTO change.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously when TCP times out, it first updates cwnd and ssthresh,
marks packets lost, and then updates congestion state again. This
was fine because everything not yet delivered is marked lost,
so the inflight is always 0 and cwnd can be safely set to 1 to
retransmit one packet on timeout.
But the inflight may not always be 0 on timeout if TCP changes to
mark packets lost based on packet sent time. Therefore we must
first mark the packet lost, then set the cwnd based on the
(updated) inflight.
This is not a pure refactor. Congestion control may potentially
break if it uses (not yet updated) inflight to compute ssthresh.
Fortunately all existing congestion control modules does not do that.
Also it changes the inflight when CA_LOSS_EVENT is called, and only
westwood processes such an event but does not use inflight.
This change has two other minor side benefits:
1) consistent with Fast Recovery s.t. the inflight is updated
first before tcp_enter_recovery flips state to CA_Recovery.
2) avoid intertwining loss marking with state update, making the
code more readable.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets
lost upon RTO.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous approach for the lost and retransmit bits was to
wipe the slate clean: zero all the lost and retransmit bits,
correspondingly zero the lost_out and retrans_out counters, and
then add back the lost bits (and correspondingly increment lost_out).
The new approach is to treat this very much like marking packets
lost in fast recovery. We don’t wipe the slate clean. We just say
that for all packets that were not yet marked sacked or lost, we now
mark them as lost in exactly the same way we do for fast recovery.
This fixes the lost retransmit accounting at RTO time and greatly
simplifies the RTO code by sharing much of the logic with Fast
Recovery.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a rewrite of NewReno loss recovery implementation that is
simpler and standalone for readability and better performance by
using less states.
Note that NewReno refers to RFC6582 as a modification to the fast
recovery algorithm. It is used only if the connection does not
support SACK in Linux. It should not to be confused with the Reno
(AIMD) congestion control.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch disables RFC6675 loss detection and make sysctl
net.ipv4.tcp_recovery = 1 controls a binary choice between RACK
(1) or RFC6675 (0).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the classic DUPACK threshold rule
(#DupThresh) in RACK.
When the number of packets SACKed is greater or equal to the
threshold, RACK sets the reordering window to zero which would
immediately mark all the unsacked packets below the highest SACKed
sequence lost. Since this approach is known to not work well with
reordering, RACK only uses it if no reordering has been observed.
The DUPACK threshold rule is a particularly useful extension to the
fast recoveries triggered by RACK reordering timer. For example
data-center transfers where the RTT is much smaller than a timer
tick, or high RTT path where the default RTT/4 may take too long.
Note that this patch differs slightly from RFC6675. RFC6675
considers a packet lost when at least #DupThresh higher-sequence
packets are SACKed.
With RACK, for connections that have seen reordering, RACK
continues to use a dynamically-adaptive time-based reordering
window to detect losses. But for connections on which we have not
yet seen reordering, this patch considers a packet lost when at
least one higher sequence packet is SACKed and the total number
of SACKed packets is at least DupThresh. For example, suppose a
connection has not seen reordering, and sends 10 packets, and
packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
lost. RACK considers packets 1, 2, 4, 6 lost.
There is some small risk of spurious retransmits here due to
reordering. However, this is mostly limited to the first flight of
a connection on which the sender receives SACKs from reordering.
And RFC 6675 and FACK loss detection have a similar risk on the
first flight with reordering (it's just that the risk of spurious
retransmits from reordering was slightly narrower for those older
algorithms due to the margin of 3*MSS).
Also the minimum reordering window is reduced from 1 msec to 0
to recover quicker on short RTT transfers. Therefore RACK is more
aggressive in marking packets lost during recovery to reduce the
reordering window timeouts.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updating the FIB tracepoint for the recent change to allow rules using
the protocol and ports exposed a few places where the entries in the flow
struct are not initialized.
For __fib_validate_source add the call to fib4_rules_early_flow_dissect
since it is invoked for the input path. For netfilter, add the memset on
the flow struct to avoid future problems like this. In ip_route_input_slow
need to set the fields if the skb dissection does not happen.
Fixes: bfff486265 ("net: fib_rules: support for match on ip_proto, sport and dport")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Variant of proc_create_data that directly take a seq_file show
callback and deals with network namespaces in ->open and ->release.
All callers of proc_create + single_open_net converted over, and
single_{open,release}_net are removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release. All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix handling of simultaneous open TCP connection in conntrack,
from Jozsef Kadlecsik.
2) Insufficient sanitify check of xtables extension names, from
Florian Westphal.
3) Skip unnecessary synchronize_rcu() call when transaction log
is already empty, from Florian Westphal.
4) Incorrect destination mac validation in ebt_stp, from Stephen
Hemminger.
5) xtables module reference counter leak in nft_compat, from
Florian Westphal.
6) Incorrect connection reference counting logic in IPVS
one-packet scheduler, from Julian Anastasov.
7) Wrong stats for 32-bits CPU in IPVS, also from Julian.
8) Calm down sparse error in netfilter core, also from Florian.
9) Use nla_strlcpy to fix compilation warning in nfnetlink_acct
and nfnetlink_cthelper, again from Florian.
10) Missing module alias in icmp and icmp6 xtables extensions,
from Florian Westphal.
11) Base chain statistics in nf_tables may be unset/null, from Florian.
12) Fix handling of large matchinfo size in nft_compat, this includes
one preparation for before this fix. From Florian.
13) Fix bogus EBUSY error when deleting chains due to incorrect reference
counting from the preparation phase of the two-phase commit protocol.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_PROC_FS isn't set, variable ipconfig_dir isn't used.
net/ipv4/ipconfig.c:167:31: warning: ‘ipconfig_dir’ defined but not used [-Wunused-variable]
static struct proc_dir_entry *ipconfig_dir;
^~~~~~~~~~~~
Move the declaration of ipconfig_dir inside the CONFIG_PROC_FS ifdef to
fix the warning.
Fixes: c04d2cb200 ("ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bpf syscall and selftests conflicts were trivial
overlapping changes.
The r8169 change involved moving the added mdelay from 'net' into a
different function.
A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts. I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the truncated bit is set only when 1) the mirrored packet
is larger than mtu and 2) the ipv4 packet tot_len is larger than
the actual skb->len. This patch adds another case for detecting
whether ipv6 packet is truncated or not, by checking the ipv6 header
payload_len and the skb->len.
Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
linux-4.16 got support for softirq based hrtimers.
TCP can switch its pacing hrtimer to this variant, since this
avoids going through a tasklet and some atomic operations.
pacing timer logic looks like other (jiffies based) tcp timers.
v2: use hrtimer_try_to_cancel() in tcp_clear_xmit_timers()
to correctly release reference on socket if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix more memory leaks in ip_cmsg_send() callers. Part of them were fixed
earlier in 919483096b.
* udp_sendmsg one was there since the beginning when linux sources were
first added to git;
* ping_v4_sendmsg one was copy/pasted in c319b4d76b.
Whenever return happens in udp_sendmsg() or ping_v4_sendmsg() IP options
have to be freed if they were allocated previously.
Add label so that future callers (if any) can use it instead of kfree()
before return that is easy to forget.
Fixes: c319b4d76b (net: ipv4: add IPPROTO_ICMP socket kind)
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This version has some suggestions by Eric Dumazet:
- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
races.
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement.
- Factorize code as sk_fullsock() check is not necessary.
Aidan McGurn from Openwave Mobility systems reported the following bug:
"Marked routing is broken on customer deployment. Its effects are large
increase in Uplink retransmissions caused by the client never receiving
the final ACK to their FINACK - this ACK misses the mark and routes out
of the incorrect route."
Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
setsockopt(SO_MARK..).
Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
Then progate this so that the skb gets sent with the correct mark. Do the same
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
INET_CSK_DEBUG is always set and only is used for 2 pr_debug calls.
EXPORT_SYMBOL(inet_csk_timer_bug_msg) is only used by these 2
pr_debug calls and is also unnecessary as the exported string can
be used directly by these calls.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Damir reported a breakage of SO_BINDTODEVICE for UDP sockets.
In absence of VRF devices, after commit fb74c27735 ("net:
ipv4: add second dif to udp socket lookups") the dif mismatch
isn't fatal anymore for UDP socket lookup with non null
sk_bound_dev_if, breaking SO_BINDTODEVICE semantics.
This changeset addresses the issue making the dif match mandatory
again in the above scenario.
Reported-by: Damir Mansurov <dnman@oktetlabs.ru>
Fixes: fb74c27735 ("net: ipv4: add second dif to udp socket lookups")
Fixes: 1801b570dd ("net: ipv6: add second dif to udp socket lookups")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After route cache is flushed via ipv4_sysctl_rtcache_flush(), we forget
to reset fnhe_mtu_locked in rt_bind_exception(). When pmtu is updated
in __ip_rt_update_pmtu(), it will return directly since the pmtu is
still locked. e.g.
+ ip netns exec client ping 10.10.1.1 -c 1 -s 1400 -M do
PING 10.10.1.1 (10.10.1.1) 1400(1428) bytes of data.
>From 10.10.0.254 icmp_seq=1 Frag needed and DF set (mtu = 0)
Signed-off-by: David S. Miller <davem@davemloft.net>
No changes in refcount semantics -- key init is false; replace
static_key_enable with static_branch_enable
static_key_slow_inc|dec with static_branch_inc|dec
static_key_false with static_branch_unlikely
Added a '_key' suffix to udp and udpv6 encap_needed, for better
self documentation.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
No changes in refcount semantics -- key init is false; replace
static_key_slow_inc|dec with static_branch_inc|dec
static_key_false with static_branch_unlikely
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes it so that if a destructor is not present we avoid trying
to update the skb socket or any reference counting that would be associated
with the NULL socket and/or descriptor. By doing this we can support
traffic coming from another namespace without any issues.
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for a software provided checksum and GSO_PARTIAL
segmentation support. With this we can offload UDP segmentation on devices
that only have partial support for tunnels.
Since we are no longer needing the hardware checksum we can drop the checks
in the segmentation code that were verifying if it was present.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows us to take care of unrolling the first segment and the
last segment of the loop for processing the segmented skb. Part of the
motivation for this is that it makes it easier to process the fact that the
first fame and all of the frames in between should be mostly identical
in terms of header data, and the last frame has differences in the length
and partial checksum.
In addition I am dropping the header length calculation since we don't
really need it for anything but the last frame and it can be easily
obtained by just pulling the data_len and offset of tail from the transport
header.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is meant to allow us to avoid having to recompute the checksum
from scratch and have it passed as a parameter.
Instead of taking that approach we can take advantage of the fact that the
length that was used to compute the existing checksum is included in the
UDP header.
Finally to avoid the need to invert the result we can just call csum16_add
and csum16_sub directly. By doing this we can avoid a number of
instructions in the loop that is handling segmentation.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point in passing MSS as a parameter for for the GSO
segmentation call as it is already available via the shared info for the
skb itself.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to record the number of segments that will be generated when this
frame is segmented. The expectation is that if gso_size is set then
gso_segs is set as well. Without this some drivers such as ixgbe get
confused if they attempt to offload this as they record 0 segments for the
entire packet instead of the correct value.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The icmp matches are implemented in ip_tables and ip6_tables,
respectively, so for normal iptables they are always available:
those modules are loaded once iptables calls getsockopt() to fetch
available module revisions.
In iptables-over-nftables case probing occurs via nfnetlink, so
these modules might not be loaded. Add aliases so modprobe can load
these when icmp/icmp6 is requested.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree, more relevant updates in this batch are:
1) Add Maglev support to IPVS. Moreover, store lastest server weight in
IPVS since this is needed by maglev, patches from from Inju Song.
2) Preparation works to add iptables flowtable support, patches
from Felix Fietkau.
3) Hand over flows back to conntrack slow path in case of TCP RST/FIN
packet is seen via new teardown state, also from Felix.
4) Add support for extended netlink error reporting for nf_tables.
5) Support for larger timeouts that 23 days in nf_tables, patch from
Florian Westphal.
6) Always set an upper limit to dynamic sets, also from Florian.
7) Allow number generator to make map lookups, from Laura Garcia.
8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat.
9) Extend ip6tables SRH match to support previous, next and last SID,
from Ahmed Abdelsalam.
10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez.
11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot.
12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables,
from Taehee Yoo.
13) Unify meta bridge with core nft_meta, then make nft_meta built-in.
Make rt and exthdr built-in too, again from Florian.
14) Missing initialization of tbl->entries in IPVS, from Cong Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow some non-cached routes to use non-expired fnhe:
1. ip_del_fnhe: moved above and now called by find_exception.
The 4.5+ commit deed49df73 expires fnhe only when caching
routes. Change that to:
1.1. use fnhe for non-cached local output routes, with the help
from (2)
1.2. allow __mkroute_input to detect expired fnhe (outdated
fnhe_gw, for example) when do_cache is false, eg. when itag!=0
for unicast destinations.
2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
to use fnhe info even when the new route will not be cached into fnhe.
After commit 839da4d989 ("net: ipv4: set orig_oif based on fib
result for local traffic") it means all local routes will be affected
because they are not cached. This change is used to solve a PMTU
problem with IPVS (and probably Netfilter DNAT) setups that redirect
local clients from target local IP (local route to Virtual IP)
to new remote IP target, eg. IPVS TUN real server. Loopback has
64K MTU and we need to create fnhe on the local route that will
keep the reduced PMTU for the Virtual IP. Without this change
fnhe_pmtu is updated from ICMP but never exposed to non-cached
local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
with flowi4_oif=any for 4.14+).
3. update_or_create_fnhe: make sure fnhe_expires is not 0 for
new entries
Fixes: 839da4d989 ("net: ipv4: set orig_oif based on fib result for local traffic")
Fixes: d6d5e999e5 ("route: do not cache fib route info on local routes with oif")
Fixes: deed49df73 ("route: check and remove route cache when we get route")
Cc: David Ahern <dsahern@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously the bbr->idle_restart tracking was zeroing out the
bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
e.g. receiving incoming data or receiver window updates. In such
situations BBR would forget that this was a restart-from-idle
situation, and if the min_rtt had expired it would unnecessarily enter
PROBE_RTT (even though we were actually restarting from idle but had
merely forgotten that fact).
The fix is simple: we need to remember we are restarting from idle
until we receive a S/ACK for some data (a S/ACK for the first flight
of data we send as we are restarting).
This commit is a stable candidate for kernels back as far as 4.9.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using the udp_v4_check() function to calculate the pseudo header
for the newly segmented UDP packets results in assigning the complement
of the value to the UDP header checksum field.
Always undo the complement the partial checksum value in order to
match the case where GSO is not used on the UDP transmit path.
Fixes: ee80d1ebe5 ("udp: add udp gso")
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.
The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.
Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.
Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.
With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.
V3 change-log:
As suggested by David Miller, added loads with barrier
to check whether we have multiple threads calling recvmsg
in parallel. When that happens we lock the socket to
calculate inq.
V4 change-log:
Removed inline from a static function.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot managed to send a udp gso packet without checksum offload into
the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
triggered the skb_warn_bad_offload.
RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
skb_gso_segment include/linux/netdevice.h:4038 [inline]
validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
__dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
catch and fail in this case.
After the integrity checks jump directly to the CHECKSUM_PARTIAL case
to avoid reading the no_check_tx flags again (a TOCTTOU race).
Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>