If the sender switches the congestion control during ECN-triggered
cwnd-reduction state (CA_CWR), upon exiting recovery cwnd is set to
the ssthresh value calculated by the previous congestion control. If
the previous congestion control is BBR that always keep ssthresh
to TCP_INIFINITE_SSTHRESH, cwnd ends up being infinite. The safe
step is to avoid assigning invalid ssthresh value when recovery ends.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c13ee2a4f0 ("tcp: reindent two spots after prequeue removal")
removed code in tcp_data_queue().
We can go a little farther, removing an always true test,
and removing initializers for fragstolen and eaten variables.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_loginfo structures are only passed as the seventh argument to
nf_log_trace, which is declared as const or stored in a local const
variable. Thus the nf_loginfo structures themselves can be const.
Done with the help of Coccinelle.
// <smpl>
@r disable optional_qualifier@
identifier i;
position p;
@@
static struct nf_loginfo i@p = { ... };
@ok1@
identifier r.i;
expression list[6] es;
position p;
@@
nf_log_trace(es,&i@p,...)
@ok2@
identifier r.i;
const struct nf_loginfo *e;
position p;
@@
e = &i@p
@bad@
position p != {r.p,ok1.p,ok2.p};
identifier r.i;
struct nf_loginfo e;
@@
e@i@p
@depends on !bad disable optional_qualifier@
identifier r.i;
@@
static
+const
struct nf_loginfo i = { ... };
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The target variable is not used in the compat_copy_entry_from_user().
So It can be removed.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPSec crypto offload depends on the protocol-specific
offload module (such as esp_offload.ko).
When the user installs an SA with crypto-offload, load
the offload module automatically, in the same way
that the protocol module is loaded (such as esp.ko)
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Keep the device's reported ip_summed indication in case crypto
was offloaded by the device. Subtract the csum values of the
stripped parts (esp header+iv, esp trailer+auth_data) to keep
value correct.
Note: CHECKSUM_COMPLETE should be indicated only if skb->csum
has the post-decryption offload csum value.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In the case that GRO is turned on and the original received packet is
CHECKSUM_PARTIAL, if the outer UDP header is exactly at the last
csum-unnecessary point, which for instance could occur if the packet
comes from another Linux guest on the same Linux host, we have to do
either remcsum_adjust or set up CHECKSUM_PARTIAL again with its
csum_start properly reset considering RCO.
However, since b7fe10e5eb ("gro: Fix remcsum offload to deal with frags
in GRO") that barrier in such case could be skipped if GRO turned on,
hence we pass over it and the inner L4 validation mistakenly reckons
it as a bad csum.
This patch makes remcsum_offload being reset at the same time of GRO
remcsum cleanup, so as to make it work in such case as before.
Fixes: b7fe10e5eb ("gro: Fix remcsum offload to deal with frags in GRO")
Signed-off-by: Koichiro Den <den@klaipeden.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in for(),if((optlen > 0) && (optptr[1] == 0)), enter infinite loop.
Test: receive a packet which the ip length > 20 and the first byte of ip option is 0, produce this issue
Signed-off-by: yujuan.qi <yujuan.qi@mediatek.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new proto_ops sendmsg_locked and sendpage_locked that can be
called when the socket lock is already held. Correspondingly, add
kernel_sendmsg_locked and kernel_sendpage_locked as front end
functions.
These functions will be used in zero proxy so that we can take
the socket lock in a ULP sendmsg/sendpage and then directly call the
backend transport proto_ops functions.
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).
Signed-off-by: David S. Miller <davem@davemloft.net>
Michał reported a NULL pointer deref during fib_sync_down_dev() when
unregistering a netdevice. The problem is that we don't check for
'in_dev' being NULL, which can happen in very specific cases.
Usually routes are flushed upon NETDEV_DOWN sent in either the netdev or
the inetaddr notification chains. However, if an interface isn't
configured with any IP address, then it's possible for host routes to be
flushed following NETDEV_UNREGISTER, after NULLing dev->ip_ptr in
inetdev_destroy().
To reproduce:
$ ip link add type dummy
$ ip route add local 1.1.1.0/24 dev dummy0
$ ip link del dev dummy0
Fix this by checking for the presence of 'in_dev' before referencing it.
Fixes: 982acb9756 ("ipv4: fib: Notify about nexthop status changes")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following stats into SCM_TIMESTAMPING_OPT_STATS control msg:
TCP_NLA_PACING_RATE
TCP_NLA_DELIVERY_RATE
TCP_NLA_SND_CWND
TCP_NLA_REORDERING
TCP_NLA_MIN_RTT
TCP_NLA_RECUR_RETRANS
TCP_NLA_DELIVERY_RATE_APP_LMT
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the code to extract the function to compute delivery rate.
This function will be used in later commit.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
was used by tcp prequeue and header prediction.
TCPFORWARDRETRANS use was removed in january.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
re-indent tcp_ack, and remove CA_ACK_SLOWPATH; it is always set now.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like prequeue, I am not sure this is overly useful nowadays.
If we receive a train of packets, GRO will aggregate them if the
headers are the same (HP predates GRO by several years) so we don't
get a per-packet benefit, only a per-aggregated-packet one.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
These two branches are now always true, remove the conditional.
objdiff shows no changes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.
This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.
In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.
Even normal client applications (web browsers) commonly use many tcp
connections in parallel.
This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Discussion during NFWS 2017 in Faro has shown that the current
conntrack behaviour is unreasonable.
Even if conntrack module is loaded on behalf of a single net namespace,
its turned on for all namespaces, which is expensive. Commit
481fa37347 ("netfilter: conntrack: add nf_conntrack_default_on sysctl")
attempted to provide an alternative to the 'default on' behaviour by
adding a sysctl to change it.
However, as Eric points out, the sysctl only becomes available
once the module is loaded, and then its too late.
So we either have to move the sysctl to the core, or, alternatively,
change conntrack to become active only once the rule set requires this.
This does the latter, conntrack is only enabled when a rule needs it.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If verdict is NF_STOLEN in the SYNPROXY target,
the skb is consumed.
However, ipt_do_table() always tries to get ip header from the skb.
So that, KASAN triggers the use-after-free message.
We can reproduce this message using below command.
# iptables -I INPUT -p tcp -j SYNPROXY --mss 1460
[ 193.542265] BUG: KASAN: use-after-free in ipt_do_table+0x1405/0x1c10
[ ... ]
[ 193.578603] Call Trace:
[ 193.581590] <IRQ>
[ 193.584107] dump_stack+0x68/0xa0
[ 193.588168] print_address_description+0x78/0x290
[ 193.593828] ? ipt_do_table+0x1405/0x1c10
[ 193.598690] kasan_report+0x230/0x340
[ 193.603194] __asan_report_load2_noabort+0x19/0x20
[ 193.608950] ipt_do_table+0x1405/0x1c10
[ 193.613591] ? rcu_read_lock_held+0xae/0xd0
[ 193.618631] ? ip_route_input_rcu+0x27d7/0x4270
[ 193.624348] ? ipt_do_table+0xb68/0x1c10
[ 193.629124] ? do_add_counters+0x620/0x620
[ 193.634234] ? iptable_filter_net_init+0x60/0x60
[ ... ]
After this patch, only when verdict is XT_CONTINUE,
ipt_do_table() tries to get ip header.
Also arpt_do_table() is modified because it has same bug.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We no longer place these on a list so they can be const.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a preparatory patch for adding fib support to the netdev family.
The netdev family receives the packets from ingress hook. At this point
we have no guarantee that the ip header is linear. So this patch
replaces ip_hdr with skb_header_pointer in order to address that
possible situation.
Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When using CONFIG_UBSAN_SANITIZE_ALL, the TCP code produces a
false-positive warning:
net/ipv4/tcp_output.c: In function 'tcp_connect':
net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
^~
net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
I have opened a gcc bug for this, but distros have already shipped
compilers with this problem, and it's not clear yet whether there is
a way for gcc to avoid the warning. As the problem is related to the
bitfield access, this introduces a temporary variable to store the old
enum value.
I did not notice this warning earlier, since UBSAN is disabled when
building with COMPILE_TEST, and that was always turned on in both
allmodconfig and randconfig tests.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81601
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an early demuxed packet reaches __udp6_lib_lookup_skb(), the
sk reference is retrieved and used, but the relevant reference
count is leaked and the socket destructor is never called.
Beyond leaking the sk memory, if there are pending UDP packets
in the receive queue, even the related accounted memory is leaked.
In the long run, this will cause persistent forward allocation errors
and no UDP skbs (both ipv4 and ipv6) will be able to reach the
user-space.
Fix this by explicitly accessing the early demux reference before
the lookup, and properly decreasing the socket reference count
after usage.
Also drop the skb_steal_sock() in __udp6_lib_lookup_skb(), and
the now obsoleted comment about "socket cache".
The newly added code is derived from the current ipv4 code for the
similar path.
v1 -> v2:
fixed the __udp6_lib_rcv() return code for resubmission,
as suggested by Eric
Reported-by: Sam Edwards <CFSworks@gmail.com>
Reported-by: Marc Haber <mh+netdev@zugschlus.de>
Fixes: 5425077d73 ("net: ipv6: Add early demux handler for UDP unicast")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We must use pre-processor conditional block or suitable accessors to
manipulate skb->sp elsewhere builds lacking the CONFIG_XFRM will break.
Fixes: dce4551cb2 ("udp: preserve head state for IP_CMSG_PASSSEC")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Moore reported a SELinux/IP_PASSSEC regression
caused by missing skb->sp at recvmsg() time. We need to
preserve the skb head state to process the IP_CMSG_PASSSEC
cmsg.
With this commit we avoid releasing the skb head state in the
BH even if a secpath is attached to the current skb, and stores
the skb status (with/without head states) in the scratch area,
so that we can access it at skb deallocation time, without
incurring in cache-miss penalties.
This also avoids misusing the skb CB for ipv6 packets,
as introduced by the commit 0ddf3fb2c4 ("udp: preserve
skb->dst if required for IP options processing").
Clean a bit the scratch area helpers implementation, to
reduce the code differences between 32 and 64 bits build.
Reported-by: Paul Moore <paul@paul-moore.com>
Fixes: 0a463c78d2 ("udp: avoid a cache miss on dequeue")
Fixes: 0ddf3fb2c4 ("udp: preserve skb->dst if required for IP options processing")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last (4th) argument of tcp_rcv_established() is redundant as it
always equals to skb->len and the skb itself is always passed as 2th
agrument. There is no reason to have it.
Signed-off-by: Ilya V. Matveychikov <matvejchikov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a new NETDEV_UDP_TUNNEL_DROP_INFO event, similar to
NETDEV_UDP_TUNNEL_PUSH_INFO, to signal to un-offload ports.
This also adds udp_tunnel_drop_rx_port(), which calls
ndo_udp_tunnel_del.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If NETIF_F_RX_UDP_TUNNEL_PORT was disabled on a given netdevice, skip
the tunnel offload ndo call during tunnel port creation and deletion.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) BPF verifier signed/unsigned value tracking fix, from Daniel
Borkmann, Edward Cree, and Josef Bacik.
2) Fix memory allocation length when setting up calls to
->ndo_set_mac_address, from Cong Wang.
3) Add a new cxgb4 device ID, from Ganesh Goudar.
4) Fix FIB refcount handling, we have to set it's initial value before
the configure callback (which can bump it). From David Ahern.
5) Fix double-free in qcom/emac driver, from Timur Tabi.
6) A bunch of gcc-7 string format overflow warning fixes from Arnd
Bergmann.
7) Fix link level headroom tests in ip_do_fragment(), from Vasily
Averin.
8) Fix chunk walking in SCTP when iterating over error and parameter
headers. From Alexander Potapenko.
9) TCP BBR congestion control fixes from Neal Cardwell.
10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger.
11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong
Wang.
12) xmit_recursion in ppp driver needs to be per-device not per-cpu,
from Gao Feng.
13) Cannot release skb->dst in UDP if IP options processing needs it.
From Paolo Abeni.
14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander
Levin and myself.
15) Revert some rtnetlink notification changes that are causing
regressions, from David Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
net: bonding: Fix transmit load balancing in balance-alb mode
rds: Make sure updates to cp_send_gen can be observed
net: ethernet: ti: cpsw: Push the request_irq function to the end of probe
ipv4: initialize fib_trie prior to register_netdev_notifier call.
rtnetlink: allocate more memory for dev_set_mac_address()
net: dsa: b53: Add missing ARL entries for BCM53125
bpf: more tests for mixed signed and unsigned bounds checks
bpf: add test for mixed signed and unsigned bounds checks
bpf: fix up test cases with mixed signed/unsigned bounds
bpf: allow to specify log level and reduce it for test_verifier
bpf: fix mixed signed/unsigned derived min/max value bounds
ipv6: avoid overflow of offset in ip6_find_1stfragopt
net: tehuti: don't process data if it has not been copied from userspace
Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"
net: dsa: mv88e6xxx: Enable CMODE config support for 6390X
dt-binding: ptp: Add SoC compatibility strings for dte ptp clock
NET: dwmac: Make dwmac reset unconditional
net: Zero terminate ifr_name in dev_ifname().
wireless: wext: terminate ifr name coming from userspace
netfilter: fix netfilter_net_init() return
...
Net stack initialization currently initializes fib-trie after the
first call to netdevice_notifier() call. In fact fib_trie initialization
needs to happen before first rtnl_register(). It does not cause any problem
since there are no devices UP at this moment, but trying to bring 'lo'
UP at initialization would make this assumption wrong and exposes the issue.
Fixes following crash
Call Trace:
? alternate_node_alloc+0x76/0xa0
fib_table_insert+0x1b7/0x4b0
fib_magic.isra.17+0xea/0x120
fib_add_ifaddr+0x7b/0x190
fib_netdev_event+0xc0/0x130
register_netdevice_notifier+0x1c1/0x1d0
ip_fib_init+0x72/0x85
ip_rt_init+0x187/0x1e9
ip_init+0xe/0x1a
inet_init+0x171/0x26c
? ipv4_offload_init+0x66/0x66
do_one_initcall+0x43/0x160
kernel_init_freeable+0x191/0x219
? rest_init+0x80/0x80
kernel_init+0xe/0x150
ret_from_fork+0x22/0x30
Code: f6 46 23 04 74 86 4c 89 f7 e8 ae 45 01 00 49 89 c7 4d 85 ff 0f 85 7b ff ff ff 31 db eb 08 4c 89 ff e8 16 47 01 00 48 8b 44 24 38 <45> 8b 6e 14 4d 63 76 74 48 89 04 24 0f 1f 44 00 00 48 83 c4 08
RIP: kmem_cache_alloc+0xcf/0x1c0 RSP: ffff9b1500017c28
CR2: 0000000000000014
Fixes: 7b1a74fdbb ("[NETNS]: Refactor fib initialization so it can handle multiple namespaces.")
Fixes: 7f9b80529b ("[IPV4]: fib hash|trie initialization")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adjusts the timeout formula to schedule the TCP loss probe
(TLP). The previous formula uses 2*SRTT or 1.5*RTT + DelayACKMax if
only one packet is in flight. It keeps a lower bound of 10 msec which
is too large for short RTT connections (e.g. within a data-center).
The new formula = 2*RTT + (inflight == 1 ? 200ms : 2ticks) which
performs better for short and fast connections.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we delete a netns with a CLUSTERIP rule, clusterip_net_exit() is
called first, removing /proc/net/ipt_CLUSTERIP.
Then clusterip_config_entry_put() is called from clusterip_tg_destroy(),
and tries to remove its entry under /proc/net/ipt_CLUSTERIP/.
Fix this by checking that the parent directory of the entry to remove
hasn't already been deleted.
The following triggers a KASAN splat (stealing the reproducer from
202f59afd4, thanks to Jianlin Shi and Xin Long):
ip netns add test
ip link add veth0_in type veth peer name veth0_out
ip link set veth0_in netns test
ip netns exec test ip link set lo up
ip netns exec test ip link set veth0_in up
ip netns exec test iptables -I INPUT -d 1.2.3.4 -i veth0_in -j \
CLUSTERIP --new --clustermac 89:d4:47:eb:9a:fa --total-nodes 3 \
--local-node 1 --hashmode sourceip-sourceport
ip netns del test
Fixes: ce4ff76c15 ("netfilter: ipt_CLUSTERIP: make proc directory per net namespace")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Missing netlink message sanity check in nfnetlink, patch from
Mateusz Jurczyk.
2) We now have netfilter per-netns hooks, so let's kill global hook
infrastructure, this infrastructure is known to be racy with netns.
We don't care about out of tree modules. Patch from Florian Westphal.
3) find_appropriate_src() is buggy when colissions happens after the
conversion of the nat bysource to rhashtable. Also from Florian.
4) Remove forward chain in nf_tables arp family, it's useless and it is
causing quite a bit of confusion, from Florian Westphal.
5) nf_ct_remove_expect() is called with the wrong parameter, causing
kernel oops, patch from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric noticed that in udp_recvmsg() we still need to access
skb->dst while processing the IP options.
Since commit 0a463c78d2 ("udp: avoid a cache miss on dequeue")
skb->dst is no more available at recvmsg() time and bad things
will happen if we enter the relevant code path.
This commit address the issue, avoid clearing skb->dst if
any IP options are present into the relevant skb.
Since the IP CB is contained in the first skb cacheline, we can
test it to decide to leverage the consume_stateless_skb()
optimization, without measurable additional cost in the faster
path.
v1 -> v2: updated commit message tags
Fixes: 0a463c78d2 ("udp: avoid a cache miss on dequeue")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After rcu conversions performance degradation in forward tests isn't that
noticeable anymore.
See next patch for some numbers.
A followup patcg could then also remove genid from the policies
as we do not cache bundles anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
revert c386578f1c ("xfrm: Let the flowcache handle its size by default.").
Once we remove flow cache, we don't have a flow cache limit anymore.
We must not allow (virtually) unlimited allocations of xfrm dst entries.
Revert back to the old xfrm dst gc limits.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discussed in Faro during Netfilter Workshop 2017, RB trees can be
used with RCU, using a seqlock.
Note that net/rxrpc/conn_service.c is already using this.
This patch converts inetpeer from AVL tree to RB tree, since it allows
to remove private AVL implementation in favor of shared RB code.
$ size net/ipv4/inetpeer.before net/ipv4/inetpeer.after
text data bss dec hex filename
3195 40 128 3363 d23 net/ipv4/inetpeer.before
1562 24 0 1586 632 net/ipv4/inetpeer.after
The same technique can be used to speed up
net/netfilter/nft_set_rbtree.c (removing rwlock contention in fast path)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
arp packets cannot be forwarded.
They can be bridged, but then they can be filtered using
either ebtables or nftables bridge family.
The bridge netfilter exposes a "call-arptables" switch which
pushes packets into arptables, but lets not expose this for nftables, so better
close this asap.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes the following behavior: for connections that had no RTT sample
at the time of initializing congestion control, BBR was initializing
the pacing rate to a high nominal rate (based an a guess of RTT=1ms,
in case this is LAN traffic). Then BBR never adjusted the pacing rate
downward upon obtaining an actual RTT sample, if the connection never
filled the pipe (e.g. all sends were small app-limited writes()).
This fix adjusts the pacing rate upon obtaining the first RTT sample.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a corner case noticed by Eric Dumazet, where BBR's setting
sk->sk_pacing_rate to 0 during initialization could theoretically
cause packets in the sending host to hang if there were packets "in
flight" in the pacing infrastructure at the time the BBR congestion
control state is initialized. This could occur if the pacing
infrastructure happened to race with bbr_init() in a way such that the
pacer read the 0 rather than the immediately following non-zero pacing
rate.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a helper to initialize the BBR pacing rate unconditionally,
based on the current cwnd and RTT estimate. This is a pure refactor,
but is needed for two following fixes.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a helper to convert a BBR bandwidth and gain factor to a
pacing rate in bytes per second. This is a pure refactor, but is
needed for two following fixes.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In bbr_set_pacing_rate(), which decides whether to cut the pacing
rate, there was some code that considered exiting STARTUP to be
equivalent to the notion of filling the pipe (i.e.,
bbr_full_bw_reached()). Specifically, as the code was structured,
exiting STARTUP and going into PROBE_RTT could cause us to cut the
pacing rate down to something silly and low, based on whatever
bandwidth samples we've had so far, when it's possible that all of
them have been small app-limited bandwidth samples that are not
representative of the bandwidth available in the path. (The code was
correct at the time it was written, but the state machine changed
without this spot being adjusted correspondingly.)
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some time ago David Woodhouse reported skb_under_panic
when we try to push ethernet header to fragmented ipv6 skbs.
It was fixed for ipv6 by Florian Westphal in
commit 1d325d217c ("ipv6: ip6_fragment: fix headroom tests and skb leak")
However similar problem still exist in ipv4.
It does not trigger skb_under_panic due paranoid check
in ip_finish_output2, however according to Alexey Kuznetsov
current state is abnormal and ip_fragment should be fixed too.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
callers can more safely get random bytes if they can block until the
CRNG is initialized.
Also print a warning if get_random_*() is called before the CRNG is
initialized. By default, only one single-line warning will be printed
per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
warning will be printed for each function which tries to get random
bytes before the CRNG is initialized. This can get spammy for certain
architecture types, so it is not enabled by default.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllqXNUACgkQ8vlZVpUN
gaPtAgf/aUbXZuWYsDQzslHsbzEWi+qz4QgL885/w4L00pEImTTp91Q06SDxWhtB
KPvGnZHS3IofxBh2DC+6AwN6dPMoWDCfYhhO6po3FSz0DiPRIQCTuvOb8fhKY1X7
rTdDq2xtDxPGxJ25bMJtlrgzH2XlXPpVyPUeoc9uh87zUK5aesXpUn9kBniRexoz
ume+M/cDzPKkwNQpbLq8vzhNjoWMVv0FeW2akVvrjkkWko8nZLZ0R/kIyKQlRPdG
LZDXcz0oTHpDS6+ufEo292ZuWm2IGer2YtwHsKyCAsyEWsUqBz2yurtkSj3mAVyC
hHafyS+5WNaGdgBmg0zJxxwn5qxxLg==
=ua7p
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random updates from Ted Ts'o:
"Add wait_for_random_bytes() and get_random_*_wait() functions so that
callers can more safely get random bytes if they can block until the
CRNG is initialized.
Also print a warning if get_random_*() is called before the CRNG is
initialized. By default, only one single-line warning will be printed
per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
warning will be printed for each function which tries to get random
bytes before the CRNG is initialized. This can get spammy for certain
architecture types, so it is not enabled by default"
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: reorder READ_ONCE() in get_random_uXX
random: suppress spammy warnings about unseeded randomness
random: warn when kernel uses unseeded randomness
net/route: use get_random_int for random counter
net/neighbor: use get_random_u32 for 32-bit hash random
rhashtable: use get_random_u32 for hash_rnd
ceph: ensure RNG is seeded before using
iscsi: ensure RNG is seeded before use
cifs: use get_random_u32 for 32-bit lock random
random: add get_random_{bytes,u32,u64,int,long,once}_wait family
random: add wait_for_random_bytes() API
The ipmr_get_table() function doesn't return error pointers it returns
NULL on error.
Fixes: 4f75ba6982 ("net: ipmr: Add ipmr_rtm_getroute")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 6797318e62 ("tcp: md5: add an address prefix for key lookup")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix rtm policy name typo in mpls_getroute and also remove
export of rtm_ipv4_policy
Fixes: 397fc9e5ce ("mpls: route get support")
Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SYN-ACK responses on a server in response to a SYN from a client
did not get the injected skb mark that was tagged on the SYN packet.
Fixes: 84f39b08d7 ("net: support marking accepting TCP sockets")
Reviewed-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.
The following patch demostrates how they can be used to set send and
receive buffer sizes.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
This conversion requires overall +1 on the whole
refcounting scheme.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This batch contains connection tracking updates for the cleanup
iteration path, patches from Florian Westphal:
X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
dying bit to let the CPU release them.
X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
conntrack from all namespace.
X) Restart iteration on hashtable resizing, since both may occur at
the same time.
X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
mapping on module removal.
X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
module removal, from Liping Zhang.
X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
if user requests this, also from Liping.
X) Add net_ns_barrier() and use it from FTP helper, so make sure
no concurrent namespace removal happens at the same time while
the helper module is being removed.
X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
module size. Same thing in nf_tables.
Updates for the nf_tables infrastructure:
X) Prepare usage of the extended ACK reporting infrastructure for
nf_tables.
X) Remove unnecessary forward declaration in nf_tables hash set.
X) Skip set size estimation if number of element is not specified.
X) Changes to accomodate a (faster) unresizable hash set implementation,
for anonymous sets and dynamic size fixed sets with no timeouts.
X) Faster lookup function for unresizable hash table for 2 and 4
bytes key.
And, finally, a bunch of asorted small updates and cleanups:
X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
to device events and look up for index from the packet path, this
is fixing an issue that is present since the very beginning, patch
from Xin Long.
X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.
X) Use ebt_invalid_target() whenever possible in the ebtables tree,
from Gao Feng.
X) Calm down compilation warning in nf_dup infrastructure, patch from
stephen hemminger.
X) Statify functions in nftables rt expression, also from stephen.
X) Update Makefile to use canonical method to specify nf_tables-objs.
From Jike Song.
X) Use nf_conntrack_helpers_register() in amanda and H323.
X) Space cleanup for ctnetlink, from linzhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add to RTNL_FAMILY_IPMR, RTM_GETROUTE the ability
to retrieve one S,G mroute from a specified table.
*,G will return mroute information for just that
particular mroute if it exists. This is because
it is entirely possible to have more S's then
can fit in one skb to return to the requesting
process.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that they can be later used by the IPv6 code, too.
Also lift the comments a bit.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If icsk_ulp_ops is unset, it dereferences a null ptr.
Add a null ptr check.
BUG: KASAN: null-ptr-deref in copy_to_user include/linux/uaccess.h:168 [inline]
BUG: KASAN: null-ptr-deref in do_tcp_getsockopt.isra.33+0x24f/0x1e30 net/ipv4/tcp.c:3057
Read of size 4 at addr 0000000000000020 by task syz-executor1/15452
Signed-off-by: Dave Watson <davejwatson@fb.com>
Reported-by: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to reset the sk->sk_rx_dst when we disconnect a TCP
connection, because otherwise when we re-connect it this
dst reference is simply overridden in tcp_finish_connect().
This fixes a dst leak which leads to a loopback dev refcnt
leak. It is a long-standing bug, Kevin reported a very similar
(if not same) bug before. Thanks to Andrei for providing such
a reliable reproducer which greatly narrows down the problem.
Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
Reported-by: Andrei Vagin <avagin@gmail.com>
Reported-by: Kevin Xu <kaiwen.xu@hulu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.
Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.
This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id. This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.
Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission. The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.
Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.
This is mostly for SR-IOV devices since switches don't have lower netdevs
today.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2017-06-23
1) Use memdup_user to spmlify xfrm_user_policy.
From Geliang Tang.
2) Make xfrm_dev_register static to silence a sparse warning.
From Wei Yongjun.
3) Use crypto_memneq to check the ICV in the AH protocol.
From Sabrina Dubroca.
4) Remove some unused variables in esp6.
From Stephen Hemminger.
5) Extend XFRM MIGRATE to allow to change the UDP encapsulation port.
From Antony Antony.
6) Include the UDP encapsulation port to km_migrate announcements.
From Antony Antony.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
KASAN reports out-of-bound access in proc_dostring() coming from
proc_tcp_available_ulp() because in case TCP ULP list is empty
the buffer allocated for the response will not have anything
printed into it. Set the first byte to zero to avoid strlen()
going out-of-bounds.
Fixes: 734942cc4e ("tcp: ULP infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our customer encountered stuck NFS writes for blocks starting at specific
offsets w.r.t. page boundary caused by networking stack sending packets via
UFO enabled device with wrong checksum. The problem can be reproduced by
composing a long UDP datagram from multiple parts using MSG_MORE flag:
sendto(sd, buff, 1000, MSG_MORE, ...);
sendto(sd, buff, 1000, MSG_MORE, ...);
sendto(sd, buff, 3000, 0, ...);
Assume this packet is to be routed via a device with MTU 1500 and
NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
this condition is tested (among others) to decide whether to call
ip_ufo_append_data():
((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))
At the moment, we already have skb with 1028 bytes of data which is not
marked for GSO so that the test is false (fragheaderlen is usually 20).
Thus we append second 1000 bytes to this skb without invoking UFO. Third
sendto(), however, has sufficient length to trigger the UFO path so that we
end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
uses udp_csum() to calculate the checksum but that assumes all fragments
have correct checksum in skb->csum which is not true for UFO fragments.
When checking against MTU, we need to add skb->len to length of new segment
if we already have a partially filled skb and fragheaderlen only if there
isn't one.
In the IPv6 case, skb can only be null if this is the first segment so that
we have to use headersize (length of the first IPv6 header) rather than
fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.
Fixes: e89e9cf539 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Fixes: e4c5e13aa4 ("ipv6: Should use consistent conditional judgement for
ip6 fragment between __ip6_append_data and ip6_finish_output")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael reported an UDP breakage caused by the commit b65ac44674
("udp: try to avoid 2 cache miss on dequeue").
The function __first_packet_length() can update the checksum bits
of the pending skb, making the scratched area out-of-sync, and
setting skb->csum, if the skb was previously in need of checksum
validation.
On later recvmsg() for such skb, checksum validation will be
invoked again - due to the wrong udp_skb_csum_unnecessary()
value - and will fail, causing the valid skb to be dropped.
This change addresses the issue refreshing the scratch area in
__first_packet_length() after the possible checksum update.
Fixes: b65ac44674 ("udp: try to avoid 2 cache miss on dequeue")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently in both ipv4 and ipv6 code path, the ack packet received when
sk at TCP_NEW_SYN_RECV state is not filtered by socket filter or cgroup
filter since it is handled from tcp_child_process and never reaches the
tcp_filter inside tcp_v4_rcv or tcp_v6_rcv. Adding a tcp_filter hooks
here can make sure all the ingress tcp packet can be correctly filtered.
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
On UDP packets processing, if the BH is the bottle-neck, it
always sees a cache miss while updating rmem_alloc; try to
avoid it prefetching the value as soon as we have the socket
available.
Performances under flood with multiple NIC rx queues used are
unaffected, but when a single NIC rx queue is in use, this
gives ~10% performance improvement.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add Netlink notifications on cache reports in ipmr, in addition to the
existing igmpmsg sent to mroute_sk.
Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV4_MROUTE_R.
MSGTYPE, VIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
same data as their equivalent fields in the igmpmsg header.
PKT attribute is the packet sent to mroute_sk, without the added igmpmsg
header.
Suggested-by: Ryan Halbrook <halbrook@arista.com>
Signed-off-by: Julien Gomes <julien@arista.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing from a memcpy to per-member comparison left the
size variable unused:
net/ipv4/tcp_ipv4.c: In function 'tcp_md5_do_lookup':
net/ipv4/tcp_ipv4.c:910:15: error: unused variable 'size' [-Werror=unused-variable]
This does not show up when CONFIG_IPV6 is enabled, but the
variable can be removed either way, along with the now unused
assignment.
Fixes: 6797318e62 ("tcp: md5: add an address prefix for key lookup")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey reported a lockdep warning on non-initialized
spinlock:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 4099 Comm: a.out Not tainted 4.12.0-rc6+ #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16
dump_stack+0x292/0x395 lib/dump_stack.c:52
register_lock_class+0x717/0x1aa0 kernel/locking/lockdep.c:755
? 0xffffffffa0000000
__lock_acquire+0x269/0x3690 kernel/locking/lockdep.c:3255
lock_acquire+0x22d/0x560 kernel/locking/lockdep.c:3855
__raw_spin_lock_bh ./include/linux/spinlock_api_smp.h:135
_raw_spin_lock_bh+0x36/0x50 kernel/locking/spinlock.c:175
spin_lock_bh ./include/linux/spinlock.h:304
ip_mc_clear_src+0x27/0x1e0 net/ipv4/igmp.c:2076
igmpv3_clear_delrec+0xee/0x4f0 net/ipv4/igmp.c:1194
ip_mc_destroy_dev+0x4e/0x190 net/ipv4/igmp.c:1736
We miss a spin_lock_init() in igmpv3_add_delrec(), probably
because previously we never use it on this code path. Since
we already unlink it from the global mc_tomb list, it is
probably safe not to acquire this spinlock here. It does not
harm to have it although, to avoid conditional locking.
Fixes: c38b7d327a ("igmp: acquire pmc lock for ip_mc_clear_src()")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using get_random_int here is faster, more fitting of the use case, and
just as cryptographically secure. It also has the benefit of providing
better randomness at early boot, which is when many of these structures
are assigned.
Also, semantically, it's not really proper to have been assigning an
atomic_t in this way before, even if in practice it works fine.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Replace first padding in the tcp_md5sig structure with a new flag field
and address prefix length so it can be specified when configuring a new
key for TCP MD5 signature. The tcpm_flags field will only be used if the
socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and
tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set.
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: Eric Mowat <mowat@arista.com>
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows the keys used for TCP MD5 signature to be used for whole
range of addresses, specified with a prefix length, instead of only one
address as it currently is.
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: Eric Mowat <mowat@arista.com>
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's a terrible thing to hold dev in iptables target. When the dev is
being removed, unregister_netdevice has to wait for the dev to become
free. dmesg will keep logging the err:
kernel:unregister_netdevice: waiting for veth0_in to become free. \
Usage count = 1
until iptables rules with this target are removed manually.
The worse thing is when deleting a netns, a virtual nic will be deleted
instead of reset to init_net in default_device_ops exit/exit_batch. As
it is earlier than to flush the iptables rules in iptable_filter_net_ops
exit, unregister_netdevice will block to wait for the nic to become free.
As unregister_netdevice is actually waiting for iptables rules flushing
while iptables rules have to be flushed after unregister_netdevice. This
'dead lock' will cause unregister_netdevice to block there forever. As
the netns is not available to operate at that moment, iptables rules can
not even be flushed manually either.
The reproducer can be:
# ip netns add test
# ip link add veth0_in type veth peer name veth0_out
# ip link set veth0_in netns test
# ip netns exec test ip link set lo up
# ip netns exec test ip link set veth0_in up
# ip netns exec test iptables -I INPUT -d 1.2.3.4 -i veth0_in -j \
CLUSTERIP --new --clustermac 89:d4:47:eb:9a:fa --total-nodes 3 \
--local-node 1 --hashmode sourceip-sourceport
# ip netns del test
This issue can be triggered by all virtual nics with ipt_CLUSTERIP.
This patch is to fix it by not holding dev in ipt_CLUSTERIP, but saving
the dev->ifindex instead of the dev.
As Pablo Neira Ayuso's suggestion, it will refresh c->ifindex and dev's
mc by registering a netdevice notifier, just as what xt_TEE does. So it
removes the old codes updating dev's mc, and also no need to initialize
c->ifindex with dev->ifindex.
But as one config can be shared by more than one targets, and the netdev
notifier is per config, not per target. It couldn't get e->ip.iniface
in the notifier handler. So e->ip.iniface has to be saved into config.
Note that for backwards compatibility, this patch doesn't remove the
codes checking if the dev exists before creating a config.
v1->v2:
- As Pablo Neira Ayuso's suggestion, register a netdevice notifier to
manage c->ifindex and dev's mc.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
DST_NOCACHE flag check has been removed from dst_release() and
dst_hold_safe() in a previous patch because all the dst are now ref
counted properly and can be released based on refcnt only.
Looking at the rest of the DST_NOCACHE use, all of them can now be
removed or replaced with other checks.
So this patch gets rid of all the DST_NOCACHE usage and remove this flag
completely.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all the components have been changed to release dst based on
refcnt only and not depend on dst gc anymore, we can remove the
temporary flag DST_NOGC.
Note that we also need to remove the DST_NOCACHE check in dst_release()
and dst_hold_safe() because now all the dst are released based on refcnt
and behaves as DST_NOCACHE.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the previous preparation patches, we are ready to get rid of the
dst gc operation in ipv4 code and release dst based on refcnt only.
So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
to dst_free().
At this point, all dst created in ipv4 code do not use the dst gc
anymore and will be destroyed at the point when refcnt drops to 0.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch checks all the calls to
dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if
dst_hold_safe() is needed to avoid double free issue if dst
gc is removed and dst_release() directly destroys dst when
dst->__refcnt drops to 0.
In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock().
UDP and other similar protocols always hold refcount for
skb->_skb_refdst. So both paths seem to be safe.
In rx path, as it is lockless and skb_dst_set_noref() is likely to be
used, dst_hold_safe() should always be used when trying to hold dst.
In the routing code, if dst is held during an rcu protected session, it
is necessary to call dst_hold_safe() as the current dst might be in its
rcu grace period.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the intend of this patch series is to completely remove dst gc,
we need to call dst_dev_put() to release the reference to dst->dev
when removing routes from fib because we won't keep the gc list anymore
and will lose the dst pointer right after removing the routes.
Without the gc list, there is no way to find all the dst's that have
dst->dev pointing to the going-down dev.
Hence, we are doing dst_dev_put() immediately before we lose the last
reference of the dst from the routing code. The next dst_check() will
trigger a route re-lookup to find another route (if there is any).
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In IPv4 routing code, fib_nh and fib_nh_exception can hold pointers
to struct rtable but they never increment dst->__refcnt.
This leads to the need of the dst garbage collector because when user
is done with this dst and calls dst_release(), it can only decrement
dst->__refcnt and can not free the dst even it sees dst->__refcnt
drops from 1 to 0 (unless DST_NOCACHE flag is set) because the routing
code might still hold reference to it.
And when the routing code tries to delete a route, it has to put the
dst to the gc_list if dst->__refcnt is not yet 0 and have a gc thread
running periodically to check on dst->__refcnt and finally to free dst
when refcnt becomes 0.
This patch increments dst->__refcnt when
fib_nh/fib_nh_exception holds reference to this dst and properly release
the dst when fib_nh/fib_nh_exception has been updated with a new dst.
This patch is a preparation in order to fully get rid of dst gc later.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Existing ipv4/6_blackhole_route() code generates a blackhole route
with dst->dev pointing to the passed in dst->dev.
It is not necessary to hold reference to the passed in dst->dev
because the packets going through this route are dropped anyway.
A loopback interface is good enough so that we don't need to worry about
releasing this dst->dev when this dev is going down.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In udp_v4/6_early_demux() code, we try to hold dst->__refcnt for
dst with DST_NOCACHE flag. This is because later in udp_sk_rx_dst_set()
function, we will try to cache this dst in sk for connected case.
However, a better way to achieve this is to not try to hold dst in
early_demux(), but in udp_sk_rx_dst_set(), call dst_hold_safe(). This
approach is also more consistant with how tcp is handling it. And it
will make later changes simpler.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ip_tunnel_rcv fails, the tun_dst won't be freed, so call
dst_release to free it in error code path.
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Tested-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
@@
expression SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- fn(SKB, LEN)[0]
+ *(u8 *)fn(SKB, LEN)
Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = {
skb_pull,
__skb_pull,
skb_pull_inline,
__pskb_pull_tail,
__pskb_pull,
pskb_pull
};
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = {
skb_pull,
__skb_pull,
skb_pull_inline,
__pskb_pull_tail,
__pskb_pull,
pskb_pull
};
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_put, __skb_put };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = { skb_put, __skb_put };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
which actually doesn't cover pskb_put since there are only three
users overall.
A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There were many places that my previous spatch didn't find,
as pointed out by yuan linyu in various patches.
The following spatch found many more and also removes the
now unnecessary casts:
@@
identifier p, p2;
expression len;
expression skb;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_zero(skb, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_zero(skb, len);
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, len);
|
-memset(p, 0, len);
)
@@
type t, t2;
identifier p, p2;
expression skb;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, sizeof(*p));
|
-memset(p, 0, sizeof(*p));
)
@@
expression skb, len;
@@
-memset(skb_put(skb, len), 0, len);
+skb_put_zero(skb, len);
Apply it to the tree (with one manual fixup to keep the
comment in vxlan.c, which spatch removed.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export do_tcp_sendpages and tcp_rate_check_app_limited, since tls will need to
sendpages while the socket is already locked.
tcp_sendpage is exported, but requires the socket lock to not be held already.
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
sockets. Based on a similar infrastructure in tcp_cong. The idea is that any
ULP can add its own logic by changing the TCP proto_ops structure to its own
methods.
Example usage:
setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
modules will call:
tcp_register_ulp(&tcp_tls_ulp_ops);
to register/unregister their ulp, with an init function and name.
A list of registered ulps will be returned by tcp_get_available_ulp, which is
hooked up to /proc. Example:
$ cat /proc/sys/net/ipv4/tcp_available_ulp
tls
There is currently no functionality to remove or chain ULPs, but
it should be possible to add these in the future if needed.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Weimer seems to have a glibc test-case which requires that
loopback interfaces does not get ICMP ratelimited. This was broken by
commit c0303efeab ("net: reduce cycles spend on ICMP replies that
gets rate limited").
An ICMP response will usually be routed back-out the same incoming
interface. Thus, take advantage of this and skip global ICMP
ratelimit when the incoming device is loopback. In the unlikely event
that the outgoing it not loopback, due to strange routing policy
rules, ICMP rate limiting still works via peer ratelimiting via
icmpv4_xrlim_allow(). Thus, we should still comply with RFC1812
(section 4.3.2.8 "Rate Limiting").
This seems to fix the reproducer given by Florian. While still
avoiding to perform expensive and unneeded outgoing route lookup for
rate limited packets (in the non-loopback case).
Fixes: c0303efeab ("net: reduce cycles spend on ICMP replies that gets rate limited")
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: "H.J. Lu" <hjl.tools@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey reported a use-after-free in add_grec():
for (psf = *psf_list; psf; psf = psf_next) {
...
psf_next = psf->sf_next;
where the struct ip_sf_list's were already freed by:
kfree+0xe8/0x2b0 mm/slub.c:3882
ip_mc_clear_src+0x69/0x1c0 net/ipv4/igmp.c:2078
ip_mc_dec_group+0x19a/0x470 net/ipv4/igmp.c:1618
ip_mc_drop_socket+0x145/0x230 net/ipv4/igmp.c:2609
inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:411
sock_release+0x8d/0x1e0 net/socket.c:597
sock_close+0x16/0x20 net/socket.c:1072
This happens because we don't hold pmc->lock in ip_mc_clear_src()
and a parallel mr_ifc_timer timer could jump in and access them.
The RCU lock is there but it is merely for pmc itself, this
spinlock could actually ensure we don't access them in parallel.
Thanks to Eric and Long for discussion on this bug.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when udp_recvmsg() is executed, on x86_64 and other archs, most skb
fields are on cold cachelines.
If the skb are linear and the kernel don't need to compute the udp
csum, only a handful of skb fields are required by udp_recvmsg().
Since we already use skb->dev_scratch to cache hot data, and
there are 32 bits unused on 64 bit archs, use such field to cache
as much data as we can, and try to prefetch on dequeue the relevant
fields that are left out.
This can save up to 2 cache miss per packet.
v1 -> v2:
- changed udp_dev_scratch fields types to u{32,16} variant,
replaced bitfiled with bool
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since UDP no more uses sk->destructor, we can clear completely
the skb head state before enqueuing. Amend and use
skb_release_head_state() for that.
All head states share a single cacheline, which is not
normally used/accesses on dequeue. We can avoid entirely accessing
such cacheline implementing and using in the UDP code a specialized
skb free helper which ignores the skb head state.
This saves a cacheline miss at skb deallocation time.
v1 -> v2:
replaced secpath_reset() with skb_release_head_state()
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes two issues:
1) When forwarding on *,G mroutes that are in a vrf, the
kernel was dropping information about the actual incoming
interface when calling ip_mr_forward from ip_mr_input.
This caused ip_mr_forward to send the multicast packet
back out the incoming interface. Fix this by
modifying ip_mr_forward to be handed the correctly
resolved dev.
2) When a unresolved cache entry is created we store
the incoming skb on the unresolved cache entry and
upon mroute resolution from the user space daemon,
we attempt to forward the packet. Again we were
not resolving to the correct incoming device for
a vrf scenario, before calling ip_mr_forward.
Fix this by resolving to the correct interface
and calling ip_mr_forward with the result.
Fixes: e58e415968 ("net: Enable support for VRF with ipv4 multicast")
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipvlan code already knows how to detect when a duplicate address is
about to be assigned to an ipvlan device. However, that failure is not
propogated outward and leads to a silent failure.
Introduce a validation step at ip address creation time and allow device
drivers to register to validate the incoming ip addresses. The ipvlan
code is the first consumer. If it detects an address in use, we can
return an error to the user before beginning to commit the new ifa in
the networking code.
This can be especially useful if it is necessary to provision many
ipvlans in containers. The provisioning software (or operator) can use
this to detect situations where an ip address is unexpectedly in use.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently there's no way to dump the VIF table for an ipmr table other
than the default (via proc). This is a major issue when debugging ipmr
issues and in general it is good to know which interfaces are
configured. This patch adds support for RTM_GETLINK for the ipmr family
so we can dump the VIF table and the ipmr table's current config for
each table. We're protected by rtnl so no need to acquire RCU or
mrt_lock.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DRAM supply shortage and poor memory pressure tracking in TCP
stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
limits) and tcp_mem[] quite hazardous.
TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
limits being hit, but only tracking number of transitions.
If TCP stack behavior under stress was perfect :
1) It would maintain memory usage close to the limit.
2) Memory pressure state would be entered for short times.
We certainly prefer 100 events lasting 10ms compared to one event
lasting 200 seconds.
This patch adds a new SNMP counter tracking cumulative duration of
memory pressure events, given in ms units.
$ cat /proc/sys/net/ipv4/tcp_mem
3088 4117 6176
$ grep TCP /proc/net/sockstat
TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
$ nstat -n ; sleep 10 ; nstat |grep Pressure
TcpExtTCPMemoryPressures 1700
TcpExtTCPMemoryPressuresChrono 5209
v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
instructed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to move some TCP sysctls to net namespaces in the future.
tcp_window_scaling, tcp_sack and tcp_timestamps being fetched
from tcp_parse_options(), we need to pass an extra parameter.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander reported various KASAN messages triggered in recent kernels
The problem is that ping sockets should not use udp_poll() in the first
place, and recent changes in UDP stack finally exposed this old bug.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Fixes: 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Tested-By: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The command
# arp -s 62.2.0.1 a🅱️c:d:e:f dev eth2
adds an entry like the following (listed by "arp -an")
? (62.2.0.1) at 0a:0b:0c:0d:0e:0f [ether] PERM on eth2
but the symmetric deletion command
# arp -i eth2 -d 62.2.0.1
does not remove the PERM entry from the table, and instead leaves behind
? (62.2.0.1) at <incomplete> on eth2
The reason is that there is a refcnt of 1 for the arp_tbl itself
(neigh_alloc starts off the entry with a refcnt of 1), thus
the neigh_release() call from arp_invalidate() will (at best) just
decrement the ref to 1, but will never actually free it from the
table.
To fix this, we need to do something like neigh_forced_gc: if
the refcnt is 1 (i.e., on the table's ref), remove the entry from
the table and free it. This patch refactors and shares common code
between neigh_forced_gc and the newly added neigh_remove_one.
A similar issue exists for IPv6 Neighbor Cache entries, and is fixed
in a similar manner by this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
DCCP uses dccp_write_space() for sk->sk_write_space method.
Unfortunately a passive connection (as provided by accept())
is using the generic sk_stream_write_space() function.
Lets simply inherit sk->sk_write_space from the parent
instead of forcing the generic one.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
__pskb_trim_head() does not need to reset skb tail pointer.
Also change the comments, __pskb_pull_head() does not exist.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when a data packet is retransmitted, we do not compute an
RTT sample for congestion control due to Kern's check. Therefore the
congestion control that uses RTT signals may not receive any update
during loss recovery which could last many round trips. For example,
BBR and Vegas may not be able to update its min RTT estimation if the
network path has shortened until it recovers from losses. This patch
mitigates that by using TCP timestamp options for RTT measurement
for congestion control. Note that we already use timestamps for
RTT estimation.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the sender switches its congestion control during loss
recovery, if the recovery is spurious then it may incorrectly
revert cwnd and ssthresh to the older values set by a previous
congestion control. Consider a congestion control (like BBR)
that does not use ssthresh and keeps it infinite: the connection
may incorrectly revert cwnd to an infinite value when switching
from BBR to another congestion control.
This patch fixes it by disallowing such cwnd undo operation
upon switching congestion control. Note that undo_marker
is not reset s.t. the packets that were incorrectly marked
lost would be corrected. We only avoid undoing the cwnd in
tcp_undo_cwnd_reduction().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
recent updates to inet_rtm_getroute dropped skb_dst_set in
inet_rtm_getroute. This patch restores it because it is
needed to release the dst correctly.
Fixes: 3765d35ed8 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
Reported-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MTU probing initialization occurred only at connect() and at SYN or
SYN-ACK reception, but the former sets MSS to either the default or the
user set value (through TCP_MAXSEG sockopt) and the latter never happens
with repaired sockets.
The result was that, with MTU probing enabled and unless TCP_MAXSEG
sockopt was used before connect(), probing would be stuck at
tcp_base_mss value until tcp_probe_interval seconds have passed.
Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass extack arg down to lwtunnel_build_state and the build_state callbacks.
Add messages for failures in lwtunnel_build_state, and add the extarg to
nla_parse where possible in the build_state callbacks.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass extack down to lwtunnel_valid_encap_type and
lwtunnel_valid_encap_type_attr. Add messages for unknown
or unsupported encap types.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack error message for invalid prefix length and invalid prefix.
Example of the latter is a route spec containing 172.16.100.1/24, where
the /24 mask means the lower 8-bits should be 0. Amazing how easy that
one is to overlook when an EINVAL is returned.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_table_insert and fib_table_delete have the same checks on the prefix
and length. Refactor into a helper. Avoids duplicate extack messages in
the next patch.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several places where we needlesly call nf_ct_iterate_cleanup,
we should instead iterate the full table at module unload time.
This is a leftover from back when the conntrack table got duplicated
per net namespace.
So rename nf_ct_iterate_cleanup to nf_ct_iterate_cleanup_net.
A later patch will then add a non-net variant.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
one of the last remaining users of the old api, hopefully followup commit
can remove it soon.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net'
restricting a HW workaround alongside cleanups in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Konovalov reported crashes in ipv4_mtu()
I could reproduce the issue with KASAN kernels, between
10.246.7.151 and 10.246.7.152 :
1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
2) At the same time run following loop :
while :
do
ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
done
Cong Wang attempted to add back rt->fi in commit
82486aa6f1 ("ipv4: restore rt->fi for reference counting")
but this proved to add some issues that were complex to solve.
Instead, I suggested to add a refcount to the metrics themselves,
being a standalone object (in particular, no reference to other objects)
I tried to make this patch as small as possible to ease its backport,
instead of being super clean. Note that we believe that only ipv4 dst
need to take care of the metric refcount. But if this is wrong,
this patch adds the basic infrastructure to extend this to other
families.
Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
for his efforts on this problem.
Fixes: 2860583fe8 ("ipv4: Kill rt->fi")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to return matched fib result when RTM_F_FIB_MATCH
flag is specified in RTM_GETROUTE request. This is useful for user-space
applications/controllers wanting to query a matching route.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefix is needed for returning matching route spec on get route request.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert inet_rtm_getroute to use ip_route_input_rcu and
ip_route_output_key_hash_rcu passing the fib_result arg to both.
The rcu lock is held through the creation of the response, so the
rtable/dst does not need to be attached to the skb and is passed
to rt_fill_info directly.
In converting from ip_route_output_key to ip_route_output_key_hash_rcu
the xfrm_lookup_route in ip_route_output_flow is dropped since
flowi4_proto is not set for a route get request.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt_fill_info has 1 caller with the event set to RTM_NEWROUTE. Given that
remove the arg and use RTM_NEWROUTE directly in rt_fill_info.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>