Pull networking fixes from David Miller:
1) BPF verifier signed/unsigned value tracking fix, from Daniel
Borkmann, Edward Cree, and Josef Bacik.
2) Fix memory allocation length when setting up calls to
->ndo_set_mac_address, from Cong Wang.
3) Add a new cxgb4 device ID, from Ganesh Goudar.
4) Fix FIB refcount handling, we have to set it's initial value before
the configure callback (which can bump it). From David Ahern.
5) Fix double-free in qcom/emac driver, from Timur Tabi.
6) A bunch of gcc-7 string format overflow warning fixes from Arnd
Bergmann.
7) Fix link level headroom tests in ip_do_fragment(), from Vasily
Averin.
8) Fix chunk walking in SCTP when iterating over error and parameter
headers. From Alexander Potapenko.
9) TCP BBR congestion control fixes from Neal Cardwell.
10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger.
11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong
Wang.
12) xmit_recursion in ppp driver needs to be per-device not per-cpu,
from Gao Feng.
13) Cannot release skb->dst in UDP if IP options processing needs it.
From Paolo Abeni.
14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander
Levin and myself.
15) Revert some rtnetlink notification changes that are causing
regressions, from David Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
net: bonding: Fix transmit load balancing in balance-alb mode
rds: Make sure updates to cp_send_gen can be observed
net: ethernet: ti: cpsw: Push the request_irq function to the end of probe
ipv4: initialize fib_trie prior to register_netdev_notifier call.
rtnetlink: allocate more memory for dev_set_mac_address()
net: dsa: b53: Add missing ARL entries for BCM53125
bpf: more tests for mixed signed and unsigned bounds checks
bpf: add test for mixed signed and unsigned bounds checks
bpf: fix up test cases with mixed signed/unsigned bounds
bpf: allow to specify log level and reduce it for test_verifier
bpf: fix mixed signed/unsigned derived min/max value bounds
ipv6: avoid overflow of offset in ip6_find_1stfragopt
net: tehuti: don't process data if it has not been copied from userspace
Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"
net: dsa: mv88e6xxx: Enable CMODE config support for 6390X
dt-binding: ptp: Add SoC compatibility strings for dte ptp clock
NET: dwmac: Make dwmac reset unconditional
net: Zero terminate ifr_name in dev_ifname().
wireless: wext: terminate ifr name coming from userspace
netfilter: fix netfilter_net_init() return
...
Net stack initialization currently initializes fib-trie after the
first call to netdevice_notifier() call. In fact fib_trie initialization
needs to happen before first rtnl_register(). It does not cause any problem
since there are no devices UP at this moment, but trying to bring 'lo'
UP at initialization would make this assumption wrong and exposes the issue.
Fixes following crash
Call Trace:
? alternate_node_alloc+0x76/0xa0
fib_table_insert+0x1b7/0x4b0
fib_magic.isra.17+0xea/0x120
fib_add_ifaddr+0x7b/0x190
fib_netdev_event+0xc0/0x130
register_netdevice_notifier+0x1c1/0x1d0
ip_fib_init+0x72/0x85
ip_rt_init+0x187/0x1e9
ip_init+0xe/0x1a
inet_init+0x171/0x26c
? ipv4_offload_init+0x66/0x66
do_one_initcall+0x43/0x160
kernel_init_freeable+0x191/0x219
? rest_init+0x80/0x80
kernel_init+0xe/0x150
ret_from_fork+0x22/0x30
Code: f6 46 23 04 74 86 4c 89 f7 e8 ae 45 01 00 49 89 c7 4d 85 ff 0f 85 7b ff ff ff 31 db eb 08 4c 89 ff e8 16 47 01 00 48 8b 44 24 38 <45> 8b 6e 14 4d 63 76 74 48 89 04 24 0f 1f 44 00 00 48 83 c4 08
RIP: kmem_cache_alloc+0xcf/0x1c0 RSP: ffff9b1500017c28
CR2: 0000000000000014
Fixes: 7b1a74fdbb ("[NETNS]: Refactor fib initialization so it can handle multiple namespaces.")
Fixes: 7f9b80529b ("[IPV4]: fib hash|trie initialization")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Missing netlink message sanity check in nfnetlink, patch from
Mateusz Jurczyk.
2) We now have netfilter per-netns hooks, so let's kill global hook
infrastructure, this infrastructure is known to be racy with netns.
We don't care about out of tree modules. Patch from Florian Westphal.
3) find_appropriate_src() is buggy when colissions happens after the
conversion of the nat bysource to rhashtable. Also from Florian.
4) Remove forward chain in nf_tables arp family, it's useless and it is
causing quite a bit of confusion, from Florian Westphal.
5) nf_ct_remove_expect() is called with the wrong parameter, causing
kernel oops, patch from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric noticed that in udp_recvmsg() we still need to access
skb->dst while processing the IP options.
Since commit 0a463c78d2 ("udp: avoid a cache miss on dequeue")
skb->dst is no more available at recvmsg() time and bad things
will happen if we enter the relevant code path.
This commit address the issue, avoid clearing skb->dst if
any IP options are present into the relevant skb.
Since the IP CB is contained in the first skb cacheline, we can
test it to decide to leverage the consume_stateless_skb()
optimization, without measurable additional cost in the faster
path.
v1 -> v2: updated commit message tags
Fixes: 0a463c78d2 ("udp: avoid a cache miss on dequeue")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
arp packets cannot be forwarded.
They can be bridged, but then they can be filtered using
either ebtables or nftables bridge family.
The bridge netfilter exposes a "call-arptables" switch which
pushes packets into arptables, but lets not expose this for nftables, so better
close this asap.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes the following behavior: for connections that had no RTT sample
at the time of initializing congestion control, BBR was initializing
the pacing rate to a high nominal rate (based an a guess of RTT=1ms,
in case this is LAN traffic). Then BBR never adjusted the pacing rate
downward upon obtaining an actual RTT sample, if the connection never
filled the pipe (e.g. all sends were small app-limited writes()).
This fix adjusts the pacing rate upon obtaining the first RTT sample.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a corner case noticed by Eric Dumazet, where BBR's setting
sk->sk_pacing_rate to 0 during initialization could theoretically
cause packets in the sending host to hang if there were packets "in
flight" in the pacing infrastructure at the time the BBR congestion
control state is initialized. This could occur if the pacing
infrastructure happened to race with bbr_init() in a way such that the
pacer read the 0 rather than the immediately following non-zero pacing
rate.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a helper to initialize the BBR pacing rate unconditionally,
based on the current cwnd and RTT estimate. This is a pure refactor,
but is needed for two following fixes.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a helper to convert a BBR bandwidth and gain factor to a
pacing rate in bytes per second. This is a pure refactor, but is
needed for two following fixes.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In bbr_set_pacing_rate(), which decides whether to cut the pacing
rate, there was some code that considered exiting STARTUP to be
equivalent to the notion of filling the pipe (i.e.,
bbr_full_bw_reached()). Specifically, as the code was structured,
exiting STARTUP and going into PROBE_RTT could cause us to cut the
pacing rate down to something silly and low, based on whatever
bandwidth samples we've had so far, when it's possible that all of
them have been small app-limited bandwidth samples that are not
representative of the bandwidth available in the path. (The code was
correct at the time it was written, but the state machine changed
without this spot being adjusted correspondingly.)
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some time ago David Woodhouse reported skb_under_panic
when we try to push ethernet header to fragmented ipv6 skbs.
It was fixed for ipv6 by Florian Westphal in
commit 1d325d217c ("ipv6: ip6_fragment: fix headroom tests and skb leak")
However similar problem still exist in ipv4.
It does not trigger skb_under_panic due paranoid check
in ip_finish_output2, however according to Alexey Kuznetsov
current state is abnormal and ip_fragment should be fixed too.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
callers can more safely get random bytes if they can block until the
CRNG is initialized.
Also print a warning if get_random_*() is called before the CRNG is
initialized. By default, only one single-line warning will be printed
per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
warning will be printed for each function which tries to get random
bytes before the CRNG is initialized. This can get spammy for certain
architecture types, so it is not enabled by default.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllqXNUACgkQ8vlZVpUN
gaPtAgf/aUbXZuWYsDQzslHsbzEWi+qz4QgL885/w4L00pEImTTp91Q06SDxWhtB
KPvGnZHS3IofxBh2DC+6AwN6dPMoWDCfYhhO6po3FSz0DiPRIQCTuvOb8fhKY1X7
rTdDq2xtDxPGxJ25bMJtlrgzH2XlXPpVyPUeoc9uh87zUK5aesXpUn9kBniRexoz
ume+M/cDzPKkwNQpbLq8vzhNjoWMVv0FeW2akVvrjkkWko8nZLZ0R/kIyKQlRPdG
LZDXcz0oTHpDS6+ufEo292ZuWm2IGer2YtwHsKyCAsyEWsUqBz2yurtkSj3mAVyC
hHafyS+5WNaGdgBmg0zJxxwn5qxxLg==
=ua7p
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random updates from Ted Ts'o:
"Add wait_for_random_bytes() and get_random_*_wait() functions so that
callers can more safely get random bytes if they can block until the
CRNG is initialized.
Also print a warning if get_random_*() is called before the CRNG is
initialized. By default, only one single-line warning will be printed
per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
warning will be printed for each function which tries to get random
bytes before the CRNG is initialized. This can get spammy for certain
architecture types, so it is not enabled by default"
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: reorder READ_ONCE() in get_random_uXX
random: suppress spammy warnings about unseeded randomness
random: warn when kernel uses unseeded randomness
net/route: use get_random_int for random counter
net/neighbor: use get_random_u32 for 32-bit hash random
rhashtable: use get_random_u32 for hash_rnd
ceph: ensure RNG is seeded before using
iscsi: ensure RNG is seeded before use
cifs: use get_random_u32 for 32-bit lock random
random: add get_random_{bytes,u32,u64,int,long,once}_wait family
random: add wait_for_random_bytes() API
The ipmr_get_table() function doesn't return error pointers it returns
NULL on error.
Fixes: 4f75ba6982 ("net: ipmr: Add ipmr_rtm_getroute")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 6797318e62 ("tcp: md5: add an address prefix for key lookup")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix rtm policy name typo in mpls_getroute and also remove
export of rtm_ipv4_policy
Fixes: 397fc9e5ce ("mpls: route get support")
Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SYN-ACK responses on a server in response to a SYN from a client
did not get the injected skb mark that was tagged on the SYN packet.
Fixes: 84f39b08d7 ("net: support marking accepting TCP sockets")
Reviewed-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.
The following patch demostrates how they can be used to set send and
receive buffer sizes.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
This conversion requires overall +1 on the whole
refcounting scheme.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This batch contains connection tracking updates for the cleanup
iteration path, patches from Florian Westphal:
X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
dying bit to let the CPU release them.
X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
conntrack from all namespace.
X) Restart iteration on hashtable resizing, since both may occur at
the same time.
X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
mapping on module removal.
X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
module removal, from Liping Zhang.
X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
if user requests this, also from Liping.
X) Add net_ns_barrier() and use it from FTP helper, so make sure
no concurrent namespace removal happens at the same time while
the helper module is being removed.
X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
module size. Same thing in nf_tables.
Updates for the nf_tables infrastructure:
X) Prepare usage of the extended ACK reporting infrastructure for
nf_tables.
X) Remove unnecessary forward declaration in nf_tables hash set.
X) Skip set size estimation if number of element is not specified.
X) Changes to accomodate a (faster) unresizable hash set implementation,
for anonymous sets and dynamic size fixed sets with no timeouts.
X) Faster lookup function for unresizable hash table for 2 and 4
bytes key.
And, finally, a bunch of asorted small updates and cleanups:
X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
to device events and look up for index from the packet path, this
is fixing an issue that is present since the very beginning, patch
from Xin Long.
X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.
X) Use ebt_invalid_target() whenever possible in the ebtables tree,
from Gao Feng.
X) Calm down compilation warning in nf_dup infrastructure, patch from
stephen hemminger.
X) Statify functions in nftables rt expression, also from stephen.
X) Update Makefile to use canonical method to specify nf_tables-objs.
From Jike Song.
X) Use nf_conntrack_helpers_register() in amanda and H323.
X) Space cleanup for ctnetlink, from linzhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add to RTNL_FAMILY_IPMR, RTM_GETROUTE the ability
to retrieve one S,G mroute from a specified table.
*,G will return mroute information for just that
particular mroute if it exists. This is because
it is entirely possible to have more S's then
can fit in one skb to return to the requesting
process.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that they can be later used by the IPv6 code, too.
Also lift the comments a bit.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If icsk_ulp_ops is unset, it dereferences a null ptr.
Add a null ptr check.
BUG: KASAN: null-ptr-deref in copy_to_user include/linux/uaccess.h:168 [inline]
BUG: KASAN: null-ptr-deref in do_tcp_getsockopt.isra.33+0x24f/0x1e30 net/ipv4/tcp.c:3057
Read of size 4 at addr 0000000000000020 by task syz-executor1/15452
Signed-off-by: Dave Watson <davejwatson@fb.com>
Reported-by: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to reset the sk->sk_rx_dst when we disconnect a TCP
connection, because otherwise when we re-connect it this
dst reference is simply overridden in tcp_finish_connect().
This fixes a dst leak which leads to a loopback dev refcnt
leak. It is a long-standing bug, Kevin reported a very similar
(if not same) bug before. Thanks to Andrei for providing such
a reliable reproducer which greatly narrows down the problem.
Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
Reported-by: Andrei Vagin <avagin@gmail.com>
Reported-by: Kevin Xu <kaiwen.xu@hulu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.
Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.
This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id. This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.
Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission. The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.
Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.
This is mostly for SR-IOV devices since switches don't have lower netdevs
today.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2017-06-23
1) Use memdup_user to spmlify xfrm_user_policy.
From Geliang Tang.
2) Make xfrm_dev_register static to silence a sparse warning.
From Wei Yongjun.
3) Use crypto_memneq to check the ICV in the AH protocol.
From Sabrina Dubroca.
4) Remove some unused variables in esp6.
From Stephen Hemminger.
5) Extend XFRM MIGRATE to allow to change the UDP encapsulation port.
From Antony Antony.
6) Include the UDP encapsulation port to km_migrate announcements.
From Antony Antony.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
KASAN reports out-of-bound access in proc_dostring() coming from
proc_tcp_available_ulp() because in case TCP ULP list is empty
the buffer allocated for the response will not have anything
printed into it. Set the first byte to zero to avoid strlen()
going out-of-bounds.
Fixes: 734942cc4e ("tcp: ULP infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our customer encountered stuck NFS writes for blocks starting at specific
offsets w.r.t. page boundary caused by networking stack sending packets via
UFO enabled device with wrong checksum. The problem can be reproduced by
composing a long UDP datagram from multiple parts using MSG_MORE flag:
sendto(sd, buff, 1000, MSG_MORE, ...);
sendto(sd, buff, 1000, MSG_MORE, ...);
sendto(sd, buff, 3000, 0, ...);
Assume this packet is to be routed via a device with MTU 1500 and
NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
this condition is tested (among others) to decide whether to call
ip_ufo_append_data():
((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))
At the moment, we already have skb with 1028 bytes of data which is not
marked for GSO so that the test is false (fragheaderlen is usually 20).
Thus we append second 1000 bytes to this skb without invoking UFO. Third
sendto(), however, has sufficient length to trigger the UFO path so that we
end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
uses udp_csum() to calculate the checksum but that assumes all fragments
have correct checksum in skb->csum which is not true for UFO fragments.
When checking against MTU, we need to add skb->len to length of new segment
if we already have a partially filled skb and fragheaderlen only if there
isn't one.
In the IPv6 case, skb can only be null if this is the first segment so that
we have to use headersize (length of the first IPv6 header) rather than
fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.
Fixes: e89e9cf539 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Fixes: e4c5e13aa4 ("ipv6: Should use consistent conditional judgement for
ip6 fragment between __ip6_append_data and ip6_finish_output")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael reported an UDP breakage caused by the commit b65ac44674
("udp: try to avoid 2 cache miss on dequeue").
The function __first_packet_length() can update the checksum bits
of the pending skb, making the scratched area out-of-sync, and
setting skb->csum, if the skb was previously in need of checksum
validation.
On later recvmsg() for such skb, checksum validation will be
invoked again - due to the wrong udp_skb_csum_unnecessary()
value - and will fail, causing the valid skb to be dropped.
This change addresses the issue refreshing the scratch area in
__first_packet_length() after the possible checksum update.
Fixes: b65ac44674 ("udp: try to avoid 2 cache miss on dequeue")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently in both ipv4 and ipv6 code path, the ack packet received when
sk at TCP_NEW_SYN_RECV state is not filtered by socket filter or cgroup
filter since it is handled from tcp_child_process and never reaches the
tcp_filter inside tcp_v4_rcv or tcp_v6_rcv. Adding a tcp_filter hooks
here can make sure all the ingress tcp packet can be correctly filtered.
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>