Now sctp doesn't check sock's state before listening on it. It could
even cause changing a sock with any state to become a listening sock
when doing sctp_listen.
This patch is to fix it by checking sock's state in sctp_listen, so
that it will listen on the sock with right state.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Existing L2TP kernel code does not derive the optimal MTU for Ethernet
pseudowires and instead leaves this to a userspace L2TP daemon or
operator. If an MTU is not specified, the existing kernel code chooses
an MTU that does not take account of all tunnel header overheads, which
can lead to unwanted IP fragmentation. When L2TP is used without a
control plane (userspace daemon), we would prefer that the kernel does a
better job of choosing a default pseudowire MTU, taking account of all
tunnel header overheads, including IP header options, if any. This patch
addresses this.
Change-set here uses the new kernel function, kernel_sock_ip_overhead(),
to factor the outer IP overhead on the L2TP tunnel socket (including
IP Options, if any) when calculating the default MTU for an Ethernet
pseudowire, along with consideration of the inner Ethernet header.
Signed-off-by: R. Parameswaran <rparames@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A new function, kernel_sock_ip_overhead(), is provided
to calculate the cumulative overhead imposed by the IP
Header and IP options, if any, on a socket's payload.
The new function returns an overhead of zero for sockets
that do not belong to the IPv4 or IPv6 address families.
This is used in the L2TP code path to compute the
total outer IP overhead on the L2TP tunnel socket when
calculating the default MTU for Ethernet pseudowires.
Signed-off-by: R. Parameswaran <rparames@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers. These were identified during
allyesconfig builds of x86, arm, and arm64, and the initializer fixes
were extracted from grsecurity. In this case, NULL initialize with { }
instead of undesignated NULLs.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry reported a crash when injecting faults in
attach_one_default_qdisc() and dev->qdisc is still
a noop_disc, the check before qdisc_hash_add() fails
to catch it because it tests NULL. We should test
against noop_qdisc since it is the default qdisc
at this point.
Fixes: 59cc1f61f0 ("net: sched: convert qdisc linked list to hashtable")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to
ip_route_input when iif is given. If a multipath route is present for
the designated destination, ip_multipath_icmp_hash ends up being called,
which uses the source/destination addresses within the skb to calculate
a hash. However, those are not set in the synthetic skb, causing it to
return an arbitrary and incorrect result.
Instead, use UDP, which gets no such special treatment.
Signed-off-by: Florian Larysch <fl@n621.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a tracepoint (rxrpc_connect_call) to log the combination of rxrpc_call
pointer, afs_call pointer/user data and wire call parameters to make it
easier to match the tracebuffer contents to captured network packets.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint (rxrpc_rx_rwind_change) to log changes in a call's receive
window size as imposed by the peer through an ACK packet.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint (rxrpc_rx_proto) to record protocol errors in received
packets. The following changes are made:
(1) Add a function, __rxrpc_abort_eproto(), to note a protocol error on a
call and mark the call aborted. This is wrapped by
rxrpc_abort_eproto() that makes the why string usable in trace.
(2) Add trace_rxrpc_rx_proto() or rxrpc_abort_eproto() to protocol error
generation points, replacing rxrpc_abort_call() with the latter.
(3) Only send an abort packet in rxkad_verify_packet*() if we actually
managed to abort the call.
Note that a trace event is also emitted if a kernel user (e.g. afs) tries
to send data through a call when it's not in the transmission phase, though
it's not technically a receive event.
Signed-off-by: David Howells <dhowells@redhat.com>
In the rxkad security module, when we encounter a temporary error (such as
ENOMEM) from which we could conceivably recover, don't abort the
connection, but rather permit retransmission of the relevant packets to
induce a retry.
Note that I'm leaving some places that could be merged together to insert
tracing in the next patch.
Signed-off-by; David Howells <dhowells@redhat.com>
Make rxrpc_kernel_abort_call() return an indication as to whether it
actually aborted the operation or not so that kafs can trace the failure of
the operation. Note that 'success' in this context means changing the
state of the call, not necessarily successfully transmitting an ABORT
packet.
Signed-off-by: David Howells <dhowells@redhat.com>
Use negative error codes in struct rxrpc_call::error because that's what
the kernel normally deals with and to make the code consistent. We only
turn them positive when transcribing into a cmsg for userspace recvmsg.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull networking fixes from David Miller:
1) Reject invalid updates to netfilter expectation policies, from Pablo
Neira Ayuso.
2) Fix memory leak in nfnl_cthelper, from Jeffy Chen.
3) Don't do stupid things if we get a neigh_probe() on a neigh entry
whose ops lack a solicit method. From Eric Dumazet.
4) Don't transmit packets in r8152 driver when the carrier is off, from
Hayes Wang.
5) Fix ipv6 packet type detection in aquantia driver, from Pavel
Belous.
6) Don't write uninitialized data into hw registers in bna driver, from
Arnd Bergmann.
7) Fix locking in ping_unhash(), from Eric Dumazet.
8) Make BPF verifier range checks able to understand certain sequences
emitted by LLVM, from Alexei Starovoitov.
9) Fix use after free in ipconfig, from Mark Rutland.
10) Fix refcount leak on force commit in openvswitch, from Jarno
Rajahalme.
11) Fix various overflow checks in AF_PACKET, from Andrey Konovalov.
12) Fix endianness bug in be2net driver, from Suresh Reddy.
13) Don't forget to wake TX queues when processing a timeout, from
Grygorii Strashko.
14) ARP header on-stack storage is wrong in flow dissector, from Simon
Horman.
15) Lost retransmit and reordering SNMP stats in TCP can be
underreported. From Yuchung Cheng.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits)
nfp: fix potential use after free on xdp prog
tcp: fix reordering SNMP under-counting
tcp: fix lost retransmit SNMP under-counting
sctp: get sock from transport in sctp_transport_update_pmtu
net: ethernet: ti: cpsw: fix race condition during open()
l2tp: fix PPP pseudo-wire auto-loading
bnx2x: fix spelling mistake in macros HW_INTERRUT_ASSERT_SET_*
l2tp: take reference on sessions being dumped
tcp: minimize false-positives on TCP/GRO check
sctp: check for dst and pathmtu update in sctp_packet_config
flow dissector: correct size of storage for ARP
net: ethernet: ti: cpsw: wake tx queues on ndo_tx_timeout
l2tp: take a reference on sessions used in genetlink handlers
l2tp: hold session while sending creation notifications
l2tp: fix duplicate session creation
l2tp: ensure session can't get removed during pppol2tp_session_ioctl()
l2tp: fix race in l2tp_recv_common()
sctp: use right in and out stream cnt
bpf: add various verifier test cases for self-tests
bpf, verifier: fix rejection of unaligned access checks for map_value_adj
...
Currently the reordering SNMP counters only increase if a connection
sees a higher degree then it has previously seen. It ignores if the
reordering degree is not greater than the default system threshold.
This significantly under-counts the number of reordering events
and falsely convey that reordering is rare on the network.
This patch properly and faithfully records the number of reordering
events detected by the TCP stack, just like the comment says "this
exciting event is worth to be remembered". Note that even so TCP
still under-estimate the actual reordering events because TCP
requires TS options or certain packet sequences to detect reordering
(i.e. ACKing never-retransmitted sequence in recovery or disordered
state).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The lost retransmit SNMP stat is under-counting retransmission
that uses segment offloading. This patch fixes that so all
retransmission related SNMP counters are consistent.
Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAljjveoTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRAe/sog7Dgc9tSZB/9DPsUOEbLzcJZ6HVRPJ3mJkCYf9jdD
VE7vW3w79EcmcbkE7ULkyrEr/+GJs7GvA1vS+j8jphraVGOjP4DjeOm1OLJXylqa
m2vaBpOqTOx3MdMdd/FVLlap7MKTX3f9J1WAOsGi5kJK3RR9EHRj5tKhoeG40OUk
rMDg9juskT+XVqgawrcHyM/eVHZ4ny+BlN2LN0zajHT6l3xKiqQ+0iQltXnstymW
djTfqlrbL+Ix6mlYKsK+joVWwbNUxuguto2m2CKVQukWFu2Q5wxe5YASjtMThQcv
vmq2LD45cFIfNhzgfTu2msoz2Cv613LFiMtHtrfjlbXtSZmFc2FqIbKc
=NjgL
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.13-20170404' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2017-03-03
this is a pull request of 5 patches for net-next/master.
There are two patches by Yegor Yefremov which convert the ti_hecc
driver into a DT only driver, as there is no in-tree user of the old
platform driver interface anymore. The next patch by Mario Kicherer
adds network namespace support to the can subsystem. The last two
patches by Akshay Bhat add support for the holt_hi311x SPI CAN driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When netdev events happen, a rtnetlink_event() handler will send
messages for every event in it's white list. These messages contain
current information about a particular device, but they do not include
the iformation about which event just happened. The consumer of
the message has to try to infer this information. In some cases
(ex: NETDEV_NOTIFY_PEERS), that is not possible.
This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
that would have an encoding of the which event triggered this
message. This would allow the the message consumer to easily determine
if it is interested in a particular event or not.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rtnetlink_event currently functions as a blacklist where
we block cerntain netdev events from being sent to user space.
As a result, events have been added to the system that userspace
probably doesn't care about.
This patch converts the implementation to the white list so that
newly events would have to be specifically added to the list to
be sent to userspace. This would force new event implementers to
consider whether a given event is usefull to user space or if it's
just a kernel event.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define one new macro TCP_MAX_WSCALE instead of literal number '14',
and use U16_MAX instead of 65535 as the max value of TCP window.
There is another minor change, use rounddown(space, mss) instead of
(space / mss) * mss;
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is almost to revert commit 02f3d4ce9e ("sctp: Adjust PMTU
updates to accomodate route invalidation."). As t->asoc can't be NULL
in sctp_transport_update_pmtu, it could get sk from asoc, and no need
to pass sk into that function.
It is also to remove some duplicated codes from that function.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cb_running is reported in /proc/self/net/netlink and it is reported by
the ss tool, when it gets information from the proc files.
sock_diag is a new interface which is used instead of proc files, so it
looks reasonable that this interface has to report no less information
about sockets than proc files.
We use these flags to dump and restore netlink sockets.
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We accidentally left this dead code behind after commit 5952fde10c
("net: sched: choke: remove dead filter classify code").
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a private copy of struct net_device_stats in struct
batadv_priv, use stats from struct net_device.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It looks like a typo to assign a return code to a variable which is not
used. Found due to a compiler warning:
net/nfc/netlink.c: In function ‘nfc_genl_activate_target’:
net/nfc/netlink.c:903:6: warning: variable ‘rc’ set but not used [-Wunused-but-set-variable]
int rc;
^~
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
PPP pseudo-wire type is 7 (11 is L2TP_PWTYPE_IP).
Fixes: f1f39f9110 ("l2tp: auto load type modules")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take a reference on the sessions returned by l2tp_session_find_nth()
(and rename it l2tp_session_get_nth() to reflect this change), so that
caller is assured that the session isn't going to disappear while
processing it.
For procfs and debugfs handlers, the session is held in the .start()
callback and dropped in .show(). Given that pppol2tp_seq_session_show()
dereferences the associated PPPoL2TP socket and that
l2tp_dfs_seq_session_show() might call pppol2tp_show(), we also need to
call the session's .ref() callback to prevent the socket from going
away from under us.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Fixes: 0ad6614048 ("l2tp: Add debugfs files for dumping l2tp debug info")
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds initial support for network namespaces. The changes only
enable support in the CAN raw, proc and af_can code. GW and BCM still
have their checks that ensure that they are used only from the main
namespace.
The patch boils down to moving the global structures, i.e. the global
filter list and their /proc stats, into a per-namespace structure and passing
around the corresponding "struct net" in a lot of different places.
Changes since v1:
- rebased on current HEAD (2bfe01e)
- fixed overlong line
Signed-off-by: Mario Kicherer <dev@kicherer.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Number of sockets is limited by 16-bit, so 64-bit allocation will never
happen.
16-bit ops are the worst code density-wise on x86_64 because of
additional prefix (66).
Space savings:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3)
function old new delta
reuseport_add_sock 539 536 -3
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make ->hash_count, ->low_watermark and ->high_watermark unsigned int
and propagate unsignedness to other variables.
This change doesn't change code generation because these fields aren't
used in 64-bit contexts but make it anyway: these fields can't be
negative numbers.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flow keys aren't 4GB+ numbers so 64-bit arithmetic is excessive.
Space savings (I'm not sure what CSWTCH is):
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-48 (-48)
function old new delta
flow_cache_lookup 1163 1159 -4
CSWTCH 75997 75953 -44
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Trippelsdorf reported that after commit dcb17d22e1 ("tcp: warn
on bogus MSS and try to amend it") the kernel started logging the
warning for a NIC driver that doesn't even support GRO.
It was diagnosed that it was possibly caused on connections that were
using TCP Timestamps but some packets lacked the Timestamps option. As
we reduce rcv_mss when timestamps are used, the lack of them would cause
the packets to be bigger than expected, although this is a valid case.
As this warning is more as a hint, getting a clean-cut on the
threshold is probably not worth the execution time spent on it. This
patch thus alleviates the false-positives with 2 quick checks: by
accounting for the entire TCP option space and also checking against the
interface MTU if it's available.
These changes, specially the MTU one, might mask some real positives,
though if they are really happening, it's possible that sooner or later
it will be triggered anyway.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to move sctp_transport_dst_check into sctp_packet_config
from sctp_packet_transmit and add pathmtu check in sctp_packet_config.
With this fix, sctp can update dst or pathmtu before appending chunks,
which can void dropping packets in sctp_packet_transmit when dst is
obsolete or dst's mtu is changed.
This patch is also to improve some other codes in sctp_packet_config.
It updates packet max_size with gso_max_size, checks for dst and
pathmtu, and appends ecne chunk only when packet is empty and asoc
is not NULL.
It makes sctp flush work better, as we only need to set up them once
for one flush schedule. It's also safe, since asoc is NULL only when
the packet is created by sctp_ootb_pkt_new in which it just gets the
new dst, no need to do more things for it other than set packet with
transport's pathmtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before when implementing sctp prsctp, SCTP_PR_STREAM_STATUS wasn't
added, as it needs to save abandoned_(un)sent for every stream.
After sctp stream reconf is added in sctp, assoc has structure
sctp_stream_out to save per stream info.
This patch is to add SCTP_PR_STREAM_STATUS by putting the prsctp
per stream statistics into sctp_stream_out.
v1->v2:
fix an indent issue.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last argument to __skb_header_pointer() should be a buffer large
enough to store struct arphdr. This can be a pointer to a struct arphdr
structure. The code was previously using a pointer to a pointer to
struct arphdr.
By my counting the storage available both before and after is 8 bytes on
x86_64.
Fixes: 55733350e5 ("flow disector: ARP support")
Reported-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ethtool code was spread in soft-interface.c. This makes reading the
code and working on it unnecessary complicated. Having everything in a
common place next to the other code which references it, makes it slightly
easier.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The .get_settings function pointer and the related API was deprecated.
Fortunately, batman-adv is a virtual interface and never provided any
useful information via .get_settings. The stub can therefore be
removed.
This also avoids that incorrect information is shown in ethtool about the
batadv interface.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
batadv devices don't support msglevel. The ethtool stubs therefore returned
that it isn't supported. But instead, the complete function can be dropped
to avoid that bogus values are shown in ethtool.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The ethtool_ops of batman-adv never contained more than a stub for the
get_link function pointer. It was always returning that a link exists even
when the devices was not yet up and therefore nothing resampling a link
could have been available.
Instead use the ethtool helper which returns the current carrier state.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
A dump may come in the middle of another dump, modifying its dump
structure members. This race condition will result in NULL pointer
dereference in kernel. So add a lock to prevent that race.
Fixes: 83321d6b98 ("[AF_KEY]: Dump SA/SP entries non-atomically")
Signed-off-by: Yuejie Shi <syjcnss@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The rds_connect_worker() has a bug in the check that enforces the
canonical connection order described in the comments of
rds_tcp_state_change(). The intention is to make sure that all
the multipath connections are always initiated by the smaller IP
address via rds_start_mprds. To achieve this, rds_connection_worker
should check that cp_index > 0.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_conn_shutdown() runs in workq context, and marks the rds_connection
as DISCONNECTING before quiescing Tx/Rx paths. However, after all I/O
has quiesced, we may still find the rds_connection state to be
RDS_CONN_ERROR if an intervening FIN was processed in softirq context.
This is not a fatal error: rds_conn_shutdown() should continue the
shutdown, and there is no need to log noisy messages about this event.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the mess observed in e.g. rsync over a noisy link we'd been
seeing since last Summer. What happens is that we copy part of
a datagram before noticing a checksum mismatch. Datagram will be
resent, all right, but we want the next try go into the same place,
not after it...
All this family of primitives (copy/checksum and copy a datagram
into destination) is "all or nothing" sort of interface - either
we get 0 (meaning that copy had been successful) or we get an
error (and no way to tell how much had been copied before we ran
into whatever error it had been). Make all of them leave iterator
unadvanced in case of errors - all callers must be able to cope
with that (an error might've been caught before the iterator had
been advanced), it costs very little to arrange, it's safer for
callers and actually fixes at least one bug in said callers.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Alow users to push down more labels per MPLS encap. Similar to LSR case,
move label array to the end of mpls_iptunnel_encap and allocate based on
the number of labels for the route.
For consistency with the LSR case, re-use the same maximum number of
labels.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow users to push down more labels per MPLS route. With the previous
patches, no memory allocations are based on MAX_NEW_LABELS; the limit
is only used to keep userspace in check.
At this point MAX_NEW_LABELS is only used for mpls_route_config (copying
route data from userspace) and processing nexthops looking for the max
number of labels across the route spec.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Limit memory allocation size for mpls_route to 4096.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move labels to the end of mpls_nh as a 0-sized array and within mpls_route
move the via for a nexthop after the mpls_nh. The new layout becomes:
+----------------------+
| mpls_route |
+----------------------+
| mpls_nh 0 |
+----------------------+
| alignment padding | 4 bytes for odd number of labels; 0 for even
+----------------------+
| via[rt_max_alen] 0 |
+----------------------+
| alignment padding | via's aligned on sizeof(unsigned long)
+----------------------+
| ... |
+----------------------+
| mpls_nh n-1 |
+----------------------+
| via[rt_max_alen] n-1 |
+----------------------+
Memory allocated for nexthop + via is constant across all nexthops and
their via. It is based on the maximum number of labels across all nexthops
and the maximum via length. The size is saved in the mpls_route as
rt_nh_size. Accessing a nexthop becomes rt->rt_nh + index * rt->rt_nh_size.
The offset of the via address from a nexthop is saved as rt_via_offset
so that given an mpls_nh pointer the via for that hop is simply
nh + rt->rt_via_offset.
With prior code, memory allocated per mpls_route with 1 nexthop:
via is an ethernet address - 64 bytes
via is an ipv4 address - 64
via is an ipv6 address - 72
With this patch set, memory allocated per mpls_route with 1 nexthop and
1 or 2 labels:
via is an ethernet address - 56 bytes
via is an ipv4 address - 56
via is an ipv6 address - 64
The 8-byte reduction is due to the previous patch; the change introduced
by this patch has no impact on the size of allocations for 1 or 2 labels.
Performance impact of this change was examined using network namespaces
with veth pairs connecting namespaces. ns0 inserts the packet to the
label-switched path using an lwt route with encap mpls. ns1 adds 1 or 2
labels depending on test, ns2 (and ns3 for 2-label test) pops the label
and forwards. ns3 (or ns4) for a 2-label is the destination. Similar
series of namespaces used for 2-nexthop test.
Intent is to measure changes to latency (overhead in manipulating the
packet) in the forwarding path. Tests used netperf with UDP_RR.
IPv4: current patches
1 label, 1 nexthop 29908 30115
2 label, 1 nexthop 29071 29612
1 label, 2 nexthop 29582 29776
2 label, 2 nexthop 29086 29149
IPv6: current patches
1 label, 1 nexthop 24502 24960
2 label, 1 nexthop 24041 24407
1 label, 2 nexthop 23795 23899
2 label, 2 nexthop 23074 22959
In short, the change has no effect to a modest increase in performance.
This is expected since this patch does not really have an impact on routes
with 1 or 2 labels (the current limit) and 1 or 2 nexthops.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Number of nexthops and number of alive nexthops are tracked using an
unsigned int. A route should never have more than 255 nexthops so
convert both to u8. Update all references and intermediate variables
to consistently use u8 as well.
Shrinks the size of mpls_route from 32 bytes to 24 bytes with a 2-byte
hole before the nexthops.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The number of alive nexthops for a route (rt->rt_nhn_alive) and the
flags for a next hop (nh->nh_flags) are modified by netdev event
handlers. The event handlers run with rtnl_lock held so updates are
always done with the lock held. The packet path accesses the fields
under the rcu lock. Since those fields can change at any moment in
the packet path, both fields should be accessed using READ_ONCE. Updates
to both fields should use WRITE_ONCE.
Update mpls_select_multipath (packet path) and mpls_ifdown and mpls_ifup
(event handlers) accordingly.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Callers of l2tp_nl_session_find() need to hold a reference on the
returned session since there's no guarantee that it isn't going to
disappear from under them.
Relying on the fact that no l2tp netlink message may be processed
concurrently isn't enough: sessions can be deleted by other means
(e.g. by closing the PPPOL2TP socket of a ppp pseudowire).
l2tp_nl_cmd_session_delete() is a bit special: it runs a callback
function that may require a previous call to session->ref(). In
particular, for ppp pseudowires, the callback is l2tp_session_delete(),
which then calls pppol2tp_session_close() and dereferences the PPPOL2TP
socket. The socket might already be gone at the moment
l2tp_session_delete() calls session->ref(), so we need to take a
reference during the session lookup. So we need to pass the do_ref
variable down to l2tp_session_get() and l2tp_session_get_by_ifname().
Since all callers have to be updated, l2tp_session_find_by_ifname() and
l2tp_nl_session_find() are renamed to reflect their new behaviour.
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp_session_find() doesn't take any reference on the returned session.
Therefore, the session may disappear while sending the notification.
Use l2tp_session_get() instead and decrement session's refcount once
the notification is sent.
Fixes: 33f72e6f0c ("l2tp : multicast notification to the registered listeners")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp_session_create() relies on its caller for checking for duplicate
sessions. This is racy since a session can be concurrently inserted
after the caller's verification.
Fix this by letting l2tp_session_create() verify sessions uniqueness
upon insertion. Callers need to be adapted to check for
l2tp_session_create()'s return code instead of calling
l2tp_session_find().
pppol2tp_connect() is a bit special because it has to work on existing
sessions (if they're not connected) or to create a new session if none
is found. When acting on a preexisting session, a reference must be
held or it could go away on us. So we have to use l2tp_session_get()
instead of l2tp_session_find() and drop the reference before exiting.
Fixes: d9e31d17ce ("l2tp: Add L2TP ethernet pseudowire support")
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Holding a reference on session is required before calling
pppol2tp_session_ioctl(). The session could get freed while processing the
ioctl otherwise. Since pppol2tp_session_ioctl() uses the session's socket,
we also need to take a reference on it in l2tp_session_get().
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Taking a reference on sessions in l2tp_recv_common() is racy; this
has to be done by the callers.
To this end, a new function is required (l2tp_session_get()) to
atomically lookup a session and take a reference on it. Callers then
have to manually drop this reference.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since sctp reconf was added in sctp, the real cnt of in/out stream
have not been c.sinit_max_instreams and c.sinit_num_ostreams any
more.
This patch is to replace them with stream->in/outcnt.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 96567d5dac ("net: dsa: dsa2: Add basic support of devlink")
I see the following link error with CONFIG_NET_DSA=y and CONFIG_NET_DEVLINK=m:
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe226b): undefined reference to `devlink_alloc'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe2284): undefined reference to `devlink_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe243e): undefined reference to `devlink_port_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe24e1): undefined reference to `devlink_port_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe24fa): undefined reference to `devlink_port_type_eth_set'
net/built-in.o: In function 'dsa_dst_unapply.part.8':
dsa2.c:(.text.unlikely+0x345): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x36c): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x38e): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x3f2): undefined reference to 'devlink_unregister'
dsa2.c:(.text.unlikely+0x3fb): undefined reference to 'devlink_free'
Fix this by adding a dependency on MAY_USE_DEVLINK so that CONFIG_NET_DSA
get switched to be build as module when CONFIG_NET_DEVLINK=m.
Fixes: 96567d5dac ("net: dsa: dsa2: Add basic support of devlink")
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, NFC_EVENT_DEVICE_ADDED doesn't send NFC_ATTR_RF_MODE. But
NFC_CMD_GET_DEVICE send.
To get NFC_ATTR_RF_MODE, we have to call NFC_CMD_GET_DEVICE just for
NFC_ATTR_RF_MODE when get NFC_EVENT_DEVICE_ADDED.
This fixes those inconsistent.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
development and testing of networking bpf programs is quite cumbersome.
Despite availability of user space bpf interpreters the kernel is
the ultimate authority and execution environment.
Current test frameworks for TC include creation of netns, veth,
qdiscs and use of various packet generators just to test functionality
of a bpf program. XDP testing is even more complicated, since
qemu needs to be started with gro/gso disabled and precise queue
configuration, transferring of xdp program from host into guest,
attaching to virtio/eth0 and generating traffic from the host
while capturing the results from the guest.
Moreover analyzing performance bottlenecks in XDP program is
impossible in virtio environment, since cost of running the program
is tiny comparing to the overhead of virtio packet processing,
so performance testing can only be done on physical nic
with another server generating traffic.
Furthermore ongoing changes to user space control plane of production
applications cannot be run on the test servers leaving bpf programs
stubbed out for testing.
Last but not least, the upstream llvm changes are validated by the bpf
backend testsuite which has no ability to test the code generated.
To improve this situation introduce BPF_PROG_TEST_RUN command
to test and performance benchmark bpf programs.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce crosschip_bridge_{join,leave} operations in the dsa_switch_ops
structure, which can be used by switches supporting interconnection.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ovs_flow_key_update() is called when the flow key is invalid, and it is
used to update and revalidate the flow key. Commit 329f45bc4f
("openvswitch: add mac_proto field to the flow key") introduces mac_proto
field to flow key and use it to determine whether the flow key is valid.
However, the commit does not update the code path in ovs_flow_key_update()
to revalidate the flow key which may cause BUG_ON() on execute_recirc().
This patch addresses the aforementioned issue.
Fixes: 329f45bc4f ("openvswitch: add mac_proto field to the flow key")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
backchannel; fix. Also some minor refinements to the nfsd
version-setting interface that we'd like to get fixed before release.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJY3weFAAoJECebzXlCjuG+fQYQAKh7QVB/vDZwvjWZ7OGSBhoe
LJfS1mzzf52MfQk6qNEM9to86qu+ywmH4XD2rvgMdif6Kc0qF1xBzzVEnp527lVW
nt1nLq/oIXeRl7a+5p+FxTnqHn0OBH+iCnxcVZI8tvHgptpR+6TwEzR36k4n4esN
kvNrrv9dFIoBGx0nSBScHTC3zNArU+C9oZTTHAuPJICNBHsOMqYF7sSAXPg9NFR+
HnowZfGWWXpSnWZ9DaM14Zb/dX0B9Pexv1MsgfiKYS9Beh7g4JN48+CHtWyVliwd
M/LVBpT2ZwFM/NzvJ2exVIqm/hwj4El2Sjy67XHwQzDvGjnUn/fz55SfXfzZSMyD
PMj+IeHKuT3jipNui1AXAlzYEz8gPvuKQ0vQ0vVuNX4Ln28KydGwvVTkUNltDnfq
E7L7RI6mj03OY1j1p6zeK6UJeueZq1gmjfq1NVTPO+TgQFPhh8A50NsDEVcaiwNN
W8uX7Qa39y79BT+4OFuYL05AbuqxKR+nAmCbVSLy9Kq4sc/6/YErBZtXXAghzPPl
4Es4tzlAd8skkxWlcVeCeUqPfcDCqHL6xKqa62m5wUuYXmqPtIMyAyVutF4lWpH/
dAV5Lcjz7HeQnpCgFZXXtIQW5OIIfFa08s0f7fuFm+uxz+nM/x8gg99tuwWP6OsT
Za7EtdB84M1ZGgjO3JhY
=oi0+
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"The restriction of NFSv4 to TCP went overboard and also broke the
backchannel; fix.
Also some minor refinements to the nfsd version-setting interface that
we'd like to get fixed before release"
* tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: set XPT_CONG_CTRL flag for bc xprt
NFSD: fix nfsd_reset_versions for NFSv4.
NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
NFSD: further refinement of content of /proc/fs/nfsd/versions
nfsd: map the ENOKEY to nfserr_perm for avoiding warning
SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
Enhance nl80211 and cfg80211 connect request and response APIs to
support FILS shared key authentication offload. The new nl80211
attributes can be used to provide additional information to the driver
to establish a FILS connection. Also enhance the set/del PMKSA to allow
support for adding and deleting PMKSA based on FILS cache identifier.
Add a new feature flag that drivers can use to advertize support for
FILS shared key authentication and association in station mode when
using their own SME.
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the connect event from driver takes all the connection
response parameters as arguments. With support for new features these
response parameters can grow. Use a structure to pass these parameters
rather than passing them as function arguments.
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[add to documentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for
every packet, even if the SOCK_TIMESTAMP flag is not set in the
related socket.
If selinux is enabled, this cause a cache miss for every packet
since sk->sk_stamp and sk->sk_security share the same cacheline.
With this change sk_stamp is set only if the SOCK_TIMESTAMP
flag is set, and is cleared for the first packet, so that the user
perceived behavior is unchanged.
This gives up to 5% speed-up under udp-flood with small packets.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Move the "window = tp->rcv_wnd;" into the condition block without
tp->rx_opt.rcv_wscale.
Because it is unnecessary when enable wscale;
2. Use the macro ALIGN instead of two statements.
The two statements are used to make window align to 1<<wscale.
Use the ALIGN is more clearer.
3. Use the rounddown to make codes clearer.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending a msg without asoc established, sctp will send INIT packet
first and then enqueue chunks.
Before receiving INIT_ACK, stream info is not yet alloced. But enqueuing
chunks needs to access stream info, like out stream state and out stream
cnt.
This patch is to fix it by allocing out stream info when initializing an
asoc, allocing in stream and re-allocing out stream when processing init.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than assign the positive errno values to ret and then
checking if it is positive and flip the sign, just return the
errno value.
Detected by CoverityScan, CID#986649 ("Logically Dead Code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These files all use functions declared in interrupt.h, but currently rely
on implicit inclusion of this file (via netns/xfrm.h).
That won't work anymore when the flow cache is removed so include that
header where needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When calculating po->tp_hdrlen + po->tp_reserve the result can overflow.
Fix by checking that tp_reserve <= INT_MAX on assign.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow.
Add a check that tp_block_size * tp_block_nr <= UINT_MAX.
Since frames_per_block <= tp_block_size, the expression would
never overflow.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).
Compare them as is instead.
Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains a rather large update with Netfilter
fixes, specifically targeted to incorrect RCU usage in several spots and
the userspace conntrack helper infrastructure (nfnetlink_cthelper),
more specifically they are:
1) expect_class_max is incorrect set via cthelper, as in kernel semantics
mandate that this represents the array of expectation classes minus 1.
Patch from Liping Zhang.
2) Expectation policy updates via cthelper are currently broken for several
reasons: This code allows illegal changes in the policy such as changing
the number of expeciation classes, it is leaking the updated policy and
such update occurs with no RCU protection at all. Fix this by adding a
new nfnl_cthelper_update_policy() that describes what is really legal on
the update path.
3) Fix several memory leaks in cthelper, from Jeffy Chen.
4) synchronize_rcu() is missing in the removal path of several modules,
this may lead to races since CPU may still be running on code that has
just gone. Also from Liping Zhang.
5) Don't use the helper hashtable from cthelper, it is not safe to walk
over those bits without the helper mutex. Fix this by introducing a
new independent list for userspace helpers. From Liping Zhang.
6) nf_ct_extend_unregister() needs synchronize_rcu() to make sure no
packets are walking on any conntrack extension that is gone after
module removal, again from Liping.
7) nf_nat_snmp may crash if we fail to unregister the helper due to
accidental leftover code, from Gao Feng.
8) Fix leak in nfnetlink_queue with secctx support, from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
for socketpairs using connectionless transport, we cache
the respective node local TIPC portid to use in subsequent
calls to send() in the socket's private data.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sockets A and B are connected back-to-back, similar to what
AF_UNIX does.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge xfrm_user validation fixes from Andy Whitcroft:
"Two patches we are applying to Ubuntu for XFRM_MSG_NEWAE validation
issue reported by ZDI.
The first of these is the primary fix, and the second is for a more
theoretical issue that Kees pointed out when reviewing the first"
* emailed patches from Andy Whitcroft <apw@canonical.com>:
xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder
xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
A recent commit skips nexthops in a route if the device has been
deleted. Update lfib_nlmsg_size accordingly.
Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Rx path may grab the socket right before pppol2tp_release(), but
nothing guarantees that it will enqueue packets before
skb_queue_purge(). Therefore, the socket can be destroyed without its
queues fully purged.
Fix this by purging queues in pppol2tp_session_destruct() where we're
guaranteed nothing is still referencing the socket.
Fixes: 9e9cb6221a ("l2tp: fix userspace reception on plain L2TP sockets")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code following l2tp_tunnel_find() expects that a new reference is
held on sk. Either sk_receive_skb() or the discard_put error path will
drop a reference from the tunnel's socket.
This issue exists in both l2tp_ip and l2tp_ip6.
Fixes: a3c18422a4 ("l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv()")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues. To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.
CVE-2017-7184
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer. However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call. There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents. We do
not at this point check that the replay_window is within the allocated
memory. This leads to out-of-bounds reads and writes triggered by
netlink packets. This leads to memory corruption and the potential for
priviledge escalation.
We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn. It however does not check the replay_window
remains within that buffer. Add validation of the contained
replay_window.
CVE-2017-7184
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When internal mac80211 TXQs aren't supported, netdev queues must
always started out started even when driver queues are stopped
while the interface is added. This is necessary because with the
internal TXQ support netdev queues are never stopped and packet
scheduling/dropping is done in mac80211.
Cc: stable@vger.kernel.org # 4.9+
Fixes: 80a83cfc43 ("mac80211: skip netdev queue control with software queuing")
Reported-and-tested-by: Sven Eckelmann <sven.eckelmann@openmesh.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We must call security_release_secctx to free the memory returned by
security_secid_to_secctx, otherwise memory may be leaked forever.
Fixes: ef493bd930 ("netfilter: nfnetlink_queue: add security context information")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On some practical cases, it is useful to drop new node in the distance.
Because mesh metric is calculated with hop count and without RSSI
information, a node far from local peer and near to destination node
could be used as best path.
For example, the nodes are located in linear. Distance of 0 - 1 and
1 - 2 and 2 - 3 is 20meters. 0 to 3 signal is very weak.
0 --- 1 --- 2 --- 3
Though most robust path from 0 to 3 is 0 -> 1 -> 2 -> 3,
unfortunately, node 0 could recognize node 3 as neighbor. Then node 3
could be next of node 0. This patch aims to avoid such a case.
[Johannes:]
Dropping the node entirely isn't ideal, but at least with encryption
there will be a limit on # of keys the hardware can deal with, and
there might also be a limit on the number of stations it supports.
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We got the following use-after-free KASAN report:
BUG: KASAN: use-after-free in wiphy_resume+0x591/0x5a0 [cfg80211]
at addr ffff8803fc244090
Read of size 8 by task kworker/u16:24/2587
CPU: 6 PID: 2587 Comm: kworker/u16:24 Tainted: G B 4.9.13-debug+
Hardware name: Dell Inc. XPS 15 9550/0N7TVV, BIOS 1.2.19 12/22/2016
Workqueue: events_unbound async_run_entry_fn
ffff880425d4f9d8 ffffffffaeedb541 ffff88042b80ef00 ffff8803fc244088
ffff880425d4fa00 ffffffffae84d7a1 ffff880425d4fa98 ffff8803fc244080
ffff88042b80ef00 ffff880425d4fa88 ffffffffae84da3a ffffffffc141f7d9
Call Trace:
[<ffffffffaeedb541>] dump_stack+0x85/0xc4
[<ffffffffae84d7a1>] kasan_object_err+0x21/0x70
[<ffffffffae84da3a>] kasan_report_error+0x1fa/0x500
[<ffffffffc141f7d9>] ? cfg80211_bss_age+0x39/0xc0 [cfg80211]
[<ffffffffc141f83a>] ? cfg80211_bss_age+0x9a/0xc0 [cfg80211]
[<ffffffffae48d46d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffffc13fb1c0>] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
[<ffffffffae84def1>] __asan_report_load8_noabort+0x61/0x70
[<ffffffffc13fb100>] ? wiphy_suspend+0xbb0/0xc70 [cfg80211]
[<ffffffffc13fb751>] ? wiphy_resume+0x591/0x5a0 [cfg80211]
[<ffffffffc13fb751>] wiphy_resume+0x591/0x5a0 [cfg80211]
[<ffffffffc13fb1c0>] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
[<ffffffffaf3b206e>] dpm_run_callback+0x6e/0x4f0
[<ffffffffaf3b31b2>] device_resume+0x1c2/0x670
[<ffffffffaf3b367d>] async_resume+0x1d/0x50
[<ffffffffae3ee84e>] async_run_entry_fn+0xfe/0x610
[<ffffffffae3d0666>] process_one_work+0x716/0x1a50
[<ffffffffae3d05c9>] ? process_one_work+0x679/0x1a50
[<ffffffffafdd7b6d>] ? _raw_spin_unlock_irq+0x3d/0x60
[<ffffffffae3cff50>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
[<ffffffffae3d1a80>] worker_thread+0xe0/0x1460
[<ffffffffae3d19a0>] ? process_one_work+0x1a50/0x1a50
[<ffffffffae3e54c2>] kthread+0x222/0x2e0
[<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
[<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
[<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
[<ffffffffafdd86aa>] ret_from_fork+0x2a/0x40
Object at ffff8803fc244088, in cache kmalloc-1024 size: 1024
Allocated:
PID = 71
save_stack_trace+0x1b/0x20
save_stack+0x46/0xd0
kasan_kmalloc+0xad/0xe0
kasan_slab_alloc+0x12/0x20
__kmalloc_track_caller+0x134/0x360
kmemdup+0x20/0x50
brcmf_cfg80211_attach+0x10b/0x3a90 [brcmfmac]
brcmf_bus_start+0x19a/0x9a0 [brcmfmac]
brcmf_pcie_setup+0x1f1a/0x3680 [brcmfmac]
brcmf_fw_request_nvram_done+0x44c/0x11b0 [brcmfmac]
request_firmware_work_func+0x135/0x280
process_one_work+0x716/0x1a50
worker_thread+0xe0/0x1460
kthread+0x222/0x2e0
ret_from_fork+0x2a/0x40
Freed:
PID = 2568
save_stack_trace+0x1b/0x20
save_stack+0x46/0xd0
kasan_slab_free+0x71/0xb0
kfree+0xe8/0x2e0
brcmf_cfg80211_detach+0x62/0xf0 [brcmfmac]
brcmf_detach+0x14a/0x2b0 [brcmfmac]
brcmf_pcie_remove+0x140/0x5d0 [brcmfmac]
brcmf_pcie_pm_leave_D3+0x198/0x2e0 [brcmfmac]
pci_pm_resume+0x186/0x220
dpm_run_callback+0x6e/0x4f0
device_resume+0x1c2/0x670
async_resume+0x1d/0x50
async_run_entry_fn+0xfe/0x610
process_one_work+0x716/0x1a50
worker_thread+0xe0/0x1460
kthread+0x222/0x2e0
ret_from_fork+0x2a/0x40
Memory state around the buggy address:
ffff8803fc243f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8803fc244000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8803fc244080: fc fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8803fc244100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8803fc244180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
What is happening is that brcmf_pcie_resume() detects a device that
is no longer responsive and it decides to unbind resulting in a
wiphy_unregister() and wiphy_free() call. Now the wiphy instance
remains allocated, because PM needs to call wiphy_resume() for it.
However, brcmfmac already does a kfree() for the struct
cfg80211_registered_device::ops field. Change the checks in
wiphy_resume() to only access the struct cfg80211_registered_device::ops
if the wiphy instance is still registered at this time.
Cc: stable@vger.kernel.org # 4.10.x, 4.9.x
Reported-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Register the switch and its ports with devlink.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is an include loop between netdevice.h, dsa.h, devlink.h because
of NETDEV_ALIGN, making it impossible to use devlink structures in
dsa.h.
Break this loop by taking dsa.h out of netdevice.h, add a forward
declaration of dsa_switch_tree and netdev_set_default_ethtool_ops()
function, which is what netdevice.h requires.
No longer having dsa.h in netdevice.h means the includes in dsa.h no
longer get included. This breaks a few other files which depend on
these includes. Add these directly in the affected file.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send netconf notifications for MPLS when the device registers and
unregisters.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor mpls_netconf_notify_devconf to take the event as an input arg.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send RTM_DELNETCONF notifications when a device is deleted. The message only
needs the device index, so modify inet6_netconf_fill_devconf to skip devconf
references if it is NULL.
Allows a userspace cache to remove entries as devices are deleted.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor inet6_netconf_notify_devconf to take the event as an input arg.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send RTM_DELNETCONF notifications when a device is deleted. The message only
needs the device index, so modify inet_netconf_fill_devconf to skip devconf
references if it is NULL.
Allows a userspace cache to remove entries as devices are deleted.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor inet_netconf_notify_devconf to take the event as an input arg.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I do not hold the copyright of the DSA core and drivers source files,
since these changes have been written as an initiative of my day job.
Fix this.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for NETDEV_RESEND_IGMP event similar
to how it works for IPv4.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reference count held for skb needs to be released when the skb's
nfct pointer is cleared regardless of if nf_ct_delete() is called or
not.
Failing to release the skb's reference cound led to deferred conntrack
cleanup spinning forever within nf_conntrack_cleanup_net_list() when
cleaning up a network namespace:
kworker/u16:0-19025 [004] 45981067.173642: sched_switch: kworker/u16:0:19025 [120] R ==> rcu_preempt:7 [120]
kworker/u16:0-19025 [004] 45981067.173651: kernel_stack: <stack trace>
=> ___preempt_schedule (ffffffffa001ed36)
=> _raw_spin_unlock_bh (ffffffffa0713290)
=> nf_ct_iterate_cleanup (ffffffffc00a4454)
=> nf_conntrack_cleanup_net_list (ffffffffc00a5e1e)
=> nf_conntrack_pernet_exit (ffffffffc00a63dd)
=> ops_exit_list.isra.1 (ffffffffa06075f3)
=> cleanup_net (ffffffffa0607df0)
=> process_one_work (ffffffffa0084c31)
=> worker_thread (ffffffffa008592b)
=> kthread (ffffffffa008bee2)
=> ret_from_fork (ffffffffa071b67c)
Fixes: dd41d33f0b ("openvswitch: Add force commit.")
Reported-by: Yang Song <yangsong@vmware.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same change as Kinglong Mee's fix for the TCP backchannel service.
Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When a new subscription object is inserted into name_seq->subscriptions
list, it's under name_seq->lock protection; when a subscription is
deleted from the list, it's also under the same lock protection;
similarly, when accessing a subscription by going through subscriptions
list, the entire process is also protected by the name_seq->lock.
Therefore, if subscription refcount is increased before it's inserted
into subscriptions list, and its refcount is decreased after it's
deleted from the list, it will be unnecessary to hold refcount at all
before accessing subscription object which is obtained by going through
subscriptions list under name_seq->lock protection.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a subscription object is created, it's inserted into its
subscriber subscrp_list list under subscriber lock protection,
similarly, before it's destroyed, it should be first removed from
its subscriber->subscrp_list. Since the subscription list is
accessed with subscriber lock, all the subscriptions are valid
during the lock duration. Hence in tipc_subscrb_subscrp_delete(), we
remove subscription get/put and the extra subscriber unlock/lock.
After this change, the subscriptions refcount cleanup is very simple
and does not access any lock.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By moving these client drivers to use RPMSG instead of the direct SMD
API we can reuse them ontop of the newly added GLINK wire-protocol
support found in the 820 and 835 Qualcomm platforms.
As the new (RPMSG-based) and old SMD implementations are mutually
exclusive we have to change all client drivers in one commit, to make
sure we have a working system before and after this transition.
Acked-by: Andy Gross <andy.gross@linaro.org>
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Laight noticed the support for MSG_MORE with datamsg->force_delay
didn't really work as we expected, as the first msg with MSG_MORE set
would always block the following chunks' dequeuing.
This Patch is to rewrite it by saving the MSG_MORE flag into assoc as
David Laight suggested.
asoc->force_delay is used to save MSG_MORE flag before a msg is sent.
All chunks in queue would not be sent out if asoc->force_delay is set
by the msg with MSG_MORE flag, until a new msg without MSG_MORE flag
clears asoc->force_delay.
Note that this change would not affect the flush is generated by other
triggers, like asoc->state != ESTABLISHED, queue size > pmtu etc.
v1->v2:
Not clear asoc->force_delay after sending the msg with MSG_MORE flag.
Fixes: 4ea0c32f5f ("sctp: add support for MSG_MORE")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Laight <david.laight@aculab.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pipeline debug is used to export the pipeline abstractions for the
main objects - tables, headers and entries. The only support for set is
for changing the counter parameter on specific table.
The basic structures:
Header - can represent a real protocol header information or internal
metadata. Generic protocol headers like IPv4 can be shared
between drivers. Each driver can add local headers.
Field - part of a header. Can represent protocol field or specific ASIC
metadata field. Hardware special metadata fields can be mapped
to different resources, for example switch ASIC ports can have
internal number which from the systems point of view is mapped
to netdeivce ifindex.
Match - represent specific match rule. Can describe match on specific
field or header. The header index should be specified as well
in order to support several header instances of the same type
(tunneling).
Action - represents specific action rule. Actions can describe operations
on specific field values for example like set, increment, etc.
And header operation like add and delete.
Value - represents value which can be associated with specific match or
action.
Table - represents a hardware block which can be described with match/
action behavior. The match/action can be done on the packets
data or on the internal metadata that it gathered along the
packets traversal throw the pipeline which is vendor specific
and should be exported in order to provide understanding of
ASICs behavior.
Entry - represents single record in a specific table. The entry is
identified by specific combination of values for match/action.
Prior to accessing the tables/entries the drivers provide the header/
field data base which is used by driver to user-space. The data base
is split between the shared headers and unique headers.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our chosen ic_dev may be anywhere in our list of ic_devs, and we may
free it before attempting to close others. When we compare d->dev and
ic_dev->dev, we're potentially dereferencing memory returned to the
allocator. This causes KASAN to scream for each subsequent ic_dev we
check.
As there's a 1-1 mapping between ic_devs and netdevs, we can instead
compare d and ic_dev directly, which implicitly handles the !ic_dev
case, and avoids the use-after-free. The ic_dev pointer may be stale,
but we will not dereference it.
Original splat:
[ 6.487446] ==================================================================
[ 6.494693] BUG: KASAN: use-after-free in ic_close_devs+0xc4/0x154 at addr ffff800367efa708
[ 6.503013] Read of size 8 by task swapper/0/1
[ 6.507452] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc3-00002-gda42158 #8
[ 6.514993] Hardware name: AppliedMicro Mustang/Mustang, BIOS 3.05.05-beta_rc Jan 27 2016
[ 6.523138] Call trace:
[ 6.525590] [<ffff200008094778>] dump_backtrace+0x0/0x570
[ 6.530976] [<ffff200008094d08>] show_stack+0x20/0x30
[ 6.536017] [<ffff200008bee928>] dump_stack+0x120/0x188
[ 6.541231] [<ffff20000856d5e4>] kasan_object_err+0x24/0xa0
[ 6.546790] [<ffff20000856d924>] kasan_report_error+0x244/0x738
[ 6.552695] [<ffff20000856dfec>] __asan_report_load8_noabort+0x54/0x80
[ 6.559204] [<ffff20000aae86ac>] ic_close_devs+0xc4/0x154
[ 6.564590] [<ffff20000aaedbac>] ip_auto_config+0x2ed4/0x2f1c
[ 6.570321] [<ffff200008084b04>] do_one_initcall+0xcc/0x370
[ 6.575882] [<ffff20000aa31de8>] kernel_init_freeable+0x5f8/0x6c4
[ 6.581959] [<ffff20000a16df00>] kernel_init+0x18/0x190
[ 6.587171] [<ffff200008084710>] ret_from_fork+0x10/0x40
[ 6.592468] Object at ffff800367efa700, in cache kmalloc-128 size: 128
[ 6.598969] Allocated:
[ 6.601324] PID = 1
[ 6.603427] save_stack_trace_tsk+0x0/0x418
[ 6.607603] save_stack_trace+0x20/0x30
[ 6.611430] kasan_kmalloc+0xd8/0x188
[ 6.615087] ip_auto_config+0x8c4/0x2f1c
[ 6.619002] do_one_initcall+0xcc/0x370
[ 6.622832] kernel_init_freeable+0x5f8/0x6c4
[ 6.627178] kernel_init+0x18/0x190
[ 6.630660] ret_from_fork+0x10/0x40
[ 6.634223] Freed:
[ 6.636233] PID = 1
[ 6.638334] save_stack_trace_tsk+0x0/0x418
[ 6.642510] save_stack_trace+0x20/0x30
[ 6.646337] kasan_slab_free+0x88/0x178
[ 6.650167] kfree+0xb8/0x478
[ 6.653131] ic_close_devs+0x130/0x154
[ 6.656875] ip_auto_config+0x2ed4/0x2f1c
[ 6.660875] do_one_initcall+0xcc/0x370
[ 6.664705] kernel_init_freeable+0x5f8/0x6c4
[ 6.669051] kernel_init+0x18/0x190
[ 6.672534] ret_from_fork+0x10/0x40
[ 6.676098] Memory state around the buggy address:
[ 6.680880] ffff800367efa600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 6.688078] ffff800367efa680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 6.695276] >ffff800367efa700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 6.702469] ^
[ 6.705952] ffff800367efa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 6.713149] ffff800367efa800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 6.720343] ==================================================================
[ 6.727536] Disabling lock debugging due to kernel taint
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_IPV6_SEG6_LWTUNNEL is selected, automatically select DST_CACHE.
This allows to remove multiple ifdefs.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
When all devices for all nexthops in a route have been deleted, the
route is effectively dead, so remove it.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the device for a nexthop in a multipath route is deleted, the nexthop
is effectively removed from the route. Currently, a route dump still
returns the nexhop though without the device set:
$ ip -f mpls ro ls
100
nexthopvia inet 10.11.1.2 dev br0
nexthopvia inet 10.100.3.1 dev eth3
$ ip li del br0
$ ip -f mpls ro ls
100
nexthopvia inet 10.11.1.2 dev * dead linkdown
nexthopvia inet 10.100.3.1 dev eth3
Since the nexthop is effectively deleted, drop the hop from the route
dump.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the commit 93557f53e1 ("netfilter: nf_conntrack: nf_conntrack snmp
helper"), the snmp_helper is replaced by nf_nat_snmp_hook. So the
snmp_helper is never registered. But it still tries to unregister the
snmp_helper, it could cause the panic.
Now remove the useless snmp_helper and the unregister call in the
error handler.
Fixes: 93557f53e1 ("netfilter: nf_conntrack: nf_conntrack snmp helper")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If one cpu is doing nf_ct_extend_unregister while another cpu is doing
__nf_ct_ext_add_length, then we may hit BUG_ON(t == NULL). Moreover,
there's no synchronize_rcu invocation after set nf_ct_ext_types[id] to
NULL, so it's possible that we may access invalid pointer.
But actually, most of the ct extends are built-in, so the problem listed
above will not happen. However, there are two exceptions: NF_CT_EXT_NAT
and NF_CT_EXT_SYNPROXY.
For _EXT_NAT, the panic will not happen, since adding the nat extend and
unregistering the nat extend are located in the same file(nf_nat_core.c),
this means that after the nat module is removed, we cannot add the nat
extend too.
For _EXT_SYNPROXY, synproxy extend may be added by init_conntrack, while
synproxy extend unregister will be done by synproxy_core_exit. So after
nf_synproxy_core.ko is removed, we may still try to add the synproxy
extend, then kernel panic may happen.
I know it's very hard to reproduce this issue, but I can play a tricky
game to make it happen very easily :)
Step 1. Enable SYNPROXY for tcp dport 1234 at FORWARD hook:
# iptables -I FORWARD -p tcp --dport 1234 -j SYNPROXY
Step 2. Queue the syn packet to the userspace at raw table OUTPUT hook.
Also note, in the userspace we only add a 20s' delay, then
reinject the syn packet to the kernel:
# iptables -t raw -I OUTPUT -p tcp --syn -j NFQUEUE --queue-num 1
Step 3. Using "nc 2.2.2.2 1234" to connect the server.
Step 4. Now remove the nf_synproxy_core.ko quickly:
# iptables -F FORWARD
# rmmod ipt_SYNPROXY
# rmmod nf_synproxy_core
Step 5. After 20s' delay, the syn packet is reinjected to the kernel.
Now you will see the panic like this:
kernel BUG at net/netfilter/nf_conntrack_extend.c:91!
Call Trace:
? __nf_ct_ext_add_length+0x53/0x3c0 [nf_conntrack]
init_conntrack+0x12b/0x600 [nf_conntrack]
nf_conntrack_in+0x4cc/0x580 [nf_conntrack]
ipv4_conntrack_local+0x48/0x50 [nf_conntrack_ipv4]
nf_reinject+0x104/0x270
nfqnl_recv_verdict+0x3e1/0x5f9 [nfnetlink_queue]
? nfqnl_recv_verdict+0x5/0x5f9 [nfnetlink_queue]
? nla_parse+0xa0/0x100
nfnetlink_rcv_msg+0x175/0x6a9 [nfnetlink]
[...]
One possible solution is to make NF_CT_EXT_SYNPROXY extend built-in, i.e.
introduce nf_conntrack_synproxy.c and only do ct extend register and
unregister in it, similar to nf_conntrack_timeout.c.
But having such a obscure restriction of nf_ct_extend_unregister is not a
good idea, so we should invoke synchronize_rcu after set nf_ct_ext_types
to NULL, and check the NULL pointer when do __nf_ct_ext_add_length. Then
it will be easier if we add new ct extend in the future.
Last, we use kfree_rcu to free nf_ct_ext, so rcu_barrier() is unnecessary
anymore, remove it too.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The nf_ct_helper_hash table is protected by nf_ct_helper_mutex, while
nfct_helper operation is protected by nfnl_lock(NFNL_SUBSYS_CTHELPER).
So it's possible that one CPU is walking the nf_ct_helper_hash for
cthelper add/get/del, another cpu is doing nf_conntrack_helpers_unregister
at the same time. This is dangrous, and may cause use after free error.
Note, delete operation will flush all cthelpers added via nfnetlink, so
using rcu to do protect is not easy.
Now introduce a dummy list to record all the cthelpers added via
nfnetlink, then we can walk the dummy list instead of walking the
nf_ct_helper_hash. Also, keep nfnl_cthelper_dump_table unchanged, it
may be invoked without nfnl_lock(NFNL_SUBSYS_CTHELPER) held.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise, another CPU may access the invalid pointer. For example:
CPU0 CPU1
- rcu_read_lock();
- pfunc = _hook_;
_hook_ = NULL; -
mod unload -
- pfunc(); // invalid, panic
- rcu_read_unlock();
So we must call synchronize_rcu() to wait the rcu reader to finish.
Also note, in nf_nat_snmp_basic_fini, synchronize_rcu() will be invoked
by later nf_conntrack_helper_unregister, but I'm inclined to add a
explicit synchronize_rcu after set the nf_nat_snmp_hook to NULL. Depend
on such obscure assumptions is not a good idea.
Last, in nfnetlink_cttimeout, we use kfree_rcu to free the time object,
so in cttimeout_exit, invoking rcu_barrier() is not necessary at all,
remove it too.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch refactors the num_packets counter of a forw_packet in the
following three ways:
1) Removed dual-use of forw_packet::num_packets:
-> now for aggregation purposes only
2) Using forw_packet::skb::cb::num_bcasts instead:
-> for easier access in aggregation code later
3) make access to num_bcasts private to batadv_forw_packet_*()
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
[sven@narfation.org: Change num_bcasts to unsigned]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
An skb is assigned to a forw_packet only once, shortly after the
forw_packet allocation.
With this patch the assignment is moved into the this allocation
function.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
We got a report of yet another bug in ping
http://www.openwall.com/lists/oss-security/2017/03/24/6
->disconnect() is not called with socket lock held.
Fix this by acquiring ping rwlock earlier.
Thanks to Daniel, Alexander and Andrey for letting us know this problem.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Daniel Jiang <danieljiang0415@gmail.com>
Reported-by: Solar Designer <solar@openwall.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This socket option returns the NAPI ID associated with the queue on which
the last frame is received. This information can be used by the apps to
split the incoming flows among the threads based on the Rx queue on which
they are received.
If the NAPI ID actually represents a sender_cpu then the value is ignored
and 0 is returned.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the core functionality in sk_busy_loop() to napi_busy_loop() and
make it independent of sk.
This enables re-using this function in epoll busy loop implementation.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch flips the logic we were using to determine if the busy polling
has timed out. The main motivation for this is that we will need to
support two different possible timeout values in the future and by
recording the start time rather than when we would want to end we can focus
on making the end_time specific to the task be it epoll or socket based
polling.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
checking the return value of sk_busy_loop. As there are only a few
consumers of that data, and the data being checked for can be replaced
with a check for !skb_queue_empty() we might as well just pull the code
out of sk_busy_loop and place it in the spots that actually need it.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on some recent busy poll changes we found that child sockets
were being instantiated without NAPI ID being set. In our first attempt to
fix it, it was suggested that we should just pull programming the NAPI ID
into the function itself since all callers will need to have it set.
In addition to the NAPI ID change I have dropped the code that was
populating the Rx hash since it was actually being populated in
tcp_get_cookie_sock.
Reported-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup/fix for NAPI IDs following the changes that made it
so that sender_cpu and napi_id were doing a better job of sharing the same
location in the sk_buff.
One issue I found is that we weren't validating the napi_id as being valid
before we started trying to setup the busy polling. This change corrects
that by using the MIN_NAPI_ID value that is now used in both allocating the
NAPI IDs, as well as validating them.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unfortunately too many devices (not under our control) use tcp_tw_recycle=1,
which depends on timestamps being identical of the same saddr.
Although tcp_tw_recycle got removed in net-next we can't make
such end hosts disappear so downgrade to per-host timestamp offsets.
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Reported-by: Yvan Vanrossomme <yvan@vanrossomme.net>
Fixes: 95a22caee3 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change basically codifies what I think was already the limitations on
the busy_poll and busy_read sysctl interfaces. We weren't checking the
lower bounds and as such could input negative values. The behavior when
that was used was dependent on the architecture. In order to prevent any
issues with that I am just disabling support for values less than 0 since
this way we don't have to worry about any odd behaviors.
By limiting the sysctl values this way it also makes it consistent with how
we handle the SO_BUSY_POLL socket option since the value appears to be
reported as a signed integer value and negative values are rejected.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already use dst_cache in seg6_output, when handling locally generated
packets. We extend it in seg6_input, to also handle forwarded packets, and avoid
unnecessary fib lookups.
Performances for SRH encapsulation before the patch:
Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
884006pps 7072Mb/sec (7072048000bps) errors: 0
Performances after the patch:
Result: OK: 4774543(c4774084+d459) usec, 5000000 (1000byte,0frags)
1047220pps 8377Mb/sec (8377760000bps) errors: 0
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
To insert or encapsulate a packet with an SRH, we need a large enough skb
headroom. Currently, we are using pskb_expand_head to inconditionally increase
the size of the headroom by the amount needed by the SRH (and IPv6 header).
If this reallocation is performed by another CPU than the one that initially
allocated the skb, then when the initial CPU kfree the skb, it will enter the
__slab_free slowpath, impacting performances.
This patch replaces pskb_expand_head with skb_cow_head, that will reallocate the
skb head only if the headroom is not large enough.
Performances for SRH encapsulation before the patch:
Result: OK: 7348320(c7347271+d1048) usec, 5000000 (1000byte,0frags)
680427pps 5443Mb/sec (5443416000bps) errors: 0
Performances after the patch:
Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
884006pps 7072Mb/sec (7072048000bps) errors: 0
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
reclaim path, tagged for stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJY1QoTAAoJEEp/3jgCEfOLtnoH/0uj3Y2QUVHDo0Je9HkQhzVC
Mp2a/WzN1YO/wdYRdxs6LbVqMIukPJ5kKbKO3WIgpJZR9lxSM9k5Y0YC98YVBHuB
v1uuTPe60fmdJRYnoD4/SfwXiIiAF5/dzPp80SI4xQ9zN24dS1wBREkH2eXUeoKL
FCEoHUv7cJju1dGNbcGpv4MV75b0e+HHAxPuDG4fiUdT79Vp+wP7exx0lMkS9ub8
i5T1GU5gDDBR+SKhpqMIvNgR7s3zdInUs476tk6jz/o049BwybkLIuDhx0+x93/g
1Y1KXHYPCT3yZeL5kdJxM6Dul6wDYOsr2/wIPHzY0ym9RXqCEy/PFowJnL+7eWk=
=cTM6
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a writeback deadlock caused by a GFP_KERNEL allocation on
the reclaim path, tagged for stable"
* tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client:
libceph: force GFP_NOIO for socket allocations
Fix copy and paste error setting rt_ttl_propagate.
Fixes: 5b441ac878 ("mpls: allow TTL propagation to IP packets to be configured")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converting IPv4 address doesn't need 64-bit arithmetic.
Space savings: 10 bytes!
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
function old new delta
in_aton 96 86 -10
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.
By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -> 788Mbps unconnected single stream UDPv4
633 -> 654Mbps unconnected UDPv4 different sources
The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.
Both sysctls are enabled by default to preserve existing behavior.
v1->v2: Change function pointer instead of adding conditional as
suggested by Stephen.
v2->v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.
v3->v4: Store and use read once result instead of querying pointer
again incorrectly.
v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Tom Herbert <tom@herbertland.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason to continue after a copy_from_user()
failure.
Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sch_choke is classless qdisc so it does not define cl_ops. Therefore
filter_list cannot be ever changed, being NULL all the time.
Reason is this check in tc_ctl_tfilter:
/* Is it classful? */
cops = q->ops->cl_ops;
if (!cops)
return -EINVAL;
So remove this dead code.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NTF_EXT_LEARNED flag was added for switchdev and externally learned
entries, but it can also be used for entries learned via a software
in user-space which requires dynamic entries that do not expire.
One such case that we have is with quagga and evpn which need dynamic
entries but also require to age them themselves.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow to take over an entry which was previously learned via HW when it
shows up from a SW port. This is analogous to how HW takes over SW learned
entries already.
Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry posted a nice reproducer of a bug triggering in neigh_probe()
when dereferencing a NULL neigh->ops->solicit method.
This can happen for arp_direct_ops/ndisc_direct_ops and similar,
which can be used for NUD_NOARP neighbours (created when dev->header_ops
is NULL). Admin can then force changing nud_state to some other state
that would fire neigh timer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
after act_csum computes the checksum on skbs carrying GSO TCP/UDP packets,
subsequent segmentation fails because skb_needs_check(skb, true) returns
true. Because of that, skb_warn_bad_offload() is invoked and the following
message is displayed:
WARNING: CPU: 3 PID: 28 at net/core/dev.c:2553 skb_warn_bad_offload+0xf0/0xfd
<...>
[<ffffffff8171f486>] skb_warn_bad_offload+0xf0/0xfd
[<ffffffff8161304c>] __skb_gso_segment+0xec/0x110
[<ffffffff8161340d>] validate_xmit_skb+0x12d/0x2b0
[<ffffffff816135d2>] validate_xmit_skb_list+0x42/0x70
[<ffffffff8163c560>] sch_direct_xmit+0xd0/0x1b0
[<ffffffff8163c760>] __qdisc_run+0x120/0x270
[<ffffffff81613b3d>] __dev_queue_xmit+0x23d/0x690
[<ffffffff81613fa0>] dev_queue_xmit+0x10/0x20
Since GSO is able to compute checksum on individual segments of such skbs,
we can simply skip mangling the packet.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise overflowuid is
returned.
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/genet/bcmmii.c
drivers/net/hyperv/netvsc.c
kernel/bpf/hashtab.c
Almost entirely overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Several netfilter fixes from Pablo and the crew:
- Handle fragmented packets properly in netfilter conntrack, from
Florian Westphal.
- Fix SCTP ICMP packet handling, from Ying Xue.
- Fix big-endian bug in nftables, from Liping Zhang.
- Fix alignment of fake conntrack entry, from Steven Rostedt.
2) Fix feature flags setting in fjes driver, from Taku Izumi.
3) Openvswitch ipv6 tunnel source address not set properly, from Or
Gerlitz.
4) Fix jumbo MTU handling in amd-xgbe driver, from Thomas Lendacky.
5) sk->sk_frag.page not released properly in some cases, from Eric
Dumazet.
6) Fix RTNL deadlocks in nl80211, from Johannes Berg.
7) Fix erroneous RTNL lockdep splat in crypto, from Herbert Xu.
8) Cure improper inflight handling during AF_UNIX GC, from Andrey
Ulanov.
9) sch_dsmark doesn't write to packet headers properly, from Eric
Dumazet.
10) Fix SCM_TIMESTAMPING_OPT_STATS handling in TCP, from Soheil Hassas
Yeganeh.
11) Add some IDs for Motorola qmi_wwan chips, from Tony Lindgren.
12) Fix nametbl deadlock in tipc, from Ying Xue.
13) GRO and LRO packets not counted correctly in mlx5 driver, from Gal
Pressman.
14) Fix reset of internal PHYs in bcmgenet, from Doug Berger.
15) Fix hashmap allocation handling, from Alexei Starovoitov.
16) nl_fib_input() needs stronger netlink message length checking, from
Eric Dumazet.
17) Fix double-free of sk->sk_filter during sock clone, from Daniel
Borkmann.
18) Fix RX checksum offloading in aquantia driver, from Pavel Belous.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
net:ethernet:aquantia: Fix for RX checksum offload.
amd-xgbe: Fix the ECC-related bit position definitions
sfc: cleanup a condition in efx_udp_tunnel_del()
Bluetooth: btqcomsmd: fix compile-test dependency
inet: frag: release spinlock before calling icmp_send()
tcp: initialize icsk_ack.lrcvtime at session start time
genetlink: fix counting regression on ctrl_dumpfamily()
socket, bpf: fix sk_filter use after free in sk_clone_lock
ipv4: provide stronger user input validation in nl_fib_input()
bpf: fix hashmap extra_elems logic
enic: update enic maintainers
net: bcmgenet: remove bcmgenet_internal_phy_setup()
ipv6: make sure to initialize sockc.tsflags before first use
fjes: Do not load fjes driver if extended socket device is not power on.
fjes: Do not load fjes driver if system does not have extended socket device.
net/mlx5e: Count LRO packets correctly
net/mlx5e: Count GSO packets correctly
net/mlx5: Increase number of max QPs in default profile
net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps
net/mlx5e: Use the proper UAPI values when offloading TC vlan actions
...
icsk_ack.lrcvtime has a 0 value at socket creation time.
tcpi_last_data_recv can have bogus value if no payload is ever received.
This patch initializes icsk_ack.lrcvtime for active sessions
in tcp_finish_connect(), and for passive sessions in
tcp_create_openreq_child()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2ae0f17df1 ("genetlink: use idr to track families") replaced
if (++n < fams_to_skip)
continue;
into:
if (n++ < fams_to_skip)
continue;
This subtle change cause that on retry ctrl_dumpfamily() call we omit
one family that failed to do ctrl_fill_info() on previous call, because
cb->args[0] = n number counts also family that failed to do
ctrl_fill_info().
Patch fixes the problem and avoid confusion in the future just decrease
n counter when ctrl_fill_info() fail.
User visible problem caused by this bug is failure to get access to
some genetlink family i.e. nl80211. However problem is reproducible
only if number of registered genetlink families is big enough to
cause second call of ctrl_dumpfamily().
Cc: Xose Vazquez Perez <xose.vazquez@gmail.com>
Cc: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Fixes: 2ae0f17df1 ("genetlink: use idr to track families")
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In sk_clone_lock(), we create a new socket and inherit most of the
parent's members via sock_copy() which memcpy()'s various sections.
Now, in case the parent socket had a BPF socket filter attached,
then newsk->sk_filter points to the same instance as the original
sk->sk_filter.
sk_filter_charge() is then called on the newsk->sk_filter to take a
reference and should that fail due to hitting max optmem, we bail
out and release the newsk instance.
The issue is that commit 278571baca ("net: filter: simplify socket
charging") wrongly combined the dismantle path with the failure path
of xfrm_sk_clone_policy(). This means, even when charging failed, we
call sk_free_unlock_clone() on the newsk, which then still points to
the same sk_filter as the original sk.
Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually
where it tests for present sk_filter and calls sk_filter_uncharge()
on it, which potentially lets sk_omem_alloc wrap around and releases
the eBPF prog and sk_filter structure from the (still intact) parent.
Fix it by making sure that when sk_filter_charge() failed, we reset
newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(),
so that we don't mess with the parents sk_filter.
Only if xfrm_sk_clone_policy() fails, we did reach the point where
either the parent's filter was NULL and as a result newsk's as well
or where we previously had a successful sk_filter_charge(), thus for
that case, we do need sk_filter_uncharge() to release the prior taken
reference on sk_filter.
Fixes: 278571baca ("net: filter: simplify socket charging")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds a new sysctl accept_ra_rt_info_min_plen that
defines the minimum acceptable prefix length of Route Information
Options. The new sysctl is intended to be used together with
accept_ra_rt_info_max_plen to configure a range of acceptable
prefix lengths. It is useful to prevent misconfigurations from
unintentionally blackholing too much of the IPv6 address space
(e.g., home routers announcing RIOs for fc00::/7, which is
incorrect).
Signed-off-by: Joel Scherpelz <jscherpelz@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander reported a KMSAN splat caused by reads of uninitialized
field (tb_id_in) from user provided struct fib_result_nl
It turns out nl_fib_input() sanity tests on user input is a bit
wrong :
User can pretend nlh->nlmsg_len is big enough, but provide
at sendmsg() time a too small buffer.
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the rtnl_dump_all to dump all netconf handlers that have been
registered. Allows userspace to send a dump request for PF_UNSPEC
and get all families.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case udp_sk(sk)->pending is AF_INET6, udpv6_sendmsg() would
jump to do_append_data, skipping the initialization of sockc.tsflags.
Fix the problem by moving sockc.tsflags initialization earlier.
The bug was detected with KMSAN.
Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, tipc_nametbl_unsubscribe() is called at subscriptions
reference count cleanup. Usually the subscriptions cleanup is
called at subscription timeout or at subscription cancel or at
subscriber delete.
We have ignored the possibility of this being called from other
locations, which causes deadlock as we try to grab the
tn->nametbl_lock while holding it already.
CPU1: CPU2:
---------- ----------------
tipc_nametbl_publish
spin_lock_bh(&tn->nametbl_lock)
tipc_nametbl_insert_publ
tipc_nameseq_insert_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
tipc_close_conn
tipc_subscrb_release_cb
tipc_subscrb_delete
tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>
CPU1: CPU2:
---------- ----------------
tipc_nametbl_stop
spin_lock_bh(&tn->nametbl_lock)
tipc_purge_publications
tipc_nameseq_remove_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
tipc_close_conn
tipc_subscrb_release_cb
tipc_subscrb_delete
tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>
In this commit, we advance the calling of tipc_nametbl_unsubscribe()
from the refcount cleanup to the intended callers.
Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid delete")
Reported-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When user_mss is zero, it means use the default value. But the current
codes don't permit user set TCP_MAXSEG to the default value.
It would return the -EINVAL when val is zero.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added clone_execute() that both the sample and the recirc
action implementation can use.
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of open flow 'clone' action, the OVS user space
can now translate the 'clone' action into kernel datapath 'sample'
action, with 100% probability, to ensure that the clone semantics,
which is that the packet seen by the clone action is the same as the
packet seen by the action after clone, is faithfully carried out
in the datapath.
While the sample action in the datpath has the matching semantics,
its implementation is only optimized for its original use.
Specifically, there are two limitation: First, there is a 3 level of
nesting restriction, enforced at the flow downloading time. This
limit turns out to be too restrictive for the 'clone' use case.
Second, the implementation avoid recursive call only if the sample
action list has a single userspace action.
The main optimization implemented in this series removes the static
nesting limit check, instead, implement the run time recursion limit
check, and recursion avoidance similar to that of the 'recirc' action.
This optimization solve both #1 and #2 issues above.
One related optimization attempts to avoid copying flow key as
long as the actions enclosed does not change the flow key. The
detection is performed only once at the flow downloading time.
Another related optimization is to rewrite the action list
at flow downloading time in order to save the fast path from parsing
the sample action list in its original form repeatedly.
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The logic of allocating and copy key for each 'exec_actions_level'
was specific to execute_recirc(). However, future patches will reuse
as well. Refactor the logic into its own function clone_key().
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
add_deferred_actions() API currently requires actions to be passed in
as a fully encoded netlink message. So far both 'sample' and 'recirc'
actions happens to carry actions as fully encoded netlink messages.
However, this requirement is more restrictive than necessary, future
patch will need to pass in action lists that are not fully encoded
by themselves.
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allows reading of SK_MEMINFO_VARS via socket option. This way an
application can get all meminfo related information in single socket
option call instead of multiple calls.
Adds helper function, sk_get_meminfo(), and uses that for both
getsockopt and sock_diag_put_meminfo().
Suggested by Eric Dumazet.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to remove the unnecessary temporary variable 'err' from
sctp_association_init.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh notifications today carry pid 0 for nlmsg_pid
in all cases. This patch fixes it to carry calling process
pid when available. Applications (eg. quagga) rely on
nlmsg_pid to ignore notifications generated by their own
netlink operations. This patch follows the routing subsystem
which already sets this correctly.
Reported-by: Vivek Venkatraman <vivek@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The net_cls controller controls the classid field of each socket which
is associated with the cgroup. Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51 ("cgroups: Allow dynamically changing net_classid").
While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations. However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css. This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.
On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds. Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.
Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse. Before bfc2cf6f61 ("cgroup: call subsys->*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the ->css_attach callback even for controllers which don't
see actual migration to a different css.
As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
->css_attach callback for those. The net_cls ->css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used. This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.
The worst symptom is already fixed in upstream by bfc2cf6f61
("cgroup: call subsys->*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.
This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated. This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup. As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().
Reported-by: David Goode <dgoode@fb.com>
Fixes: 3b13758f51 ("cgroups: Allow dynamically changing net_classid")
Cc: stable@vger.kernel.org # v4.4+
Cc: Nina Schiff <ninasc@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have memory leaks of nf_conntrack_helper & expect_policy.
Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We only allow runtime updates of expectation policies for timeout and
maximum number of expectations, otherwise reject the update.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Liping Zhang <zlpnobody@gmail.com>
Consider the following situation which has been found in a test setup:
Gateway B has claimed client C and gateway A has the same backbone
network as B. C sends a broad- or multicast to B and directly after
this packet decides to send another packet to A due to a better TQ
value. B will forward the broad-/multicast into the backbone as it is
the responsible gw and after that A will claim C as it has been
chosen by C as the best gateway. If it now happens that A claims C
before it has received the broad-/multicast forwarded by B (due to
backbone topology or due to some delay in B when forwarding the
packet) we get a critical situation: in the current code A will
immediately unclaim C when receiving the multicast due to the
roaming client scenario although the position of C has not changed
in the mesh. If this happens the multi-/broadcast forwarded by B
will be sent back into the mesh by A and we have looping packets
until one of the gateways claims C again.
In order to prevent this, unclaiming of a client due to the roaming
client scenario is only done after a certain time is expired after
the last claim of the client. 100 ms are used here, which should be
slow enough for big backbones and slow gateways but fast enough not
to break the roaming client use case.
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Some of the bla debug messages are extended and additional messages are
added for easier bla debugging. Some debug messages introduced with the
dat changes in prior patches of this patch series have been changed to
be more compliant to other existing debug messages.
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Additional dropping of unicast packets received from another backbone gw if
the same backbone network before being forwarded to the same backbone again
is necessary. It was observed in a test setup that in rare cases these
frames lead to looping unicast traffic backbone->mesh->backbone.
Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
If none of the backbone gateways in a bla setup has already knowledge of
the mac address searched for in an incoming ARP request from the backbone
an address resolution via the DHT of DAT is started. The gateway can send
several ARP requests to different DHT nodes and therefore can get several
replies. This patch assures that not all of the possible ARP replies are
returned to the backbone by checking the local DAT cache of the gateway.
If there is an entry in the local cache the gateway has already learned
the requested address and there is no need to forward the additional reply
to the backbone.
Furthermore it is checked if this gateway has claimed the source of the ARP
reply and only forwards it to the backbone if it has claimed the source or
if there is no claim at all.
Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
If dat is enabled it must be made sure that only the backbone gw which has
claimed the remote destination for the ARP request answers the ARP request
directly if the MAC address is known due to the local dat table. This
prevents multiple ARP replies in a common backbone if more than one
gateway already knows the remote mac searched for in the ARP request.
Signed-off-by: Andreas Pape <apape@phoenixcontact.com>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
SOF_TIMESTAMPING_OPT_STATS can be enabled and disabled
while packets are collected on the error queue.
So, checking SOF_TIMESTAMPING_OPT_STATS in sk->sk_tsflags
is not enough to safely assume that the skb contains
OPT_STATS data.
Add a bit in sock_exterr_skb to indicate whether the
skb contains opt_stats data.
Fixes: 1c885808e4 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__sock_recv_timestamp can be called for both normal skbs (for
receive timestamps) and for skbs on the error queue (for transmit
timestamps).
Commit 1c885808e4
(tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING)
assumes any skb passed to __sock_recv_timestamp are from
the error queue, containing OPT_STATS in the content of the skb.
This results in accessing invalid memory or generating junk
data.
To fix this, set skb->pkt_type to PACKET_OUTGOING for packets
on the error queue. This is safe because on the receive path
on local sockets skb->pkt_type is never set to PACKET_OUTGOING.
With that, copy OPT_STATS from a packet, only if its pkt_type
is PACKET_OUTGOING.
Fixes: 1c885808e4 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to fix the issue that sctp_prsctp_prune_sent forgot
to update q->out_qlen when removing a chunk from unsent queue.
Fixes: 8dbdf1f5b0 ("sctp: implement prsctp PRIO policy")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c86a773c78 ("sctp: add dst_pending_confirm flag") introduced
a temporary variable "confirm" in sctp_packet_transmit.
But it broke the rule that longer lines should be above shorter ones.
Besides, this variable is not necessary, so this patch is to just
remove it and use tp->dst_pending_confirm directly.
Fixes: c86a773c78 ("sctp: add dst_pending_confirm flag")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_cow(skb, sizeof(ip header)) is not very helpful in this context.
First we need to use pskb_may_pull() to make sure the ip header
is in skb linear part, then use skb_try_make_writable() to
address clones issues.
Fixes: 4c30719f4f ("[PKT_SCHED] dsmark: handle cloned and non-linear skb's")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for ECMP hash policy choice via a new sysctl
called fib_multipath_hash_policy and also adds support for L4 hashes.
The current values for fib_multipath_hash_policy are:
0 - layer 3 (default)
1 - layer 4
If there's an skb hash already set and it matches the chosen policy then it
will be used instead of being calculated (currently only for L4).
In L3 mode we always calculate the hash due to the ICMP error special
case, the flow dissector's field consistentification should handle the
address order thus we can remove the address reversals.
If the skb is provided we always use it for the hash calculation,
otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wanted_features is a set of features which have to be enabled if a
hardware allows that.
Currently when a vlan device is created, its wanted_features is set to
current features of its base device.
The problem is that the base device can get new features and they are
not propagated to vlan-s of this device.
If we look at bonding devices, they doesn't have this problem and this
patch suggests to fix this issue by the same way how it works for bonding
devices.
We meet this problem, when we try to create a vlan device over a bonding
device. When a system are booting, real devices require time to be
initialized, so bonding devices created without slaves, then vlan
devices are created and only then ethernet devices are added to the
bonding device. As a result we have vlan devices with disabled
scatter-gather.
* create a bonding device
$ ip link add bond0 type bond
$ ethtool -k bond0 | grep scatter
scatter-gather: off
tx-scatter-gather: off [requested on]
tx-scatter-gather-fraglist: off [requested on]
* create a vlan device
$ ip link add link bond0 name bond0.10 type vlan id 10
$ ethtool -k bond0.10 | grep scatter
scatter-gather: off
tx-scatter-gather: off
tx-scatter-gather-fraglist: off
* Add a slave device to bond0
$ ip link set dev eth0 master bond0
And now we can see that the bond0 device has got the scatter-gather
feature, but the bond0.10 hasn't got it.
[root@laptop linux-task-diag]# ethtool -k bond0 | grep scatter
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
[root@laptop linux-task-diag]# ethtool -k bond0.10 | grep scatter
scatter-gather: off
tx-scatter-gather: off
tx-scatter-gather-fraglist: off
With this patch the vlan device will get all new features from the
bonding device.
Here is a call trace how features which are set in this patch reach
dev->wanted_features.
register_netdevice
vlan_dev_init
...
dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
NETIF_F_FRAGLIST | NETIF_F_GSO_SOFTWARE |
NETIF_F_HIGHDMA | NETIF_F_SCTP_CRC |
NETIF_F_ALL_FCOE;
dev->features |= dev->hw_features;
...
dev->wanted_features = dev->features & dev->hw_features;
__netdev_update_features(dev);
vlan_dev_fix_features
...
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().
The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.
Commit 6209344f5a ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.
This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.
kernel BUG at net/unix/garbage.c:149!
RIP: 0010:[<ffffffff8717ebf4>] [<ffffffff8717ebf4>]
unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
Call Trace:
[<ffffffff8716cfbf>] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
[<ffffffff8716f6a9>] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
[<ffffffff86a90a01>] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
[<ffffffff86a9808a>] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
[<ffffffff86a980ea>] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
[<ffffffff86a98284>] kfree_skb+0x184/0x570 net/core/skbuff.c:705
[<ffffffff871789d5>] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
[<ffffffff87179039>] unix_release+0x49/0x90 net/unix/af_unix.c:836
[<ffffffff86a694b2>] sock_release+0x92/0x1f0 net/socket.c:570
[<ffffffff86a6962b>] sock_close+0x1b/0x20 net/socket.c:1017
[<ffffffff81a76b8e>] __fput+0x34e/0x910 fs/file_table.c:208
[<ffffffff81a771da>] ____fput+0x1a/0x20 fs/file_table.c:244
[<ffffffff81483ab0>] task_work_run+0x1a0/0x280 kernel/task_work.c:116
[< inline >] exit_task_work include/linux/task_work.h:21
[<ffffffff8141287a>] do_exit+0x183a/0x2640 kernel/exit.c:828
[<ffffffff8141383e>] do_group_exit+0x14e/0x420 kernel/exit.c:931
[<ffffffff814429d3>] get_signal+0x663/0x1880 kernel/signal.c:2307
[<ffffffff81239b45>] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
[<ffffffff8100666a>] exit_to_usermode_loop+0x1ea/0x2d0
arch/x86/entry/common.c:156
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[<ffffffff81009693>] syscall_return_slowpath+0x4d3/0x570
arch/x86/entry/common.c:259
[<ffffffff881478e6>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov <andreyu@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise we'll leave the packets queued until releasing vsock device.
E.g., if guest is slow to start up, resulting ETIMEDOUT on connect, guest
will get the connect requests from failed host sockets.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can cancel a queued pkt later if necessary.
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Tue, Mar 14, 2017 at 10:44:10AM +0100, Dmitry Vyukov wrote:
>
> Yes, please.
> Disregarding some reports is not a good way long term.
Please try this patch.
---8<---
Subject: netlink: Annotate nlk cb_mutex by protocol
Currently all occurences of nlk->cb_mutex are annotated by lockdep
as a single class. This causes a false lcokdep cycle involving
genl and crypto_user.
This patch fixes it by dividing cb_mutex into individual classes
based on the netlink protocol. As genl and crypto_user do not
use the same netlink protocol this breaks the false dependency
loop.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your
net-next tree. A couple of new features for nf_tables, and unsorted
cleanups and incremental updates for the Netfilter tree. More
specifically, they are:
1) Allow to check for TCP option presence via nft_exthdr, patch
from Phil Sutter.
2) Add symmetric hash support to nft_hash, from Laura Garcia Liebana.
3) Use pr_cont() in ebt_log, from Joe Perches.
4) Remove some dead code in arp_tables reported via static analysis
tool, from Colin Ian King.
5) Consolidate nf_tables expression validation, from Liping Zhang.
6) Consolidate set lookup via nft_set_lookup().
7) Remove unnecessary rcu read lock side in bridge netfilter, from
Florian Westphal.
8) Remove unused variable in nf_reject_ipv4, from Tahee Yoo.
9) Pass nft_ctx struct to object initialization indirections, from
Florian Westphal.
10) Add code to integrate conntrack helper into nf_tables, also from
Florian.
11) Allow to check if interface index or name exists via
NFTA_FIB_F_PRESENT, from Phil Sutter.
12) Simplify resolve_normal_ct(), from Florian.
13) Use per-limit spinlock in nft_limit and xt_limit, from Liping Zhang.
14) Use rwlock in nft_set_rbtree set, also from Liping Zhang.
15) One patch to remove a useless printk at netns init path in ipvs,
and several patches to document IPVS knobs.
16) Use refcount_t for reference counter in the Netfilter/IPVS code,
from Elena Reshetova.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The helper->expect_class_max must be set to the total number of
expect_policy minus 1, since we will use the statement "if (class >
helper->expect_class_max)" to validate the CTA_EXPECT_CLASS attr in
ctnetlink_alloc_expect.
So for compatibility, set the helper->expect_class_max to the
NFCTH_POLICY_SET_NUM attr's value minus 1.
Also: it's invalid when the NFCTH_POLICY_SET_NUM attr's value is zero.
1. this will result "expect_policy = kzalloc(0, GFP_KERNEL);";
2. we cannot set the helper->expect_class_max to a proper value.
So if nla_get_be32(tb[NFCTH_POLICY_SET_NUM]) is zero, report -EINVAL to
the userspace.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Stable Bugfixes:
- Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
- Prevent a double free in async nfs4_exchange_id()
- Squelch a kbuild sparse complaint for xprtrdma
Other Bugfixes:
- Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
- Fix a reference leak that causes kernel warnings
- Make nfs4_cb_sv_ops static to fix a sparse warning
- Respect a server's max size in CREATE_SESSION
- Handle errors from nfs4_pnfs_ds_connect
- Flexfiles layout shouldn't mark devices as unavailable
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAljMQsUACgkQ18tUv7Cl
QOt0FA//eSieOojEm9uJIxfydrJY2VkPgqg0xmIxLhcMmXi/d4kO9GpS9YeJJZi4
r5oClq1afhXVR83JmNDCvIYUNwf5/lluckuzXZEYlC3qUbXjQt/Nn/FHfrqW8qXV
HJy4PVwV+BHnfU6Y7p14zzucGPrMeWZQJO+7mRpboe1jcizHOMdcw+Aim7pr44y6
BI3QcLPtQGY4CnPOEkpDNuEWtO7iMME3bRJOJ2lOWz5smG0KAQo80OTHGXIe4OqR
d+gHhoHZ2LbZwdbs+rsjAIMFsFrgXqZmXYbQCZ9SEsr4ysj3PesHPdGFrKXCZCSM
0MjlEcznGl6ooPDD9tO5Bi047Xhq2TlUWF+FIVYOdFur+7oIcJcnJB7epoYEQ2d2
6RMvddeKmEgW5Y77myIb3G6jTnk7S8dMq5aAGSyUmKoVhybfw0PGFMbZ2gDEpaTG
HweeaPmR7J0e+MZBiShTBH2zulFcI1qG3kowu/oKccU9jGi/uA7vkXOSkaxkSzST
+vS30JwArNOj7OFqhGZbi1YzoK0ixxxXLD4DaEDKKm4mOt7g1Zmb0QgVnGSx1V6X
Or4Y4xTKn0vCt3e61O9dsBRApBCEVSBJMgYb9Z+LUSdQIKoUj+sQPMzY3iGTefcx
r7qiUddBZerQ0CZCsRxXk/otJawGCO9XFuSY4CksvlReTeyl1Tc=
=JY3W
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"We have a handful of stable fixes to fix kernel warnings and other
bugs that have been around for a while. We've also found a few other
reference counting bugs and memory leaks since the initial 4.11 pull.
Stable Bugfixes:
- Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
- Prevent a double free in async nfs4_exchange_id()
- Squelch a kbuild sparse complaint for xprtrdma
Other Bugfixes:
- Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
- Fix a reference leak that causes kernel warnings
- Make nfs4_cb_sv_ops static to fix a sparse warning
- Respect a server's max size in CREATE_SESSION
- Handle errors from nfs4_pnfs_ds_connect
- Flexfiles layout shouldn't mark devices as unavailable"
* tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
pNFS/flexfiles: never nfs4_mark_deviceid_unavailable
pNFS: return status from nfs4_pnfs_ds_connect
NFSv4.1 respect server's max size in CREATE_SESSION
NFS prevent double free in async nfs4_exchange_id
nfs: make nfs4_cb_sv_ops static
xprtrdma: Squelch kbuild sparse complaint
NFS: fix the fault nrequests decreasing for nfs_inode COPY
NFSv4: fix a reference leak caused WARNING messages
nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME
New complaint from kbuild for 4.9.y:
net/sunrpc/xprtrdma/verbs.c:489:19: sparse: incompatible types in
comparison expression (different type sizes)
verbs.c:
489 max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);
I can't reproduce this running sparse here. Likewise, "make W=1
net/sunrpc/xprtrdma/verbs.o" never indicated any issue.
A little poking suggests that because the range of its values is
small, gcc can make the actual width of RPCRDMA_MAX_SEND_SGES
smaller than the width of an unsigned integer.
Fixes: 16f906d66c ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The memory for netdev_priv is allocated using kzalloc in alloc_netdev
(or alloc_netdev_mq respectively) so there is no need to set it to 0
again.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The name of the function might change in which these messages are printed.
It is therefore better to let the compiler handle the insertion of the
correct function name.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
refcount_t type and corresponding API (see include/linux/refcount.h)
should be used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When a wdev changes network namespace, its wdev ID will get
reassigned since NETDEV_REGISTER is called again, in the new
network namespace. Avoid that by checking if it was already
assigned before, and document why we do that.
Reported-and-tested-by: Arend Van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit b369e7fd41 ("tcp: make TCP_INFO more consistent") moved
lock_sock_fast() earlier in tcp_get_info()
This has the minor effect that jiffies value being sampled at the
beginning of tcp_get_info() is more likely to be off by one, and we
report big tcpi_last_data_sent values (like 0xFFFFFFFF).
Since we lock the socket, fetching tcp_time_stamp right before
doing the jiffies_to_msecs() calls is enough to remove these
wrong values.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrei reported a false alarm of lockdep at net/bridge/br_fdb.c:109,
this is because in Andrei's case, a spin_bug() was already triggered
before this, therefore the debug_locks is turned off, lockdep_is_held()
is no longer accurate after that. We should use lockdep_assert_held_once()
instead of lockdep_is_held() to respect debug_locks.
Fixes: 410b3d48f5 ("bridge: fdb: add proper lock checks in searching functions")
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we receive a BUSY packet for a call we think we've just completed, the
packet is handed off to the connection processor to deal with - but the
connection processor doesn't expect a BUSY packet and so flags a protocol
error.
Fix this by simply ignoring the BUSY packet for the moment.
The symptom of this may appear as a system call failing with EPROTO. This
may be triggered by pressing ctrl-C under some circumstances.
This comes about we abort calls due to interruption by a signal (which we
shouldn't do, but that's going to be a large fix and mostly in fs/afs/).
What happens is that we abort the call and may also abort follow up calls
too (this needs offloading somehoe). So we see a transmission of something
like the following sequence of packets:
DATA for call N
ABORT call N
DATA for call N+1
ABORT call N+1
in very quick succession on the same channel. However, the peer may have
deferred the processing of the ABORT from the call N to a background thread
and thus sees the DATA message from the call N+1 coming in before it has
cleared the channel. Thus it sends a BUSY packet[*].
[*] Note that some implementations (OpenAFS, for example) mark the BUSY
packet with one plus the callNumber of the call prior to call N.
Ordinarily, this would be call N, but there's no requirement for the
calls on a channel to be numbered strictly sequentially (the number is
required to increase).
This is wrong and means that the callNumber in the BUSY packet should
be ignored (it really ought to be N+1 since that's what it's in
response to).
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Anycast routes have the RTF_ANYCAST flag set, but when dumping routes
for userspace the route type is not set to RTN_ANYCAST. Make it so.
Fixes: 58c4fb86ea ("[IPV6]: Flag RTF_ANYCAST for anycast routes")
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tcp_tw_recycle was already broken for connections
behind NAT, since the per-destination timestamp is not
monotonically increasing for multiple machines behind
a single destination address.
After the randomization of TCP timestamp offsets
in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
for each connection), the tcp_tw_recycle is broken for all
types of connections for the same reason: the timestamps
received from a single machine is not monotonically increasing,
anymore.
Remove tcp_tw_recycle, since it is not functional. Also, remove
the PAWSPassive SNMP counter since it is only used for
tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
since the strict argument is only set when tcp_tw_recycle is
enabled.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Lutz Vieweg <lvml@5t9.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
randomizes TCP timestamps per connection. After this commit,
there is no guarantee that the timestamps received from the
same destination are monotonically increasing. As a result,
the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
in struct tcp_metrics_block) is broken and cannot be relied upon.
Remove the per-destination timestamp cache and all related code
paths.
Note that this cache was already broken for caching timestamps of
multiple machines behind a NAT sharing the same address.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Lutz Vieweg <lvml@5t9.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
replace comma to semi colons in tcp_westwood_info().
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alive tracking of nexthops can account for a link twice if the carrier
goes down followed by an admin down of the same link rendering multipath
routes useless. This is similar to 79099aab38 for UNREGISTER events and
DOWN events.
Fix by tracking number of alive nexthops in mpls_ifdown similar to the
logic in mpls_ifup. Checking the flags per nexthop once after all events
have been processed is simpler than trying to maintian a running count
through all event combinations.
Also, WRITE_ONCE is used instead of ACCESS_ONCE to set rt_nhn_alive
per a comment from checkpatch:
WARNING: Prefer WRITE_ONCE(<FOO>, <BAR>) over ACCESS_ONCE(<FOO>) = <BAR>
Fixes: c89359a42e ("mpls: support for dead routes")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:
1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
machines
The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
There appear to be two underlying bugs causing these effects:
- The first issue causes long delays when the rate is slow and no
delay is configured (e.g., "rate 512kbit"). This is because SKBs are
not orphaned when no delay is configured, so orphaning does not
occur until *after* the rate-induced delay has been applied. For
this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
dramatically increases the measured bandwidth.
- The second issue is that rate-induced delays are not correctly
applied, allowing SKB delays to occur in parallel. The indended
approach is to compute the delay for an SKB and to add this delay to
the end of the current queue. However, the code does not detect
existing SKBs in the queue due to improperly testing sch->q.qlen,
which is nonzero even when packets exist only in the
rbtree. Consequently, new SKBs do not wait for the current queue to
empty. When packet delays vary significantly (e.g., if packet sizes
are different), then this also causes unintended reordering.
I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.
Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The BATADV_PRINT_VID is not free of of possible side-effects. This can be
avoided when the the macro is converted to a simple inline function.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
An argument of a macro should not be evaluated multiple times. Otherwise
embedded operations in these arguments will be executed multiple times.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is not necessary to disable these code sections in case other kernel
features are disabled. Instead the IS_ENABLED tests can be added directly
in the code and the compiler can remove the unnecessary code parts during
its optimization run.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
- Keep fragments equally sized, avoids some problems with too small fragments,
by Sven Eckelmann
- Initialize gateway class correctly when BATMAN V is compiled in,
by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAljKiocWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSYaD/4o3vSRQbiu95rzaPZbZ8cfFRYz
FYS7b/SLV8Q8uxzCo05jimnYB9qGD8bZZlAPNhNZLqkSsTXc8vqfwPnyB9ytasrp
A8jefoMlO0A5w3AQxBErzg3wrYr5Gz6lqkgqZTv06Sb6xa7F/OEo9pMjdbCBX15D
G9y1FZdLZVhdVE8r9yUohbfcnZ4KkfBlViEXnjAcJoV2x4OOVGQjDdBGpaEQnwyk
wyGEFuIO2oWW6AlQXCbXomTUZ6bAzJ43lPbRDWTVdrVJg2RV65F4MsNDfLpXVBQV
GvmB33WVTzphQshxvhnjs2uWglnj8KGVcvpIp7GuRfGgpFJh9tbS2B0bITXCVxWU
+HkNja6QwObwhEvAWfINaTT9RR1kxWAQk6YB9VEq3RJoHufP49felOZ877oAkK8T
7CwazIe4rOhA+ISLUcUFk5W9B46DusPRyfhzrDZ6EM0P/ppGWpKAioawltLqX2P3
kBmIW8kB6YRHnTeCHJZTjdPdQkYeIAYVunnhLvRrmTbf0oyVh3+UnwgAVquwJj/5
WsEqNPsP0B++nsWmT+MPVOKj/JwcmtKfaDY2sor72aN6TZxq7MF9VGdcZiOu9neB
zILNcb0oIvUahb2SO2u/gH2E/4am+ED5StQmNxHdJf1A+iCYGXEy/yETgeIm51/S
TaHd1aH3VwmCR0sAKw==
=zGdm
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20170316' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are two batman-adv bugfixes:
- Keep fragments equally sized, avoids some problems with too small fragments,
by Sven Eckelmann
- Initialize gateway class correctly when BATMAN V is compiled in,
by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The code introduced by commit 2ccccf5fb4 ("net_sched: update
hierarchical backlog too") only sets prev_backlog in fq_codel_dequeue()
but not using that anywhere, remove that setting.
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added a case for OVS_TUNNEL_KEY_ATTR_PAD to the switch statement
in ip_tun_from_nlattr in order to prevent the default case
returning an error.
Fixes: b46f6ded90 ("libnl: nla_put_be64(): align on a 64-bit area")
Signed-off-by: Kris Murphy <kriskend@linux.vnet.ibm.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit c3852ef7f2 ("ipv4: fib: Replay events when registering FIB
notifier") we dumped the FIB tables and replayed the events to the
passed notification block.
However, we merely sent a RULE_ADD notification in case custom rules
were in use. As explained in previous patches, this approach won't work
anymore. Instead, we should notify the caller about all the FIB rules
and let it act accordingly.
Upon registration to the FIB notification chain, replay a RULE_ADD
notification for each programmed FIB rule, custom or not. The integrity
of the dump is ensured by the mechanism introduced in the above
mentioned commit.
Prevent regressions by making sure current listeners correctly sanitize
the notified rules.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever a FIB rule is added or removed, a notification is sent in the
FIB notification chain. However, listeners don't have a way to tell
which rule was added or removed.
This is problematic as we would like to give listeners the ability to
decide which action to execute based on the notified rule. Specifically,
offloading drivers should be able to determine if they support the
reflection of the notified FIB rule and flush their LPM tables in case
they don't.
Do that by adding a notifier info to these notifications and embed the
common FIB rule struct in it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when non-default (custom) FIB rules are used, devices capable
of layer 3 offloading flush their tables and let the kernel do the
forwarding instead.
When these devices' drivers are loaded they register to the FIB
notification chain, which lets them know about the existence of any
custom FIB rules. This is done by sending a RULE_ADD notification based
on the value of 'net->ipv4.fib_has_custom_rules'.
This approach is problematic when VRF offload is taken into account, as
upon the creation of the first VRF netdev, a l3mdev rule is programmed
to direct skbs to the VRF's table.
Instead of merely reading the above value and sending a single RULE_ADD
notification, we should iterate over all the FIB rules and send a
detailed notification for each, thereby allowing offloading drivers to
sanitize the rules they don't support and potentially flush their
tables.
While l3mdev rules are uniquely marked, the default rules are not.
Therefore, when they are being notified they might invoke offloading
drivers to unnecessarily flush their tables.
Solve this by adding an helper to check if a FIB rule is a default rule.
Namely, its selector should match all packets and its action should
point to the local, main or default tables.
As noted by David Ahern, uniquely marking the default rules is
insufficient. When using VRFs, it's common to avoid false hits by moving
the rule for the local table to just before the main table:
Default configuration:
$ ip rule show
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
Common configuration with VRFs:
$ ip rule show
1000: from all lookup [l3mdev-table]
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At most it is used for debugging purpose, but I don't think
it is even useful for debugging, just remove it.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Use setup_timer() and setup_deferrable_timer() to set the data and
function timer fields. It makes the code cleaner and will allow for
easier change of the timer struct internals.
Signed-off-by: Ondřej Lysoněk <ondrej.lysonek@seznam.cz>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <linux-wireless@vger.kernel.org>
Cc: <netdev@vger.kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
to be perfectly accurate - there are various error paths that miss unlock
of the RTNL.
To fix those, change the locking a bit to not be conditional in all those
nl80211_prepare_*_dump() functions, but make those require the RTNL to
start with, and fix the buggy error paths. This also let me use sparse
(by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
validate the changes.
Cc: stable@vger.kernel.org
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
I mistakenly added the code to release sk->sk_frag in
sk_common_release() instead of sk_destruct()
TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
sk_common_release() at close time, thus leaking one (order-3) page.
iSCSI is using such sockets.
Fixes: 5640f76858 ("net: use a per task frag allocator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a DSA switch driver cannot program an ageing time value due to it
being out-of-range, switchdev will raise a stack trace before failing.
To fix this, add ageing_time_min and ageing_time_max members to the
dsa_switch in order for the switch drivers to optionally specify their
supported ageing time limits.
The DSA core will now check for provided ageing time limits and return
-ERANGE from the switchdev prepare phase if the value is out-of-range.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ageing time is defined as unsigned int, so make
dsa_fastest_ageing_time return an unsigned int instead of int.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The configurable priority to traffic class mapping and the user specified
queue ranges are used to configure the traffic class, overriding the
hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
option is non-zero, the hardware QOS defaults are used.
This patch makes it so that we can pass the data the user provided to
ndo_setup_tc. This allows us to pull in the queue configuration if the
user requested it as well as any additional hardware offload type
requested by using a value other than 1 for the hw value.
Finally it also provides a means for the device driver to return the level
supported for the offload type via the qopt->hw value. Previously we were
just always assuming the value to be 1, in the future values beyond just 1
may be supported.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is meant to allow for support of multiple hardware offload type
for a single device. There is currently no bounds checking for the hw
member of the mqprio_qopt structure. This results in us being able to pass
values from 1 to 255 with all being treated the same. On retreiving the
value it is returned as 1 for anything 1 or greater being set.
With this change we are currently adding limited bounds checking by
defining an enum and using those values to limit the reported hardware
offloads.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree, a
rather large batch of fixes targeted to nf_tables, conntrack and bridge
netfilter. More specifically, they are:
1) Don't track fragmented packets if the socket option IP_NODEFRAG is set.
From Florian Westphal.
2) SCTP protocol tracker assumes that ICMP error messages contain the
checksum field, what results in packet drops. From Ying Xue.
3) Fix inconsistent handling of AH traffic from nf_tables.
4) Fix new bitmap set representation with big endian. Fix mismatches in
nf_tables due to incorrect big endian handling too. Both patches
from Liping Zhang.
5) Bridge netfilter doesn't honor maximum fragment size field, cap to
largest fragment seen. From Florian Westphal.
6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB
bits are now used to store the ctinfo. From Steven Rostedt.
7) Fix element comments with the bitmap set type. Revert the flush
field in the nft_set_iter structure, not required anymore after
fixing up element comments.
8) Missing error on invalid conntrack direction from nft_ct, also from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When dealing with ipv6 source tunnel key address attribute
(OVS_TUNNEL_KEY_ATTR_IPV6_SRC) we are wrongly setting the tunnel
dst ip, fix that.
Fixes: 6b26ba3a7d ('openvswitch: netlink attributes for IPv6 tunneling')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/genet/bcmgenet.c
net/core/sock.c
Conflicts were overlapping changes in bcmgenet and the
lockdep handling of sockets.
Signed-off-by: David S. Miller <davem@davemloft.net>
We should jump to invoke __nft_ct_set_destroy() instead of just
return error.
Fixes: edee4f1e92 ("netfilter: nft_ct: add zone id set support")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking fixes from David Miller:
1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
from Steffen Klassert.
2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
Alexei Starovoitov.
3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
Michal Schmidt.
4) We can get a divide by zero when TCP socket are morphed into
listening state, fix from Eric Dumazet.
5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
skb_complete_tx_timestamp(). From Eric Dumazet.
6) Use after free in dccp_feat_activate_values(), also from Eric
Dumazet.
7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
Jarod Wilson.
8) Fix use after free in vrf_xmit(), from David Ahern.
9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
Alexey Kodanev.
10) Properly check napi_complete_done() return value in order to decide
whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
Lendacky.
11) Fix double free of hwmon device in marvell phy driver, from Andrew
Lunn.
12) Don't crash on malformed netlink attributes in act_connmark, from
Etienne Noss.
13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
from Sabrina Dubroca.
14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
Florian Westphal.
15) Fix routing redirect races in dccp and tcp, basically the ICMP
handler can't modify the socket's cached route in it's locked by the
user at this moment. From Jon Maxwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
qed: Enable iSCSI Out-of-Order
qed: Correct out-of-bound access in OOO history
qed: Fix interrupt flags on Rx LL2
qed: Free previous connections when releasing iSCSI
qed: Fix mapping leak on LL2 rx flow
qed: Prevent creation of too-big u32-chains
qed: Align CIDs according to DORQ requirement
mlxsw: reg: Fix SPVMLR max record count
mlxsw: reg: Fix SPVM max record count
net: Resend IGMP memberships upon peer notification.
dccp: fix memory leak during tear-down of unsuccessful connection request
tun: fix premature POLLOUT notification on tun devices
dccp/tcp: fix routing redirect race
ucc/hdlc: fix two little issue
vxlan: fix ovs support
net: use net->count to check whether a netns is alive or not
bridge: drop netfilter fake rtable unconditionally
ipv6: avoid write to a possibly cloned skb
net: wimax/i2400m: fix NULL-deref at probe
isdn/gigaset: fix NULL-deref at probe
...
When we notify peers of potential changes, it's also good to update
IGMP memberships. For example, during VM migration, updating IGMP
memberships will redirect existing multicast streams to the VM at the
new location.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
silences the below warning:
net/core/lwtunnel.c: In function ‘lwtunnel_valid_encap_type_attr’:
net/core/lwtunnel.c:165:17: warning: variable ‘nla’ set but not used
[-Wunused-but-set-variable]
Fixes: 9ed59592e3 ("lwtunnel: fix autoload of lwt modules")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When some errors occur, the scatter/gather list mapped to DMA addresses
should be handled.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function rds_ib_map_fmr is used only in the ib_fmr.c
file. As such, the static type is added to limit it in this file.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function ib_dealloc_fmr will never be called. As such, it should
be removed.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a memory leak, which happens if the connection request
is not fulfilled between parsing the DCCP options and handling the SYN
(because e.g. the backlog is full), because we forgot to free the
list of ack vectors.
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:
#8 [] page_fault at ffffffff8163e648
[exception RIP: __tcp_ack_snd_check+74]
.
.
#9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8
Of course it may happen with other NIC drivers as well.
It's found the freed dst_entry here:
224 static bool tcp_in_quickack_mode(struct sock *sk)↩
225 {↩
226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩
227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩
228 ↩
229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
231 }↩
But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.
All the vmcores showed 2 significant clues:
- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.
- All vmcores showed a postitive LockDroppedIcmps value, e.g:
LockDroppedIcmps 267
A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:
do_redirect()->__sk_dst_check()-> dst_release().
Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.
To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.
The dccp/IPv6 code is very similar in this respect, so fixing it there too.
As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().
Fixes: ceb3320610 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous idea was to check whether a net namespace is in
net_exit_list or not. It doesn't work, because net->exit_list is used in
__register_pernet_operations and __unregister_pernet_operations where
all namespaces are added to a temporary list to make cleanup in a error
case, so list_empty(&net->exit_list) always returns false.
Reported-by: Mantas Mikulėnas <grawity@gmail.com>
Fixes: 002d8a1a6c ("net: skip genenerating uevents for network namespaces that are exiting")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey reported this kernel warning:
WARNING: CPU: 0 PID: 4114 at kernel/sched/core.c:7737 __might_sleep+0x149/0x1a0
do not call blocking ops when !TASK_RUNNING; state=1 set at
[<ffffffff813fcb22>] prepare_to_wait+0x182/0x530
The deeply nested alloc_skb is a problem.
Diagnosis: nesting is wrong. It makes zero sense. Fix it and the
implicit task state change problem automagically goes away.
alloc_skb() does not need to be in the "while" loop.
alloc_skb() does not need to be in the {prepare_to_wait/add_wait_queue ...
finish_wait/remove_wait_queue} block.
I claim that:
- alloc_tx() should only perform the "wait_for_decent_tx_drain" part
- alloc_skb() ought to be done directly in vcc_sendmsg
- alloc_skb() failure can be handled gracefully in vcc_sendmsg
- alloc_skb() may use a (m->msg_flags & MSG_DONTWAIT) dependent
GFP_{KERNEL / ATOMIC} flag
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-and-Tested-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow TTL propagation from IP packets to MPLS packets to be
configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which
allows the TTL to be set in the resulting MPLS packet, with the value
of 0 having the semantics of enabling propagation of the TTL from the
IP header (i.e. non-zero values disable propagation).
Also allow the configuration to be overridden globally by reusing the
same sysctl to control whether the TTL is propagated from IP packets
into the MPLS header. If the per-LWT attribute is set then it
overrides the global configuration. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl,
"net.mpls.default_ttl". This is kept separate from the configuration
of whether IP TTL propagation is enabled as it can be used in the
future when non-IP payloads are supported (i.e. where there is no
payload TTL that can be propagated).
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide the ability to control on a per-route basis whether the TTL
value from an MPLS packet is propagated to an IPv4/IPv6 packet when
the last label is popped as per the theoretical model in RFC 3443
through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to
mean disable propagation and 1 to mean enable propagation.
In order to provide the ability to change the behaviour for packets
arriving with IPv4/IPv6 Explicit Null labels and to provide an easy
way for a user to change the behaviour for all existing routes without
having to reprogram them, a global knob is provided. This is done
through the addition of a new per-namespace sysctl,
"net.mpls.ip_ttl_propagate", which defaults to enabled. If the
per-route attribute is set (either enabled or disabled) then it
overrides the global configuration.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andreas reports kernel oops during rmmod of the br_netfilter module.
Hannes debugged the oops down to a NULL rt6info->rt6i_indev.
Problem is that br_netfilter has the nasty concept of adding a fake
rtable to skb->dst; this happens in a br_netfilter prerouting hook.
A second hook (in bridge LOCAL_IN) is supposed to remove these again
before the skb is handed up the stack.
However, on module unload hooks get unregistered which means an
skb could traverse the prerouting hook that attaches the fake_rtable,
while the 'fake rtable remove' hook gets removed from the hooklist
immediately after.
Fixes: 34666d467c ("netfilter: bridge: move br_netfilter out of the core")
Reported-by: Andreas Karis <akaris@redhat.com>
Debugged-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_fragment, in case skb has a fraglist, checks if the
skb is cloned. If it is, it will move to the 'slow path' and allocates
new skbs for each fragment.
However, right before entering the slowpath loop, it updates the
nexthdr value of the last ipv6 extension header to NEXTHDR_FRAGMENT,
to account for the fragment header that will be inserted in the new
ipv6-fragment skbs.
In case original skb is cloned this munges nexthdr value of another
skb. Avoid this by doing the nexthdr update for each of the new fragment
skbs separately.
This was observed with tcpdump on a bridge device where netfilter ipv6
reassembly is active: tcpdump shows malformed fragment headers as
the l4 header (icmpv6, tcp, etc). is decoded as a fragment header.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Andreas Karis <akaris@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2759647247 ("ipv6: fix ECMP route replacement") introduced a
loop that removes all siblings of an ECMP route that is being
replaced. However, this loop doesn't stop when it has replaced
siblings, and keeps removing other routes with a higher metric.
We also end up triggering the WARN_ON after the loop, because after
this nsiblings < 0.
Instead, stop the loop when we have taken care of all routes with the
same metric as the route being replaced.
Reproducer:
===========
#!/bin/sh
ip netns add ns1
ip netns add ns2
ip -net ns1 link set lo up
for x in 0 1 2 ; do
ip link add veth$x netns ns2 type veth peer name eth$x netns ns1
ip -net ns1 link set eth$x up
ip -net ns2 link set veth$x up
done
ip -net ns1 -6 r a 2000::/64 nexthop via fe80::0 dev eth0 \
nexthop via fe80::1 dev eth1 nexthop via fe80::2 dev eth2
ip -net ns1 -6 r a 2000::/64 via fe80::42 dev eth0 metric 256
ip -net ns1 -6 r a 2000::/64 via fe80::43 dev eth0 metric 2048
echo "before replace, 3 routes"
ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
echo
ip -net ns1 -6 r c 2000::/64 nexthop via fe80::4 dev eth0 \
nexthop via fe80::5 dev eth1 nexthop via fe80::6 dev eth2
echo "after replace, only 2 routes, metric 2048 is gone"
ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
Fixes: 2759647247 ("ipv6: fix ECMP route replacement")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karel Rericha reported that in his test case, ICMP packets going through
boxes had normally about 5ms latency. But when running nft, actually
listing the sets with interval flags, latency would go up to 30-100ms.
This was observed when router throughput is from 600Mbps to 2Gbps.
This is because we use a single global spinlock to protect the whole
rbtree sets, so "dumping sets" will race with the "key lookup" inevitably.
But actually they are all _readers_, so it's ok to convert the spinlock
to rwlock to avoid competition between them. Also use per-set rwlock since
each set is independent.
Reported-by: Karel Rericha <karel@unitednetworks.cz>
Tested-by: Karel Rericha <karel@unitednetworks.cz>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The limit token is independent between each rules, so there's no
need to use a global spinlock.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
also mark init_conntrack noinline, in most cases resolve_normal_ct will
find an existing conntrack entry.
text data bss dec hex filename
16735 5707 176 22618 585a net/netfilter/nf_conntrack_core.o
16687 5707 176 22570 582a net/netfilter/nf_conntrack_core.o
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commit 1f48ff6c53.
This patch is not required anymore now that we keep a dummy list of
set elements in the bitmap set implementation, so revert this before
we forget this code has no clients.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of the actual interface index or name, set destination register
to just 1 or 0 depending on whether the lookup succeeded or not if
NFTA_FIB_F_PRESENT was set in userspace.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
this allows to assign connection tracking helpers to
connections via nft objref infrastructure.
The idea is to first specifiy a helper object:
table ip filter {
ct helper some-name {
type "ftp"
protocol tcp
l3proto ip
}
}
and then assign it via
nft add ... ct helper set "some-name"
helper assignment works for new conntracks only as we cannot expand the
conntrack extension area once it has been committed to the main conntrack
table.
ipv4 and ipv6 protocols are tracked stored separately so
we can also handle families that observe both ipv4 and ipv6 traffic.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
this is needed by the upcoming ct helper object type --
we'd like to be able use the table family (ip, ip6, inet) to figure
out which helper has to be requested.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Element comments may come without any prior set flag, so we have to keep
a list of dummy struct nft_set_ext to keep this information around. This
is only useful for set dumps to userspace. From the packet path, this
set type relies on the bitmap representation. This patch simplifies the
logic since we don't need to allocate the dummy nft_set_ext structure
anymore on the fly at the cost of increasing memory consumption because
of the list of dummy struct nft_set_ext.
Fixes: 665153ff57 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since the nfct and nfctinfo have been combined, the nf_conn structure
must be at least 8 bytes aligned, as the 3 LSB bits are used for the
nfctinfo. But there's a fake nf_conn structure to denote untracked
connections, which is created by a PER_CPU construct. This does not
guarantee that it will be 8 bytes aligned and can break the logic in
determining the correct nfctinfo.
I triggered this on a 32bit machine with the following error:
BUG: unable to handle kernel NULL pointer dereference at 00000af4
IP: nf_ct_deliver_cached_events+0x1b/0xfb
*pdpt = 0000000031962001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
[Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
OK ]
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
task: c126ec00 task.stack: c1258000
EIP: nf_ct_deliver_cached_events+0x1b/0xfb
EFLAGS: 00010202 CPU: 0
EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a
ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0
Call Trace:
<SOFTIRQ>
? ipv6_skip_exthdr+0xac/0xcb
ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
nf_hook_slow+0x22/0xc7
nf_hook+0x9a/0xad [ipv6]
? ip6t_do_table+0x356/0x379 [ip6_tables]
? ip6_fragment+0x9e9/0x9e9 [ipv6]
ip6_output+0xee/0x107 [ipv6]
? ip6_fragment+0x9e9/0x9e9 [ipv6]
dst_output+0x36/0x4d [ipv6]
NF_HOOK.constprop.37+0xb2/0xba [ipv6]
? icmp6_dst_alloc+0x2c/0xfd [ipv6]
? local_bh_enable+0x14/0x14 [ipv6]
mld_sendpack+0x1c5/0x281 [ipv6]
? mark_held_locks+0x40/0x5c
mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
call_timer_fn+0x135/0x283
? detach_if_pending+0x55/0x55
? mld_dad_timer_expire+0x3e/0x3e [ipv6]
__run_timers+0x111/0x14b
? mld_dad_timer_expire+0x3e/0x3e [ipv6]
run_timer_softirq+0x1c/0x36
__do_softirq+0x185/0x37c
? test_ti_thread_flag.constprop.19+0xd/0xd
do_softirq_own_stack+0x22/0x28
</SOFTIRQ>
irq_exit+0x5a/0xa4
smp_apic_timer_interrupt+0x2a/0x34
apic_timer_interrupt+0x37/0x3c
By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
alignment as all cache line sizes are at least 8 bytes or more.
Fixes: a9e419dc7b ("netfilter: merge ctinfo into nfct pointer storage area")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>