Граф коммитов

7130 Коммитов

Автор SHA1 Сообщение Дата
Renato Westphal e2e8009ff7 tcp: remove improper preemption check in tcp_xmit_probe_skb()
Commit e520af48c7 introduced the following bug when setting the
TCP_REPAIR sockoption:

[ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164
[ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20
[ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1
[ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012
[ 2860.657054]  ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002
[ 2860.657058]  0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18
[ 2860.657062]  ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38
[ 2860.657065] Call Trace:
[ 2860.657072]  [<ffffffff8185d765>] dump_stack+0x4f/0x7b
[ 2860.657075]  [<ffffffff8146ed91>] check_preemption_disabled+0xe1/0xf0
[ 2860.657078]  [<ffffffff8146edd3>] __this_cpu_preempt_check+0x13/0x20
[ 2860.657082]  [<ffffffff817e0bc7>] tcp_xmit_probe_skb+0xc7/0x100
[ 2860.657085]  [<ffffffff817e1e2d>] tcp_send_window_probe+0x2d/0x30
[ 2860.657089]  [<ffffffff817d1d8c>] do_tcp_setsockopt.isra.29+0x74c/0x830
[ 2860.657093]  [<ffffffff817d1e9c>] tcp_setsockopt+0x2c/0x30
[ 2860.657097]  [<ffffffff81767b74>] sock_common_setsockopt+0x14/0x20
[ 2860.657100]  [<ffffffff817669e1>] SyS_setsockopt+0x71/0xc0
[ 2860.657104]  [<ffffffff81865172>] entry_SYSCALL_64_fastpath+0x16/0x75

Since tcp_xmit_probe_skb() can be called from process context, use
NET_INC_STATS() instead of NET_INC_STATS_BH().

Fixes: e520af48c7 ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters")
Signed-off-by: Renato Westphal <renatow@taghos.com.br>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:29:26 -07:00
David S. Miller 36a28b2116 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains four Netfilter fixes for net, they are:

1) Fix Kconfig dependencies of new nf_dup_ipv4 and nf_dup_ipv6.

2) Remove bogus test nh_scope in IPv4 rpfilter match that is breaking
   --accept-local, from Xin Long.

3) Wait for RCU grace period after dropping the pending packets in the
   nfqueue, from Florian Westphal.

4) Fix sleeping allocation while holding spin_lock_bh, from Nikolay Borisov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:26:17 -07:00
Arad, Ronen b1974ed05e netlink: Rightsize IFLA_AF_SPEC size calculation
if_nlmsg_size() overestimates the minimum allocation size of netlink
dump request (when called from rtnl_calcit()) or the size of the
message (when called from rtnl_getlink()). This is because
ext_filter_mask is not supported by rtnl_link_get_af_size() and
rtnl_link_get_size().

The over-estimation is significant when at least one netdev has many
VLANs configured (8 bytes for each configured VLAN).

This patch-set "rightsizes" the protocol specific attribute size
calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
and adding this a argument to get_link_af_size op in rtnl_af_ops.

Bridge module already used filtering aware sizing for notifications.
br_get_link_af_size_filtered() is consistent with the modified
get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
br_get_link_af_size() becomes unused and thus removed.

Signed-off-by: Ronen Arad <ronen.arad@intel.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:15:20 -07:00
Yuchung Cheng 4f41b1c58a tcp: use RACK to detect losses
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:53 -07:00
Yuchung Cheng 659a8ad56f tcp: track the packet timings in RACK
This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:48 -07:00
Yuchung Cheng 77c631273d tcp: add tcp_tsopt_ecr_before helper
a helper to prepare the main RACK patch

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:45 -07:00
Yuchung Cheng af82f4e848 tcp: remove tcp_mark_lost_retrans()
Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:44 -07:00
Yuchung Cheng f672258391 tcp: track min RTT using windowed min-filter
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:43 -07:00
Yuchung Cheng 9e45a3e36b tcp: apply Kern's check on RTTs used for congestion control
Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.

Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:41 -07:00
David S. Miller 26440c835f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	net/ipv4/inet_connection_sock.c
	net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-20 06:08:27 -07:00
Steffen Klassert ca064bd893 xfrm: Fix pmtu discovery for local generated packets.
Commit 044a832a77 ("xfrm: Fix local error reporting crash
with interfamily tunnels") moved the setting of skb->protocol
behind the last access of the inner mode family to fix an
interfamily crash. Unfortunately now skb->protocol might not
be set at all, so we fail dispatch to the inner address family.
As a reault, the local error handler is not called and the
mtu value is not reported back to userspace.

We fix this by setting skb->protocol on message size errors
before we call xfrm_local_error.

Fixes: 044a832a77 ("xfrm: Fix local error reporting crash with interfamily tunnels")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-10-19 10:30:05 +02:00
David S. Miller 371f1c7e0d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree. Most relevantly, updates for the nfnetlink_log to integrate with
conntrack, fixes for cttimeout and improvements for nf_queue core, they are:

1) Remove useless ifdef around static inline function in IPVS, from
   Eric W. Biederman.

2) Simplify the conntrack support for nfnetlink_queue: Merge
   nfnetlink_queue_ct.c file into nfnetlink_queue_core.c, then rename it back
   to nfnetlink_queue.c

3) Use y2038 safe timestamp from nfnetlink_queue.

4) Get rid of dead function definition in nf_conntrack, from Flavio
   Leitner.

5) Attach conntrack support for nfnetlink_log.c, from Ken-ichirou MATSUZAWA.
   This adds a new NETFILTER_NETLINK_GLUE_CT Kconfig switch that
   controls enabling both nfqueue and nflog integration with conntrack.
   The userspace application can request this via NFULNL_CFG_F_CONNTRACK
   configuration flag.

6) Remove unused netns variables in IPVS, from Eric W. Biederman and
   Simon Horman.

7) Don't put back the refcount on the cttimeout object from xt_CT on success.

8) Fix crash on cttimeout policy object removal. We have to flush out
   the cttimeout extension area of the conntrack not to refer to an unexisting
   object that was just removed.

9) Make sure rcu_callback completion before removing nfnetlink_cttimeout
   module removal.

10) Fix compilation warning in br_netfilter when no nf_defrag_ipv4 and
    nf_defrag_ipv6 are enabled. Patch from Arnd Bergmann.

11) Autoload ctnetlink dependencies when NFULNL_CFG_F_CONNTRACK is
    requested. Again from Ken-ichirou MATSUZAWA.

12) Don't use pointer to previous hook when reinjecting traffic via
    nf_queue with NF_REPEAT verdict since it may be already gone. This
    also avoids a deadloop if the userspace application keeps returning
    NF_REPEAT.

13) A bunch of cleanups for netfilter IPv4 and IPv6 code from Ian Morris.

14) Consolidate logger instance existence check in nfulnl_recv_config().

15) Fix broken atomicity when applying configuration updates to logger
    instances in nfnetlink_log.

16) Get rid of the .owner attribute in our hook object. We don't need
    this anymore since we're dropping pending packets that have escaped
    from the kernel when unremoving the hook. Patch from Florian Westphal.

17) Remove unnecessary rcu_read_lock() from nf_reinject code, we always
    assume RCU read side lock from .call_rcu in nfnetlink. Also from Florian.

18) Use static inline function instead of macros to define NF_HOOK() and
    NF_HOOK_COND() when no netfilter support in on, from Arnd Bergmann.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:48:34 -07:00
Eric Dumazet dc6ef6be52 tcp: do not set queue_mapping on SYNACK
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.

We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.

Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:26:02 -07:00
Li RongQing 26fb342c73 ipconfig: send Client-identifier in DHCP requests
A dhcp server may provide parameters to a client from a pool of IP
addresses and using a shared rootfs, or provide a specific set of
parameters for a specific client, usually using the MAC address to
identify each client individually. The dhcp protocol also specifies
a client-id field which can be used to determine the correct
parameters to supply when no MAC address is available. There is
currently no way to tell the kernel to supply a specific client-id,
only the userspace dhcp clients support this feature, but this can
not be used when the network is needed before userspace is available
such as when the root filesystem is on NFS.

This patch is to be able to do something like "ip=dhcp,client_id_type,
client_id_value", as a kernel parameter to enable the kernel to
identify itself to the server.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 19:23:52 -07:00
Pablo Neira Ayuso f0a0a978b6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
This merge resolves conflicts with 75aec9df3a ("bridge: Remove
br_nf_push_frag_xmit_sk") as part of Eric Biederman's effort to improve
netns support in the network stack that reached upstream via David's
net-next tree.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Conflicts:
	net/bridge/br_netfilter_hooks.c
2015-10-17 14:28:03 +02:00
Ian Morris c8d71d08aa netfilter: ipv4: whitespace around operators
This patch cleanses whitespace around arithmetical operators.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:19:23 +02:00
Ian Morris 24cebe3f29 netfilter: ipv4: code indentation
Use tabs instead of spaces to indent code.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:19:15 +02:00
Ian Morris 6c28255b46 netfilter: ipv4: function definition layout
Use tabs instead of spaces to indent second line of parameters in
function definitions.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:19:10 +02:00
Ian Morris 27951a0168 netfilter: ipv4: ternary operator layout
Correct whitespace layout of ternary operators in the netfilter-ipv4
code.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:19:04 +02:00
Ian Morris 19f0a60201 netfilter: ipv4: label placement
Whitespace cleansing: Labels should not be indented.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:17:41 +02:00
Florian Westphal 2ffbceb2b0 netfilter: remove hook owner refcounting
since commit 8405a8fff3 ("netfilter: nf_qeueue: Drop queue entries on
nf_unregister_hook") all pending queued entries are discarded.

So we can simply remove all of the owner handling -- when module is
removed it also needs to unregister all its hooks.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 18:21:39 +02:00
David Ahern 51161aa98d net: Fix suspicious RCU usage in fib_rebalance
This command:
  ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2

generated this suspicious RCU usage message:

[ 63.249262]
[ 63.249939] ===============================
[ 63.251571] [ INFO: suspicious RCU usage. ]
[ 63.253250] 4.3.0-rc3+ #298 Not tainted
[ 63.254724] -------------------------------
[ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
[ 63.259450]
[ 63.259450] other info that might help us debug this:
[ 63.259450]
[ 63.262297]
[ 63.262297] rcu_scheduler_active = 1, debug_locks = 1
[ 63.264647] 1 lock held by ip/2870:
[ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14
[ 63.268858]
[ 63.268858] stack backtrace:
[ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298
[ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301
[ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000
[ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908
[ 63.283177] Call Trace:
[ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68
[ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103
[ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f
[ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127
[ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f
[ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc
[ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
[ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451
[ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43
[ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51
[ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195
[ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf
[ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14
[ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12
[ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90
[ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28
[ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f
[ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc
[ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d
[ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b
[ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c
[ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
[ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6
[ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54
[ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77
[ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b
[ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19
[ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f

It looks like all of the code paths to fib_rebalance are under rtnl.

Fixes: 0e884c78ee ("ipv4: L3 hash-based multipath")
Cc: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:57:55 -07:00
Eric Dumazet ebb516af60 tcp/dccp: fix race at listener dismantle phase
Under stress, a close() on a listener can trigger the
WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()

We need to test if listener is still active before queueing
a child in inet_csk_reqsk_queue_add()

Create a common inet_child_forget() helper, and use it
from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:19 -07:00
Eric Dumazet f03f2e154f tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper
Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
In many cases we also need to release reference on request socket,
so add a helper to do this, reducing code size and complexity.

Fixes: 4bdc3d6614 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:18 -07:00
Eric Dumazet ef84d8ce5a Revert "inet: fix double request socket freeing"
This reverts commit c69736696c.

At the time of above commit, tcp_req_err() and dccp_req_err()
were dead code, as SYN_RECV request sockets were not yet in ehash table.

Real bug was fixed later in a different commit.

We need to revert to not leak a refcount on request socket.

inet_csk_reqsk_queue_drop_and_put() will be added
in following commit to make clean inet_csk_reqsk_queue_drop()
does not release the reference owned by caller.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:17 -07:00
Eric Dumazet f985c65c90 tcp: avoid spurious SYN flood detection at listen() time
At listen() time, there is a small window where listener is visible with
a zero backlog, triggering a spurious "Possible SYN flooding on port"
message.

Nothing prevents us from setting the correct backlog.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-14 19:06:32 -07:00
Eric Dumazet c2f34a65a6 tcp/dccp: fix potential NULL deref in __inet_inherit_port()
As we no longer hold listener lock in fast path, it is possible that a
child is created right after listener freed its bound port, if a close()
is done while incoming packets are processed.

__inet_inherit_port() must detect this and return an error,
so that caller can free the child earlier.

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-14 19:06:31 -07:00
Paolo Abeni 02a6d6136f Revert "ipv4/icmp: redirect messages can use the ingress daddr as source"
Revert the commit e2ca690b65 ("ipv4/icmp: redirect messages
can use the ingress daddr as source"), which tried to introduce a more
suitable behaviour for ICMP redirect messages generated by VRRP routers.
However RFC 5798 section 8.1.1 states:

    The IPv4 source address of an ICMP redirect should be the address
    that the end-host used when making its next-hop routing decision.

while said commit used the generating packet destination
address, which do not match the above and in most cases leads to
no redirect packets to be generated.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-14 06:01:07 -07:00
Eric Dumazet 4bdc3d6614 tcp/dccp: fix behavior of stale SYN_RECV request sockets
When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
become stale, meaning 3WHS can not complete.

But current behavior is wrong :
incoming packets finding such stale sockets are dropped.

We need instead to cleanup the request socket and perform another
lookup :
- Incoming ACK will give a RST answer,
- SYN rtx might find another listener if available.
- We expedite cleanup of request sockets and old listener socket.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13 18:26:34 -07:00
Eric W. Biederman 19bcf9f203 ipv4: Pass struct net into ip_defrag and ip_check_defrag
The function ip_defrag is called on both the input and the output
paths of the networking stack.  In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.

So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:44:16 -07:00
Eric W. Biederman 37fcbab61b ipv4: Only compute net once in ip_call_ra_chain
ip_call_ra_chain is called early in the forwarding chain from
ip_forward and ip_mr_input, which makes skb->dev the correct
expression to get the input network device and dev_net(skb->dev) a
correct expression for the network namespace the packet is being
processed in.

Compute the network namespace and store it in a variable to make the
code clearer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:44:14 -07:00
Paolo Abeni e2ca690b65 ipv4/icmp: redirect messages can use the ingress daddr as source
This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.

The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:

Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.

If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.

The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:38:02 -07:00
Eric Dumazet ed53d0ab76 net: shrink struct sock and request_sock by 8 bytes
One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:28:22 -07:00
Eric Dumazet 70da268b56 net: SO_INCOMING_CPU setsockopt() support
SO_INCOMING_CPU as added in commit 2c8c56e15d was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()

This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.

This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.

Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:28:20 -07:00
lucien cc4998febd netfilter: ipt_rpfilter: remove the nh_scope test in rpfilter_lookup_reverse
--accept-local  option works for res.type == RTN_LOCAL, which should be
from the local table, but there, the fib_info's nh->nh_scope =
RT_SCOPE_NOWHERE ( > RT_SCOPE_HOST). in fib_create_info().

	if (cfg->fc_scope == RT_SCOPE_HOST) {
		struct fib_nh *nh = fi->fib_nh;

		/* Local address is added. */
		if (nhs != 1 || nh->nh_gw)
			goto err_inval;
		nh->nh_scope = RT_SCOPE_NOWHERE;   <===
		nh->nh_dev = dev_get_by_index(net, fi->fib_nh->nh_oif);
		err = -ENODEV;
		if (!nh->nh_dev)
			goto failure;

but in our rpfilter_lookup_reverse():

	if (dev_match || flags & XT_RPFILTER_LOOSE)
		return FIB_RES_NH(res).nh_scope <= RT_SCOPE_HOST;

if nh->nh_scope > RT_SCOPE_HOST, it will fail. --accept-local option
will never be passed.

it seems the test is bogus and can be removed to fix this issue.

	if (dev_match || flags & XT_RPFILTER_LOOSE)
		return FIB_RES_NH(res).nh_scope <= RT_SCOPE_HOST;

ipv6 does not have this issue.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12 17:27:48 +02:00
Richard Sailer 7533ce3055 tcp: change type of alive from int to bool
The alive parameter of tcp_orphan_retries, indicates
whether the connection is assumed alive or not.
In the function and all places calling it is used as a boolean value.

Therefore this changes the type of alive to bool in the function
definition and all calling locations.

Since tcp_orphan_tries is a tcp_timer.c local function no change in
any other file or header is necessary.

Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 05:15:03 -07:00
Eric Dumazet 6bcfd7f8c2 tcp: fix RFS vs lockless listeners
Before recent TCP listener patches, we were updating listener
sk->sk_rxhash before the cloning of master socket.

children sk_rxhash was therefore correct after the normal 3WHS.

But with lockless listener, we no longer dirty/change listener sk_rxhash
as it would be racy.

We need to correctly update the child sk_rxhash, otherwise first data
packet wont hit correct cpu if RFS is used.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11 05:33:15 -07:00
David Ahern 28335a7445 net: Do not drop to make_route if oif is l3mdev
Commit deaa0a6a93 ("net: Lookup actual route when oif is VRF device")
exposed a bug in __ip_route_output_key_hash for VRF devices: on FIB lookup
failure if the oif is specified the current logic drops to make_route on
the assumption that the route tables are wrong. For VRF/L3 master devices
this leads to wrong dst entries and route lookups. For example:
    $ ip route ls table vrf-red
    unreachable default
    broadcast 10.2.1.0 dev eth1  proto kernel  scope link  src 10.2.1.2
    10.2.1.0/24 dev eth1  proto kernel  scope link  src 10.2.1.2
    local 10.2.1.2 dev eth1  proto kernel  scope host  src 10.2.1.2
    broadcast 10.2.1.255 dev eth1  proto kernel  scope link  src 10.2.1.2

    $ ip route get oif vrf-red 1.1.1.1
    1.1.1.1 dev vrf-red  src 10.0.0.2
        cache

With this patch:
    $  ip route get oif vrf-red 1.1.1.1
    RTNETLINK answers: No route to host

which is the correct response based on the default route

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 05:18:47 -07:00
Eric W. Biederman ede2059dba dst: Pass net into dst->output
The network namespace is already passed into dst_output pass it into
dst->output lwt->output and friends.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:27:03 -07:00
Eric W. Biederman 33224b16ff ipv4, ipv6: Pass net into ip_local_out and ip6_local_out
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:27:02 -07:00
Eric W. Biederman cf91a99daa ipv4, ipv6: Pass net into __ip_local_out and __ip6_local_out
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:27:02 -07:00
Eric W. Biederman 77589ce0f8 ipv4: Cache net in ip_build_and_send_pkt and ip_queue_xmit
Compute net and store it in a variable in the functions
ip_build_and_send_pkt and ip_queue_xmit so that it does not need to be
recomputed next time it is needed.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:59 -07:00
Eric W. Biederman f859b0f662 ipv4: Cache net in iptunnel_xmit
Store net in a variable in ip_tunnel_xmit so it does not need
to be recomputed when it is used again.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:59 -07:00
Eric W. Biederman e2cb77db08 ipv4: Merge ip_local_out and ip_local_out_sk
It is confusing and silly hiding a parameter so modify all of
the callers to pass in the appropriate socket or skb->sk if
no socket is known.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:57 -07:00
Eric W. Biederman b92dacd456 ipv4: Merge __ip_local_out and __ip_local_out_sk
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:57 -07:00
Eric W. Biederman 4ebdfba73c dst: Pass a sk into .local_out
For consistency with the other similar methods in the kernel pass a
struct sock into the dst_ops .local_out method.

Simplifying the socket passing case is needed a prequel to passing a
struct net reference into .local_out.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:55 -07:00
Eric W. Biederman 13206b6bff net: Pass net into dst_output and remove dst_output_okfn
Replace dst_output_okfn with dst_output

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:54 -07:00
Eric W. Biederman 850dcc4d4d ipv4: Fix ip_queue_xmit to pass sk into ip_local_out_sk
After a packet has been encapsulated by a tunnel we should use the
tunnel sockets local multicast loopback flag to control if the
encapsulated packet should be locally loopback back.

Pass sk into ip_local_out_sk so that in the rare case we are dealing
with a tunneled packet whose tunnel destination address is a multicast
address the kernel properly decides to loopback this packet.

In practice I don't think this matters as ip_queue_xmit is used by
tcp, l2tp and sctp none of which I am aware of uses ip level
multicasting as they are all point to point communications protocols.
Let's fix this before someone uses ip_queue_xmit for a tunnel protocol
that does use multicast.

Fixes: aad88724c9 ("ipv4: add a sock pointer to dst->output() path.")
Fixes: b0270e9101 ("ipv4: add a sock pointer to ip_queue_xmit()")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:52 -07:00
Eric W. Biederman fd2874b3bb ipv4: Fix ip_local_out_sk by passing the sk into __ip_local_out_sk
In the rare case where sk != skb->sk ip_local_out_sk arranges
to call dst->output differently if the skb is queued or not.
This is a bug.

Fix this bug by passing the sk parameter of ip_local_out_sk through
from ip_local_out_sk to __ip_local_out_sk (skipping __ip_local_out).

Fixes: 7026b1ddb6 ("netfilter: Pass socket pointer down through okfn().")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:52 -07:00
Eric Dumazet acb4a6bfc8 tcp: ensure prior synack rtx behavior with small backlogs
Some applications use a listen() backlog of 1.

Prior kernels were silently enforcing a qlen_log of 4, so that we were
sending up to /proc/sys/net/ipv4/tcp_synack_retries SYNACK messages.

Fixes: ef547f2ac1 ("tcp: remove max_qlen_log")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 05:08:58 -07:00
Yuvaraja Mariappan 686a562449 net: ipv4: tcp.c Fixed an assignment coding style issue
Fixed an assignment coding style issue

Signed-off-by: Yuvaraja Mariappan <ymariappan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 05:01:04 -07:00
David Ahern deaa0a6a93 net: Lookup actual route when oif is VRF device
If the user specifies a VRF device in a get route query the custom route
pointing to the VRF device is returned:

    $ ip route ls table vrf-red
    unreachable default
    broadcast 10.2.1.0 dev eth1  proto kernel  scope link  src 10.2.1.2
    10.2.1.0/24 dev eth1  proto kernel  scope link  src 10.2.1.2
    local 10.2.1.2 dev eth1  proto kernel  scope host  src 10.2.1.2
    broadcast 10.2.1.255 dev eth1  proto kernel  scope link  src 10.2.1.2

    $ ip route get oif vrf-red 10.2.1.40
    10.2.1.40 dev vrf-red
        cache

Add the flags to skip the custom route and go directly to the FIB. With
this patch the actual route is returned:

    $ ip route get oif vrf-red 10.2.1.40
    10.2.1.40 dev eth1  src 10.2.1.2
        cache

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 04:31:16 -07:00
David Ahern bb191c3e87 net: Add l3mdev saddr lookup to raw_sendmsg
ping originated on box through a VRF device is showing up in tcpdump
without a source address:
    $ tcpdump -n -i vrf-blue
    08:58:33.311303 IP 0.0.0.0 > 10.2.2.254: ICMP echo request, id 2834, seq 1, length 64
    08:58:33.311562 IP 10.2.2.254 > 10.2.2.2: ICMP echo reply, id 2834, seq 1, length 64

Add the call to l3mdev_get_saddr to raw_sendmsg.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 04:27:46 -07:00
David Ahern 8cbb512c92 net: Add source address lookup op for VRF
Add operation to l3mdev to lookup source address for a given flow.
Add support for the operation to VRF driver and convert existing
IPv4 hooks to use the new lookup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 04:27:44 -07:00
David Ahern 3ce58d8435 net: Refactor path selection in __ip_route_output_key_hash
VRF device needs the same path selection following lookup to set source
address. Rather than duplicating code, move existing code into a
function that is exported to modules.

Code move only; no functional change.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 04:27:44 -07:00
David Ahern 6e2895a8e3 net: Rename FLOWI_FLAG_VRFSRC to FLOWI_FLAG_L3MDEV_SRC
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07 04:27:42 -07:00
Peter Nørlund 0a837fe472 ipv4: Fix compilation errors in fib_rebalance
This fixes

net/built-in.o: In function `fib_rebalance':
fib_semantics.c:(.text+0x9df14): undefined reference to `__divdi3'

and

net/built-in.o: In function `fib_rebalance':
net/ipv4/fib_semantics.c:572: undefined reference to `__aeabi_ldivmod'

Fixes: 0e884c78ee ("ipv4: L3 hash-based multipath")

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 23:48:09 -07:00
Jiri Benc 181a4224ac ipv4: fix reply_dst leakage on arp reply
There are cases when the created metadata reply is not used. Ensure the
allocated memory is freed also in such cases.

Fixes: 63d008a4e9 ("ipv4: send arp replies to the correct tunnel")
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 04:05:15 -07:00
Eric Dumazet 2306c704ce inet: fix race in reqsk_queue_unlink()
reqsk_timer_handler() tests if icsk_accept_queue.listen_opt
is NULL at its beginning.

By the time it calls inet_csk_reqsk_queue_drop() and
reqsk_queue_unlink(), listener might have been closed and
inet_csk_listen_stop() had called reqsk_queue_yank_acceptq()
which sets icsk_accept_queue.listen_opt to NULL

We therefore need to correctly check listen_opt being NULL
after holding syn_wait_lock for proper synchronization.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Fixes: b357a364c5 ("inet: fix possible panic in reqsk_queue_unlink()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 04:04:09 -07:00
David S. Miller 40e106801e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next
Eric W. Biederman says:

====================
net: Pass net through ip fragmention

This is the next installment of my work to pass struct net through the
output path so the code does not need to guess how to figure out which
network namespace it is in, and ultimately routes can have output
devices in another network namespace.

This round focuses on passing net through ip fragmentation which we seem
to call from about everywhere.  That is the main ip output paths, the
bridge netfilter code, and openvswitch.  This has to happend at once
accross the tree as function pointers are involved.

First some prep work is done, then ipv4 and ipv6 are converted and then
temporary helper functions are removed.
====================

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 03:39:31 -07:00
Peter Nørlund 79a131592d ipv4: ICMP packet inspection for multipath
ICMP packets are inspected to let them route together with the flow they
belong to, minimizing the chance that a problematic path will affect flows
on other paths, and so that anycast environments can work with ECMP.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 03:00:04 -07:00
Peter Nørlund 0e884c78ee ipv4: L3 hash-based multipath
Replaces the per-packet multipath with a hash-based multipath using
source and destination address.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 02:59:21 -07:00
Eric Dumazet a1a5344ddb tcp: avoid two atomic ops for syncookies
inet_reqsk_alloc() is used to allocate a temporary request
in order to generate a SYNACK with a cookie. Then later,
syncookie validation also uses a temporary request.

These paths already took a reference on listener refcount,
we can avoid a couple of atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 02:45:27 -07:00
Eric Dumazet 7656d842de tcp: fix fastopen races vs lockless listener
There are multiple races that need fixes :

1) skb_get() + queue skb + kfree_skb() is racy

An accept() can be done on another cpu, data consumed immediately.
tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
socket receive queue are private.

Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb

2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
for the same reasons.

3) We want to send the SYNACK before queueing child into accept queue,
otherwise we might reintroduce the ooo issue fixed in
commit 7c85af8810 ("tcp: avoid reorders for TFO passive connections")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 02:45:24 -07:00
Eric Dumazet e994b2f0fb tcp: do not lock listener to process SYN packets
Everything should now be ready to finally allow SYN
packets processing without holding listener lock.

Tested:

3.5 Mpps SYNFLOOD. Plenty of cpu cycles available.

Next bottleneck is the refcount taken on listener,
that could be avoided if we remove SLAB_DESTROY_BY_RCU
strict semantic for listeners, and use regular RCU.

    13.18%  [kernel]  [k] __inet_lookup_listener
     9.61%  [kernel]  [k] tcp_conn_request
     8.16%  [kernel]  [k] sha_transform
     5.30%  [kernel]  [k] inet_reqsk_alloc
     4.22%  [kernel]  [k] sock_put
     3.74%  [kernel]  [k] tcp_make_synack
     2.88%  [kernel]  [k] ipt_do_table
     2.56%  [kernel]  [k] memcpy_erms
     2.53%  [kernel]  [k] sock_wfree
     2.40%  [kernel]  [k] tcp_v4_rcv
     2.08%  [kernel]  [k] fib_table_lookup
     1.84%  [kernel]  [k] tcp_openreq_init_rwin

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:46 -07:00
Eric Dumazet 92d6f176fd tcp/dccp: add a reschedule point in inet_csk_listen_stop()
If a listener with thousands of children in accept queue
is dismantled, it can take a while to close all of them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:45 -07:00
Eric Dumazet ef547f2ac1 tcp: remove max_qlen_log
This control variable was set at first listen(fd, backlog)
call, but not updated if application tried to increase or decrease
backlog. It made sense at the time listener had a non resizeable
hash table.

Also rounding to powers of two was not very friendly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:44 -07:00
Eric Dumazet 10cbc8f179 tcp/dccp: remove struct listen_sock
It is enough to check listener sk_state, no need for an extra
condition.

max_qlen_log can be moved into struct request_sock_queue

We can remove syn_wait_lock and the alignment it enforced.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet ca6fb06518 tcp: attach SYNACK messages to request sockets instead of listener
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet 079096f103 tcp/dccp: install syn_recv requests into ehash table
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:41 -07:00
Eric Dumazet 2feda34192 tcp/dccp: remove inet_csk_reqsk_queue_added() timeout argument
This is no longer used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:40 -07:00
Eric Dumazet aa3a0c8ce6 tcp: get_openreq[46]() changes
When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener

get_openreq6() also gets a const for its request socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:40 -07:00
Eric Dumazet 9cfd08601f tcp: remove BUG_ON() in tcp_check_req()
Once listener is lockless, its sk_state can change anytime.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:39 -07:00
Eric Dumazet ba8e275a45 tcp: cleanup tcp_v[46]_inbound_md5_hash()
We'll soon have to call tcp_v[46]_inbound_md5_hash() twice.
Also add const attribute to the socket, as it might be the
unlocked listener for SYN packets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:38 -07:00
Eric Dumazet 38cb52455c tcp: call sk_mark_napi_id() on the child, not the listener
This fixes a typo : We want to store the NAPI id on child socket.
Presumably nobody really uses busy polling, on short lived flows.

Fixes: 3d97379a67 ("tcp: move sk_mark_napi_id() at the right place")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:37 -07:00
Eric Dumazet 8d2675f1e4 tcp: move synflood_warned into struct request_sock_queue
long term plan is to remove struct listen_sock when its hash
table is no longer there.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:37 -07:00
Eric Dumazet aac065c50a tcp: move qlen/young out of struct listen_sock
qlen_inc & young_inc were protected by listener lock,
while qlen_dec & young_dec were atomic fields.

Everything needs to be atomic for upcoming lockless listener.

Also move qlen/young in request_sock_queue as we'll get rid
of struct listen_sock eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:36 -07:00
Eric Dumazet fff1f3001c tcp: add a spinlock to protect struct request_sock_queue
struct request_sock_queue fields are currently protected
by the listener 'lock' (not a real spinlock)

We need to add a private spinlock instead, so that softirq handlers
creating children do not have to worry with backlog notion
that the listener 'lock' carries.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:36 -07:00
David S. Miller f6d3125fa3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/dsa/slave.c

net/dsa/slave.c simply had overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-02 07:21:25 -07:00
Pablo Neira Ayuso 6ece90f9a1 netfilter: fix Kconfig dependencies for nf_dup_ipv{4,6}
net/built-in.o: In function `nf_dup_ipv4': (.text+0xed24d): undefined reference to `nf_conntrack_untracked'
net/built-in.o: In function `nf_dup_ipv4': (.text+0xed267): undefined reference to `nf_conntrack_untracked'
net/built-in.o: In function `nf_dup_ipv6': (.text+0x158aef): undefined reference to `nf_conntrack_untracked'
net/built-in.o: In function `nf_dup_ipv6': (.text+0x158b09): undefined reference to `nf_conntrack_untracked'

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-01 00:19:54 +02:00
Eric W. Biederman 694869b3c5 ipv4: Pass struct net through ip_fragment
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-09-30 01:45:03 -05:00
David Ahern b84f787820 net: Initialize flow flags in input path
The fib_table_lookup tracepoint found 2 places where the flowi4_flags is
not initialized.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 21:52:32 -07:00
David S. Miller 4bf1b54f9d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following pull request contains Netfilter/IPVS updates for net-next
containing 90 patches from Eric Biederman.

The main goal of this batch is to avoid recurrent lookups for the netns
pointer, that happens over and over again in our Netfilter/IPVS code. The idea
consists of passing netns pointer from the hook state to the relevant functions
and objects where this may be needed.

You can find more information on the IPVS updates from Simon Horman's commit
merge message:

c3456026ad ("Merge tag 'ipvs2-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next").

Exceptionally, this time, I'm not posting the patches again on netdev, Eric
already Cc'ed this mailing list in the original submission. If you need me to
make, just let me know.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 21:46:21 -07:00
David Ahern 8e1ed7058b net: Replace calls to vrf_dev_get_rth
Replace calls to vrf_dev_get_rth with l3mdev_get_rtable.
The check on the flow flags is handled in the l3mdev operation.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:33 -07:00
David Ahern 3236b0042b net: Replace vrf_dev_table and friends
Replace calls to vrf_dev_table and friends with l3mdev_fib_table
and kin.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:33 -07:00
David Ahern 385add906b net: Replace vrf_master_ifindex{, _rcu} with l3mdev equivalents
Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either
l3mdev_master_ifindex_rcu or l3mdev_master_ifindex.

The pattern:
    oif = vrf_master_ifindex(dev) ? : dev->ifindex;
is replaced with
    oif = l3mdev_fib_oif(dev);

And remove the now unused vrf macros.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:33 -07:00
David Ahern 007979eaf9 net: Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER
Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the
netif_is_vrf and netif_index_is_vrf macros.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 20:40:32 -07:00
Eric Dumazet 0536fcc039 tcp: prepare fastopen code for upcoming listener changes
While auditing TCP stack for upcoming 'lockless' listener changes,
I found I had to change fastopen_init_queue() to properly init the object
before publishing it.

Otherwise an other cpu could try to lock the spinlock before it gets
properly initialized.

Instead of adding appropriate barriers, just remove dynamic memory
allocations :
- Structure is 28 bytes on 64bit arches. Using additional 8 bytes
  for holding a pointer seems overkill.
- Two listeners can share same cache line and performance would suffer.

If we really want to save few bytes, we would instead dynamically allocate
whole struct request_sock_queue in the future.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:10 -07:00
Eric Dumazet 2985aaac01 tcp: constify tcp_syn_flood_action() socket argument
tcp_syn_flood_action() will soon be called with unlocked socket.
In order to avoid SYN flood warning being emitted multiple times,
use xchg().
Extend max_qlen_log and synflood_warned fields in struct listen_sock
to u32

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:10 -07:00
Eric Dumazet f964629e33 tcp: constify tcp_v{4|6}_route_req() sock argument
These functions do not change the listener socket.
Goal is to make sure tcp_conn_request() is not messing with
listener in a racy way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet 3f684b4b1f tcp: cookie_init_sequence() cleanups
Some common IPv4/IPv6 code can be factorized.
Also constify cookie_init_sequence() socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet 0c27171e66 tcp/dccp: constify syn_recv_sock() method sock argument
We'll soon no longer hold listener socket lock, these
functions do not modify the socket in any way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet c28c6f0459 tcp: constify tcp_create_openreq_child() socket argument
This method does not touch the listener socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet 1ce31c9e08 inet: constify __inet_inherit_port() sock argument
socket is not touched, make it const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:08 -07:00
Eric Dumazet a2432c4fa5 inet: constify inet_csk_route_child_sock() socket argument
The socket points to the (shared) listener.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:08 -07:00
Eric Dumazet 72ab4a86f7 tcp: remove tcp_rcv_state_process() tcp_hdr argument
Factorize code to get tcp header from skb. It makes no sense
to duplicate code in callers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:07 -07:00
Eric Dumazet bda07a64c0 tcp: remove unused len argument from tcp_rcv_state_process()
Once we realize tcp_rcv_synsent_state_process() does not use
its 'len' argument and we get rid of it, then it becomes clear
this argument is no longer used in tcp_rcv_state_process()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:07 -07:00
Eric Dumazet a00e74442b tcp/dccp: constify send_synack and send_reset socket argument
None of these functions need to change the socket, make it
const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:07 -07:00
David Ahern 0d7539603b net: Remove martian_source_keep_err goto label
err is initialized to -EINVAL when it is declared. It is not reset until
fib_lookup which is well after the 3 users of the martian_source jump. So
resetting err to -EINVAL at martian_source label is not needed.

Removing that line obviates the need for the martian_source_keep_err label
so delete it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:27:47 -07:00
Alexander Duyck 75fea73dce net: Swap ordering of tests in ip_route_input_mc
This patch just swaps the ordering of one of the conditional tests in
ip_route_input_mc.  Specifically it swaps the testing for the source
address to see if it is loopback, and the test to see if we allow a
loopback source address.

The reason for swapping these two tests is because it is much faster to
test if an address is loopback than it is to dereference several pointers
to get at the net structure to see if the use of loopback is allowed.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:27:47 -07:00