Граф коммитов

44791 Коммитов

Автор SHA1 Сообщение Дата
Eric Dumazet c8c8b12709 udp: under rx pressure, try to condense skbs
Under UDP flood, many softirq producers try to add packets to
UDP receive queue, and one user thread is burning one cpu trying
to dequeue packets as fast as possible.

Two parts of the per packet cost are :
- copying payload from kernel space to user space,
- freeing memory pieces associated with skb.

If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.

Meaning the softirq handler(s) can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later, possibly from another node.

Additional gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
- We avoid cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.

This comes at the cost of a copy, bounded to available tail room, which
is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
than necessary)

This patch gave me about 5 % increase in throughput in my tests.

skb_condense() helper could probably used in other contexts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 13:25:07 -05:00
Eric Dumazet 13bfff25c0 net: rfs: add a jump label
RFS is not commonly used, so add a jump label to avoid some conditionals
in fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 13:18:35 -05:00
Simon Horman 7b684884fb net/sched: cls_flower: Support matching on ICMP type and code
Support matching on ICMP type and code.

Example usage:

tc qdisc add dev eth0 ingress

tc filter add dev eth0 protocol ip parent ffff: flower \
	indev eth0 ip_proto icmp type 8 code 0 action drop

tc filter add dev eth0 protocol ipv6 parent ffff: flower \
	indev eth0 ip_proto icmpv6 type 128 code 0 action drop

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 11:47:08 -05:00
Simon Horman 972d3876fa flow dissector: ICMP support
Allow dissection of ICMP(V6) type and code. This should only occur
if a packet is ICMP(V6) and the dissector has FLOW_DISSECTOR_KEY_ICMP set.

There are currently no users of FLOW_DISSECTOR_KEY_ICMP.
A follow-up patch will allow FLOW_DISSECTOR_KEY_ICMP to be used by
the flower classifier.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 11:45:21 -05:00
Or Gerlitz faa3ffce78 net/sched: cls_flower: Add support for matching on flags
Add UAPI to provide set of flags for matching, where the flags
provided from user-space are mapped to flow-dissector flags.

The 1st flag allows to match on whether the packet is an
IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 11:32:50 -05:00
Zhang Shengju f91c58d68b icmp: correct return value of icmp_rcv()
Currently, icmp_rcv() always return zero on a packet delivery upcall.

To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 11:24:23 -05:00
Johan Hedberg a62da6f14d Bluetooth: SMP: Add support for H7 crypto function and CT2 auth flag
Bluetooth 5.0 introduces a new H7 key generation function that's used
when both sides of the pairing set the CT2 authentication flag to 1.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-12-08 07:50:24 +01:00
David S. Miller 5fccd64aa4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains a large Netfilter update for net-next,
to summarise:

1) Add support for stateful objects. This series provides a nf_tables
   native alternative to the extended accounting infrastructure for
   nf_tables. Two initial stateful objects are supported: counters and
   quotas. Objects are identified by a user-defined name, you can fetch
   and reset them anytime. You can also use a maps to allow fast lookups
   using any arbitrary key combination. More info at:

   http://marc.info/?l=netfilter-devel&m=148029128323837&w=2

2) On-demand registration of nf_conntrack and defrag hooks per netns.
   Register nf_conntrack hooks if we have a stateful ruleset, ie.
   state-based filtering or NAT. The new nf_conntrack_default_on sysctl
   enables this from newly created netnamespaces. Default behaviour is not
   modified. Patches from Florian Westphal.

3) Allocate 4k chunks and then use these for x_tables counter allocation
   requests, this improves ruleset load time and also datapath ruleset
   evaluation, patches from Florian Westphal.

4) Add support for ebpf to the existing x_tables bpf extension.
   From Willem de Bruijn.

5) Update layer 4 checksum if any of the pseudoheader fields is updated.
   This provides a limited form of 1:1 stateless NAT that make sense in
   specific scenario, eg. load balancing.

6) Add support to flush sets in nf_tables. This series comes with a new
   set->ops->deactivate_one() indirection given that we have to walk
   over the list of set elements, then deactivate them one by one.
   The existing set->ops->deactivate() performs an element lookup that
   we don't need.

7) Two patches to avoid cloning packets, thus speed up packet forwarding
   via nft_fwd from ingress. From Florian Westphal.

8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to
   prevent infinite loops, patch from Dwip Banerjee. And one minor
   refactoring from Gao feng.

9) Revisit recent log support for nf_tables netdev families: One patch
   to ensure that we correctly handle non-ethernet packets. Another
   patch to add missing logger definition for netdev. Patches from
   Liping Zhang.

10) Three patches for nft_fib, one to address insufficient register
    initialization and another to solve incorrect (although harmless)
    byteswap operation. Moreover update xt_rpfilter and nft_fib to match
    lbcast packets with zeronet as source, eg. DHCP Discover packets
    (0.0.0.0 -> 255.255.255.255). Also from Liping Zhang.

11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from
    Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has
    been broken in many-cast mode for some little time, let's give them a
    chance by placing them at the same level as other existing protocols.
    Thus, users don't explicitly have to modprobe support for this and
    NAT rules work for them. Some people point to the lack of support in
    SOHO Linux-based routers that make deployment of new protocols harder.
    I guess other middleboxes outthere on the Internet are also to blame.
    Anyway, let's see if this has any impact in the midrun.

12) Skip software SCTP software checksum calculation if the NIC comes
    with SCTP checksum offload support. From Davide Caratti.

13) Initial core factoring to prepare conversion to hook array. Three
    patches from Aaron Conole.

14) Gao Feng made a wrong conversion to switch in the xt_multiport
    extension in a patch coming in the previous batch. Fix it in this
    batch.

15) Get vmalloc call in sync with kmalloc flags to avoid a warning
    and likely OOM killer intervention from x_tables. From Marcelo
    Ricardo Leitner.

16) Update Arturo Borrero's email address in all source code headers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-07 19:16:46 -05:00
Pablo Neira Ayuso 73c25fb139 netfilter: nft_quota: allow to restore consumed quota
Allow to restore consumed quota, this is useful to restore the quota
state across reboots.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 14:40:53 +01:00
Willem de Bruijn 2c16d60332 netfilter: xt_bpf: support ebpf
Add support for attaching an eBPF object by file descriptor.

The iptables binary can be called with a path to an elf object or a
pinned bpf object. Also pass the mode and path to the kernel to be
able to return it later for iptables dump and save.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:32:35 +01:00
Marcelo Ricardo Leitner 5bad87348c netfilter: x_tables: avoid warn and OOM killer on vmalloc call
Andrey Konovalov reported that this vmalloc call is based on an
userspace request and that it's spewing traces, which may flood the logs
and cause DoS if abused.

Florian Westphal also mentioned that this call should not trigger OOM
killer.

This patch brings the vmalloc call in sync to kmalloc and disables the
warn trace on allocation failure and also disable OOM killer invocation.

Note, however, that under such stress situation, other places may
trigger OOM killer invocation.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:31:41 +01:00
Pablo Neira Ayuso 8411b6442e netfilter: nf_tables: support for set flushing
This patch adds support for set flushing, that consists of walking over
the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set.
This patch requires the following changes:

1) Add set->ops->deactivate_one() operation: This allows us to
   deactivate an element from the set element walk path, given we can
   skip the lookup that happens in ->deactivate().

2) Add a new nft_trans_alloc_gfp() function since we need to allocate
   transactions using GFP_ATOMIC given the set walk path happens with
   held rcu_read_lock.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:31:40 +01:00
Pablo Neira Ayuso 37df5301a3 netfilter: nft_set: introduce nft_{hash, rbtree}_deactivate_one()
This new function allows us to deactivate one single element, this is
required by the set flush command that comes in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:31:02 +01:00
Pablo Neira Ayuso 1a37ef769d netfilter: nf_tables: constify struct nft_ctx * parameter in nft_trans_alloc()
Context is not modified by nft_trans_alloc(), so constify it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:51 +01:00
Davide Caratti 3189a290f9 netfilter: nat: skip checksum on offload SCTP packets
SCTP GSO and hardware can do CRC32c computation after netfilter processing,
so we can avoid calling sctp_compute_checksum() on skb if skb->ip_summed
is equal to CHECKSUM_PARTIAL. Moreover, set skb->ip_summed to CHECKSUM_NONE
when the NAT code computes the CRC, to prevent offloaders from computing
it again (on ixgbe this resulted in a transmission with wrong L4 checksum).

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:50 +01:00
Liping Zhang 3b760dcb0f netfilter: rpfilter: bypass ipv4 lbcast packets with zeronet source
Otherwise, DHCP Discover packets(0.0.0.0->255.255.255.255) may be
dropped incorrectly.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:50 +01:00
Pablo Neira Ayuso a9fea2a3c3 netfilter: nf_tables: allow to filter stateful object dumps by type
This patch adds the netlink code to filter out dump of stateful objects,
through the NFTA_OBJ_TYPE netlink attribute.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:49 +01:00
Pablo Neira Ayuso 63aea29060 netfilter: nft_objref: support for stateful object maps
This patch allows us to refer to stateful object dictionaries, the
source register indicates the key data to be used to look up for the
corresponding state object. We can refer to these maps through names or,
alternatively, the map transaction id. This allows us to refer to both
anonymous and named maps.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:48 +01:00
Pablo Neira Ayuso 8aeff920dc netfilter: nf_tables: add stateful object reference to set elements
This patch allows you to refer to stateful objects from set elements.
This provides the infrastructure to create maps where the right hand
side of the mapping is a stateful object.

This allows us to build dictionaries of stateful objects, that you can
use to perform fast lookups using any arbitrary key combination.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:47 +01:00
Pablo Neira Ayuso 1896531710 netfilter: nft_quota: add depleted flag for objects
Notify on depleted quota objects. The NFT_QUOTA_F_DEPLETED flag
indicates we have reached overquota.

Add pointer to table from nft_object, so we can use it when sending the
depletion notification to userspace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:12 +01:00
Pablo Neira Ayuso 2599e98934 netfilter: nf_tables: notify internal updates of stateful objects
Introduce nf_tables_obj_notify() to notify internal state changes in
stateful objects. This is used by the quota object to report depletion
in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 12:57:20 +01:00
Pablo Neira Ayuso 43da04a593 netfilter: nf_tables: atomic dump and reset for stateful objects
This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
dump-and-reset of the stateful object. This also comes with add support
for atomic dump and reset for counter and quota objects.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 12:56:57 +01:00
Pablo Neira Ayuso 795595f68d netfilter: nft_quota: dump consumed quota
Add a new attribute NFTA_QUOTA_CONSUMED that displays the amount of
quota that has been already consumed. This allows us to restore the
internal state of the quota object between reboots as well as to monitor
how wasted it is.

This patch changes the logic to account for the consumed bytes, instead
of the bytes that remain to be consumed.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 12:54:22 +01:00
Marc Kleine-Budde 332b05ca7a can: raw: raw_setsockopt: limit number of can_filter that can be set
This patch adds a check to limit the number of can_filters that can be
set via setsockopt on CAN_RAW sockets. Otherwise allocations > MAX_ORDER
are not prevented resulting in a warning.

Reference: https://lkml.org/lkml/2016/12/2/230

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-12-07 10:45:57 +01:00
David S. Miller c63d352f05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-12-06 21:33:19 -05:00
Pablo Neira Ayuso c97d22e68b netfilter: nf_tables: add stateful object reference expression
This new expression allows us to refer to existing stateful objects from
rules.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:25 +01:00
Pablo Neira Ayuso 173705d9a2 netfilter: nft_quota: add stateful object type
Register a new quota stateful object type into the new stateful object
infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:24 +01:00
Pablo Neira Ayuso b1ce0ced10 netfilter: nft_counter: add stateful object type
Register a new percpu counter stateful object type into the stateful
object infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:23 +01:00
Pablo Neira Ayuso e50092404c netfilter: nf_tables: add stateful objects
This patch augments nf_tables to support stateful objects. This new
infrastructure allows you to create, dump and delete stateful objects,
that are identified by a user-defined name.

This patch adds the generic infrastructure, follow up patches add
support for two stateful objects: counters and quotas.

This patch provides a native infrastructure for nf_tables to replace
nfacct, the extended accounting infrastructure for iptables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:22 +01:00
Florian Westphal 3bf3276119 netfilter: add and use nf_fwd_netdev_egress
... so we can use current skb instead of working with a clone.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:22 +01:00
Gao Feng 1ed9887ee3 netfilter: xt_multiport: Fix wrong unmatch result with multiple ports
I lost one test case in the last commit for xt_multiport.
For example, the rule is "-m multiport --dports 22,80,443".
When first port is unmatched and the second is matched, the curent codes
could not return the right result.
It would return false directly when the first port is unmatched.

Fixes: dd2602d00f ("netfilter: xt_multiport: Use switch case instead
of multiple condition checks")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:20 +01:00
Pablo Neira Ayuso 1814096980 netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields
This patch adds a new flag that signals the kernel to update layer 4
checksum if the packet field belongs to the layer 4 pseudoheader. This
implicitly provides stateless NAT 1:1 that is useful under very specific
usecases.

Since rules mangling layer 3 fields that are part of the pseudoheader
may potentially convey any layer 4 packet, we have to deal with the
layer 4 checksum adjustment using protocol specific code.

This patch adds support for TCP, UDP and ICMPv6, since they include the
pseudoheader in the layer 4 checksum calculation. ICMP doesn't, so we
can skip it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:47:54 +01:00
Liping Zhang e0ffdbc78d netfilter: nft_fib_ipv4: initialize *dest to zero
Otherwise, if fib lookup fail, *dest will be filled with garbage value,
so reverse path filtering will not work properly:
 # nft add rule x prerouting fib saddr oif eq 0 drop

Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:21 +01:00
Liping Zhang 11583438b7 netfilter: nft_fib: convert htonl to ntohl properly
Acctually ntohl and htonl are identical, so this doesn't affect
anything, but it is conceptually wrong.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:20 +01:00
Florian Westphal ae0ac0ed6f netfilter: x_tables: pack percpu counter allocations
instead of allocating each xt_counter individually, allocate 4k chunks
and then use these for counter allocation requests.

This should speed up rule evaluation by increasing data locality,
also speeds up ruleset loading because we reduce calls to the percpu
allocator.

As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
arches with 64k page size.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:19 +01:00
Florian Westphal f28e15bace netfilter: x_tables: pass xt_counters struct to counter allocator
Keeps some noise away from a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:18 +01:00
Florian Westphal 4d31eef517 netfilter: x_tables: pass xt_counters struct instead of packet counter
On SMP we overload the packet counter (unsigned long) to contain
percpu offset.  Hide this from callers and pass xt_counters address
instead.

Preparation patch to allocate the percpu counters in page-sized batch
chunks.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:17 +01:00
Aaron Conole 679972f3be netfilter: convert while loops to for loops
This is to facilitate converting from a singly-linked list to an array
of elements.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:16 +01:00
Aaron Conole 0aa8c57a04 netfilter: introduce accessor functions for hook entries
This allows easier future refactoring.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:15 +01:00
Florian Westphal 834184b1f3 netfilter: defrag: only register defrag functionality if needed
nf_defrag modules for ipv4 and ipv6 export an empty stub function.
Any module that needs the defragmentation hooks registered simply 'calls'
this empty function to create a phony module dependency -- modprobe will
then load the defrag module too.

This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook
registration until the functionality is requested within a network namespace
instead of module load time for all namespaces.

Hooks are only un-registered on module unload or when a namespace that used
such defrag functionality exits.

We have to use struct net for this as the register hooks can be called
before netns initialization here from the ipv4/ipv6 conntrack module
init path.

There is no unregister functionality support, defrag will always be
active once it was requested inside a net namespace.

The reason is that defrag has impact on nft and iptables rulesets
(without defrag we might see framents).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:00 +01:00
Fabian Frederick 3eb15f2828 sunrpc: use DEFINE_SPINLOCK()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-12-06 14:18:30 -05:00
Florian Westphal 343dfaa198 Revert "dctcp: update cwnd on congestion event"
Neal Cardwell says:
 If I am reading the code correctly, then I would have two concerns:
 1) Has that been tested? That seems like an extremely dramatic
    decrease in cwnd. For example, if the cwnd is 80, and there are 40
    ACKs, and half the ACKs are ECE marked, then my back-of-the-envelope
    calculations seem to suggest that after just 11 ACKs the cwnd would be
    down to a minimal value of 2 [..]
 2) That seems to contradict another passage in the draft [..] where it
    sazs:
       Just as specified in [RFC3168], DCTCP does not react to congestion
       indications more than once for every window of data.

Neal is right.  Fortunately we don't have to complicate this by testing
vs. current rtt estimate, we can just revert the patch.

Normal stack already handles this for us: receiving ACKs with ECE
set causes a call to tcp_enter_cwr(), from there on the ssthresh gets
adjusted and prr will take care of cwnd adjustment.

Fixes: 4780566784 ("dctcp: update cwnd on congestion event")
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06 11:34:24 -05:00
Marcelo Ricardo Leitner dcb17d22e1 tcp: warn on bogus MSS and try to amend it
There have been some reports lately about TCP connection stalls caused
by NIC drivers that aren't setting gso_size on aggregated packets on rx
path. This causes TCP to assume that the MSS is actually the size of the
aggregated packet, which is invalid.

Although the proper fix is to be done at each driver, it's often hard
and cumbersome for one to debug, come to such root cause and report/fix
it.

This patch amends this situation in two ways. First, it adds a warning
on when this situation occurs, so it gives a hint to those trying to
debug this. It also limit the maximum probed MSS to the adverised MSS,
as it should never be any higher than that.

The result is that the connection may not have the best performance ever
but it shouldn't stall, and the admin will have a hint on what to look
for.

Tested with virtio by forcing gso_size to 0.

v2: updated msg per David's suggestion
v3: use skb_iif to find the interface and also log its name, per Eric
    Dumazet's suggestion. As the skb may be backlogged and the interface
    gone by then, we need to check if the number still has a meaning.
v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per
    David's suggestion

Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06 11:01:19 -05:00
Eric Dumazet a297569fe0 net/udp: do not touch skb->peeked unless really needed
In UDP recvmsg() path we currently access 3 cache lines from an skb
while holding receive queue lock, plus another one if packet is
dequeued, since we need to change skb->next->prev

1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)

skb->peeked is only needed to make sure 0-length packets are properly
handled while MSG_PEEK is operated.

I had first the intent to remove skb->peeked but the "MSG_PEEK at
non-zero offset" support added by Sam Kumar makes this not possible.

This patch avoids one cache line miss during the locked section, when
skb->len and skb->peeked do not have to be read.

It also avoids the skb_set_peeked() cost for non empty UDP datagrams.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06 10:41:49 -05:00
Herbert Xu ed5d7788a9 netlink: Do not schedule work from sk_destruct
It is wrong to schedule a work from sk_destruct using the socket
as the memory reserve because the socket will be freed immediately
after the return from sk_destruct.

Instead we should do the deferral prior to sk_free.

This patch does just that.

Fixes: 707693c8a4 ("netlink: Call cb->done from a worker thread")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 19:43:42 -05:00
Daniel Borkmann 7bd509e311 bpf: add prog_digest and expose it via fdinfo/netlink
When loading a BPF program via bpf(2), calculate the digest over
the program's instruction stream and store it in struct bpf_prog's
digest member. This is done at a point in time before any instructions
are rewritten by the verifier. Any unstable map file descriptor
number part of the imm field will be zeroed for the hash.

fdinfo example output for progs:

  # cat /proc/1590/fdinfo/5
  pos:          0
  flags:        02000002
  mnt_id:       11
  prog_type:    1
  prog_jited:   1
  prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
  memlock:      4096

When programs are pinned and retrieved by an ELF loader, the loader
can check the program's digest through fdinfo and compare it against
one that was generated over the ELF file's program section to see
if the program needs to be reloaded. Furthermore, this can also be
exposed through other means such as netlink in case of a tc cls/act
dump (or xdp in future), but also through tracepoints or other
facilities to identify the program. Other than that, the digest can
also serve as a base name for the work in progress kallsyms support
of programs. The digest doesn't depend/select the crypto layer, since
we need to keep dependencies to a minimum. iproute2 will get support
for this facility.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:33:11 -05:00
Daniel Borkmann 8d829bdb97 bpf, cls: consolidate prog deletion path
Commit 18cdb37ebf ("net: sched: do not use tcf_proto 'tp' argument from
call_rcu") removed the last usage of tp from cls_bpf_delete_prog(), so also
remove it from the function as argument to not give a wrong impression. tp
is illegal to access from this callback, since it could already have been
freed.

Refactor the deletion code a bit, so that cls_bpf_destroy() can call into
the same code for prog deletion as cls_bpf_delete() op, instead of having
it unnecessarily duplicated.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:33:10 -05:00
Daniel Borkmann 1afaf661b2 bpf: remove type arg from __is_valid_{,xdp_}access
Commit d691f9e8d4 ("bpf: allow programs to write to certain skb
fields") pushed access type check outside of __is_valid_access()
to have different restrictions for socket filters and tc programs.
type is thus not used anymore within __is_valid_access() and should
be removed as a function argument. Same for __is_valid_xdp_access()
introduced by 6a773a15a1 ("bpf: add XDP prog type for early driver
filter").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:33:10 -05:00
Eric Dumazet 1c0d32fde5 net_sched: gen_estimator: complete rewrite of rate estimators
1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
  gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:21:59 -05:00
Hadar Hen Zion a6e1693129 net/sched: cls_flower: Set the filter Hardware device for all use-cases
Check if the returned device from tcf_exts_get_dev function supports tc
offload and in case the rule can't be offloaded, set the filter hw_dev
parameter to the original device given by the user.

The filter hw_device parameter should always be set by fl_hw_replace_filter
function, since this pointer is used by dump stats and destroy
filter for each flower rule (offloaded or not).

Fixes: 7091d8c705 ('net/sched: cls_flower: Add offload support using egress Hardware device')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reported-by: Simon Horman <horms@verge.net.au>
Tested-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:06:58 -05:00
Erik Nordmark 96d5822c1d ipv6: Allow IPv4-mapped address as next-hop
Made kernel accept IPv6 routes with IPv4-mapped address as next-hop.

It is possible to configure IP interfaces with IPv4-mapped addresses, and
one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior
to this fix the kernel returned an EINVAL when attempting to add an IPv6
route with an IPv4-mapped address as a nexthop/gateway.

RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops,
thus in order to support that type of address configuration the kernel
needs to allow IPv4-mapped addresses as nexthops.

Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 14:52:05 -05:00
Pan Bian c79e167c3c net: caif: remove ineffective check
The check of the return value of sock_register() is ineffective.
"if(!err)" seems to be a typo. It is better to propagate the error code
to the callers of caif_sktinit_module(). This patch removes the check
statment and directly returns the result of sock_register().

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188751
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 14:48:48 -05:00
Al Viro 0b62fca262 switch getfrag callbacks to ..._full() primitives
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 14:34:32 -05:00
Al Viro cbbd26b8b1 [iov_iter] new primitives - copy_from_iter_full() and friends
copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.

Convert some obvious users.  *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case.  Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 14:33:36 -05:00
David S. Miller c3543688ab Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-12-03

Here's a set of Bluetooth & 802.15.4 patches for net-next (i.e. 4.10
kernel):

 - Fix for a potential NULL deref in the ieee802154 netlink code
 - Fix for the ED values of the at86rf2xx driver
 - Documentation updates to ieee802154
 - Cleanups to u8 vs __u8 usage
 - Timer API usage cleanups in HCI drivers

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:37:28 -05:00
Kees Cook 0eab121ef8 net: ping: check minimum size on ICMP header length
Prior to commit c0371da604 ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.

This was found using trinity with KASAN on v3.18:

BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
Read of size 8 by task trinity-c2/9623
page:ffffffbe034b9a08 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G    BU         3.18.0-dirty #15
Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
Call trace:
[<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
[<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
[<     inline     >] __dump_stack lib/dump_stack.c:15
[<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
[<     inline     >] print_address_description mm/kasan/report.c:147
[<     inline     >] kasan_report_error mm/kasan/report.c:236
[<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
[<     inline     >] check_memory_region mm/kasan/kasan.c:264
[<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
[<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
[<     inline     >] memcpy_from_msg include/linux/skbuff.h:2667
[<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
[<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
[<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
[<     inline     >] __sock_sendmsg_nosec net/socket.c:624
[<     inline     >] __sock_sendmsg net/socket.c:632
[<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643
[<     inline     >] SYSC_sendto net/socket.c:1797
[<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761

CVE-2016-8399

Reported-by: Qidan He <i@flanker017.me>
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:35:38 -05:00
Eric Dumazet 7aa5470c2c tcp: tsq: move tsq_flags close to sk_wmem_alloc
tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.

Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:32:24 -05:00
Eric Dumazet 12a59abc22 tcp: tcp_mtu_probe() is likely to exit early
Adding a likely() in tcp_mtu_probe() moves its code which used to
be inlined in front of tcp_write_xmit()

We still have a cache line miss to access icsk->icsk_mtup.enabled,
we will probably have to reorganize fields to help data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:32:23 -05:00
Eric Dumazet 75eefc6c59 tcp: tsq: add a shortcut in tcp_small_queue_check()
Always allow the two first skbs in write queue to be sent,
regardless of sk_wmem_alloc/sk_pacing_rate values.

This helps a lot in situations where TX completions are delayed either
because of driver latencies or softirq latencies.

Test is done with no cache line misses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:32:23 -05:00
Eric Dumazet a9b204d156 tcp: tsq: avoid one atomic in tcp_wfree()
Under high load, tcp_wfree() has an atomic operation trying
to schedule a tasklet over and over.

We can schedule it only if our per cpu list was empty.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:32:23 -05:00
Eric Dumazet b223feb9de tcp: tsq: add shortcut in tcp_tasklet_func()
Under high stress, I've seen tcp_tasklet_func() consuming
~700 usec, handling ~150 tcp sockets.

By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance
for other cpus/threads entering tcp_write_xmit() to grab it,
allowing tcp_tasklet_func() to skip sockets that already did
an xmit cycle.

In the future, we might give to ACK processing an increased
budget to reduce even more tcp_tasklet_func() amount of work.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:32:22 -05:00
Eric Dumazet 408f0a6c21 tcp: tsq: remove one locked operation in tcp_wfree()
Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED
bits, use one cmpxchg() to perform a single locked operation.

Since the following patch will also set TCP_TSQ_DEFERRED here,
this cmpxchg() will make this addition free.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:32:22 -05:00
Eric Dumazet 40fc3423b9 tcp: tsq: add tsq_flags / tsq_enum
This is a cleanup, to ease code review of following patches.

Old 'enum tsq_flags' is renamed, and a new enumeration is added
with the flags used in cmpxchg() operations as opposed to
single bit operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:32:22 -05:00
Pan Bian b59589635f net: bridge: set error code on failure
Function br_sysfs_addbr() does not set error code when the call
kobject_create_and_add() returns a NULL pointer. It may be better to
return "-ENOMEM" when kobject_create_and_add() fails.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188781

Signed-off-by: Pan Bian <bianpan2016@163.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:26:22 -05:00
Suraj Deshmukh 14dd3e1b97 net: af_mpls.c add space before open parenthesis
Adding space after switch keyword before open
parenthesis for readability purpose.

This patch fixes the checkpatch.pl warning:
space required before the open parenthesis '('

Signed-off-by: Suraj Deshmukh <surajssd009005@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:25:55 -05:00
Alexander Duyck a52ca62c4a ipv4: Drop suffix update from resize code
It has been reported that update_suffix can be expensive when it is called
on a large node in which most of the suffix lengths are the same.  The time
required to add 200K entries had increased from around 3 seconds to almost
49 seconds.

In order to address this we need to move the code for updating the suffix
out of resize and instead just have it handled in the cases where we are
pushing a node that increases the suffix length, or will decrease the
suffix length.

Fixes: 5405afd1a3 ("fib_trie: Add tracking value for suffix length")
Reported-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Robert Shearman <rshearma@brocade.com>
Tested-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:15:58 -05:00
Alexander Duyck 1a239173cc ipv4: Drop leaf from suffix pull/push functions
It wasn't necessary to pass a leaf in when doing the suffix updates so just
drop it.  Instead just pass the suffix and work with that.

Since we dropped the leaf there is no need to include that in the name so
the names are updated to node_push_suffix and node_pull_suffix.

Finally I noticed that the logic for pulling the suffix length back
actually had some issues.  Specifically it would stop prematurely if there
was a longer suffix, but it was not as long as the original suffix.  I
updated the code to address that in node_pull_suffix.

Fixes: 5405afd1a3 ("fib_trie: Add tracking value for suffix length")
Suggested-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Robert Shearman <rshearma@brocade.com>
Tested-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:15:58 -05:00
Florian Westphal 481fa37347 netfilter: conntrack: add nf_conntrack_default_on sysctl
This switch (default on) can be used to disable automatic registration
of connection tracking functionality in newly created network
namespaces.

This means that when net namespace goes down (or the tracker protocol
module is unloaded) we *might* have to unregister the hooks.

We can either add another per-netns variable that tells if
the hooks got registered by default, or, alternatively, just call
the protocol _put() function and have the callee deal with a possible
'extra' put() operation that doesn't pair with a get() one.

This uses the latter approach, i.e. a put() without a get has no effect.

Conntrack is still enabled automatically regardless of the new sysctl
setting if the new net namespace requires connection tracking, e.g. when
NAT rules are created.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 21:17:25 +01:00
Florian Westphal 0c66dc1ea3 netfilter: conntrack: register hooks in netns when needed by ruleset
This makes use of nf_ct_netns_get/put added in previous patch.
We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6
then implement use-count to track how many users (nft or xtables modules)
have a dependency on ipv4 and/or ipv6 connection tracking functionality.

When count reaches zero, the hooks are unregistered.

This delays activation of connection tracking inside a namespace until
stateful firewall rule or nat rule gets added.

This patch breaks backwards compatibility in the sense that connection
tracking won't be active anymore when the protocol tracker module is
loaded.  This breaks e.g. setups that ctnetlink for flow accounting and
the like, without any '-m conntrack' packet filter rules.

Followup patch restores old behavour and makes new delayed scheme
optional via sysctl.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 21:17:24 +01:00
Florian Westphal 20afd42397 netfilter: nf_tables: add conntrack dependencies for nat/masq/redir expressions
so that conntrack core will add the needed hooks in this namespace.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 21:17:16 +01:00
Florian Westphal a357b3f80b netfilter: nat: add dependencies on conntrack module
MASQUERADE, S/DNAT and REDIRECT already call functions that depend on the
conntrack module.

However, since the conntrack hooks are now registered in a lazy fashion
(i.e., only when needed) a symbol reference is not enough.

Thus, when something is added to a nat table, make sure that it will see
packets by calling nf_ct_netns_get() which will register the conntrack
hooks in the current netns.

An alternative would be to add these dependencies to the NAT table.

However, that has problems when using non-modular builds -- we might
register e.g. ipv6 conntrack before its initcall has run, leading to NULL
deref crashes since its per-netns storage has not yet been allocated.

Adding the dependency in the modules instead has the advantage that nat
table also does not register its hooks until rules are added.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 21:16:51 +01:00
Florian Westphal ecb2421b5d netfilter: add and use nf_ct_netns_get/put
currently aliased to try_module_get/_put.
Will be changed in next patch when we add functions to make use of ->net
argument to store usercount per l3proto tracker.

This is needed to avoid registering the conntrack hooks in all netns and
later only enable connection tracking in those that need conntrack.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 21:16:50 +01:00
Florian Westphal a379854d91 netfilter: conntrack: remove unused init_net hook
since adf0516845 ("netfilter: remove ip_conntrack* sysctl compat code")
the only user (ipv4 tracker) sets this to an empty stub function.

After this change nf_ct_l3proto_pernet_register() is also empty,
but this will change in a followup patch to add conditional register
of the hooks.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 21:16:41 +01:00
Davide Caratti 9b91c96c5d netfilter: conntrack: built-in support for UDPlite
CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y,
connection tracking support for UDPlite protocol is built-in into
nf_conntrack.ko.

footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \
        net/ipv4/netfilter/nf_conntrack_ipv4.ko \
        net/ipv6/netfilter/nf_conntrack_ipv6.ko

(builtin)|| udplite|  ipv4  |  ipv6  |nf_conntrack
---------++--------+--------+--------+--------------
none     || 432538 | 828755 | 828676 | 6141434
UDPlite  ||   -    | 829649 | 829362 | 6498204

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:57:36 +01:00
Davide Caratti a85406afeb netfilter: conntrack: built-in support for SCTP
CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection
tracking support for SCTP protocol is built-in into nf_conntrack.ko.

footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \
        net/ipv4/netfilter/nf_conntrack_ipv4.ko \
        net/ipv6/netfilter/nf_conntrack_ipv6.ko

(builtin)||  sctp  |  ipv4  |  ipv6  | nf_conntrack
---------++--------+--------+--------+--------------
none     || 498243 | 828755 | 828676 | 6141434
SCTP     ||   -    | 829254 | 829175 | 6547872

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:55:37 +01:00
Davide Caratti c51d39010a netfilter: conntrack: built-in support for DCCP
CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection
tracking support for DCCP protocol is built-in into nf_conntrack.ko.

footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \
        net/ipv4/netfilter/nf_conntrack_ipv4.ko \
        net/ipv6/netfilter/nf_conntrack_ipv6.ko

(builtin)||  dccp  |  ipv4  |  ipv6  | nf_conntrack
---------++--------+--------+--------+--------------
none     || 469140 | 828755 | 828676 | 6141434
DCCP     ||   -    | 830566 | 829935 | 6533526

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:53:15 +01:00
Pablo Neira Ayuso f6b3ef5e38 Merge tag 'ipvs-for-v4.10' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
IPVS Updates for v4.10

please consider these enhancements to the IPVS for v4.10.

* Decrement the IP ttl in all the modes in order to prevent infinite
  route loops. Thanks to Dwip Banerjee.
* Use IS_ERR_OR_NULL macro. Clean-up from Gao Feng.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:46:16 +01:00
Liping Zhang a7647080d3 netfilter: nfnetlink_log: add "nf-logger-5-1" module alias name
So we can autoload nfnetlink_log.ko when the user adding nft log
group X rule in netdev family.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:45:34 +01:00
Liping Zhang 673ab46f34 netfilter: nf_log: do not assume ethernet header in netdev family
In netdev family, we will handle non ethernet packets, so using
eth_hdr(skb)->h_proto is incorrect.

Meanwhile, we can use socket(AF_PACKET...) to sending packets, so
skb->protocol is not always set in bridge family.

Add an extra parameter into nf_log_l2packet to solve this issue.

Fixes: 1fddf4bad0 ("netfilter: nf_log: add packet logging for netdev family")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:45:33 +01:00
Davide Caratti b8ad652f97 netfilter: built-in NAT support for UDPlite
CONFIG_NF_NAT_PROTO_UDPLITE is no more a tristate. When set to y, NAT
support for UDPlite protocol is built-in into nf_nat.ko.

footprint test:

(nf_nat_proto_)           |udplite || nf_nat
--------------------------+--------++--------
no builtin                | 408048 || 2241312
UDPLITE builtin           |   -    || 2577256

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:45:32 +01:00
Davide Caratti 7a2dd28c70 netfilter: built-in NAT support for SCTP
CONFIG_NF_NAT_PROTO_SCTP is no more a tristate. When set to y, NAT
support for SCTP protocol is built-in into nf_nat.ko.

footprint test:

(nf_nat_proto_)           | sctp   || nf_nat
--------------------------+--------++--------
no builtin                | 428344 || 2241312
SCTP builtin              |   -    || 2597032

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:45:31 +01:00
Davide Caratti 0c4e966eaf netfilter: built-in NAT support for DCCP
CONFIG_NF_NAT_PROTO_DCCP is no more a tristate. When set to y, NAT
support for DCCP protocol is built-in into nf_nat.ko.

footprint test:

(nf_nat_proto_)           | dccp   || nf_nat
--------------------------+--------++--------
no builtin                | 409800 || 2241312
DCCP builtin              |   -    || 2578968

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:45:30 +01:00
Arturo Borrero Gonzalez cd72751468 netfilter: update Arturo Borrero Gonzalez email address
The email address has changed, let's update the copyright statements.

Signed-off-by: Arturo Borrero Gonzalez <arturo@debian.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-04 20:45:25 +01:00
Pan Bian c66ebf2db5 net: dcb: set error code on failures
In function dcbnl_cee_fill(), returns the value of variable err on
errors. However, on some error paths (e.g. nla put fails), its value may
be 0. It may be better to explicitly set a negative errno to variable
err before returning.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188881

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 23:54:25 -05:00
Erik Nordmark adc176c547 ipv6 addrconf: Implemented enhanced DAD (RFC7527)
Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.

Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 23:21:37 -05:00
Ido Schimmel c3852ef7f2 ipv4: fib: Replay events when registering FIB notifier
Commit b90eb75494 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.

However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.

Solve that by dumping the FIB tables and replaying the events to the
passed notification block. The dump itself is done using RCU in order
not to starve consumers that need RTNL to make progress.

The integrity of the dump is ensured by reading the FIB change sequence
counter before and after the dump under RTNL. This allows us to avoid
the problematic situation in which the dumping process sends a ENTRY_ADD
notification following ENTRY_DEL generated by another process holding
RTNL.

Callers of the registration function may pass a callback that is
executed in case the dump was inconsistent with current FIB tables.

The number of retries until a consistent dump is achieved is set to a
fixed number to prevent callers from looping for long periods of time.
In case current limit proves to be problematic in the future, it can be
easily converted to be configurable using a sysctl.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 19:29:35 -05:00
Ido Schimmel cacaad11f4 ipv4: fib: Allow for consistent FIB dumping
The next patch will enable listeners of the FIB notification chain to
request a dump of the FIB tables. However, since RTNL isn't taken during
the dump, it's possible for the FIB tables to change mid-dump, which
will result in inconsistency between the listener's table and the
kernel's.

Allow listeners to know about changes that occurred mid-dump, by adding
a change sequence counter to each net namespace. The counter is
incremented just before a notification is sent in the FIB chain.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 19:29:35 -05:00
Ido Schimmel d3f706f68e ipv4: fib: Convert FIB notification chain to be atomic
In order not to hold RTNL for long periods of time we're going to dump
the FIB tables using RCU.

Convert the FIB notification chain to be atomic, as we can't block in
RCU critical sections.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 19:29:35 -05:00
Ido Schimmel b423cb1080 ipv4: fib: Export free_fib_info()
The FIB notification chain is going to be converted to an atomic chain,
which means switchdev drivers will have to offload FIB entries in
deferred work, as hardware operations entail sleeping.

However, while the work is queued fib info might be freed, so a
reference must be taken. To release the reference (and potentially free
the fib info) fib_info_put() will be called, which in turn calls
free_fib_info().

Export free_fib_info() so that modules will be able to invoke
fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 19:29:35 -05:00
WANG Cong 548ed72246 act_mirred: fix a typo in get_dev
Fixes: 255cb30425 ("net/sched: act_mirred: Add new tc_action_ops get_dev()")
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 19:28:02 -05:00
Paolo Abeni 363dc73aca udp: be less conservative with sock rmem accounting
Before commit 850cbaddb5 ("udp: use it's own memory accounting
schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
the rcvbuf by the whole current packet's truesize. After said commit
we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
is empty. As reported by Jesper this cause a performance regression
for some (small) values of rcvbuf.

This commit is intended to fix the regression restoring the old
handling of the rcvbuf limit.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Fixes: 850cbaddb5 ("udp: use it's own memory accounting schema")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 16:14:48 -05:00
David S. Miller a38b610094 Here is another batman-adv bugfix:
- fix checking for failed allocation of TVLV blocks in TT local data,
    by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlhBnMAWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoQjfD/4i/eHZjGNwS9aTxh0QCfzM45JP
 xq5DmvCZrJXzGIt/g3vDI9WqpdJrnJsfUhCwzcn2NwCqBt8BKlGFT/NDXv5rxjBu
 5wHV9gZ5r6whMVyXiDgkYNW0B1M71rjgh9rxvH61Lyrdaq5RSwGVZb9dohRffGpo
 zTU+l+GiMJXzszA3Hi0ga+8/3xyEdwjMzSyBSK2xLdyHFBn8gg4EfAf1kJhS4rk8
 dbJUHFEvtg0wB7LkMlXe3DHkX/bRAi+PGU/YNk1x0x877ejfejIn7pp/GoweFgAW
 2cn4SxVIsN7sycZCQYar5iqjbQ/mO0v5fNTkxo8EGiNB7wHTr0RVFI/UPOP6KeS6
 2JR9qubg4kd6nU9yWMC8osbzye+GWmbPkvkFBrcHRkZpZDwsuaLJH2SbmWGt62kT
 mjpF2Uhsz+hn6AF5YhdDPAnBNP0iJWu25jWymCtHqCjieJsK0QjvAbXLMz45VWoW
 5rrvOBy1gI47/3gs2jitEXER9/6onai3vzoJ3cX7YYUh2ntiAuK3Lg5wGnrODm/W
 fyjpDcnwURBhiP4H+uoeiRvxFXpNeUBJiTolw+tl8n//adlLNepclA0OnGunx4GP
 qc5LPksYp1KWV6rIy3Nl+62XGLZkSGHcYzVb+uX/3Bng8HNmva7ceUggsVwIFsgi
 Isw24Dckx4bSB8P86w==
 =Y2ED
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20161202' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is another batman-adv bugfix:

 - fix checking for failed allocation of TVLV blocks in TT local data,
   by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 16:13:27 -05:00
Eric Dumazet 12efa1fa43 net_sched: gen_estimator: account for timer drifts
Under heavy stress, timer used in estimators tend to slowly be delayed
by a few jiffies, leading to inaccuracies.

Lets remember what was the last scheduled jiffies so that we get more
precise estimations, without having to add a multiply/divide in the loop
to account for the drifts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 16:12:17 -05:00
Alexey Dobriyan 6af2d5fff2 netns: fix net_generic() "id - 1" bloat
net_generic() function is both a) inline and b) used ~600 times.

It has the following code inside

		...
	ptr = ng->ptr[id - 1];
		...

"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.

We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.

Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.

Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.

Code size savings (oh boy): -4.2 KB

As usual, ignore the initial compiler stupidity part of the table.

	add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
	function                                     old     new   delta
	tipc_nametbl_insert_publ                    1250    1270     +20
	nlmclnt_lookup_host                          686     703     +17
	nfsd4_encode_fattr                          5930    5941     +11
	nfs_get_client                              1050    1061     +11
	register_pernet_operations                   333     342      +9
	tcf_mirred_init                              843     849      +6
	tcf_bpf_init                                1143    1149      +6
	gss_setup_upcall                             990     994      +4
	idmap_name_to_id                             432     434      +2
	ops_init                                     274     275      +1
	nfsd_inject_forget_client                    259     260      +1
	nfs4_alloc_client                            612     613      +1
	tunnel_key_walker                            164     163      -1

		...

	tipc_bcbase_select_primary                   392     360     -32
	mac80211_hwsim_new_radio                    2808    2767     -41
	ipip6_tunnel_ioctl                          2228    2186     -42
	tipc_bcast_rcv                               715     672     -43
	tipc_link_build_proto_msg                   1140    1089     -51
	nfsd4_lock                                  3851    3796     -55
	tipc_mon_rcv                                1012     956     -56
	Total: Before=156643951, After=156639743, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 15:59:58 -05:00
Alexey Dobriyan 9bfc7b9969 netns: add dummy struct inside "struct net_generic"
This is precursor to fixing "[id - 1]" bloat inside net_generic().

Name "s" is chosen to complement name "u" often used for dummy unions.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 15:59:58 -05:00
Alexey Dobriyan 1a9a059203 netns: publish net_generic correctly
Publishing net_generic pointer is done with silly mistake: new array is
published BEFORE setting freshly acquired pernet subsystem pointer.

	memcpy
	rcu_assign_pointer
	kfree_rcu
	ng->ptr[id - 1] = data;

This bug was introduced with commit dec827d174
("[NETNS]: The generic per-net pointers.") in the glorious days of
chopping networking stack into containers proper 8.5 years ago (whee...)

How it didn't trigger for so long?
Well, you need quite specific set of conditions:

*) race window opens once per pernet subsystem addition
   (read: modprobe or boot)

*) not every pernet subsystem is eligible (need ->id and ->size)

*) not every pernet subsystem is vulnerable (need incorrect or absense
   of ordering of register_pernet_sybsys() and actually using net_generic())

*) to hide the bug even more, default is to preallocate 13 pointers which
   is actually quite a lot. You need IPv6, netfilter, bridging etc together
   loaded to trigger reallocation in the first place. Trimmed down
   config are OK.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 15:59:58 -05:00
David S. Miller 2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Eric Dumazet b98b0bc8c4 net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
CAP_NET_ADMIN users should not be allowed to set negative
sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
corruptions, crashes, OOM...

Note that before commit 8298193012 ("net: cleanups in
sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
and SO_RCVBUF were vulnerable.

This needs to be backported to all known linux kernels.

Again, many thanks to syzkaller team for discovering this gem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 14:10:14 -05:00
Michal Kubeček 3de81b7588 tipc: check minimum bearer MTU
Qian Zhang (张谦) reported a potential socket buffer overflow in
tipc_msg_build() which is also known as CVE-2016-8632: due to
insufficient checks, a buffer overflow can occur if MTU is too short for
even tipc headers. As anyone can set device MTU in a user/net namespace,
this issue can be abused by a regular user.

As agreed in the discussion on Ben Hutchings' original patch, we should
check the MTU at the moment a bearer is attached rather than for each
processed packet. We also need to repeat the check when bearer MTU is
adjusted to new device MTU. UDP case also needs a check to avoid
overflow when calculating bearer MTU.

Fixes: b97bf3fd8f ("[TIPC] Initial merge")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 14:03:20 -05:00
David Ahern aa4c1037a3 bpf: Add support for reading socket family, type, protocol
Add socket family, type and protocol to bpf_sock allowing bpf programs
read-only access.

Add __sk_flags_offset[0] to struct sock before the bitfield to
programmtically determine the offset of the unsigned int containing
protocol and type.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:09 -05:00
David Ahern 6102365876 bpf: Add new cgroup attach type to enable sock modifications
Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:08 -05:00
Artem Savkov 6b6ebb6b01 ip6_offload: check segs for NULL in ipv6_gso_segment.
segs needs to be checked for being NULL in ipv6_gso_segment() before calling
skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference:

[   97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[   97.819112] IP: [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
[   97.825214] PGD 0 [   97.827047]
[   97.828540] Oops: 0000 [#1] SMP
[   97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5
nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel
snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device
snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport
sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon
broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core
i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod
[   97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1
[   97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010
[   97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000
[   97.947720] RIP: 0010:[<ffffffff816e52f9>]  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
[   97.956251] RSP: 0018:ffff88012fc43a10  EFLAGS: 00010207
[   97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594
[   97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000
[   97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000
[   97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3
[   97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000
[   97.997198] FS:  0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000
[   98.005280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0
[   98.018149] Stack:
[   98.020157]  00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e
[   98.027584]  0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3
[   98.035010]  ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98
[   98.042437] Call Trace:
[   98.044879]  <IRQ> [   98.046803]  [<ffffffffa017ad0a>] ? tg3_start_xmit+0x84a/0xd60 [tg3]
[   98.053156]  [<ffffffff815eeee0>] skb_mac_gso_segment+0xb0/0x130
[   98.059158]  [<ffffffff815eefd3>] __skb_gso_segment+0x73/0x110
[   98.064985]  [<ffffffff815ef40d>] validate_xmit_skb+0x12d/0x2b0
[   98.070899]  [<ffffffff815ef5d2>] validate_xmit_skb_list+0x42/0x70
[   98.077073]  [<ffffffff81618560>] sch_direct_xmit+0xd0/0x1b0
[   98.082726]  [<ffffffff815efd86>] __dev_queue_xmit+0x486/0x690
[   98.088554]  [<ffffffff8135c135>] ? cpumask_next_and+0x35/0x50
[   98.094380]  [<ffffffff815effa0>] dev_queue_xmit+0x10/0x20
[   98.099863]  [<ffffffffa09ce057>] br_dev_queue_push_xmit+0xa7/0x170 [bridge]
[   98.106907]  [<ffffffffa09ce161>] br_forward_finish+0x41/0xc0 [bridge]
[   98.113430]  [<ffffffff81627cf2>] ? nf_iterate+0x52/0x60
[   98.118735]  [<ffffffff81627d6b>] ? nf_hook_slow+0x6b/0xc0
[   98.124216]  [<ffffffffa09ce32c>] __br_forward+0x14c/0x1e0 [bridge]
[   98.130480]  [<ffffffffa09ce120>] ? br_dev_queue_push_xmit+0x170/0x170 [bridge]
[   98.137785]  [<ffffffffa09ce4bd>] br_forward+0x9d/0xb0 [bridge]
[   98.143701]  [<ffffffffa09cfbb7>] br_handle_frame_finish+0x267/0x560 [bridge]
[   98.150834]  [<ffffffffa09d0064>] br_handle_frame+0x174/0x2f0 [bridge]
[   98.157355]  [<ffffffff8102fb89>] ? sched_clock+0x9/0x10
[   98.162662]  [<ffffffff810b63b2>] ? sched_clock_cpu+0x72/0xa0
[   98.168403]  [<ffffffff815eccf5>] __netif_receive_skb_core+0x1e5/0xa20
[   98.174926]  [<ffffffff813659f9>] ? timerqueue_add+0x59/0xb0
[   98.180580]  [<ffffffff815ed548>] __netif_receive_skb+0x18/0x60
[   98.186494]  [<ffffffff815ee625>] process_backlog+0x95/0x140
[   98.192145]  [<ffffffff815edccd>] net_rx_action+0x16d/0x380
[   98.197713]  [<ffffffff8170cff1>] __do_softirq+0xd1/0x283
[   98.203106]  [<ffffffff8170b2bc>] do_softirq_own_stack+0x1c/0x30
[   98.209107]  <EOI> [   98.211029]  [<ffffffff8108a5c0>] do_softirq+0x50/0x60
[   98.216166]  [<ffffffff815ec853>] netif_rx_ni+0x33/0x80
[   98.221386]  [<ffffffffa09eeff7>] tun_get_user+0x487/0x7f0 [tun]
[   98.227388]  [<ffffffffa09ef3ab>] tun_sendmsg+0x4b/0x60 [tun]
[   98.233129]  [<ffffffffa0b68932>] handle_tx+0x282/0x540 [vhost_net]
[   98.239392]  [<ffffffffa0b68c25>] handle_tx_kick+0x15/0x20 [vhost_net]
[   98.245916]  [<ffffffffa0abacfe>] vhost_worker+0x9e/0xf0 [vhost]
[   98.251919]  [<ffffffffa0abac60>] ? vhost_umem_alloc+0x40/0x40 [vhost]
[   98.258440]  [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180
[   98.264094]  [<ffffffff810a44d9>] kthread+0xd9/0xf0
[   98.268965]  [<ffffffff810a4400>] ? kthread_park+0x60/0x60
[   98.274444]  [<ffffffff8170a4d5>] ret_from_fork+0x25/0x30
[   98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 <41> 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66
[   98.299425] RIP  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
[   98.305612]  RSP <ffff88012fc43a10>
[   98.309094] CR2: 00000000000000cc
[   98.312406] ---[ end trace 726a2c7a2d2d78d0 ]---

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:34:58 -05:00
Sowmini Varadhan 721c7443dc RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net
If some error is encountered in rds_tcp_init_net, make sure to
unregister_netdevice_notifier(), else we could trigger a panic
later on, when the modprobe from a netns fails.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:29:26 -05:00
Hadar Hen Zion 7091d8c705 net/sched: cls_flower: Add offload support using egress Hardware device
In order to support hardware offloading when the device given by the tc
rule is different from the Hardware underline device, extract the mirred
(egress) device from the tc action when a filter is added, using the new
tc_action_ops, get_dev().

Flower caches the information about the mirred device and use it for
calling ndo_setup_tc in filter change, update stats and delete.

Calling ndo_setup_tc of the mirred (egress) device instead of the
ingress device will allow a resolution between the software ingress
device and the underline hardware device.

The resolution will take place inside the offloading driver using
'egress_device' flag added to tc_to_netdev struct which is provided to
the offloading driver.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:37 -05:00
Hadar Hen Zion 255cb30425 net/sched: act_mirred: Add new tc_action_ops get_dev()
Adding support to a new tc_action_ops.
get_dev is a general option which allows to get the underline
device when trying to offload a tc rule.

In case of mirred action the returned device is the mirred (egress)
device.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:36 -05:00
Hadar Hen Zion 3036dab670 net/sched: cls_flower: Provide a filter to replace/destroy hardware filter functions
Instead of providing many arguments to fl_hw_{replace/destroy}_filter
functions, just provide cls_fl_filter struct that includes all the relevant
args.

This patches doesn't add any new functionality.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:36 -05:00
Hadar Hen Zion 796852197c net/sched: cls_flower: Try to offload only if skip_hw flag isn't set
Check skip_hw flag isn't set before calling
fl_hw_{replace/destroy}_filter and fl_hw_update_stats functions.

Replace the call to tc_should_offload with tc_can_offload.
tc_can_offload only checks if the device supports offloading, the check for
skip_hw flag is done earlier in the flow.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:36 -05:00
Florian Westphal 25429d7b7d tcp: allow to turn tcp timestamp randomization off
Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
flows, we can deduct the relative qdisc delays (think of fq pacing).
This should work even if we have one flow per remote peer."

Having random per flow (or host) offsets doesn't allow that anymore so add
a way to turn this off.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:49:59 -05:00
Florian Westphal 95a22caee3 tcp: randomize tcp timestamp offsets for each connection
jiffies based timestamps allow for easy inference of number of devices
behind NAT translators and also makes tracking of hosts simpler.

commit ceaa1fef65 ("tcp: adding a per-socket timestamp offset")
added the main infrastructure that is needed for per-connection ts
randomization, in particular writing/reading the on-wire tcp header
format takes the offset into account so rest of stack can use normal
tcp_time_stamp (jiffies).

So only two items are left:
 - add a tsoffset for request sockets
 - extend the tcp isn generator to also return another 32bit number
   in addition to the ISN.

Re-use of ISN generator also means timestamps are still monotonically
increasing for same connection quadruple, i.e. PAWS will still work.

Includes fixes from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:49:59 -05:00
Eli Cooper 80d1106aea Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
This reverts commit ae148b0858
("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").

skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
dst_output() is called. It is no longer necessary to do it for each tunnel.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:34:22 -05:00
Eli Cooper b4e479a96f ipv6: Set skb->protocol properly for local output
When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:34:22 -05:00
Eli Cooper f418043910 ipv4: Set skb->protocol properly for local output
When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
fixing a bug where GSO packets sent through a sit tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:34:22 -05:00
Philip Pettersson 84ac726023 packet: fix race condition in packet_set_ring
When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.

This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.

The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.

Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:16:49 -05:00
Soheil Hassas Yeganeh 83a1a1a70e sock: reset sk_err for ICMP packets read from error queue
Only when ICMP packets are enqueued onto the error queue,
sk_err is also set. Before f5f99309fa (sock: do not set sk_err
in sock_dequeue_err_skb), a subsequent error queue read
would set sk_err to the next error on the queue, or 0 if empty.
As no error types other than ICMP set this field, sk_err should
not be modified upon dequeuing them.

Only for ICMP errors, reset the (racy) sk_err. Some applications,
like traceroute, rely on it and go into a futile busy POLLERR
loop otherwise.

In principle, sk_err has to be set while an ICMP error is queued.
Testing is_icmp_err_skb(skb_next) approximates this without
requiring a full queue walk. Applications that receive both ICMP
and other errors cannot rely on this legacy behavior, as other
errors do not set sk_err in the first place.

Fixes: f5f99309fa (sock: do not set sk_err in sock_dequeue_err_skb)
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:55:39 -05:00
Thomas Graf 3a0af8fd61 bpf: BPF for lightweight tunnel infrastructure
Registers new BPF program types which correspond to the LWT hooks:
  - BPF_PROG_TYPE_LWT_IN   => dst_input()
  - BPF_PROG_TYPE_LWT_OUT  => dst_output()
  - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

The separate program types are required to differentiate between the
capabilities each LWT hook allows:

 * Programs attached to dst_input() or dst_output() are restricted and
   may only read the data of an skb. This prevent modification and
   possible invalidation of already validated packet headers on receive
   and the construction of illegal headers while the IP headers are
   still being assembled.

 * Programs attached to lwtunnel_xmit() are allowed to modify packet
   content as well as prepending an L2 header via a newly introduced
   helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
   invoked after the IP header has been assembled completely.

All BPF programs receive an skb with L3 headers attached and may return
one of the following error codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                (Only valid in lwtunnel_xmit() context)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:51:49 -05:00
Thomas Graf efd8570081 route: Set lwtstate for local traffic and cached input dsts
A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:51:49 -05:00
Thomas Graf 11b3d9c586 route: Set orig_output when redirecting to lwt on locally generated traffic
orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311 ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:51:49 -05:00
Tobias Klauser 6919756caa net/rtnetlink: fix attribute name in nlmsg_size() comments
Use the correct attribute constant names IFLA_GSO_MAX_{SEGS,SIZE}
instead of IFLA_MAX_GSO_{SEGS,SIZE} for the comments int nlmsg_size().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:34:59 -05:00
Sven Eckelmann c2d0f48a13 batman-adv: Check for alloc errors when preparing TT local data
batadv_tt_prepare_tvlv_local_data can fail to allocate the memory for the
new TVLV block. The caller is informed about this problem with the returned
length of 0. Not checking this value results in an invalid memory access
when either tt_data or tt_change is accessed.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-12-02 10:46:59 +01:00
NeilBrown 2c2ee6d20b sunrpc: Don't engage exponential backoff when connection attempt is rejected.
xs_connect() contains an exponential backoff mechanism so the repeated
connection attempts are delayed by longer and longer amounts.

This is appropriate when the connection failed due to a timeout, but
it not appropriate when a definitive "no" answer is received.  In such
cases, call_connect_status() imposes a minimum 3-second back-off, so
not having the exponetial back-off will never result in immediate
retries.

The current situation is a problem when the NFS server tries to
register with rpcbind but rpcbind isn't running.  All connection
attempts are made on the same "xprt" and as the connection is never
"closed", the exponential back delays successive attempts to register,
or de-register, different protocols.  This results in a multi-minute
delay with no benefit.

So, when call_connect_status() receives a definitive "no", use
xprt_conditional_disconnect() to cancel the previous connection attempt.
This will set XPRT_CLOSE_WAIT so that xprt->ops->close() calls xs_close()
which resets the reestablish_timeout.

To ensure xprt_conditional_disconnect() does the right thing, we
ensure that rq_connect_cookie is set before a connection attempt, and
allow xprt_conditional_disconnect() to complete even when the
transport is not fully connected.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:40:41 -05:00
Zhang Shengju 2934c9dbd3 rtnetlink: return the correct error code
Before this patch, function ndo_dflt_fdb_dump() will always return code
from uc fdb dump. The reture code of mc fdb dump is lost.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01 14:36:03 -05:00
David S. Miller 7bbf91ce27 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-12-01

1) Change the error value when someone tries to run 32bit
   userspace on a 64bit host from -ENOTSUPP to the userspace
   exported -EOPNOTSUPP. Fix from Yi Zhao.

2) On inbound, ESN sequence numbers are already in network
   byte order. So don't try to convert it again, this fixes
   integrity verification for ESN. Fixes from Tobias Brunner.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01 11:35:49 -05:00
David S. Miller 3d2dd617fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This is a large batch of Netfilter fixes for net, they are:

1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist
   structure that allows to have several objects with the same key.
   Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is
   expecting a return value similar to memcmp(). Change location of
   the nat_bysource field in the nf_conn structure to avoid zeroing
   this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us
   to crashes. From Florian Westphal.

2) Don't allow malformed fragments go through in IPv6, drop them,
   otherwise we hit GPF, patch from Florian Westphal.

3) Fix crash if attributes are missing in nft_range, from Liping Zhang.

4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia.

5) Two patches from David Ahern to fix netfilter interaction with vrf.
   From David Ahern.

6) Fix element timeout calculation in nf_tables, we take milliseconds
   from userspace, but we use jiffies from kernelspace. Patch from
   Anders K.  Pedersen.

7) Missing validation length netlink attribute for nft_hash, from
   Laura Garcia.

8) Fix nf_conntrack_helper documentation, we don't default to off
   anymore for a bit of time so let's get this in sync with the code.

I know is late but I think these are important, specifically the NAT
bits, as they are mostly addressing fallout from recent changes. I also
read there are chances to have -rc8, if that is the case, that would
also give us a bit more time to test this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01 11:04:41 -05:00
Chuck Lever fafedf8170 svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
No longer any need for the dprintk().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:16 -05:00
Chuck Lever 0725745020 svcrdma: Break up dprintk format in svc_rdma_accept()
The current code results in:

Nov  7 14:50:19 klimt kernel: svcrdma: newxprt->sc_cm_id=ffff88085590c800,
 newxprt->sc_pd=ffff880852a7ce00#012    cm_id->device=ffff88084dd20000,
 sc_pd->device=ffff88084dd20000#012    cap.max_send_wr = 272#012
 cap.max_recv_wr = 34#012    cap.max_send_sge = 32#012
 cap.max_recv_sge = 32
Nov  7 14:50:19 klimt kernel: svcrdma: new connection ffff880855908000
 accepted with the following attributes:#012    local_ip        :
 10.0.0.5#012    local_port#011     : 20049#012    remote_ip       :
 10.0.0.2#012    remote_port     : 59909#012    max_sge         : 32#012
 max_sge_rd      : 30#012    sq_depth        : 272#012    max_requests    :
 32#012    ord             : 16

Split up the output over multiple dprintks and take the opportunity
to fix the display of IPv6 addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:16 -05:00
Chuck Lever f5426d37f6 svcrdma: Remove unused variable in rdma_copy_tail()
Clean up.

linux-2.6/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c: In function
 ‘rdma_copy_tail’:
linux-2.6/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:376:6: warning:
 variable ‘ret’ set but not used [-Wunused-but-set-variable]
  int ret;
      ^

Fixes: a97c331f9a ("svcrdma: Handle additional inline content")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:15 -05:00
Chuck Lever 9a16a34d21 svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
Clean up.

/linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function
 ‘xprt_rdma_bc_allocate’:
linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c:169:23: warning:
 variable ‘rdma’ set but not used [-Wunused-but-set-variable]
  struct svcxprt_rdma *rdma;
                       ^

Fixes: 5d252f90a8 ("svcrdma: Add class for RDMA backwards ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:14 -05:00
Chuck Lever 96a58f9c19 svcrdma: Remove svc_rdma_op_ctxt::wc_status
Clean up: Completion status is already reported in the individual
completion handlers. Save a few bytes in struct svc_rdma_op_ctxt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:14 -05:00
Chuck Lever dd6fd213b0 svcrdma: Remove DMA map accounting
Clean up: sc_dma_used is not required for correct operation. It is
simply a debugging tool to report when svcrdma has leaked DMA maps.

However, manipulating an atomic has a measurable CPU cost, and DMA
map accounting specific to svcrdma will be meaningless once svcrdma
is converted to use the new generic r/w API.

A similar kind of debug accounting can be done simply by enabling
the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and
CONFIG_IOMMU_LEAK.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:13 -05:00
Chuck Lever e4eb42cecc svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
svcrdma's current SQ accounting algorithm takes sc_lock and disables
bottom-halves while posting all RDMA Read, Write, and Send WRs.

This is relatively heavyweight serialization. And note that Write and
Send are already fully serialized by the xpt_mutex.

Using a single atomic_t should be all that is necessary to guarantee
that ib_post_send() is called only when there is enough space on the
send queue. This is what the other RDMA-enabled storage targets do.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:13 -05:00
Chuck Lever 5fdca65314 svcrdma: Renovate sendto chunk list parsing
The current sendto code appears to support clients that provide only
one of a Read list, a Write list, or a Reply chunk. My reading of
that code is that it doesn't support the following cases:

 - Read list + Write list
 - Read list + Reply chunk
 - Write list + Reply chunk
 - Read list + Write list + Reply chunk

The protocol allows more than one Read or Write chunk in those
lists. Some clients do send a Read list and Reply chunk
simultaneously. NFSv4 WRITE uses a Read list for the data payload,
and a Reply chunk because the GETATTR result in the reply can
contain a large object like an ACL.

Generalize one of the sendto code paths needed to support all of
the above cases, and attempt to ensure that only one pass is done
through the RPC Call's transport header to gather chunk list
information for building the reply.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:12 -05:00
Chuck Lever 4d712ef1db svcauth_gss: Close connection when dropping an incoming message
S5.3.3.1 of RFC 2203 requires that an incoming GSS-wrapped message
whose sequence number lies outside the current window is dropped.
The rationale is:

  The reason for discarding requests silently is that the server
  is unable to determine if the duplicate or out of range request
  was due to a sequencing problem in the client, network, or the
  operating system, or due to some quirk in routing, or a replay
  attack by an intruder.  Discarding the request allows the client
  to recover after timing out, if indeed the duplication was
  unintentional or well intended.

However, clients may rely on the server dropping the connection to
indicate that a retransmit is needed. Without a connection reset, a
client can wait forever without retransmitting, and the workload
just stops dead. I've reproduced this behavior by running xfstests
generic/323 on an NFSv4.0 mount with proto=rdma and sec=krb5i.

To address this issue, have the server close the connection when it
silently discards an incoming message due to a GSS sequence number
problem.

There are a few other places where the server will never reply.
Change those spots in a similar fashion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:11 -05:00
Chuck Lever 1b9f700b8c svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
Logic copied from xs_setup_bc_tcp().

Fixes: 39a9beab5a ('rpc: share one xps between all backchannels')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30 17:31:11 -05:00
Lorenzo Colitti d109e61bfe net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.
Commit e2d118a1cb ("net: inet: Support UID-based routing in IP
protocols.") made __build_flow_key call sock_net(sk) to determine
the network namespace of the passed-in socket. This crashes if sk
is NULL.

Fix this by getting the network namespace from the skb instead.

Fixes: e2d118a1cb ("net: inet: Support UID-based routing in IP protocols.")
Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:53:59 -05:00
Hongxu Jia 17a49cd549 netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel
Since 09d9686047 ("netfilter: x_tables: do compat validation via
translate_table"), it used compatr structure to assign newinfo
structure.  In translate_compat_table of ip_tables.c and ip6_tables.c,
it used compatr->hook_entry to replace info->hook_entry and
compatr->underflow to replace info->underflow, but not do the same
replacement in arp_tables.c.

It caused invoking 32-bit "arptbale -P INPUT ACCEPT" failed in 64bit
kernel.
--------------------------------------
root@qemux86-64:~# arptables -P INPUT ACCEPT
root@qemux86-64:~# arptables -P INPUT ACCEPT
ERROR: Policy for `INPUT' offset 448 != underflow 0
arptables: Incompatible with this kernel
--------------------------------------

Fixes: 09d9686047 ("netfilter: x_tables: do compat validation via translate_table")
Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-30 20:50:23 +01:00
Guillaume Nault 31e2f21fb3 l2tp: fix address test in __l2tp_ip6_bind_lookup()
The '!(addr && ipv6_addr_equal(addr, laddr))' part of the conditional
matches if addr is NULL or if addr != laddr.
But the intend of __l2tp_ip6_bind_lookup() is to find a sockets with
the same address, so the ipv6_addr_equal() condition needs to be
inverted.

For better clarity and consistency with the rest of the expression, the
(!X || X == Y) notation is used instead of !(X && X != Y).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:08 -05:00
Guillaume Nault df90e68861 l2tp: fix lookup for sockets not bound to a device in l2tp_ip
When looking up an l2tp socket, we must consider a null netdevice id as
wild card. There are currently two problems caused by
__l2tp_ip_bind_lookup() not considering 'dif' as wild card when set to 0:

  * A socket bound to a device (i.e. with sk->sk_bound_dev_if != 0)
    never receives any packet. Since __l2tp_ip_bind_lookup() is called
    with dif == 0 in l2tp_ip_recv(), sk->sk_bound_dev_if is always
    different from 'dif' so the socket doesn't match.

  * Two sockets, one bound to a device but not the other, can be bound
    to the same address. If the first socket binding to the address is
    the one that is also bound to a device, the second socket can bind
    to the same address without __l2tp_ip_bind_lookup() noticing the
    overlap.

To fix this issue, we need to consider that any null device index, be
it 'sk->sk_bound_dev_if' or 'dif', matches with any other value.
We also need to pass the input device index to __l2tp_ip_bind_lookup()
on reception so that sockets bound to a device never receive packets
from other devices.

This patch fixes l2tp_ip6 in the same way.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:08 -05:00
Guillaume Nault d5e3a19093 l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
It's not enough to check for sockets bound to same address at the
beginning of l2tp_ip{,6}_bind(): even if no socket is found at that
time, a socket with the same address could be bound before we take
the l2tp lock again.

This patch moves the lookup right before inserting the new socket, so
that no change can ever happen to the list between address lookup and
socket insertion.

Care is taken to avoid side effects on the socket in case of failure.
That is, modifications of the socket are done after the lookup, when
binding is guaranteed to succeed, and before releasing the l2tp lock,
so that concurrent lookups will always see fully initialised sockets.

For l2tp_ip, 'ret' is set to -EINVAL before checking the SOCK_ZAPPED
bit. Error code was mistakenly set to -EADDRINUSE on error by commit
32c231164b ("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()").
Using -EINVAL restores original behaviour.

For l2tp_ip6, the lookup is now always done with the correct bound
device. Before this patch, when binding to a link-local address, the
lookup was done with the original sk->sk_bound_dev_if, which was later
overwritten with addr->l2tp_scope_id. Lookup is now performed with the
final sk->sk_bound_dev_if value.

Finally, the (addr_len >= sizeof(struct sockaddr_in6)) check has been
dropped: addr is a sockaddr_l2tpip6 not sockaddr_in6 and addr_len has
already been checked at this point (this part of the code seems to have
been copy-pasted from net/ipv6/raw.c).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:08 -05:00
Guillaume Nault a3c18422a4 l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv()
Socket must be held while under the protection of the l2tp lock; there
is no guarantee that sk remains valid after the read_unlock_bh() call.

Same issue for l2tp_ip and l2tp_ip6.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:07 -05:00
Guillaume Nault 0382a25af3 l2tp: lock socket before checking flags in connect()
Socket flags aren't updated atomically, so the socket must be locked
while reading the SOCK_ZAPPED flag.

This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
also brings error handling for __ip6_datagram_connect() failures.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:14:07 -05:00
Zhang Shengju 18502acd9a neigh: remove duplicate check for same neigh
Currently loop index 'idx' is used as the index in the neigh list of interest.
It's increased only when the neigh is dumped. It's not the absolute index in
the list. Because there is no info to record which neigh has already be scanned
by previous loop. This will cause the filtered out neighs to be scanned mulitple
times.

This patch make idx as the absolute index in the list, it will increase no matter
whether the neigh is filtered. This will prevent the above problem.

And this is in line with other dump functions.

v2:
 - take David Ahern's advice to do simple change

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 13:46:16 -05:00
Daniele Di Proietto f92a80a997 openvswitch: Fix skb leak in IPv6 reassembly.
If nf_ct_frag6_gather() returns an error other than -EINPROGRESS, it
means that we still have a reference to the skb.  We should free it
before returning from handle_fragments, as stated in the comment above.

Fixes: daaa7d647f ("netfilter: ipv6: avoid nf_iterate recursion")
CC: Florian Westphal <fw@strlen.de>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Joe Stringer <joe@ovn.org>
Signed-off-by: Daniele Di Proietto <diproiettod@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 11:00:45 -05:00
Daniel Borkmann 85de8576a0 bpf, xdp: allow to pass flags to dev_change_xdp_fd
Add an IFLA_XDP_FLAGS attribute that can be passed for setting up
XDP along with IFLA_XDP_FD, which eventually allows user space to
implement typical add/replace/delete logic for programs. Right now,
calling into dev_change_xdp_fd() will always replace previous programs.

When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more
graceful when requested by returning -EBUSY in case we try to
attach a new program, but we find that another one is already
attached. This will be used by upcoming front-end for iproute2 as
well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 10:27:20 -05:00
Francis Yan 1c885808e4 tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING
This patch exports the sender chronograph stats via the socket
SO_TIMESTAMPING channel. Currently we can instrument how long a
particular application unit of data was queued in TCP by tracking
SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
these sender chronograph stats exported simultaneously along with
these timestamps allow further breaking down the various sender
limitation.  For example, a video server can tell if a particular
chunk of video on a connection takes a long time to deliver because
TCP was experiencing small receive window. It is not possible to
tell before this patch without packet traces.

To prepare these stats, the user needs to set
SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
while requesting other SOF_TIMESTAMPING TX timestamps. When the
timestamps are available in the error queue, the stats are returned
in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 10:04:25 -05:00
Francis Yan efd9017416 tcp: export sender limits chronographs to TCP_INFO
This patch exports all the sender chronograph measurements collected
in the previous patches to TCP_INFO interface. Note that busy time
exported includes all the other sending limits (rwnd-limited,
sndbuf-limited). Internally the time unit is jiffy but externally
the measurements are in microseconds for future extensions.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 10:04:25 -05:00
Francis Yan b0f71bd3e1 tcp: instrument how long TCP is limited by insufficient send buffer
This patch measures the amount of time when TCP runs out of new data
to send to the network due to insufficient send buffer, while TCP
is still busy delivering (i.e. write queue is not empty). The goal
is to indicate either the send buffer autotuning or user SO_SNDBUF
setting has resulted network under-utilization.

The measurement starts conservatively by checking various conditions
to minimize false claims (i.e. under-estimation is more likely).
The measurement stops when the SOCK_NOSPACE flag is cleared. But it
does not account the time elapsed till the next application write.
Also the measurement only starts if the sender is still busy sending
data, s.t. the limit accounted is part of the total busy time.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 10:04:24 -05:00
Francis Yan 5615f88614 tcp: instrument how long TCP is limited by receive window
This patch measures the total time when the TCP stops sending because
the receiver's advertised window is not large enough. Note that
once the limit is lifted we are likely in the busy status if we
have data pending.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 10:04:24 -05:00
Francis Yan 0f87230d1a tcp: instrument how long TCP is busy sending
This patch measures TCP busy time, which is defined as the period
of time when sender has data (or FIN) to send. The time starts when
data is buffered and stops when the write queue is flushed by ACKs
or error events.

Note the busy time does not include SYN time, unless data is
included in SYN (i.e. Fast Open). It does include FIN time even
if the FIN carries no payload. Excluding pure FIN is possible but
would incur one additional test in the fast path, which may not
be worth it.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 10:04:24 -05:00
Francis Yan 05b055e891 tcp: instrument tcp sender limits chronographs
This patch implements the skeleton of the TCP chronograph
instrumentation on sender side limits:

	1) idle (unspec)
	2) busy sending data other than 3-4 below
	3) rwnd-limited
	4) sndbuf-limited

The limits are enumerated 'tcp_chrono'. Since a connection in
theory can idle forever, we do not track the actual length of this
uninteresting idle period. For the rest we track how long the sender
spends in each limit. At any point during the life time of a
connection, the sender must be in one of the four states.

If there are multiple conditions worthy of tracking in a chronograph
then the highest priority enum takes precedence over
the other conditions. So that if something "more interesting"
starts happening, stop the previous chrono and start a new one.

The time unit is jiffy(u32) in order to save space in tcp_sock.
This implies application must sample the stats no longer than every
49 days of 1ms jiffy.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 10:04:24 -05:00
vegard.nossum@oracle.com 5b3211dcd4 ieee802154: check device type
I've observed a NULL pointer dereference in ieee802154_del_iface() during
netlink fuzzing. It's the ->wpan_phy dereference here:

        phy = dev->ieee802154_ptr->wpan_phy;

My bet is that we're not checking that this is an IEEE802154 interface,
so let's do what ieee802154_nl_get_dev() is doing. (Maybe we should even
be calling this directly?)

Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Sergey Lapin <slapin@ossfans.org>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
2016-11-30 12:33:07 +01:00
Tobias Brunner a55e23864d esp6: Fix integrity verification when ESN are used
When handling inbound packets, the two halves of the sequence number
stored on the skb are already in network order.

Fixes: 000ae7b269 ("esp6: Switch to new AEAD interface")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-30 11:10:16 +01:00
Tobias Brunner 7c7fedd51c esp4: Fix integrity verification when ESN are used
When handling inbound packets, the two halves of the sequence number
stored on the skb are already in network order.

Fixes: 7021b2e1cd ("esp4: Switch to new AEAD interface")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-30 11:09:39 +01:00
Yi Zhao 83e2d0587a xfrm_user: fix return value from xfrm_user_rcv_msg
It doesn't support to run 32bit 'ip' to set xfrm objdect on 64bit host.
But the return value is unknown for user program:

ip xfrm policy list
RTNETLINK answers: Unknown error 524

Replace ENOTSUPP with EOPNOTSUPP:

ip xfrm policy list
RTNETLINK answers: Operation not supported

Signed-off-by: Yi Zhao <yi.zhao@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-30 10:58:53 +01:00
Johan Hovold 881eadabe7 net: dsa: slave: fix fixed-link phydev leaks
Make sure to deregister and free any fixed-link PHY registered using
of_phy_register_fixed_link() on slave-setup errors and on slave destroy.

Fixes: 0d8bcdd383 ("net: dsa: allow for more complex PHY setups")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 23:17:02 -05:00
Johan Hovold 3f65047c85 of_mdio: add helper to deregister fixed-link PHYs
Add helper to deregister fixed-link PHYs registered using
of_phy_register_fixed_link().

Convert the two drivers that care to deregister their fixed-link PHYs to
use the new helper, but note that most drivers currently fail to do so.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 23:17:02 -05:00
Johan Hovold 0d8f3c6715 net: dsa: slave: fix of-node leak and phy priority
Make sure to drop the reference taken by of_parse_phandle() before
returning from dsa_slave_phy_setup().

Note that this also modifies the PHY priority so that any fixed-link
node is only parsed when no phy-handle is given, which is in accordance
with the common scheme for this.

Fixes: 0d8bcdd383 ("net: dsa: allow for more complex PHY setups")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 23:17:02 -05:00
Arnaldo Carvalho de Melo a510887824 GSO: Reload iph after pskb_may_pull
As it may get stale and lead to use after free.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Alexander Duyck <aduyck@mirantis.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Fixes: cbc53e08a7 ("GSO: Add GSO type for fixed IPv4 ID")
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 20:45:54 -05:00
Jiri Pirko 725cbb62e7 sched: cls_flower: remove from hashtable only in case skip sw flag is not set
Be symmetric to hashtable insert and remove filter from hashtable only
in case skip sw flag is not set.

Fixes: e69985c67c ("net/sched: cls_flower: Introduce support in SKIP SW flag")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 20:44:38 -05:00
Eric Dumazet 648f0c28df net/dccp: fix use-after-free in dccp_invalid_packet
pskb_may_pull() can reallocate skb->head, we need to reload dh pointer
in dccp_invalid_packet() or risk use after free.

Bug found by Andrey Konovalov using syzkaller.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 20:37:26 -05:00
Herbert Xu 707693c8a4 netlink: Call cb->done from a worker thread
The cb->done interface expects to be called in process context.
This was broken by the netlink RCU conversion.  This patch fixes
it by adding a worker struct to make the cb->done call where
necessary.

Fixes: 21e4902aea ("netlink: Lockless lookup with RCU grace...")
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 19:48:38 -05:00
Amir Vadai 95c2027bfe net/sched: pedit: make sure that offset is valid
Add a validation function to make sure offset is valid:
1. Not below skb head (could happen when offset is negative).
2. Validate both 'offset' and 'at'.

Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 19:46:00 -05:00
Chuck Lever 3a72dc771c xprtrdma: Relocate connection helper functions
Clean up: Disentangle connection helpers from RPC-over-RDMA reply
decoding functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever c351f94387 xprtrdma: Update dprintk in rpcrdma_count_chunks
Clean up: offset and handle should be zero-filled, just like in the
chunk encoders.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever 2f6922ca33 xprtrdma: Shorten QP access error message
Clean up: The convention for this type of warning message is not to
show the function name or "RPC:       ".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever 6d6bf72de9 xprtrdma: Squelch "max send, max recv" messages at connect time
Clean up: This message was intended to be a dprintk, as it is on the
server-side.

Fixes: 87cfb9a0c8 ('xprtrdma: Client-side support for ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever 289400af2b xprtrdma: Update documenting comment
Clean up: If reset fails, FRMRs are no longer abandoned, rather
they are released immediately. Update the comment to reflect this.

Fixes: 2ffc871a57 ('xprtrdma: Release orphaned MRs immediately')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever a100fda1a2 xprtrdma: Refactor FRMR invalidation
Clean up: After some recent updates, clarifications can be made to
the FRMR invalidation logic.

- Both the remote and local invalidation case mark the frmr INVALID,
  so make that a common path.

- Manage the WR list more "tastefully" by replacing the conditional
  that discriminates between the list head and ->next pointers.

- Use mw->mw_handle in all cases, since that has the same value as
  f->fr_mr->rkey, and is already in cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever 48016dce46 xprtrdma: Avoid calls to ro_unmap_safe()
Micro-optimization: Most of the time, calls to ro_unmap_safe are
expensive no-ops. Call only when there is work to do.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever 109b88ab9d xprtrdma: Address coverity complaint about wait_for_completion()
> ** CID 114101:  Error handling issues  (CHECKED_RETURN)
> /net/sunrpc/xprtrdma/verbs.c: 355 in rpcrdma_create_id()

Commit 5675add36e ("RPC/RDMA: harden connection logic against
missing/late rdma_cm upcalls.") replaced wait_for_completion() calls
with these two call sites.

The original wait_for_completion() calls were added in the initial
commit of verbs.c, which was commit c56c65fb67 ("RPCRDMA: rpc rdma
verbs interface implementation"), but these returned void.

rpcrdma_create_id() is called by the RDMA connect worker, which
probably won't ever be interrupted. It is also called by
rpcrdma_ia_open which is in the synchronous mount path, and ^C is
possible there.

Add a bit of logic at those two call sites to return if the waits
return ERESTARTSYS.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever ae09531d3c SUNRPC: Proper metric accounting when RPC is not transmitted
I noticed recently that during an xfstests on a krb5i mount, the
retransmit count for certain operations had gone negative, and the
backlog value became unreasonably large. I recall that Andy has
pointed this out to me in the past.

When call_refresh fails to find a valid credential for an RPC, the
RPC exits immediately without sending anything on the wire. This
leaves rq_ntrans, rq_xtime, and rq_rtt set to zero.

The solution for om_queue is to not add the to RPC's running backlog
queue total whenever rq_xtime is zero.

For om_ntrans, it's a bit more difficult. A zero rq_ntrans causes
om_ops to become larger than om_ntrans. The design of the RPC
metrics API assumes that ntrans will always be equal to or larger
than the ops count. The result is that when an RPC fails to find
credentials, the RPC operation's reported retransmit count, which is
computed in user space as the difference between ops and ntrans,
goes negative.

Ideally the kernel API should report a separate retransmit and
"exited before initial transmission" metric, so that user space can
sort out the difference properly.

To avoid kernel API changes and changes to the way rq_ntrans is used
when performing transport locking, account for untransmitted RPCs
so that om_ntrans keeps up with om_ops: always add one or more to
om_ntrans.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever 5e9fc6a06b xprtrdma: Support for SG_GAP devices
Some devices (such as the Mellanox CX-4) can register, under a
single R_key, a set of memory regions that are not contiguous. When
this is done, all the segments in a Reply list, say, can then be
invalidated in a single LocalInv Work Request (or via Remote
Invalidation, which can invalidate exactly one R_key when completing
a Receive).

This means a single FastReg WR is used to register, and one or zero
LocalInv WRs can invalidate, the memory involved with RDMA transfers
on behalf of an RPC.

In addition, xprtrdma constructs some Reply chunks from three or
more segments. By registering them with SG_GAP, only one segment
is needed for the Reply chunk, allowing the whole chunk to be
invalidated remotely.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever 8d38de6564 xprtrdma: Make FRWR send queue entry accounting more accurate
Verbs providers may perform house-keeping on the Send Queue during
each signaled send completion. It is necessary therefore for a verbs
consumer (like xprtrdma) to occasionally force a signaled send
completion if it runs unsignaled most of the time.

xprtrdma does not require signaled completions for Send or FastReg
Work Requests, but does signal some LocalInv Work Requests. To
ensure that Send Queue house-keeping can run before the Send Queue
is more than half-consumed, xprtrdma forces a signaled completion
on occasion by counting the number of Send Queue Entries it
consumes. It currently does this by counting each ib_post_send as
one Entry.

Commit c9918ff56d ("xprtrdma: Add ro_unmap_sync method for FRWR")
introduced the ability for frwr_op_unmap_sync to post more than one
Work Request with a single post_send. Thus the underlying assumption
of one Send Queue Entry per ib_post_send is no longer true.

Also, FastReg Work Requests are currently never signaled. They
should be signaled once in a while, just as Send is, to keep the
accounting of consumed SQEs accurate.

While we're here, convert the CQCOUNT macros to the currently
preferred kernel coding style, which is inline functions.

Fixes: c9918ff56d ("xprtrdma: Add ro_unmap_sync method for FRWR")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Chuck Lever 62aee0e302 xprtrdma: Cap size of callback buffer resources
When the inline threshold size is set to large values (say, 32KB)
any NFSv4.1 CB request from the server gets a reply with status
NFS4ERR_RESOURCE.

Looks like there are some upper layer assumptions about the maximum
size of a reply (for example, in process_op). Cap the size of the
NFSv4 client's reply resources at a page.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29 16:45:44 -05:00
Florian Westphal 9b57da0630 netfilter: ipv6: nf_defrag: drop mangled skb on ream error
Dmitry Vyukov reported GPF in network stack that Andrey traced down to
negative nh offset in nf_ct_frag6_queue().

Problem is that all network headers before fragment header are pulled.
Normal ipv6 reassembly will drop the skb when errors occur further down
the line.

netfilter doesn't do this, and instead passed the original fragment
along.  That was also fine back when netfilter ipv6 defrag worked with
cloned fragments, as the original, pristine fragment was passed on.

So we either have to undo the pull op, or discard such fragments.
Since they're malformed after all (e.g. overlapping fragment) it seems
preferrable to just drop them.

Same for temporary errors -- it doesn't make sense to accept (and
perhaps forward!) only some fragments of same datagram.

Fixes: 029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone operations")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Debugged-by: Andrey Konovalov <andreyknvl@google.com>
Diagnosed-by: Eric Dumazet <Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-29 20:23:58 +01:00
Nikita Yushchenko 7a99cd6e21 net: dsa: fix unbalanced dsa_switch_tree reference counting
_dsa_register_switch() gets a dsa_switch_tree object either via
dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref
in returned object (resulting into caller not owning a reference),
while later path does create a new object (resulting into caller owning
a reference).

The rest of _dsa_register_switch() assumes that it owns a reference, and
calls dsa_put_dst().

This causes a memory breakage if first switch in the tree initialized
successfully, but second failed to initialize. In particular, freed
dsa_swith_tree object is left referenced by switch that was initialized,
and later access to sysfs attributes of that switch cause OOPS.

To fix, need to add kref_get() call to dsa_get_dst().

Fixes: 83c0afaec7 ("net: dsa: Add new binding implementation")
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 16:15:45 -05:00
David Ahern 79dc7e3f1c net: handle no dst on skb in icmp6_send
Andrey reported the following while fuzzing the kernel with syzkaller:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800666d4200 task.stack: ffff880067348000
RIP: 0010:[<ffffffff833617ec>]  [<ffffffff833617ec>]
icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451
RSP: 0018:ffff88006734f2c0  EFLAGS: 00010206
RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018
RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003
R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000
R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0
FS:  00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0
Stack:
 ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460
 ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046
 ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000
Call Trace:
 [<ffffffff83364ddc>] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557
 [<     inline     >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88
 [<ffffffff83394405>] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157
 [<ffffffff8339a759>] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663
 [<ffffffff832ee773>] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191
 ...

icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both
cases the dst->dev should be preferred for determining the L3 domain
if the dst has been set on the skb. Fallback to the skb->dev if it has
not. This covers the case reported here where icmp6_send is invoked on
Rx before the route lookup.

Fixes: 5d41ce29e ("net: icmp6_send should use dst dev to determine L3 domain")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 16:13:01 -05:00
Julian Wollrath 4df21dfcf2 tcp: Set DEFAULT_TCP_CONG to bbr if DEFAULT_BBR is set
Signed-off-by: Julian Wollrath <jwollrath@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 12:15:00 -05:00
Sebastian Andrzej Siewior 9c6bafab03 net/iucv: Use explicit clean up labels in iucv_init()
Ursula suggested to use explicit labels for clean up in the error path
instead of one `out_free' label, which handles multiple exits, introduced
in commit 38b482929e ("net/iucv: Convert to hotplug state machine").

Suggested-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-s390@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20161124161013.dukr42y2nwscosk6@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-28 17:29:04 +01:00
Daniel Borkmann d936377414 net, sched: respect rcu grace period on cls destruction
Roi reported a crash in flower where tp->root was NULL in ->classify()
callbacks. Reason is that in ->destroy() tp->root is set to NULL via
RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
this doesn't respect RCU grace period for them, and as a result, still
outstanding readers from tc_classify() will try to blindly dereference
a NULL tp->root.

The tp->root object is strictly private to the classifier implementation
and holds internal data the core such as tc_ctl_tfilter() doesn't know
about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
is only checked for NULL in ->get() callback, but nowhere else. This is
misleading and seemed to be copied from old classifier code that was not
cleaned up properly. For example, d3fa76ee6b ("[NET_SCHED]: cls_basic:
fix NULL pointer dereference") moved tp->root initialization into ->init()
routine, where before it was part of ->change(), so ->get() had to deal
with tp->root being NULL back then, so that was indeed a valid case, after
d3fa76ee6b, not really anymore. We used to set tp->root to NULL long
ago in ->destroy(), see 47a1a1d4be ("pkt_sched: remove unnecessary xchg()
in packet classifiers"); but the NULLifying was reintroduced with the
RCUification, but it's not correct for every classifier implementation.

In the cases that are fixed here with one exception of cls_cgroup, tp->root
object is allocated and initialized inside ->init() callback, which is always
performed at a point in time after we allocate a new tp, which means tp and
thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
handler, same for the tp which is kfree_rcu()'ed right when we return
from ->destroy() in tcf_destroy(). This means, the head object's lifetime
for such classifiers is always tied to the tp lifetime. The RCU callback
invocation for the two kfree_rcu() could be out of order, but that's fine
since both are independent.

Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
means that 1) we don't need a useless NULL check in fast-path and, 2) that
outstanding readers of that tp in tc_classify() can still execute under
respect with RCU grace period as it is actually expected.

Things that haven't been touched here: cls_fw and cls_route. They each
handle tp->root being NULL in ->classify() path for historic reasons, so
their ->destroy() implementation can stay as is. If someone actually
cares, they could get cleaned up at some point to avoid the test in fast
path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
!head should anyone actually be using/testing it, so it at least aligns with
cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
destruction (to a sleepable context) after RCU grace period as concurrent
readers might still access it. (Note that in this case we need to hold module
reference to keep work callback address intact, since we only wait on module
unload for all call_rcu()s to finish.)

This fixes one race to bring RCU grace period guarantees back. Next step
as worked on by Cong however is to fix 1e052be69d ("net_sched: destroy
proto tp when all filters are gone") to get the order of unlinking the tp
in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
RCU_INIT_POINTER() before tcf_destroy() and let the notification for
removal be done through the prior ->delete() callback. Both are independant
issues. Once we have that right, we can then clean tp->root up for a number
of classifiers by not making them RCU pointers, which requires a new callback
(->uninit) that is triggered from tp's RCU callback, where we just kfree()
tp->root from there.

Fixes: 1f947bf151 ("net: sched: rcu'ify cls_bpf")
Fixes: 9888faefe1 ("net: sched: cls_basic use RCU")
Fixes: 70da9f0bf9 ("net: sched: cls_flow use RCU")
Fixes: 77b9900ef5 ("tc: introduce Flower classifier")
Fixes: bf3994d2ed ("net/sched: introduce Match-all classifier")
Fixes: 952313bd62 ("net: sched: cls_cgroup use RCU")
Reported-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Roi Dayan <roid@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-28 10:47:35 -05:00
Daniel Borkmann c491680f8f bpf: reuse dev_is_mac_header_xmit for redirect
Commit dcf800344a ("net/sched: act_mirred: Refactor detection whether
dev needs xmit at mac header") added dev_is_mac_header_xmit(); since it's
also useful elsewhere, move it to if_arp.h and reuse it for BPF.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:38:47 -05:00
Daniel Borkmann 55556dd59d bpf: drop useless bpf_fd member from cls/act
After setup we don't need to keep user space fd number around anymore, as
it also has no useful meaning for anyone, just remove it.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:38:47 -05:00
Jon Paul Maloy 9590112241 tipc: fix link statistics counter errors
In commit e4bf4f7696 ("tipc: simplify packet sequence number
handling") we changed the internal representation of the packet
sequence number counters from u32 to u16, reflecting what is really
sent over the wire.

Since then some link statistics counters have been displaying incorrect
values, partially because the counters meant to be used as sequence
number snapshots are now used as direct counters, stored as u32, and
partially because some counter updates are just missing in the code.

In this commit we correct this in two ways. First, we base the
displayed packet sent/received values on direct counters instead
of as previously a calculated difference between current sequence
number and a snapshot. Second, we add the missing updates of the
counters.

This change is compatible with the current netlink API, and requires
no changes to the user space tools.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:35:55 -05:00
David S. Miller 33f8a0458b wireless-drivers-next patches for 4.10
Major changes:
 
 iwlwifi
 
 * finalize and enable dynamic queue allocation
 * use dev_coredumpmsg() to prevent locking the driver
 * small fix to pass the AID to the FW
 * use FW PS decisions with multi-queue
 
 ath9k
 
 * add device tree bindings
 * switch to use mac80211 intermediate software queues to reduce
   latency and fix bufferbloat
 
 wl18xx
 
 * allow scanning in AP mode
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJYOAC4AAoJEG4XJFUm622bUYkH/3SSYp6moSdKpVnVPx7ST7yK
 t9WHR9IMZFIhD6vq8AK6+8OQr1TgGjHfPu+WZj7CIl8nu53kcgPRi51gg1mndbNg
 9N3RbVp06nGbM2VnW8ZIpg3OLIXatZ4c9g3LFvvtyobYvWGJ6W4D79JdlmTG1ELr
 XAjInbxFsgon+CwqCMOaAJx8xYp42rBnPRZZvhOq9O33kRw8Umo9UQw0s1U2Vfgx
 prxQ6d0GxNAPEe8QiDw/vtBcXWFMOhQeDl8sK70ZcojSn1FY730NsIh/Y86PcQTK
 6TsvOL5gg+rd0ln8TZRAslnDrZBAhTEDqUzLQMRJ9VjEj5RFd8eLCSIzHfaroI8=
 =4qCH
 -----END PGP SIGNATURE-----

Merge tag 'wireless-drivers-next-for-davem-2016-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.10

Major changes:

iwlwifi

* finalize and enable dynamic queue allocation
* use dev_coredumpmsg() to prevent locking the driver
* small fix to pass the AID to the FW
* use FW PS decisions with multi-queue

ath9k

* add device tree bindings
* switch to use mac80211 intermediate software queues to reduce
  latency and fix bufferbloat

wl18xx

* allow scanning in AP mode
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:26:59 -05:00
David S. Miller 8eb4adf60b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-11-25

1) Fix a refcount leak in vti6.
   From Nicolas Dichtel.

2) Fix a wrong if statement in xfrm_sk_policy_lookup.
   From Florian Westphal.

3) The flowcache watermarks are per cpu. Take this into
   account when comparing to the threshold where we
   refusing new allocations. From Miroslav Urbanek.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:21:48 -05:00
Johan Hovold fd05d7b18c net: dsa: fix fixed-link-phy device leaks
Make sure to drop the reference taken by of_phy_find_device() when
registering and deregistering the fixed-link PHY-device.

Fixes: 39b0c70519 ("net: dsa: Allow configuration of CPU & DSA port
speeds/duplex")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:01:15 -05:00
David S. Miller 0b42f25d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.

Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26 23:42:21 -05:00
Jon Paul Maloy 6998cc6ec2 tipc: resolve connection flow control compatibility problem
In commit 10724cc7bb ("tipc: redesign connection-level flow control")
we replaced the previous message based flow control with one based on
1k blocks. In order to ensure backwards compatibility the mechanism
falls back to using message as base unit when it senses that the peer
doesn't support the new algorithm. The default flow control window,
i.e., how many units can be sent before the sender blocks and waits
for an acknowledge (aka advertisement) is 512. This was tested against
the previous version, which uses an acknowledge frequency of on ack per
256 received message, and found to work fine.

However, we missed the fact that versions older than Linux 3.15 use an
acknowledge frequency of 512, which is exactly the limit where a 4.6+
sender will stop and wait for acknowledge. This would also work fine if
it weren't for the fact that if the first sent message on a 4.6+ server
side is an empty SYNACK, this one is also is counted as a sent message,
while it is not counted as a received message on a legacy 3.15-receiver.
This leads to the sender always being one step ahead of the receiver, a
scenario causing the sender to block after 512 sent messages, while the
receiver only has registered 511 read messages. Hence, the legacy
receiver is not trigged to send an acknowledge, with a permanently
blocked sender as result.

We solve this deadlock by simply allowing the sender to send one more
message before it blocks, i.e., by a making minimal change to the
condition used for determining connection congestion.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 21:38:16 -05:00
Miroslav Lichvar 8006f6bf5e net: ethtool: don't require CAP_NET_ADMIN for ETHTOOL_GLINKSETTINGS
The ETHTOOL_GLINKSETTINGS command is deprecating the ETHTOOL_GSET
command and likewise it shouldn't require the CAP_NET_ADMIN capability.

Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 20:23:30 -05:00
Jon Paul Maloy d876a4d2af tipc: improve sanity check for received domain records
In commit 35c55c9877 ("tipc: add neighbor monitoring framework") we
added a data area to the link monitor STATE messages under the
assumption that previous versions did not use any such data area.

For versions older than Linux 4.3 this assumption is not correct. In
those version, all STATE messages sent out from a node inadvertently
contain a 16 byte data area containing a string; -a leftover from
previous RESET messages which were using this during the setup phase.
This string serves no purpose in STATE messages, and should no be there.

Unfortunately, this data area is delivered to the link monitor
framework, where a sanity check catches that it is not a correct domain
record, and drops it. It also issues a rate limited warning about the
event.

Since such events occur much more frequently than anticipated, we now
choose to remove the warning in order to not fill the kernel log with
useless contents. We also make the sanity check stricter, to further
reduce the risk that such data is inavertently admitted.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 20:06:18 -05:00
Jon Paul Maloy f79675563a tipc: fix compatibility bug in link monitoring
commit 817298102b ("tipc: fix link priority propagation") introduced a
compatibility problem between TIPC versions newer than Linux 4.6 and
those older than Linux 4.4. In versions later than 4.4, link STATE
messages only contain a non-zero link priority value when the sender
wants the receiver to change its priority. This has the effect that the
receiver resets itself in order to apply the new priority. This works
well, and is consistent with the said commit.

However, in versions older than 4.4 a valid link priority is present in
all sent link STATE messages, leading to cyclic link establishment and
reset on the 4.6+ node.

We fix this by adding a test that the received value should not only
be valid, but also differ from the current value in order to cause the
receiving link endpoint to reset.

Reported-by: Amar Nv <amar.nv005@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 20:06:18 -05:00
Eric Dumazet f52dffe049 net: properly flush delay-freed skbs
Typical NAPI drivers use napi_consume_skb(skb) at TX completion time.
This put skb in a percpu special queue, napi_alloc_cache, to get bulk
frees.

It turns out the queue is not flushed and hits the NAPI_SKB_CACHE_SIZE
limit quite often, with skbs that were queued hundreds of usec earlier.
I measured this can take ~6000 nsec to perform one flush.

__kfree_skb_flush() can be called from two points right now :

1) From net_tx_action(), but only for skbs that were queued to
sd->completion_queue.

 -> Irrelevant for NAPI drivers in normal operation.

2) From net_rx_action(), but only under high stress or if RPS/RFS has a
pending action.

This patch changes net_rx_action() to perform the flush in all cases and
after more urgent operations happened (like kicking remote CPUS for
RPS/RFS).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 19:37:49 -05:00
Daniel Mack 33b486793c net: ipv4, ipv6: run cgroup eBPF egress programs
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output(). From mentioned functions we have two socket contexts
as per 7026b1ddb6 ("netfilter: Pass socket pointer down through
okfn()."). We explicitly need to use sk instead of skb->sk here,
since otherwise the same program would run multiple times on egress
when encap devices are involved, which is not desired in our case.

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:26:04 -05:00
Daniel Mack c11cd3a6ec net: filter: run cgroup eBPF ingress programs
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:26:04 -05:00
Daniel Mack 0e33661de4 bpf: add new prog type for cgroup socket filtering
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:25:52 -05:00
David S. Miller fb09c8c524 linux-can-fixes-for-4.9-20161123
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEES2FAuYbJvAGobdVQPTuqJaypJWoFAlg1ppETHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRA9O6olrKklaqieB/4o9PaD8Rj38Piy7lJW1ahAxoUnY4AA
 Vu1eFvtidUswO5RV6mDOuqhTulzrMcPQZguW/S7eLZh6hWYVVLlgkrNLj/RpMXsH
 rqGRC/sL5ICL1q/ijYK6NJJ3+GFQhl92gG+wJxsQfETWVDKH13N3sWcEyBh0+C5P
 lnPFNVDVSy4bpkEgXAN/sfAvoHzW//34cnxTzlsd1COAWlxZ+HHgBAGp4kaYTpbF
 Vz3kuNPfDI7U+36quE8SUXe/R9HfqQBtfbFtaxha8vqH8Fw6MJYO0BUJVmtawTDq
 nBFvB/x+d0n1YeOgo5UD5bW9thItF57GEscWqYpTuhZ0jlPr5+CZeo14
 =FImK
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-4.9-20161123' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2016-11-23

this is a pull request for net/master.

The patch by Oliver Hartkopp for the broadcast manager (bcm) fixes the
CAN-FD support, which may cause an out-of-bounds access otherwise.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:17:12 -05:00
David S. Miller f9e154a0e6 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2016-11-23

Sorry about the late pull request for 4.9, but we have one more
important Bluetooth patch that should make it to the release. It fixes
connection creation for Bluetooth LE controllers that do not have a
public address (only a random one).

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:24:20 -05:00
Roman Mashak 19a8bb28d1 net sched filters: fix filter handle ID in tfilter_notify_chain()
Should pass valid filter handle, not the netlink flags.

Fixes: 30a391a13a ("net sched filters: pass netlink message flags in event notification")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:05:58 -05:00
Florian Fainelli 4b65246b42 ethtool: Protect {get, set}_phy_tunable with PHY device mutex
PHY drivers should be able to rely on the caller of {get,set}_tunable to
have acquired the PHY device mutex, in order to both serialize against
concurrent calls of these functions, but also against PHY state machine
changes. All ethtool PHY-level functions do this, except
{get,set}_tunable, so we make them consistent here as well.

We need to update the Microsemi PHY driver in the same commit to avoid
introducing either deadlocks, or lack of proper locking.

Fixes: 968ad9da7e ("ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE")
Fixes: 310d9ad57a ("net: phy: Add downshift get/set support in Microsemi PHYs driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:02:32 -05:00
Roi Dayan 59bfde01fa devlink: Add E-Switch inline mode control
Some HWs need the VF driver to put part of the packet headers on the
TX descriptor so the e-switch can do proper matching and steering.

The supported modes: none, link, network, transport.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:01:14 -05:00
Or Gerlitz 3df5b3c675 net: Add net-device param to the get offloaded stats ndo
Some drivers would need to check few internal matters for
that. To be used in downstream mlx5 commit.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 16:01:14 -05:00
Eric Dumazet f8071cde78 tcp: enhance tcp_collapse_retrans() with skb_shift()
In commit 2331ccc5b3 ("tcp: enhance tcp collapsing"),
we made a first step allowing copying right skb to left skb head.

Since all skbs in socket write queue are headless (but possibly the very
first one), this strategy often does not work.

This patch extends tcp_collapse_retrans() to perform frag shifting,
thanks to skb_shift() helper.

This helper needs to not BUG on non headless skbs, as callers are ok
with that.

Tested:

Following packetdrill test now passes :

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.100 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 write(4, ..., 200) = 200
   +0 > P. 1:201(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 201:401(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 401:601(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 601:801(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 801:1001(200) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1001:1101(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1101:1201(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1201:1301(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1301:1401(100) ack 1

+.099 < . 1:1(0) ack 201 win 257
+.001 < . 1:1(0) ack 201 win 257 <nop,nop,sack 1001:1401>
   +0 > P. 201:1001(800) ack 1

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 15:40:42 -05:00
Eric Dumazet 30c7be26fd udplite: call proper backlog handlers
In commits 93821778de ("udp: Fix rcv socket locking") and
f7ad74fef3 ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
__udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
was forgotten.

This leads to crashes if UDPlite header is pulled twice, which happens
starting from commit e6afc8ace6 ("udp: remove headers from UDP packets
before queueing")

Bug found by syzkaller team, thanks a lot guys !

Note that backlog use in UDP/UDPlite is scheduled to be removed starting
from linux-4.10, so this patch is only needed up to linux-4.9

Fixes: 93821778de ("udp: Fix rcv socket locking")
Fixes: f7ad74fef3 ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 15:32:14 -05:00
Paolo Abeni 764d3be6e4 ipv6: bump genid when the IFA_F_TENTATIVE flag is clear
When an ipv6 address has the tentative flag set, it can't be
used as source for egress traffic, while the associated route,
if any, can be looked up and even stored into some dst_cache.

In the latter scenario, the source ipv6 address selected and
stored in the cache is most probably wrong (e.g. with
link-local scope) and the entity using the dst_cache will
experience lack of ipv6 connectivity until said cache is
cleared or invalidated.

Overall this may cause lack of connectivity over most IPv6 tunnels
(comprising geneve and vxlan), if the first egress packet reaches
the tunnel before the DaD is completed for the used ipv6
address.

This patch bumps a new genid after that the IFA_F_TENTATIVE flag
is cleared, so that dst_cache will be invalidated on
next lookup and ipv6 connectivity restored.

Fixes: 0c1d70af92 ("net: use dst_cache for vxlan device")
Fixes: 468dfffcd7 ("geneve: add dst caching support")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 12:04:10 -05:00
Stefan Hajnoczi b911682318 VSOCK: add loopback to virtio_transport
The VMware VMCI transport supports loopback inside virtual machines.
This patch implements loopback for virtio-vsock.

Flow control is handled by the virtio-vsock protocol as usual.  The
sending process stops transmitting on a connection when the peer's
receive buffer space is exhausted.

Cathy Avery <cavery@redhat.com> noticed this difference between VMCI and
virtio-vsock when a test case using loopback failed.  Although loopback
isn't the main point of AF_VSOCK, it is useful for testing and
virtio-vsock must match VMCI semantics so that userspace programs run
regardless of the underlying transport.

My understanding is that loopback is not supported on the host side with
VMCI.  Follow that by implementing it only in the guest driver, not the
vhost host driver.

Cc: Jorgen Hansen <jhansen@vmware.com>
Reported-by: Cathy Avery <cavery@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24 11:53:15 -05:00
Liping Zhang 49cdc4c749 netfilter: nft_range: add the missing NULL pointer check
Otherwise, kernel panic will happen if the user does not specify
the related attributes.

Fixes: 0f3cd9b369 ("netfilter: nf_tables: add range expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-24 14:43:35 +01:00
Anders K. Pedersen d3e2a1110c netfilter: nf_tables: fix inconsistent element expiration calculation
As Liping Zhang reports, after commit a8b1e36d0d ("netfilter: nft_dynset:
fix element timeout for HZ != 1000"), priv->timeout was stored in jiffies,
while set->timeout was stored in milliseconds. This is inconsistent and
incorrect.

Firstly, we already call msecs_to_jiffies in nft_set_elem_init, so
priv->timeout will be converted to jiffies twice.

Secondly, if the user did not specify the NFTA_DYNSET_TIMEOUT attr,
set->timeout will be used, but we forget to call msecs_to_jiffies
when do update elements.

Fix this by using jiffies internally for traditional sets and doing the
conversions to/from msec when interacting with userspace - as dynset
already does.

This is preferable to doing the conversions, when elements are inserted or
updated, because this can happen very frequently on busy dynsets.

Fixes: a8b1e36d0d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Reported-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Acked-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-24 14:43:34 +01:00
Florian Westphal 7223ecd466 netfilter: nat: switch to new rhlist interface
I got offlist bug report about failing connections and high cpu usage.
This happens because we hit 'elasticity' checks in rhashtable that
refuses bucket list exceeding 16 entries.

The nat bysrc hash unfortunately needs to insert distinct objects that
share same key and are identical (have same source tuple), this cannot
be avoided.

Switch to the rhlist interface which is designed for this.

The nulls_base is removed here, I don't think its needed:

A (unlikely) false positive results in unneeded port clash resolution,
a false negative results in packet drop during conntrack confirmation,
when we try to insert the duplicate into main conntrack hash table.

Tested by adding multiple ip addresses to host, then adding
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

... and then creating multiple connections, from same source port but
different addresses:

for i in $(seq 2000 2032);do nc -p 1234 192.168.7.1 $i > /dev/null  & done

(all of these then get hashed to same bysource slot)

Then, to test that nat conflict resultion is working:

nc -s 10.0.0.1 -p 1234 192.168.7.1 2000
nc -s 10.0.0.2 -p 1234 192.168.7.1 2000

tcp  .. src=10.0.0.1 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1024 [ASSURED]
tcp  .. src=10.0.0.2 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1025 [ASSURED]
tcp  .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1234 [ASSURED]
tcp  .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2001 src=192.168.7.1 dst=192.168.7.10 sport=2001 dport=1234 [ASSURED]
[..]

-> nat altered source ports to 1024 and 1025, respectively.
This can also be confirmed on destination host which shows
ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1024
ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1025
ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1234

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Fixes: 870190a9ec ("netfilter: nat: convert nat bysrc hash to rhashtable")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-24 14:43:34 +01:00
Florian Westphal 728e87b496 netfilter: nat: fix cmp return value
The comparator works like memcmp, i.e. 0 means objects are equal.
In other words, when objects are distinct they are treated as identical,
when they are distinct they are allegedly the same.

The first case is rare (distinct objects are unlikely to get hashed to
same bucket).

The second case results in unneeded port conflict resolutions attempts.

Fixes: 870190a9ec ("netfilter: nat: convert nat bysrc hash to rhashtable")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-24 14:43:33 +01:00
Laura Garcia Liebana abd66e9f3c netfilter: nft_hash: validate maximum value of u32 netlink hash attribute
Use the function nft_parse_u32_check() to fetch the value and validate
the u32 attribute into the hash len u8 field.

This patch revisits 4da449ae1d ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").

Fixes: cb1b69b0b1 ("netfilter: nf_tables: add hash expression")
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-24 14:40:03 +01:00
David Ahern 00b4422fe3 netfilter: Update nf_send_reset6 to consider L3 domain
nf_send_reset6 is not considering the L3 domain and lookups are sent
to the wrong table. For example consider the following output rule:

ip6tables -A OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset

using perf to analyze lookups via the fib6_table_lookup tracepoint shows:

swapper     0 [001]   248.787816: fib6:fib6_table_lookup: table 255 oif 0 iif 1 src 2100:1::3 dst 2100:1:
        ffffffff81439cdc perf_trace_fib6_table_lookup ([kernel.kallsyms])
        ffffffff814c1ce3 trace_fib6_table_lookup ([kernel.kallsyms])
        ffffffff814c3e89 ip6_pol_route ([kernel.kallsyms])
        ffffffff814c40d5 ip6_pol_route_output ([kernel.kallsyms])
        ffffffff814e7b6f fib6_rule_action ([kernel.kallsyms])
        ffffffff81437f60 fib_rules_lookup ([kernel.kallsyms])
        ffffffff814e7c79 fib6_rule_lookup ([kernel.kallsyms])
        ffffffff814c2541 ip6_route_output_flags ([kernel.kallsyms])
                     528 nf_send_reset6 ([nf_reject_ipv6])

The lookup is directed to table 255 rather than the table associated with
the device via the L3 domain. Update nf_send_reset6 to pull the L3 domain
from the dst currently attached to the skb.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-24 12:47:08 +01:00
David Ahern 6d8b49c3a3 netfilter: Update ip_route_me_harder to consider L3 domain
ip_route_me_harder is not considering the L3 domain and sending lookups
to the wrong table. For example consider the following output rule:

iptables -I OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset

using perf to analyze lookups via the fib_table_lookup tracepoint shows:

vrf-test  1187 [001] 46887.295927: fib:fib_table_lookup: table 255 oif 0 iif 0 src 0.0.0.0 dst 10.100.1.254 tos 0 scope 0 flags 0
        ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
        ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
        ffffffff8148dda3 __inet_dev_addr_type ([kernel.kallsyms])
        ffffffff8148ddf6 inet_addr_type ([kernel.kallsyms])
        ffffffff8149e344 ip_route_me_harder ([kernel.kallsyms])

and

vrf-test  1187 [001] 46887.295933: fib:fib_table_lookup: table 255 oif 0 iif 1 src 10.100.1.254 dst 10.100.1.2 tos 0 scope 0 flags
        ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
        ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
        ffffffff814998ff fib4_rule_action ([kernel.kallsyms])
        ffffffff81437f35 fib_rules_lookup ([kernel.kallsyms])
        ffffffff81499758 __fib_lookup ([kernel.kallsyms])
        ffffffff8144f010 fib_lookup.constprop.34 ([kernel.kallsyms])
        ffffffff8144f759 __ip_route_output_key_hash ([kernel.kallsyms])
        ffffffff8144fc6a ip_route_output_flow ([kernel.kallsyms])
        ffffffff8149e39b ip_route_me_harder ([kernel.kallsyms])

In both cases the lookups are directed to table 255 rather than the
table associated with the device via the L3 domain. Update both
lookups to pull the L3 domain from the dst currently attached to the
skb.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-24 12:44:36 +01:00
WANG Cong a4cd0271ea net: revert "net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit"
This reverts commit 7c6ae610a1, because l2tp_xmit_skb() never
returns NET_XMIT_CN, it ignores the return value of l2tp_xmit_core().

Cc: Gao Feng <gfree.wind@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-23 20:18:36 -05:00
Zhang Shengju 93af205656 rtnetlink: fix the wrong minimal dump size getting from rtnl_calcit()
For RT netlink, calcit() function should return the minimal size for
netlink dump message. This will make sure that dump message for every
network device can be stored.

Currently, rtnl_calcit() function doesn't account the size of header of
netlink message, this patch will fix it.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-23 20:18:36 -05:00
Oliver Hartkopp 5499a6b22e can: bcm: fix support for CAN FD frames
Since commit 6f3b911d5f ("can: bcm: add support for CAN FD frames") the
CAN broadcast manager supports CAN and CAN FD data frames.

As these data frames are embedded in struct can[fd]_frames which have a
different length the access to the provided array of CAN frames became
dependend of op->cfsiz. By using a struct canfd_frame pointer for the array of
CAN frames the new offset calculation based on op->cfsiz was accidently applied
to CAN FD frame element lengths.

This fix makes the pointer to the arrays of the different CAN frame types a
void pointer so that the offset calculation in bytes accesses the correct CAN
frame elements.

Reference: http://marc.info/?l=linux-netdev&m=147980658909653

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-11-23 15:22:18 +01:00
Miroslav Urbanek 6b22648781 flowcache: Increase threshold for refusing new allocations
The threshold for OOM protection is too small for systems with large
number of CPUs. Applications report ENOBUFs on connect() every 10
minutes.

The problem is that the variable net->xfrm.flow_cache_gc_count is a
global counter while the variable fc->high_watermark is a per-CPU
constant. Take the number of CPUs into account as well.

Fixes: 6ad3122a08 ("flowcache: Avoid OOM condition under preasure")
Reported-by: Lukáš Koldrt <lk@excello.cz>
Tested-by: Jan Hejl <jh@excello.cz>
Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-23 06:37:09 +01:00
Sebastian Andrzej Siewior 38b482929e net/iucv: Convert to hotplug state machine
Install the callbacks via the state machine and let the core invoke the
callbacks on the already online CPUs. The smp function calls in the
online/downprep callbacks are not required as the callback is guaranteed to
be invoked on the upcoming/outgoing cpu.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-s390@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: rt@linuxtronix.de
Link: http://lkml.kernel.org/r/20161117183541.8588-13-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 23:34:40 +01:00
Johan Hedberg 39385cb5f3 Bluetooth: Fix using the correct source address type
The hci_get_route() API is used to look up local HCI devices, however
so far it has been incapable of dealing with anything else than the
public address of HCI devices. This completely breaks with LE-only HCI
devices that do not come with a public address, but use a static
random address instead.

This patch exteds the hci_get_route() API with a src_type parameter
that's used for comparing with the right address of each HCI device.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-11-22 22:50:46 +01:00
Eric Dumazet c9b8af1330 flow_dissect: call init_default_flow_dissectors() earlier
Andre Noll reported panics after my recent fix (commit 34fad54c25
"net: __skb_flow_dissect() must cap its return value")

After some more headaches, Alexander root caused the problem to
init_default_flow_dissectors() being called too late, in case
a network driver like IGB is not a module and receives DHCP message
very early.

Fix is to call init_default_flow_dissectors() much earlier,
as it is a core infrastructure and does not depend on another
kernel service.

Fixes: 06635a35d1 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andre Noll <maan@tuebingen.mpg.de>
Diagnosed-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 14:44:01 -05:00
David S. Miller f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Linus Torvalds 27e7ab99db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Clear congestion control state when changing algorithms on an
    existing socket, from Florian Westphal.

 2) Fix register bit values in altr_tse_pcs portion of stmmac driver,
    from Jia Jie Ho.

 3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe
    CAVALLARO.

 4) Fix udplite multicast delivery handling, it ignores the udp_table
    parameter passed into the lookups, from Pablo Neira Ayuso.

 5) Synchronize the space estimated by rtnl_vfinfo_size and the space
    actually used by rtnl_fill_vfinfo. From Sabrina Dubroca.

 6) Fix memory leak in fib_info when splitting nodes, from Alexander
    Duyck.

 7) If a driver does a napi_hash_del() explicitily and not via
    netif_napi_del(), it must perform RCU synchronization as needed. Fix
    this in virtio-net and bnxt drivers, from Eric Dumazet.

 8) Likewise, it is not necessary to invoke napi_hash_del() is we are
    also doing neif_napi_del() in the same code path. Remove such calls
    from be2net and cxgb4 drivers, also from Eric Dumazet.

 9) Don't allocate an ID in peernet2id_alloc() if the netns is dead,
    from WANG Cong.

10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold.

11) We cannot cache routes in ip6_tunnel when using inherited traffic
    classes, from Paolo Abeni.

12) Fix several crashes and leaks in cpsw driver, from Johan Hovold.

13) Splice operations cannot use freezable blocking calls in AF_UNIX,
    from WANG Cong.

14) Link dump filtering by master device and kind support added an error
    in loop index updates during the dump if we actually do filter, fix
    from Zhang Shengju.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
  tcp: zero ca_priv area when switching cc algorithms
  net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
  ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
  tipc: eliminate obsolete socket locking policy description
  rtnl: fix the loop index update error in rtnl_dump_ifinfo()
  l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
  net: macb: add check for dma mapping error in start_xmit()
  rtnetlink: fix FDB size computation
  netns: fix get_net_ns_by_fd(int pid) typo
  af_unix: conditionally use freezable blocking calls in read
  net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
  net: ethernet: ti: cpsw: add missing sanity check
  net: ethernet: ti: cpsw: fix secondary-emac probe error path
  net: ethernet: ti: cpsw: fix of_node and phydev leaks
  net: ethernet: ti: cpsw: fix deferred probe
  net: ethernet: ti: cpsw: fix mdio device reference leak
  net: ethernet: ti: cpsw: fix bad register access in probe error path
  net: sky2: Fix shutdown crash
  cfg80211: limit scan results cache size
  net sched filters: pass netlink message flags in event notification
  ...
2016-11-21 13:26:28 -08:00
Florian Westphal e97991832a tcp: make undo_cwnd mandatory for congestion modules
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
which un-does reno halving behaviour.

It seems more appropriate to let congctl algorithms pair .ssthresh
and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
up for all congestion algorithms that used to rely on the fallback.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:20:17 -05:00
Florian Westphal 85f7e7508a tcp: add cwnd_undo functions to various tcp cc algorithms
congestion control algorithms that do not halve cwnd in their .ssthresh
should provide a .cwnd_undo rather than rely on current fallback which
assumes reno halving (and thus doubles the cwnd).

All of these do 'something else' in their .ssthresh implementation, thus
store the cwnd on loss and provide .undo_cwnd to restore it again.

A followup patch will remove the fallback and all algorithms will
need to provide a .cwnd_undo function.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:20:17 -05:00
Nikolay Aleksandrov aa2ae3e71c bridge: mcast: add MLDv2 querier support
This patch adds basic support for MLDv2 queries, the default is MLDv1
as before. A new multicast option - multicast_mld_version, adds the
ability to change it between 1 and 2 via netlink and sysfs.
The MLD option is disabled if CONFIG_IPV6 is disabled.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:16:58 -05:00
Nikolay Aleksandrov 5e9235853d bridge: mcast: add IGMPv3 query support
This patch adds basic support for IGMPv3 queries, the default is IGMPv2
as before. A new multicast option - multicast_igmp_version, adds the
ability to change it between 2 and 3 via netlink and sysfs. The option
struct member is in a 4 byte hole in net_bridge.

There also a few minor style adjustments in br_multicast_new_group and
br_multicast_add_group.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:16:58 -05:00
Florian Westphal 7082c5c3f2 tcp: zero ca_priv area when switching cc algorithms
We need to zero out the private data area when application switches
connection to different algorithm (TCP_CONGESTION setsockopt).

When congestion ops get assigned at connect time everything is already
zeroed because sk_alloc uses GFP_ZERO flag.  But in the setsockopt case
this contains whatever previous cc placed there.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:13:56 -05:00
Gao Feng 7c6ae610a1 net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
The tc could return NET_XMIT_CN as one congestion notification, but
it does not mean the packe is lost. Other modules like ipvlan,
macvlan, and others treat NET_XMIT_CN as success too.
So l2tp_eth_dev_xmit should add the NET_XMIT_CN check.

Signed-off-by: Gao Feng <gfree.wind@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 13:10:29 -05:00
Eric Dumazet d21dbdfe0a udp: avoid one cache line miss in recvmsg()
UDP_SKB_CB(skb)->partial_cov is located at offset 66 in skb,
requesting a cold cache line being read in cpu cache.

We can avoid this cache line miss for UDP sockets,
as partial_cov has a meaning only for UDPLite.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 11:26:59 -05:00
Jon Paul Maloy 51b9a31c42 tipc: eliminate obsolete socket locking policy description
The comment block in socket.c describing the locking policy is
obsolete, and does not reflect current reality. We remove it in this
commit.

Since the current locking policy is much simpler and follows a
mainstream approach, we see no need to add a new description.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:15:41 -05:00
Zhang Shengju 3f0ae05d6f rtnl: fix the loop index update error in rtnl_dump_ifinfo()
If the link is filtered out, loop index should also be updated. If not,
loop index will not be correct.

Fixes: dc599f76c2 ("net: Add support for filtering link dump by master device and kind")
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:14:30 -05:00
Alexey Dobriyan c72d8cdaa5 net: fix bogus cast in skb_pagelen() and use unsigned variables
1) cast to "int" is unnecessary:
   u8 will be promoted to int before decrementing,
   small positive numbers fit into "int", so their values won't be changed
   during promotion.

   Once everything is int including loop counters, signedness doesn't
   matter: 32-bit operations will stay 32-bit operations.

   But! Someone tried to make this loop smart by making everything of
   the same type apparently in an attempt to optimise it.
   Do the optimization, just differently.
   Do the cast where it matters. :^)

2) frag size is unsigned entity and sum of fragments sizes is also
   unsigned.

Make everything unsigned, leave no MOVSX instruction behind.

	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
	function                                     old     new   delta
	skb_cow_data                                 835     834      -1
	ip_do_fragment                              2549    2548      -1
	ip6_fragment                                3130    3128      -2
	Total: Before=154865032, After=154865028, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:11:25 -05:00
Alexey Dobriyan e0d7924a4a net: make struct napi_alloc_cache::skb_count unsigned int
size_t is way too much for an integer not exceeding 64.

Space savings: 10 bytes!

	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-10 (-10)
	function                                     old     new   delta
	napi_consume_skb                             165     163      -2
	__kfree_skb_flush                             56      53      -3
	__kfree_skb_defer                             97      92      -5
	Total: Before=154865639, After=154865629, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:11:25 -05:00
Guillaume Nault 32c231164b l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind().
Without lock, a concurrent call could modify the socket flags between
the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way,
a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it
would then leave a stale pointer there, generating use-after-free
errors when walking through the list or modifying adjacent entries.

BUG: KASAN: use-after-free in l2tp_ip6_close+0x22e/0x290 at addr ffff8800081b0ed8
Write of size 8 by task syz-executor/10987
CPU: 0 PID: 10987 Comm: syz-executor Not tainted 4.8.0+ #39
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
 ffff880031d97838 ffffffff829f835b ffff88001b5a1640 ffff8800081b0ec0
 ffff8800081b15a0 ffff8800081b6d20 ffff880031d97860 ffffffff8174d3cc
 ffff880031d978f0 ffff8800081b0e80 ffff88001b5a1640 ffff880031d978e0
Call Trace:
 [<ffffffff829f835b>] dump_stack+0xb3/0x118 lib/dump_stack.c:15
 [<ffffffff8174d3cc>] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
 [<     inline     >] print_address_description mm/kasan/report.c:194
 [<ffffffff8174d666>] kasan_report_error+0x1f6/0x4d0 mm/kasan/report.c:283
 [<     inline     >] kasan_report mm/kasan/report.c:303
 [<ffffffff8174db7e>] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:329
 [<     inline     >] __write_once_size ./include/linux/compiler.h:249
 [<     inline     >] __hlist_del ./include/linux/list.h:622
 [<     inline     >] hlist_del_init ./include/linux/list.h:637
 [<ffffffff8579047e>] l2tp_ip6_close+0x22e/0x290 net/l2tp/l2tp_ip6.c:239
 [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
 [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
 [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
 [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
 [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
 [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
 [<ffffffff813774f9>] task_work_run+0xf9/0x170
 [<ffffffff81324aae>] do_exit+0x85e/0x2a00
 [<ffffffff81326dc8>] do_group_exit+0x108/0x330
 [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
 [<ffffffff811b49af>] do_signal+0x7f/0x18f0
 [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
 [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
 [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
 [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Object at ffff8800081b0ec0, in cache L2TP/IPv6 size: 1448
Allocated:
PID = 10987
 [ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
 [ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
 [ 1116.897025] [<ffffffff8174c9ad>] kasan_kmalloc+0xad/0xe0
 [ 1116.897025] [<ffffffff8174cee2>] kasan_slab_alloc+0x12/0x20
 [ 1116.897025] [<     inline     >] slab_post_alloc_hook mm/slab.h:417
 [ 1116.897025] [<     inline     >] slab_alloc_node mm/slub.c:2708
 [ 1116.897025] [<     inline     >] slab_alloc mm/slub.c:2716
 [ 1116.897025] [<ffffffff817476a8>] kmem_cache_alloc+0xc8/0x2b0 mm/slub.c:2721
 [ 1116.897025] [<ffffffff84c4f6a9>] sk_prot_alloc+0x69/0x2b0 net/core/sock.c:1326
 [ 1116.897025] [<ffffffff84c58ac8>] sk_alloc+0x38/0xae0 net/core/sock.c:1388
 [ 1116.897025] [<ffffffff851ddf67>] inet6_create+0x2d7/0x1000 net/ipv6/af_inet6.c:182
 [ 1116.897025] [<ffffffff84c4af7b>] __sock_create+0x37b/0x640 net/socket.c:1153
 [ 1116.897025] [<     inline     >] sock_create net/socket.c:1193
 [ 1116.897025] [<     inline     >] SYSC_socket net/socket.c:1223
 [ 1116.897025] [<ffffffff84c4b46f>] SyS_socket+0xef/0x1b0 net/socket.c:1203
 [ 1116.897025] [<ffffffff85e4d685>] entry_SYSCALL_64_fastpath+0x23/0xc6
Freed:
PID = 10987
 [ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
 [ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
 [ 1116.897025] [<ffffffff8174cf61>] kasan_slab_free+0x71/0xb0
 [ 1116.897025] [<     inline     >] slab_free_hook mm/slub.c:1352
 [ 1116.897025] [<     inline     >] slab_free_freelist_hook mm/slub.c:1374
 [ 1116.897025] [<     inline     >] slab_free mm/slub.c:2951
 [ 1116.897025] [<ffffffff81748b28>] kmem_cache_free+0xc8/0x330 mm/slub.c:2973
 [ 1116.897025] [<     inline     >] sk_prot_free net/core/sock.c:1369
 [ 1116.897025] [<ffffffff84c541eb>] __sk_destruct+0x32b/0x4f0 net/core/sock.c:1444
 [ 1116.897025] [<ffffffff84c5aca4>] sk_destruct+0x44/0x80 net/core/sock.c:1452
 [ 1116.897025] [<ffffffff84c5ad33>] __sk_free+0x53/0x220 net/core/sock.c:1460
 [ 1116.897025] [<ffffffff84c5af23>] sk_free+0x23/0x30 net/core/sock.c:1471
 [ 1116.897025] [<ffffffff84c5cb6c>] sk_common_release+0x28c/0x3e0 ./include/net/sock.h:1589
 [ 1116.897025] [<ffffffff8579044e>] l2tp_ip6_close+0x1fe/0x290 net/l2tp/l2tp_ip6.c:243
 [ 1116.897025] [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
 [ 1116.897025] [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
 [ 1116.897025] [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
 [ 1116.897025] [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
 [ 1116.897025] [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
 [ 1116.897025] [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
 [ 1116.897025] [<ffffffff813774f9>] task_work_run+0xf9/0x170
 [ 1116.897025] [<ffffffff81324aae>] do_exit+0x85e/0x2a00
 [ 1116.897025] [<ffffffff81326dc8>] do_group_exit+0x108/0x330
 [ 1116.897025] [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
 [ 1116.897025] [<ffffffff811b49af>] do_signal+0x7f/0x18f0
 [ 1116.897025] [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
 [ 1116.897025] [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
 [ 1116.897025] [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
 [ 1116.897025] [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Memory state around the buggy address:
 ffff8800081b0d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8800081b0e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8800081b0e80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                    ^
 ffff8800081b0f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8800081b0f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

==================================================================

The same issue exists with l2tp_ip_bind() and l2tp_ip_bind_table.

Fixes: c51ce49735 ("l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:09:21 -05:00
David S. Miller f463c99b20 This feature patchset includes the following changes:
- 6 patches adding functionality to detect a WiFi interface under
    other virtual interfaces, like VLANs. They introduce a cache for
    the detected the WiFi configuration to avoid RTNL locking in
    critical sections. Patches have been prepared by Marek Lindner
    and Sven Eckelmann
 
  - Enable automatic module loading for genl requests, by Sven Eckelmann
 
  - Fix a potential race condition on interface removal. This is not
    happening very often in practice, but requires bigger changes to fix,
    so we are sending this to net-next. By Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlgwVFYWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoZRaD/9YkdTsT8N629/C2yCrfvt2Zjav
 xj9+sGtmIq5xtSLkJaiQilz+ua8dCt99TbuzB8c48xXn9O41kejtv6kPE/YzYWVN
 dkLSO7cpJpT20hAAKD54iRv8m+Ed9ozgTPZd+Lu28fjwDc+FAzdmM1gAKx21Wtk6
 SLmRWguA/ezN+FWWLv4HYtThWuOCVOpkhx8Zk7wuzT2PQbryXOIqQ5JOgaKqm1PE
 iHFhsleaHJ74qnr6UReIZ/g3h27+RPGvhXtAtfo+HEupW4FTZowGr7C6Sm9BCpmU
 yMQ1DGckNbg51hz+irJH7nGT+9y6UP/mKNduMOW37JcOF2YKyDvXr+A7C+3Nmv/0
 F+AoFrDKp6vRBTgyKYvuL8zMvDn5mwCh/436/jIbqRvrCVGJQUY1IsS1yK+kPldy
 b/XamVCKAzxzVTumIDz5UCOAxMqaJmhLbasSoMqLZum4fuEU/CAZblKc/2lz/2h2
 o4Jpka3aGwGSIB+vZC0cat1a3RYKesxKuUmEIU7ZTnySOpP8FoiEZuz2qQhhlfNm
 fdGnL0YydBO4yOBBmSoSmS64hfvfdwZv9yuXt2NABXJDSD6lfJKNT3MGDlB9phLM
 OVzO5tQqP9AWel2iFSXafqtuxdqGvv+eFv7PqLGNRqHEr691AIFw10qugx9g4Bul
 dqlpQ7CoVC16LrOaQw==
 =Pdlv
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20161119' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature patchset includes the following changes:

 - 6 patches adding functionality to detect a WiFi interface under
   other virtual interfaces, like VLANs. They introduce a cache for
   the detected the WiFi configuration to avoid RTNL locking in
   critical sections. Patches have been prepared by Marek Lindner
   and Sven Eckelmann

 - Enable automatic module loading for genl requests, by Sven Eckelmann

 - Fix a potential race condition on interface removal. This is not
   happening very often in practice, but requires bigger changes to fix,
   so we are sending this to net-next. By Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 11:13:05 -05:00
David S. Miller adda306744 Here are two batman-adv bugfix patches:
- Revert a splat on disabling interface which created another problem,
    by Sven Eckelmann
 
  - Fix error handling when the primary interface disappears during a
    throughput meter test, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlgwL/AWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoaS5D/9kfo4tKbtdC5t9uQ/GhfE9eNCR
 aLpdoeUl73d/IgGXquAGtFphKsYlaORVa3E/1EnAqN+v9E0q0MZ7GtmFQdOhFKu9
 SOOuo5U+Qz9lmVv9PQlXEk1mTfdA5EukKdbTyycs6rhTEqjt+Ji1JhhdAVTcGb5o
 O2qOXu9312Q7LFFBw828plThkDgATSA0KxvqCVJ9PO5Ewje0Nk1HuUxaSZqRDM6P
 e6i/V5Qh8PDnw/9J3QATo6SrHgM4kmhQwvJNHsLf84kQXhPvOxBvRTABEdl1FKZs
 rkoKOakkrzG/6udV+UGjQK0dpKUZYIjKyjVIFfRkGCwZixkThE4xVo5uYA84anPh
 72wOOeYkBY39XOEECaMCMvV5ECZ9ItphwoW/PkLdQOXIpVO8wHGQNLVkG0cpvOXO
 3FwQOTLr1szvHD3+iwuOXUUXbi9ku/3zytL18xOgFY0wYMyIN+wh0Y8RjtnnQ9dS
 T/qM3cazPU4joI12SAOahw7MFRVm52wOuuVIynXgjHFjIioOpQq/SYZ8lgBW/cQv
 UPu4kApTg976iULmhUjfSCkYrK2/KJqhvV/XOhiLOlQX8ghYJptScGmIsBKUYCId
 PZ0If3IHMXtkRoCSmB8lv+HQi/fvjDJpnZVWLyr1RZzdoNo/JtG4Fb0juXaU7Awr
 hDUHatWOj4N3KQftnQ==
 =DdJj
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20161119' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are two batman-adv bugfix patches:

 - Revert a splat on disabling interface which created another problem,
   by Sven Eckelmann

 - Fix error handling when the primary interface disappears during a
   throughput meter test, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 11:11:52 -05:00
Jarno Rajahalme 5a21388126 af_packet: Use virtio_net_hdr_from_skb() directly.
Remove static function __packet_rcv_vnet(), which only called
virtio_net_hdr_from_skb() and BUG()ged out if an error code was
returned.  Instead, call virtio_net_hdr_from_skb() from the former
call sites of __packet_rcv_vnet() and actually use the error handling
code that is already there.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 10:37:03 -05:00
Jarno Rajahalme db60eb5f65 af_packet: Use virtio_net_hdr_to_skb().
Use the common virtio_net_hdr_to_skb() instead of open coding it.
Other call sites were changed by commit fd2a0437dc, but this one was
missed, maybe because it is split in two parts of the source code.

Interim comparisons of 'vnet_hdr->gso_type' still work as both the
vnet_hdr and skb notion of gso_type is zero when there is no gso.

Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 10:37:03 -05:00
Jarno Rajahalme 9403cd7cbb virtio_net: Do not clear memory for struct virtio_net_hdr twice.
virtio_net_hdr_from_skb() clears the memory for the header, so there
is no point for the callers to do the same.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 10:37:03 -05:00
Linus Torvalds aad931a30f One fix for an NFS/RDMA crash.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYLznTAAoJECebzXlCjuG+ThgP/0tQByIqOKGqUNcEE0MLT4/+
 E2/V/Basi3tnCIatAjYZiAN/rKnO5f4iuyh7PKG/bzlYluLv5MLq68HUia8EorN8
 LExNvZAwYhjEQwDTzhBSityLHWmCwy3G0yYsJQ1DnbUdh8wAKr7lj4R0sr8RGJc1
 GxgWnlhi2lAJSGYRa8tYHzh0tTXGOCoR8POKXFJ91PTx8gEO6VzULvbIQm0RLSow
 +LGW36ov/ChQtJzVJsfcW6Hf4wHFevrtVTPtLWckMEtRq/DJ7hS2btgc02hpqFZm
 MK7wywHT35LV+DvU6QPmwUUaf5IXJjWx0W7thOsjWbYMbAHC/0D3De8bgGaAI3B1
 nB+B96BpGrALyhTX2pXQiQxsavXBl37BOGl3Ft03WrAVI4aJsfkaWDRS2X1jxfXI
 zhGBN2vseoiJblie95hLIgvMtkRmOq4E44oNDiP9zKTwrIkISoz5jmvLHY/8Mj7E
 NCof2P+K6ays8ywD2DqHlJKmiGA7PdNT87ZeeS4ZFvEjWSd4S1pfa0R+jg5FVxZl
 Vl7QQX5D/Ep+sXszJin4dYQnl844+sVMVaj6CdQOK0udml81UZTRO5fvjNexs3e4
 4Zd/ymC/XGs6Hz3pbPeIkAd/MzXCK0zNojNAdZnicOMzQpG2sZ76SJRZQg0sTCFH
 EP6QTWxOog4lnDfML13E
 =F2DQ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Just one fix for an NFS/RDMA crash"

* tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux:
  sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports
2016-11-18 16:32:21 -08:00
Sabrina Dubroca f82ef3e10a rtnetlink: fix FDB size computation
Add missing NDA_VLAN attribute's size.

Fixes: 1e53d5bb88 ("net: Pass VLAN ID to rtnl_fdb_notify.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 14:09:42 -05:00
David S. Miller 87305c4cd2 A few more bugfixes:
* limit # of scan results stored in memory - this is a long-standing bug
    Jouni and I only noticed while discussing other things in Santa Fe
  * revert AP_LINK_PS patch that was causing issues (Felix)
  * various A-MSDU/A-MPDU fixes for TXQ code (Felix)
  * interoperability workaround for peers with broken VHT capabilities
    (Filip Matusiak)
  * add bitrate definition for a VHT MCS that's supposed to be invalid
    but gets used by some hardware anyway (Thomas Pedersen)
  * beacon timer fix in hwsim (Benjamin Beichler)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYLrKJAAoJEGt7eEactAAdshYP/Rwi/k9WZ0RKQIA1QA0fnmwv
 TbUUvPHQ4LA39Czs0JwOPDIbter7TBzFXV5EK0y39jGm8St4a/oHuN072kzBYvob
 YbBzM0YKV2uQCAjH3AjWjS3RPOlTzfUqSRWgEqtgUTC5VZrpNfhd2r0q+jfwpQ7s
 +qUsdwqvVJMZKfOEngIXzNXdI95N1d8tpGwzkUqDbOJ+7amdriiJKyTsKEOMqREA
 0Mdhu6roEcMDVO+xt5ZkilmpLZOUEHAzaWkwm7d6VToWi7k3pwMiEZ4fthvHZHEZ
 6K2bn1UMLdKAExRcQBE3wrmn5jFfUbts1mhmoN1drHocI12epcgw5I4ZjEbfpqMB
 IBkOM+2hbSdUVfE4KVH6iubqiJwtHB8YSTdKFmMO4jPx7UshPF7YCny6XWC+EqQm
 ktjyG/lhlXJ0HaOKC/MAwd/KSzfH0eJWGNBDV5s/FIWiqu2oWn+ZIX7vGTpp4K1Z
 Rery/ZUiJGzVITNG1ka+GXq2tUvDXfkMApr+jFH4G6zioWovxB6+4uyMEeclvC5q
 zjgYEpZhbn2wQUNEJ8btDDgXkyDGh3UPjy93fO8fMEvv8d9CCOKTQkIUpjPp6jbZ
 EoLCTW0NNjnAmQRiUFF3vO2O7e3ZTJ4hyNEVkc7mJ2X3DllXRSelTIFSfZVHpb2k
 P+gY7uC0cAPZ4s6q8GSb
 =q2mI
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A few more bugfixes:
 * limit # of scan results stored in memory - this is a long-standing bug
   Jouni and I only noticed while discussing other things in Santa Fe
 * revert AP_LINK_PS patch that was causing issues (Felix)
 * various A-MSDU/A-MPDU fixes for TXQ code (Felix)
 * interoperability workaround for peers with broken VHT capabilities
   (Filip Matusiak)
 * add bitrate definition for a VHT MCS that's supposed to be invalid
   but gets used by some hardware anyway (Thomas Pedersen)
 * beacon timer fix in hwsim (Benjamin Beichler)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 14:00:27 -05:00
WANG Cong 06a77b07e3 af_unix: conditionally use freezable blocking calls in read
Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

The pipe_lock is still held at that point.

So use freezable version only for the recvmsg call path, avoid impact for
Android.

Fixes: 2b514574f7 ("net: af_unix: implement splice for stream af_unix sockets")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Colin Cross <ccross@android.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 13:58:39 -05:00
Raju Lakkaraju 65feddd5b9 ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable
Adding validation support for the ETHTOOL_PHY_DOWNSHIFT. Functional
implementation needs to be done in the individual PHY drivers.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:12:14 -05:00
Raju Lakkaraju 968ad9da7e ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE
Adding get_tunable/set_tunable function pointer to the phy_driver
structure, and uses these function pointers to implement the
ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE ioctls.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 12:12:14 -05:00
Alexey Dobriyan c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
Eric Dumazet e68b6e50fa udp: enable busy polling for all sockets
UDP busy polling is restricted to connected UDP sockets.

This is because sk_busy_loop() only takes care of one NAPI context.

There are cases where it could be extended.

1) Some hosts receive traffic on a single NIC, with one RX queue.

2) Some applications use SO_REUSEPORT and associated BPF filter
   to split the incoming traffic on one UDP socket per RX
queue/thread/cpu

3) Some UDP sockets are used to send/receive traffic for one flow, but
they do not bother with connect()

This patch records the napi_id of first received skb, giving more
reach to busy polling.

Tested:

lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
lpaa24:~# echo 70 >/proc/sys/net/core/busy_read

lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done

Before patch :
   27867   28870   37324   41060   41215
   36764   36838   44455   41282   43843
After patch :
   73920   73213   70147   74845   71697
   68315   68028   75219   70082   73707

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:44:31 -05:00
Johannes Berg 9853a55ef1 cfg80211: limit scan results cache size
It's possible to make scanning consume almost arbitrary amounts
of memory, e.g. by sending beacon frames with random BSSIDs at
high rates while somebody is scanning.

Limit the number of BSS table entries we're willing to cache to
1000, limiting maximum memory usage to maybe 4-5MB, but lower
in practice - that would be the case for having both full-sized
beacon and probe response frames for each entry; this seems not
possible in practice, so a limit of 1000 entries will likely be
closer to 0.5 MB.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-18 08:44:44 +01:00
Florian Westphal 330e832abd xfrm: unbreak xfrm_sk_policy_lookup
if we succeed grabbing the refcount, then
  if (err && !xfrm_pol_hold_rcu)

will evaluate to false so this hits last else branch which then
sets policy to ERR_PTR(0).

Fixes: ae33786f73 ("xfrm: policy: only use rcu in xfrm_sk_policy_lookup")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-11-18 07:00:05 +01:00
Roman Mashak 30a391a13a net sched filters: pass netlink message flags in event notification
Userland client should be able to read an event, and reflect it back to
the kernel, therefore it needs to extract complete set of netlink flags.

For example, this will allow "tc monitor" to distinguish Add and Replace
operations.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 13:42:12 -05:00
Sowmini Varadhan 1a0e100fb2 RDS: TCP: Force every connection to be initiated by numerically smaller IP address
When 2 RDS peers initiate an RDS-TCP connection simultaneously,
there is a potential for "duelling syns" on either/both sides.
See commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") for a description of this
condition, and the arbitration logic which ensures that the
numerically large IP address in the TCP connection is bound to the
RDS_TCP_PORT ("canonical ordering").

The rds_connection should not be marked as RDS_CONN_UP until the
arbitration logic has converged for the following reason. The sender
may start transmitting RDS datagrams as soon as RDS_CONN_UP is set,
and since the sender removes all datagrams from the rds_connection's
cp_retrans queue based on TCP acks. If the TCP ack was sent from
a tcp socket that got reset as part of duel aribitration (but
before data was delivered to the receivers RDS socket layer),
the sender may end up prematurely freeing the datagram, and
the datagram is no longer reliably deliverable.

This patch remedies that condition by making sure that, upon
receipt of 3WH completion state change notification of TCP_ESTABLISHED
in rds_tcp_state_change, we mark the rds_connection as RDS_CONN_UP
if, and only if, the IP addresses and ports for the connection are
canonically ordered. In all other cases, rds_tcp_state_change will
force an rds_conn_path_drop(), and rds_queue_reconnect() on
both peers will restart the connection to ensure canonical ordering.

A side-effect of enforcing this condition in rds_tcp_state_change()
is that rds_tcp_accept_one_path() can now be refactored for simplicity.
It is also no longer possible to encounter an RDS_CONN_UP connection in
the arbitration logic in rds_tcp_accept_one().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 13:35:18 -05:00
Sowmini Varadhan 905dd4184e RDS: TCP: Track peer's connection generation number
The RDS transport has to be able to distinguish between
two types of failure events:
(a) when the transport fails (e.g., TCP connection reset)
    but the RDS socket/connection layer on both sides stays
    the same
(b) when the peer's RDS layer itself resets (e.g., due to module
    reload or machine reboot at the peer)
In case (a) both sides must reconnect and continue the RDS messaging
without any message loss or disruption to the message sequence numbers,
and this is achieved by rds_send_path_reset().

In case (b) we should reset all rds_connection state to the
new incarnation of the peer. Examples of state that needs to
be reset are next expected rx sequence number from, or messages to be
retransmitted to, the new incarnation of the peer.

To achieve this, the RDS handshake probe added as part of
commit 5916e2c155 ("RDS: TCP: Enable multipath RDS for TCP")
is enhanced so that sender and receiver of the RDS ping-probe
will add a generation number as part of the RDS_EXTHDR_GEN_NUM
extension header. Each peer stores local and remote generation
numbers as part of each rds_connection. Changes in generation
number will be detected via incoming handshake probe ping
request or response and will allow the receiver to reset rds_connection
state.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 13:35:18 -05:00
Sowmini Varadhan 315ca6d98e RDS: TCP: set RDS_FLAG_RETRANSMITTED in cp_retrans list
As noted in rds_recv_incoming() sequence numbers on data packets
can decreas for the failover case, and the Rx path is equipped
to recover from this, if the RDS_FLAG_RETRANSMITTED is set
on the rds header of an incoming message with a suspect sequence
number.

The RDS_FLAG_RETRANSMITTED is predicated on the RDS_FLAG_RETRANSMITTED
flag in the rds_message, so make sure the flag is set on messages
queued for retransmission.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 13:35:18 -05:00
Eric Dumazet 29c58472ec net_sched: sch_fq: use hash_ptr()
When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful,
and I chose hash_32().

Linus Torvalds and George Spelvin fixed this issue, so we can
use hash_ptr() to get more entropy on 64bit arches with Terabytes
of memory, and avoid the cast games.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 13:28:56 -05:00
Paolo Abeni b5c2d49544 ip6_tunnel: disable caching when the traffic class is inherited
If an ip6 tunnel is configured to inherit the traffic class from
the inner header, the dst_cache must be disabled or it will foul
the policy routing.

The issue is apprently there since at leat Linux-2.6.12-rc2.

Reported-by: Liam McBirnie <liam.mcbirnie@boeing.com>
Cc: Liam McBirnie <liam.mcbirnie@boeing.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 12:08:56 -05:00
WANG Cong cfc44a4d14 net: check dead netns for peernet2id_alloc()
Andrei reports we still allocate netns ID from idr after we destroy
it in cleanup_net().

cleanup_net():
  ...
  idr_destroy(&net->netns_ids);
  ...
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list);
      -> rollback_registered_many()
        -> rtmsg_ifinfo_build_skb()
         -> rtnl_fill_ifinfo()
           -> peernet2id_alloc()

After that point we should not even access net->netns_ids, we
should check the death of the current netns as early as we can in
peernet2id_alloc().

For net-next we can consider to avoid sending rtmsg totally,
it is a good optimization for netns teardown path.

Fixes: 0c7aecd4bd ("netns: add rtnl cmd to add and get peer netns ids")
Reported-by: Andrei Vagin <avagin@gmail.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 11:19:40 -05:00
Arnd Bergmann 10f3366b4d wireless: fix bogus maybe-uninitialized warning
The hostap_80211_rx() function is supposed to set up the mac addresses
for four possible cases, based on two bits of input data. For
some reason, gcc decides that it's possible that none of the these
four cases apply and the addresses remain uninitialized:

drivers/net/wireless/intersil/hostap/hostap_80211_rx.c: In function ‘hostap_80211_rx’:
arch/x86/include/asm/string_32.h:77:14: warning: ‘src’ may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/net/wireless/intel/ipw2x00/libipw_rx.c: In function ‘libipw_rx’:
arch/x86/include/asm/string_32.h:77:14: error: ‘dst’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
arch/x86/include/asm/string_32.h:78:22: error: ‘*((void *)&dst+4)’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

This warning is clearly nonsense, but changing the last case into
'default' makes it obvious to the compiler too, which avoids the
warning and probably leads to better object code too.

The same code is duplicated several times in the kernel, so this
patch uses the same workaround for all copies. The exact configuration
was hit only very rarely in randconfig builds and I only saw it
in three drivers, but I assume that all of them are potentially
affected, and it's better to keep the code consistent.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2016-11-17 08:46:38 +02:00
Andreas Gruenbacher 4a59015372 xattr: Fix setting security xattrs on sockfs
The IOP_XATTR flag is set on sockfs because sockfs supports getting the
"system.sockprotoname" xattr.  Since commit 6c6ef9f2, this flag is checked for
setxattr support as well.  This is wrong on sockfs because security xattr
support there is supposed to be provided by security_inode_setsecurity.  The
smack security module relies on socket labels (xattrs).

Fix this by adding a security xattr handler on sockfs that returns
-EAGAIN, and by checking for -EAGAIN in setxattr.

We cannot simply check for -EOPNOTSUPP in setxattr because there are
filesystems that neither have direct security xattr support nor support
via security_inode_setsecurity.  A more proper fix might be to move the
call to security_inode_setsecurity into sockfs, but it's not clear to me
if that is safe: we would end up calling security_inode_post_setxattr after
that as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-17 00:00:23 -05:00
Xin Long 7fda702f93 sctp: use new rhlist interface on sctp transport rhashtable
Now sctp transport rhashtable uses hash(lport, dport, daddr) as the key
to hash a node to one chain. If in one host thousands of assocs connect
to one server with the same lport and different laddrs (although it's
not a normal case), all the transports would be hashed into the same
chain.

It may cause to keep returning -EBUSY when inserting a new node, as the
chain is too long and sctp inserts a transport node in a loop, which
could even lead to system hangs there.

The new rhlist interface works for this case that there are many nodes
with the same key in one chain. It puts them into a list then makes this
list be as a node of the chain.

This patch is to replace rhashtable_ interface with rhltable_ interface.
Since a chain would not be too long and it would not return -EBUSY with
this fix when inserting a node, the reinsert loop is also removed here.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 23:22:17 -05:00
Eric Dumazet 89c4b442b7 netpoll: more efficient locking
Callers of netpoll_poll_lock() own NAPI_STATE_SCHED

Callers of netpoll_poll_unlock() have BH blocked between
the NAPI_STATE_SCHED being cleared and poll_lock is released.

We can avoid the spinlock which has no contention, and use cmpxchg()
on poll_owner which we need to set anyway.

This removes a possible lockdep violation after the cited commit,
since sk_busy_loop() re-enables BH before calling busy_poll_stop()

Fixes: 217f697436 ("net: busy-poll: allow preemption in sk_busy_loop()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 18:32:02 -05:00
Eric Dumazet 364b605573 net: busy-poll: return busypolling status to drivers
NAPI drivers use napi_complete_done() or napi_complete() when
they drained RX ring and right before re-enabling device interrupts.

In busy polling, we can avoid interrupts being delivered since
we are polling RX ring in a controlled loop.

Drivers can chose to use napi_complete_done() return value
to reduce interrupts overhead while busy polling is active.

This is optional, legacy drivers should work fine even
if not updated.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Adam Belay <abelay@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 13:40:58 -05:00
Eric Dumazet 217f697436 net: busy-poll: allow preemption in sk_busy_loop()
After commit 4cd13c21b2 ("softirq: Let ksoftirqd do its job"),
sk_busy_loop() needs a bit of care :
softirqs might be delayed since we do not allow preemption yet.

This patch adds preemptiom points in sk_busy_loop(),
and makes sure no unnecessary cache line dirtying
or atomic operations are done while looping.

A new flag is added into napi->state : NAPI_STATE_IN_BUSY_POLL

This prevents napi_complete_done() from clearing NAPIF_STATE_SCHED,
so that sk_busy_loop() does not have to grab it again.

Similarly, netpoll_poll_lock() is done one time.

This gives about 10 to 20 % improvement in various busy polling
tests, especially when many threads are busy polling in
configurations with large number of NIC queues.

This should allow experimenting with bigger delays without
hurting overall latencies.

Tested:
 On a 40Gb mlx4 NIC, 32 RX/TX queues.

 echo 70 >/proc/sys/net/core/busy_read
 for i in `seq 1 40`; do echo -n $i: ; ./super_netperf $i -H lpaa24 -t UDP_RR -- -N -n; done

    Before:      After:
 1:   90072   92819
 2:  157289  184007
 3:  235772  213504
 4:  344074  357513
 5:  394755  458267
 6:  461151  487819
 7:  549116  625963
 8:  544423  716219
 9:  720460  738446
10:  794686  837612
11:  915998  923960
12:  937507  925107
13: 1019677  971506
14: 1046831 1113650
15: 1114154 1148902
16: 1105221 1179263
17: 1266552 1299585
18: 1258454 1383817
19: 1341453 1312194
20: 1363557 1488487
21: 1387979 1501004
22: 1417552 1601683
23: 1550049 1642002
24: 1568876 1601915
25: 1560239 1683607
26: 1640207 1745211
27: 1706540 1723574
28: 1638518 1722036
29: 1734309 1757447
30: 1782007 1855436
31: 1724806 1888539
32: 1717716 1944297
33: 1778716 1869118
34: 1805738 1983466
35: 1815694 2020758
36: 1893059 2035632
37: 1843406 2034653
38: 1888830 2086580
39: 1972827 2143567
40: 1877729 2181851

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Adam Belay <abelay@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 13:40:57 -05:00
Alexander Duyck 3114cdfe66 ipv4: Fix memory leak in exception case for splitting tries
Fix a small memory leak that can occur where we leak a fib_alias in the
event of us not being able to insert it into the local table.

Fixes: 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 13:24:50 -05:00
Alexander Duyck 3b7093346b ipv4: Restore fib_trie_flush_external function and fix call ordering
The patch that removed the FIB offload infrastructure was a bit too
aggressive and also removed code needed to clean up us splitting the table
if additional rules were added.  Specifically the function
fib_trie_flush_external was called at the end of a new rule being added to
flush the foreign trie entries from the main trie.

I updated the code so that we only call fib_trie_flush_external on the main
table so that we flush the entries for local from main.  This way we don't
call it for every rule change which is what was happening previously.

Fixes: 347e3b28c1 ("switchdev: remove FIB offload infrastructure")
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 13:24:50 -05:00
David Lebrun 46738b1317 ipv6: sr: add option to control lwtunnel support
This patch adds a new option CONFIG_IPV6_SEG6_LWTUNNEL to enable/disable
support of encapsulation with the lightweight tunnels. When this option
is enabled, CONFIG_LWTUNNEL is automatically selected.

Fix commit 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")

Without a proper option to control lwtunnel support for SR-IPv6, if
CONFIG_LWTUNNEL=n then the IPv6 initialization fails as a consequence
of seg6_iptunnel_init() failure with EOPNOTSUPP:

NET: Registered protocol family 10
IPv6: Attempt to unregister permanent protocol 6
IPv6: Attempt to unregister permanent protocol 136
IPv6: Attempt to unregister permanent protocol 17
NET: Unregistered protocol family 10

Tested (compiling, booting, and loading ipv6 module when relevant)
with possible combinations of CONFIG_IPV6={y,m,n},
CONFIG_IPV6_SEG6_LWTUNNEL={y,n} and CONFIG_LWTUNNEL={y,n}.

Reported-by: Lorenzo Colitti <lorenzo@google.com>
Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 11:29:46 -05:00
Sabrina Dubroca b3cfaa31e3 rtnetlink: fix rtnl message size computation for XDP
rtnl_xdp_size() only considers the size of the actual payload attribute,
and misses the space taken by the attribute used for nesting (IFLA_XDP).

Fixes: d1fdd91386 ("rtnl: add option for setting link xdp prog")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 22:40:07 -05:00
Sabrina Dubroca 7e75f74a17 rtnetlink: fix rtnl_vfinfo_size
The size reported by rtnl_vfinfo_size doesn't match the space used by
rtnl_fill_vfinfo.

rtnl_vfinfo_size currently doesn't account for the nest attributes
used by statistics (added in commit 3b766cd832), nor for struct
ifla_vf_tx_rate (since commit ed616689a3, which added ifla_vf_rate
to the dump without removing ifla_vf_tx_rate, but replaced
ifla_vf_tx_rate with ifla_vf_rate in the size computation).

Fixes: 3b766cd832 ("net/core: Add reading VF statistics through the PF netdevice")
Fixes: ed616689a3 ("net-next:v4: Add support to configure SR-IOV VF minimum and maximum Tx rate through ip tool")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 22:40:07 -05:00
Andrey Vagin 319b0534b9 tcp: allow to enable the repair mode for non-listening sockets
The repair mode is used to get and restore sequence numbers and
data from queues. It used to checkpoint/restore connections.

Currently the repair mode can be enabled for sockets in the established
and closed states, but for other states we have to dump the same socket
properties, so lets allow to enable repair mode for these sockets.

The repair mode reveals nothing more for sockets in other states.

Signed-off-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 22:28:50 -05:00
Pablo Neira 73e2d5e34b udp: restore UDPlite many-cast delivery
Honor udptable parameter that is passed to __udp*_lib_mcast_deliver(),
otherwise udplite broadcast/multicast use the wrong table and it breaks.

Fixes: 2dc41cff75 ("udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 22:14:27 -05:00
Florian Westphal 4780566784 dctcp: update cwnd on congestion event
draft-ietf-tcpm-dctcp-02 says:

... when the sender receives an indication of congestion
(ECE), the sender SHOULD update cwnd as follows:

         cwnd = cwnd * (1 - DCTCP.Alpha / 2)

So, lets do this and reduce cwnd more smoothly (and faster), as per
current congestion estimate.

Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 22:01:58 -05:00
Hangbin Liu 24803f38a5 igmp: do not remove igmp souce list info when set link down
In commit 24cf3af3fe ("igmp: call ip_mc_clear_src..."), we forgot to remove
igmpv3_clear_delrec() in ip_mc_down(), which also called ip_mc_clear_src().
This make us clear all IGMPv3 source filter info after NETDEV_DOWN.
Move igmpv3_clear_delrec() to ip_mc_destroy_dev() and then no need
ip_mc_clear_src() in ip_mc_destroy_dev().

On the other hand, we should restore back instead of free all source filter
info in igmpv3_del_delrec(). Or we will not able to restore IGMPv3 source
filter info after NETDEV_UP and NETDEV_POST_TYPE_CHANGE.

Fixes: 24cf3af3fe ("igmp: call ip_mc_clear_src() only when ...")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 19:51:16 -05:00
Paolo Abeni c915fe13cb udplite: fix NULL pointer dereference
The commit 850cbaddb5 ("udp: use it's own memory accounting schema")
assumes that the socket proto has memory accounting enabled,
but this is not the case for UDPLITE.
Fix it enabling memory accounting for UDPLITE and performing
fwd allocated memory reclaiming on socket shutdown.
UDP and UDPLITE share now the same memory accounting limits.
Also drop the backlog receive operation, since is no more needed.

Fixes: 850cbaddb5 ("udp: use it's own memory accounting schema")
Reported-by: Andrei Vagin <avagin@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 11:59:38 -05:00
David S. Miller bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
Felix Fietkau a786f96da0 mac80211: fix A-MSDU aggregation with fast-xmit + txq
A-MSDU aggregation alters the QoS header after a frame has been
enqueued, so it needs to be ready before enqueue and not overwritten
again afterwards

Fixes: bb42f2d13f ("mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:37:30 +01:00
Felix Fietkau fff712cbe3 mac80211: remove bogus skb vif assignment
The call to ieee80211_txq_enqueue overwrites the vif pointer with the
codel enqueue time, so setting it just before that call makes no sense.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:37:21 +01:00
Felix Fietkau c1f4c9ede3 mac80211: update A-MPDU flag on tx dequeue
The sequence number counter is used to derive the starting sequence
number. Since that counter is updated on tx dequeue, the A-MPDU flag
needs to be up to date at the tme of dequeue as well.

This patch prevents sending more A-MPDU frames after the session has
been terminated and also ensures that aggregation starts right after the
session has been established

Fixes: bb42f2d13f ("mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:37:12 +01:00
Pedersen, Thomas 8fdd136f22 cfg80211: add bitrate for 20MHz MCS 9
Some drivers (ath10k) report MCS 9 @ 20MHz, which
technically isn't defined. To get more meaningful value
than 0 out of this however, just extrapolate a bitrate
from ratio of MCS 7 and 9 in channels where it is allowed.

Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com>
[add a comment about it in the code]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:34:00 +01:00
Felix Fietkau 6c18a6b4e7 Revert "mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE"
This reverts commit c68df2e7be.

__sta_info_recalc_tim turns into a no-op if local->ops->set_tim is not
set. This prevents the beacon TIM bit from being set for all drivers
that do not implement this op (almost all of them), thus thoroughly
essential AP mode powersave functionality.

Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Fixes: c68df2e7be ("mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:32:09 +01:00
Filip Matusiak c8eaf3479e mac80211: Ignore VHT IE from peer with wrong rx_mcs_map
This is a workaround for VHT-enabled STAs which break the spec
and have the VHT-MCS Rx map filled in with value 3 for all eight
spacial streams, an example is AR9462 in AP mode.

As per spec, in section 22.1.1 Introduction to the VHT PHY
A VHT STA shall support at least single spactial stream VHT-MCSs
0 to 7 (transmit and receive) in all supported channel widths.

Some devices in STA mode will get firmware assert when trying to
associate, examples are QCA9377 & QCA6174.

Packet example of broken VHT Cap IE of AR9462:

Tag: VHT Capabilities (IEEE Std 802.11ac/D3.1)
    Tag Number: VHT Capabilities (IEEE Std 802.11ac/D3.1) (191)
    Tag length: 12
    VHT Capabilities Info: 0x00000000
    VHT Supported MCS Set
        Rx MCS Map: 0xffff
            .... .... .... ..11 = Rx 1 SS: Not Supported (0x0003)
            .... .... .... 11.. = Rx 2 SS: Not Supported (0x0003)
            .... .... ..11 .... = Rx 3 SS: Not Supported (0x0003)
            .... .... 11.. .... = Rx 4 SS: Not Supported (0x0003)
            .... ..11 .... .... = Rx 5 SS: Not Supported (0x0003)
            .... 11.. .... .... = Rx 6 SS: Not Supported (0x0003)
            ..11 .... .... .... = Rx 7 SS: Not Supported (0x0003)
            11.. .... .... .... = Rx 8 SS: Not Supported (0x0003)
        ...0 0000 0000 0000 = Rx Highest Long GI Data Rate (in Mb/s, 0 = subfield not in use): 0x0000
        Tx MCS Map: 0xffff
        ...0 0000 0000 0000 = Tx Highest Long GI Data Rate  (in Mb/s, 0 = subfield not in use): 0x0000

Signed-off-by: Filip Matusiak <filip.matusiak@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-11-15 14:18:43 +01:00
Dwip Banerjee 8d8e20e2d7 ipvs: Decrement ttl
We decrement the IP ttl in all the modes in order to prevent infinite
route loops. The changes were done based on Julian Anastasov's
suggestions in a prior thread.

The ttl based check/discard and the actual decrement are done in
__ip_vs_get_out_rt() and in __ip_vs_get_out_rt_v6(), for the IPv6
case. decrement_ttl() implements the actual functionality for the
two cases.

Signed-off-by: Dwip Banerjee <dwip@linux.vnet.ibm.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2016-11-15 09:49:20 +01:00
Gao Feng fe24a0c3a9 ipvs: Use IS_ERR_OR_NULL(svc) instead of IS_ERR(svc) || svc == NULL
This minor refactoring does not change the logic of function
ip_vs_genl_dump_dests.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2016-11-15 09:49:19 +01:00
Linus Torvalds e76d21c40b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix off by one wrt. indexing when dumping /proc/net/route entries,
    from Alexander Duyck.

 2) Fix lockdep splats in iwlwifi, from Johannes Berg.

 3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH
    is disabled, from Liping Zhang.

 4) Memory leak when nft_expr_clone() fails, also from Liping Zhang.

 5) Disable UFO when path will apply IPSEC tranformations, from Jakub
    Sitnicki.

 6) Don't bogusly double cwnd in dctcp module, from Florian Westphal.

 7) skb_checksum_help() should never actually use the value "0" for the
    resulting checksum, that has a special meaning, use CSUM_MANGLED_0
    instead. From Eric Dumazet.

 8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from
    Yuval MIntz.

 9) Fix SCTP reference counting of associations and transports in
    sctp_diag. From Xin Long.

10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in
    a previous layer or similar, so explicitly clear the ipv6 control
    block in the skb. From Eli Cooper.

11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG
    Cong.

12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad
    Shenai.

13) Fix potential access past the end of the skb page frag array in
    tcp_sendmsg(). From Eric Dumazet.

14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix
    from David Ahern.

15) Don't return an error in tcp_sendmsg() if we wronte any bytes
    successfully, from Eric Dumazet.

16) Extraneous unlocks in netlink_diag_dump(), we removed the locking
    but forgot to purge these unlock calls. From Eric Dumazet.

17) Fix memory leak in error path of __genl_register_family(). We leak
    the attrbuf, from WANG Cong.

18) cgroupstats netlink policy table is mis-sized, from WANG Cong.

19) Several XDP bug fixes in mlx5, from Saeed Mahameed.

20) Fix several device refcount leaks in network drivers, from Johan
    Hovold.

21) icmp6_send() should use skb dst device not skb->dev to determine L3
    routing domain. From David Ahern.

22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong.

23) We leak new macvlan port in some cases of maclan_common_netlink()
    errors. Fix from Gao Feng.

24) Similar to the icmp6_send() fix, icmp_route_lookup() should
    determine L3 routing domain using skb_dst(skb)->dev not skb->dev.
    Also from David Ahern.

25) Several fixes for route offloading and FIB notification handling in
    mlxsw driver, from Jiri Pirko.

26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet.

27) Fix long standing regression in ipv4 redirect handling, wrt.
    validating the new neighbour's reachability. From Stephen Suryaputra
    Lin.

28) If sk_filter() trims the packet excessively, handle it reasonably in
    tcp input instead of exploding. From Eric Dumazet.

29) Fix handling of napi hash state when copying channels in sfc driver,
    from Bert Kenward.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits)
  mlxsw: spectrum_router: Flush FIB tables during fini
  net: stmmac: Fix lack of link transition for fixed PHYs
  sctp: change sk state only when it has assocs in sctp_shutdown
  bnx2: Wait for in-flight DMA to complete at probe stage
  Revert "bnx2: Reset device during driver initialization"
  ps3_gelic: fix spelling mistake in debug message
  net: ethernet: ixp4xx_eth: fix spelling mistake in debug message
  ibmvnic: Fix size of debugfs name buffer
  ibmvnic: Unmap ibmvnic_statistics structure
  sfc: clear napi_hash state when copying channels
  mlxsw: spectrum_router: Correctly dump neighbour activity
  mlxsw: spectrum: Fix refcount bug on span entries
  bnxt_en: Fix VF virtual link state.
  bnxt_en: Fix ring arithmetic in bnxt_setup_tc().
  Revert "include/uapi/linux/atm_zatm.h: include linux/time.h"
  tcp: take care of truncations done by sk_filter()
  ipv4: use new_gw for redirect neigh lookup
  r8152: Fix error path in open function
  net: bpqether.h: remove if_ether.h guard
  net: __skb_flow_dissect() must cap its return value
  ...
2016-11-14 14:15:53 -08:00
Xin Long 5bf35ddfee sctp: change sk state only when it has assocs in sctp_shutdown
Now when users shutdown a sock with SEND_SHUTDOWN in sctp, even if
this sock has no connection (assoc), sk state would be changed to
SCTP_SS_CLOSING, which is not as we expect.

Besides, after that if users try to listen on this sock, kernel
could even panic when it dereference sctp_sk(sk)->bind_hash in
sctp_inet_listen, as bind_hash is null when sock has no assoc.

This patch is to move sk state change after checking sk assocs
is not empty, and also merge these two if() conditions and reduce
indent level.

Fixes: d46e416c11 ("sctp: sctp should change socket state when shutdown is received")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14 16:22:33 -05:00
WANG Cong d9dc8b0f8b net: fix sleeping for sk_wait_event()
Similar to commit 14135f30e3 ("inet: fix sleeping inside inet_wait_for_connect()"),
sk_wait_event() needs to fix too, because release_sock() is blocking,
it changes the process state back to running after sleep, which breaks
the previous prepare_to_wait().

Switch to the new wait API.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14 13:17:21 -05:00
Scott Mayhew ea08e39230 sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports
This fixes the following panic that can occur with NFSoRDMA.

general protection fault: 0000 [#1] SMP
Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp
scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm sg ioatdma
ipmi_devintf ipmi_ssif dcdbas iTCO_wdt iTCO_vendor_support pcspkr
irqbypass sb_edac shpchp dca crc32_pclmul ghash_clmulni_intel edac_core
lpc_ich aesni_intel lrw gf128mul glue_helper ablk_helper mei_me mei
ipmi_si cryptd wmi ipmi_msghandler acpi_pad acpi_power_meter nfsd
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt ahci fb_sys_fops ttm libahci mlx5_core
tg3 crct10dif_pclmul drm crct10dif_common
ptp i2c_core libata crc32c_intel pps_core fjes dm_mirror dm_region_hash
dm_log dm_mod
CPU: 1 PID: 120 Comm: kworker/1:1 Not tainted 3.10.0-514.el7.x86_64 #1
Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
Workqueue: events check_lifetime
task: ffff88031f506dd0 ti: ffff88031f584000 task.ti: ffff88031f584000
RIP: 0010:[<ffffffff8168d847>]  [<ffffffff8168d847>]
_raw_spin_lock_bh+0x17/0x50
RSP: 0018:ffff88031f587ba8  EFLAGS: 00010206
RAX: 0000000000020000 RBX: 20041fac02080072 RCX: ffff88031f587fd8
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 20041fac02080072
RBP: ffff88031f587bb0 R08: 0000000000000008 R09: ffffffff8155be77
R10: ffff880322a59b00 R11: ffffea000bf39f00 R12: 20041fac02080072
R13: 000000000000000d R14: ffff8800c4fbd800 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff880322a40000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3c52d4547e CR3: 00000000019ba000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
20041fac02080002 ffff88031f587bd0 ffffffff81557830 20041fac02080002
ffff88031f587c78 ffff88031f587c40 ffffffff8155ae08 000000010157df32
0000000800000001 ffff88031f587c20 ffffffff81096acb ffffffff81aa37d0
Call Trace:
[<ffffffff81557830>] lock_sock_nested+0x20/0x50
[<ffffffff8155ae08>] sock_setsockopt+0x78/0x940
[<ffffffff81096acb>] ? lock_timer_base.isra.33+0x2b/0x50
[<ffffffff8155397d>] kernel_setsockopt+0x4d/0x50
[<ffffffffa0386284>] svc_age_temp_xprts_now+0x174/0x1e0 [sunrpc]
[<ffffffffa03b681d>] nfsd_inetaddr_event+0x9d/0xd0 [nfsd]
[<ffffffff81691ebc>] notifier_call_chain+0x4c/0x70
[<ffffffff810b687d>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffff810b68b6>] blocking_notifier_call_chain+0x16/0x20
[<ffffffff815e8538>] __inet_del_ifa+0x168/0x2d0
[<ffffffff815e8cef>] check_lifetime+0x25f/0x270
[<ffffffff810a7f3b>] process_one_work+0x17b/0x470
[<ffffffff810a8d76>] worker_thread+0x126/0x410
[<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460
[<ffffffff810b052f>] kthread+0xcf/0xe0
[<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[<ffffffff81696418>] ret_from_fork+0x58/0x90
[<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
Code: ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb d9 66 0f 1f 44 00 00 0f 1f
44 00 00 55 48 89 e5 53 48 89 fb e8 7e 04 a0 ff b8 00 00 02 00 <f0> 0f
c1 03 89 c2 c1 ea 10 66 39 c2 75 03 5b 5d c3 83 e2 fe 0f
RIP  [<ffffffff8168d847>] _raw_spin_lock_bh+0x17/0x50
RSP <ffff88031f587ba8>

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: c3d4879e ("sunrpc: Add a function to close temporary transports immediately")
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-14 10:30:58 -05:00
David S. Miller 7d384846b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains a second batch of Netfilter updates for
your net-next tree. This includes a rework of the core hook
infrastructure that improves Netfilter performance by ~15% according to
synthetic benchmarks. Then, a large batch with ipset updates, including
a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a
couple of assorted updates.

Regarding the core hook infrastructure rework to improve performance,
using this simple drop-all packets ruleset from ingress:

        nft add table netdev x
        nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; }
        nft add rule netdev x y drop

And generating traffic through Jesper Brouer's
samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i
option. perf report shows nf_tables calls in its top 10:

    17.30%  kpktgend_0   [nf_tables]            [k] nft_do_chain
    15.75%  kpktgend_0   [kernel.vmlinux]       [k] __netif_receive_skb_core
    10.39%  kpktgend_0   [nf_tables_netdev]     [k] nft_do_chain_netdev

I'm measuring here an improvement of ~15% in performance with this
patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R)
Core(TM) i5-3320M CPU @ 2.60GHz 4-cores.

This rework contains more specifically, in strict order, these patches:

1) Remove compile-time debugging from core.

2) Remove obsolete comments that predate the rcu era. These days it is
   well known that a Netfilter hook always runs under rcu_read_lock().

3) Remove threshold handling, this is only used by br_netfilter too.
   We already have specific code to handle this from br_netfilter,
   so remove this code from the core path.

4) Deprecate NF_STOP, as this is only used by br_netfilter.

5) Place nf_state_hook pointer into xt_action_param structure, so
   this structure fits into one single cacheline according to pahole.
   This also implicit affects nftables since it also relies on the
   xt_action_param structure.

6) Move state->hook_entries into nf_queue entry. The hook_entries
   pointer is only required by nf_queue(), so we can store this in the
   queue entry instead.

7) use switch() statement to handle verdict cases.

8) Remove hook_entries field from nf_hook_state structure, this is only
   required by nf_queue, so store it in nf_queue_entry structure.

9) Merge nf_iterate() into nf_hook_slow() that results in a much more
   simple and readable function.

10) Handle NF_REPEAT away from the core, so far the only client is
    nf_conntrack_in() and we can restart the packet processing using a
    simple goto to jump back there when the TCP requires it.
    This update required a second pass to fix fallout, fix from
    Arnd Bergmann.

11) Set random seed from nft_hash when no seed is specified from
    userspace.

12) Simplify nf_tables expression registration, in a much smarter way
    to save lots of boiler plate code, by Liping Zhang.

13) Simplify layer 4 protocol conntrack tracker registration, from
    Davide Caratti.

14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due
    to recent generalization of the socket infrastructure, from Arnd
    Bergmann.

15) Then, the ipset batch from Jozsef, he describes it as it follows:

* Cleanup: Remove extra whitespaces in ip_set.h
* Cleanup: Mark some of the helpers arguments as const in ip_set.h
* Cleanup: Group counter helper functions together in ip_set.h
* struct ip_set_skbinfo is introduced instead of open coded fields
  in skbinfo get/init helper funcions.
* Use kmalloc() in comment extension helper instead of kzalloc()
  because it is unnecessary to zero out the area just before
  explicit initialization.
* Cleanup: Split extensions into separate files.
* Cleanup: Separate memsize calculation code into dedicated function.
* Cleanup: group ip_set_put_extensions() and ip_set_get_extensions()
  together.
* Add element count to hash headers by Eric B Munson.
* Add element count to all set types header for uniform output
  across all set types.
* Count non-static extension memory into memsize calculation for
  userspace.
* Cleanup: Remove redundant mtype_expire() arguments, because
  they can be get from other parameters.
* Cleanup: Simplify mtype_expire() for hash types by removing
  one level of intendation.
* Make NLEN compile time constant for hash types.
* Make sure element data size is a multiple of u32 for the hash set
  types.
* Optimize hash creation routine, exit as early as possible.
* Make struct htype per ipset family so nets array becomes fixed size
  and thus simplifies the struct htype allocation.
* Collapse same condition body into a single one.
* Fix reported memory size for hash:* types, base hash bucket structure
  was not taken into account.
* hash:ipmac type support added to ipset by Tomasz Chilinski.
* Use setup_timer() and mod_timer() instead of init_timer()
  by Muhammad Falak R Wani, individually for the set type families.

16) Remove useless connlabel field in struct netns_ct, patch from
    Florian Westphal.

17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify
    {ip,ip6,arp}tables code that uses this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 22:41:25 -05:00
Julia Lawall eb1a6bdc28 netfilter: x_tables: simplify IS_ERR_OR_NULL to NULL test
Since commit 7926dbfa4b ("netfilter: don't use
mutex_lock_interruptible()"), the function xt_find_table_lock can only
return NULL on an error.  Simplify the call sites and update the
comment before the function.

The semantic patch that change the code is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression t,e;
@@

t = \(xt_find_table_lock(...)\|
      try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
- ! IS_ERR_OR_NULL(t)
+ t

@@
expression t,e;
@@

t = \(xt_find_table_lock(...)\|
      try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
- IS_ERR_OR_NULL(t)
+ !t

@@
expression t,e,e1;
@@

t = \(xt_find_table_lock(...)\|
      try_then_request_module(xt_find_table_lock(...),...)\)
... when != t=e
?- t ? PTR_ERR(t) : e1
+ e1
... when any

// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-13 22:26:13 +01:00
Eric Dumazet ac6e780070 tcp: take care of truncations done by sk_filter()
With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()

Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.

We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq

Many thanks to syzkaller team and Marco for giving us a reproducer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Grassi <marco.gra@gmail.com>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 12:30:02 -05:00
Stephen Suryaputra Lin 969447f226 ipv4: use new_gw for redirect neigh lookup
In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
and then the state of the neigh for the new_gw is checked. If the state
isn't valid then the redirected route is deleted. This behavior is
maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
is assigned to peer->redirect_learned.a4 before calling
ipv4_neigh_lookup().

After commit 5943634fc5 ("ipv4: Maintain redirect and PMTU info in
struct rtable again."), ipv4_neigh_lookup() is performed without the
rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
isn't zero, the function uses it as the key. The neigh is most likely
valid since the old_gw is the one that sends the ICMP redirect message.
Then the new_gw is assigned to fib_nh_exception. The problem is: the
new_gw ARP may never gets resolved and the traffic is blackholed.

So, use the new_gw for neigh lookup.

Changes from v1:
 - use __ipv4_neigh_lookup instead (per Eric Dumazet).

Fixes: 5943634fc5 ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
Signed-off-by: Stephen Suryaputra Lin <ssurya@ieee.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 12:24:44 -05:00
Jiri Benc 217ac77a3c openvswitch: allow L3 netdev ports
Allow ARPHRD_NONE interfaces to be added to ovs bridge.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 91820da6ae openvswitch: add Ethernet push and pop actions
It's not allowed to push Ethernet header in front of another Ethernet
header.

It's not allowed to pop Ethernet header if there's a vlan tag. This
preserves the invariant that L3 packet never has a vlan tag.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 0a6410fbde openvswitch: netlink: support L3 packets
Extend the ovs flow netlink protocol to support L3 packets. Packets without
OVS_KEY_ATTR_ETHERNET attribute specify L3 packets; for those, the
OVS_KEY_ATTR_ETHERTYPE attribute is mandatory.

Push/pop vlan actions are only supported for Ethernet packets.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 5108bbaddc openvswitch: add processing of L3 packets
Support receiving, extracting flow key and sending of L3 packets (packets
without an Ethernet header).

Note that even after this patch, non-Ethernet interfaces are still not
allowed to be added to bridges. Similarly, netlink interface for sending and
receiving L3 packets to/from user space is not in place yet.

Based on previous versions by Lorand Jakab and Simon Horman.

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 1560a074df openvswitch: support MPLS push and pop for L3 packets
Update Ethernet header only if there is one.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc e2d9d8358c openvswitch: pass mac_proto to ovs_vport_send
We'll need it to alter packets sent to ARPHRD_NONE interfaces.

Change do_output() to use the actual L2 header size of the packet when
deciding on the minimum cutlen. The assumption here is that what matters is
not the output interface hard_header_len but rather the L2 header of the
particular packet. For example, ARPHRD_NONE tunnels that encapsulate
Ethernet should get at least the Ethernet header.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 329f45bc4f openvswitch: add mac_proto field to the flow key
Use a hole in the structure. We support only Ethernet so far and will add
a support for L2-less packets shortly. We could use a bool to indicate
whether the Ethernet header is present or not but the approach with the
mac_proto field is more generic and occupies the same number of bytes in the
struct, while allowing later extensibility. It also makes the code in the
next patches more self explaining.

It would be nice to use ARPHRD_ constants but those are u16 which would be
waste. Thus define our own constants.

Another upside of this is that we can overload this new field to also denote
whether the flow key is valid. This has the advantage that on
refragmentation, we don't have to reparse the packet but can rely on the
stored eth.type. This is especially important for the next patches in this
series - instead of adding another branch for L2-less packets before calling
ovs_fragment, we can just remove all those branches completely.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Jiri Benc 738314a084 openvswitch: use hard_header_len instead of hardcoded ETH_HLEN
On tx, use hard_header_len while deciding whether to refragment or drop the
packet. That way, all combinations are calculated correctly:

* L2 packet going to L2 interface (the L2 header len is subtracted),
* L2 packet going to L3 interface (the L2 header is included in the packet
  lenght),
* L3 packet going to L3 interface.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 00:51:02 -05:00
Eric Dumazet 34fad54c25 net: __skb_flow_dissect() must cap its return value
After Tom patch, thoff field could point past the end of the buffer,
this could fool some callers.

If an skb was provided, skb->len should be the upper limit.
If not, hlen is supposed to be the upper limit.

Fixes: a6e544b0a8 ("flow_dissector: Jump to exit code in __skb_flow_dissect")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yibin Yang <yibyang@cisco.com
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-12 23:41:53 -05:00
Martin KaFai Lau 4e3264d21b bpf: Fix bpf_redirect to an ipip/ip6tnl dev
If the bpf program calls bpf_redirect(dev, 0) and dev is
an ipip/ip6tnl, it currently includes the mac header.
e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
of IP-IP.

The fix is to pull the mac header.  At ingress, skb_postpull_rcsum()
is not needed because the ethhdr should have been pulled once already
and then got pushed back just before calling the bpf_prog.
At egress, this patch calls skb_postpull_rcsum().

If bpf_redirect(dev, BPF_F_INGRESS) is called,
it also fails now because it calls dev_forward_skb() which
eventually calls eth_type_trans(skb, dev).  The eth_type_trans()
will set skb->type = PACKET_OTHERHOST because the mac address
does not match the redirecting dev->dev_addr.  The PACKET_OTHERHOST
will eventually cause the ip_rcv() errors out.  To fix this,
____dev_forward_skb() is added.

Joint work with Daniel Borkmann.

Fixes: cfc7381b30 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Fixes: 8d79266bc4 ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-12 23:38:07 -05:00
Linus Torvalds ef091b3cef Ceph's ->read_iter() implementation is incompatible with the new
generic_file_splice_read() code that went into -rc1.  Switch to the
 less efficient default_file_splice_read() for now; the proper fix is
 being held for 4.10.
 
 We also have a fix for a 4.8 regression and a trival libceph fixup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYJdjPAAoJEEp/3jgCEfOLzEoH/A3B1qqiqs2WoMn0O4pnEdcM
 TxaU46VOkYcK2wh/xbYAns2kZEXKgcCySv+kXn4l3Gh6/lXVxv4WexNqWdO1o6yN
 GqEufIH7yQM6QOE/hkwtUciBXmPfQMPxF14vvprYQuyu5Bs96mrphiAa7vX6Vbk5
 VhfE/j0shb8Q2QQj/Om0mWqM6JtOAlr5aFtEcJcodbCk1k8CptUcBsSoQ31PXMC7
 UcaBHh1VHGLvx9WeG1Rw1g9tc2LiUyu+UK0csolp51+amB7HezgfmzDQzHtXzBmm
 n90SQwonf0DrdWUGuQlOpHnREwxLSgN19s68FCjLc0jeMTP4b6TFEIUgFxiqWc4=
 =Ws5s
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client

Pull Ceph fixes from Ilya Dryomov:
 "Ceph's ->read_iter() implementation is incompatible with the new
  generic_file_splice_read() code that went into -rc1.  Switch to the
  less efficient default_file_splice_read() for now; the proper fix is
  being held for 4.10.

  We also have a fix for a 4.8 regression and a trival libceph fixup"

* tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client:
  libceph: initialize last_linger_id with a large integer
  libceph: fix legacy layout decode with pool 0
  ceph: use default file splice read callback
2016-11-11 09:17:10 -08:00
Linus Torvalds ef5beed998 NFS client bugfixes for Linux 4.9
Bugfixes:
 - Trim extra slashes in v4 nfs_paths to fix tools that use this
 - Fix a -Wmaybe-uninitialized warnings
 - Fix suspicious RCU usages
 - Fix Oops when mounting multiple servers at once
 - Suppress a false-positive pNFS error
 - Fix a DMAR failure in NFS over RDMA
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYJOCbAAoJENfLVL+wpUDrbO0QAIkcxdUu2iQeOrk07VP48kDE
 UEfJTal8vbW/KtKyL9bIeRa1qCvYpSJXnnKcR/Uo5VHE5nMz/5omoJofWf5Zg0UM
 iEHyZfOsuGFieBbl1NBaLjEd6MCoJYpmWFUj+3drZ8zqSdqDTL+JgrP7k3XEU2Mx
 glKb7U0AKoclm3h1MKyCyo5TgDVeI5TOhi+i3VVw2IN79VY2CUp4lHWMY4vloghp
 h+GuJWeVFS1nBpfCF9PpTU6LdHDfg4o/J5+DrP+IjIffD1XGzGEjfFR0BX5HyDcN
 PgOSF3fc7uVOOUIBEAqHUHY/7XiKlv6TEMRPdM8ALVoCXZ6hPSSFxq8JBJSWoVEp
 r11ts66VgYxdQgHbs51Y5AaKudLBwU60KosWuddbdZVb4YPM0cn5WQzVezrpoQYu
 k4rfrpt+LFv23NGfIJa6JaTSFBzM+YXmggEGUI8TI/YUFSN+wEp4uzLB4r19nqAP
 ff32iunzV9Z5edpPQFDCf3/1HAhzrL5KWo7E8EvijpdQKZl5k5CnUJxbG22lh4ct
 QIyYg51LjhCayzbRH8Mu+TKUFT29ORlcSp851BotLjT8ZdUetWXcFab93nAkQI7g
 sMREml4DvcXWy8qFAOzi8mX1ddTBumxBfOD0m3skPg+odxwsl/KiwjLCRwfTrgwS
 jfSXsXmrwTniPCDWgKg3
 =hFod
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Most of these fix regressions in 4.9, and none are going to stable
  this time around.

  Bugfixes:
   - Trim extra slashes in v4 nfs_paths to fix tools that use this
   - Fix a -Wmaybe-uninitialized warnings
   - Fix suspicious RCU usages
   - Fix Oops when mounting multiple servers at once
   - Suppress a false-positive pNFS error
   - Fix a DMAR failure in NFS over RDMA"

* tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
  fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
  NFS: Don't print a pNFS error if we aren't using pNFS
  NFS: Ignore connections that have cl_rpcclient uninitialized
  SUNRPC: Fix suspicious RCU usage
  NFSv4.1: work around -Wmaybe-uninitialized warning
  NFS: Trim extra slash in v4 nfs_path
2016-11-11 09:15:30 -08:00
Ilya Dryomov 264048afab libceph: initialize last_linger_id with a large integer
osdc->last_linger_id is a counter for lreq->linger_id, which is used
for watch cookies.  Starting with a large integer should ease the task
of telling apart kernel and userspace clients.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-11-10 20:13:08 +01:00