Граф коммитов

85786 Коммитов

Автор SHA1 Сообщение Дата
Aaron Conole 54f17bbc52 netfilter: nf_queue: whitespace cleanup
A future patch will modify the hook drop and outfn functions.  This will
cause the line lengths to take up too much space.  This is simply a
readability change.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 01:20:05 +02:00
Florian Westphal fe72926b79 netfilter: call nf_hook_state_init with rcu_read_lock held
This makes things simpler because we can store the head of the list
in the nf_state structure without worrying about concurrent add/delete
of hook elements from the list.

A future commit will make use of this to implement a simpler
linked-list.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:49 +02:00
Florian Westphal c5136b15ea netfilter: bridge: add and use br_nf_hook_thresh
This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.  It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:48 +02:00
Liping Zhang 2462f3f4a7 netfilter: nf_queue: improve queue range support for bridge family
After commit ac28634456 ("netfilter: bridge: add nf_afinfo to enable
queuing to userspace"), we can queue packets to the user space in bridge
family. But when the user specify the queue range, packets will be only
delivered to the first queue num. Because in nfqueue_hash, we only support
ipv4 and ipv6 family. Now add support for bridge family too.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:30:01 +02:00
Liping Zhang 8061bb5443 netfilter: nft_queue: add _SREG_QNUM attr to select the queue number
Currently, the user can specify the queue numbers by _QUEUE_NUM and
_QUEUE_TOTAL attributes, this is enough in most situations.

But acctually, it is not very flexible, for example:
  tcp dport 80 mapped to queue0
  tcp dport 81 mapped to queue1
  tcp dport 82 mapped to queue2
In order to do this thing, we must add 3 nft rules, and more
mapping meant more rules ...

So take one register to select the queue number, then we can add one
simple rule to mapping queues, maybe like this:
  queue num tcp dport map { 80:0, 81:1, 82:2 ... }

Florian Westphal also proposed wider usage scenarios:
  queue num jhash ip saddr . ip daddr mod ...
  queue num meta cpu ...
  queue num meta mark ...

The last point is how to load a queue number from sreg, although we can
use *(u16*)&regs->data[reg] to load the queue number, just like nat expr
to load its l4port do.

But we will cooperate with hash expr, meta cpu, meta mark expr and so on.
They all store the result to u32 type, so cast it to u16 pointer and
dereference it will generate wrong result in the big endian system.

So just keep it simple, we treat queue number as u32 type, although u16
type is already enough.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:29:50 +02:00
Laura Garcia Liebana 36b701fae1 netfilter: nf_tables: validate maximum value of u32 netlink attributes
Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.

This patch revisits 4da449ae1d ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").

Fixes: 96518518cc ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:29:02 +02:00
Laura Garcia Liebana 2b03bf7324 netfilter: nft_numgen: add number generation offset
Add support of an offset value for incremental counter and random. With
this option the sysadmin is able to start the counter to a certain value
and then apply the generated number.

Example:

	meta mark set numgen inc mod 2 offset 100

This will generate marks with the serie 100, 101, 100, 101, ...

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-22 16:33:05 +02:00
Florian Westphal 8e8118f893 netfilter: conntrack: remove packet hotpath stats
These counters sit in hot path and do show up in perf, this is especially
true for 'found' and 'searched' which get incremented for every packet
processed.

Information like

searched=212030105
new=623431
found=333613
delete=623327

does not seem too helpful nowadays:

- on busy systems found and searched will overflow every few hours
(these are 32bit integers), other more busy ones every few days.

- for debugging there are better methods, such as iptables' trace target,
the conntrack log sysctls.  Nowadays we also have perf tool.

This removes packet path stat counters except those that
are expected to be 0 (or close to 0) on a normal system, e.g.
'insert_failed' (race happened) or 'invalid' (proto tracker rejects).

The insert stat is retained for the ctnetlink case.
The found stat is retained for the tuple-is-taken check when NAT has to
determine if it needs to pick a different source address.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:59:39 +02:00
Liping Zhang 6bd14303a9 netfilter: nf_queue: get rid of dependency on IP6_NF_IPTABLES
hash_v6 is used by both nftables and ip6tables, so depend on
IP6_NF_IPTABLES is not properly.

Actually, it only parses ipv6hdr and computes a hash value, so
even if IPV6 is disabled, there's no side effect too, remove it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:54:46 +02:00
Pablo Neira Ayuso 71212c9b04 netfilter: nf_tables: don't drop IPv6 packets that cannot parse transport
This is overly conservative and not flexible at all, so better let them
go through and let the filtering policy decide what to do with them. We
use skb_header_pointer() all over the place so we would just fail to
match when trying to access fields from malformed traffic.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:32 +02:00
Pablo Neira Ayuso 10151d7b03 netfilter: nf_tables_bridge: use nft_set_pktinfo_ipv{4, 6}_validate
Consolidate pktinfo setup and validation by using the new generic
functions so we converge to the netdev family codebase.

We only need a linear IPv4 and IPv6 header from the reject expression,
so move nft_bridge_iphdr_validate() and nft_bridge_ip6hdr_validate()
to net/bridge/netfilter/nft_reject_bridge.c.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:15 +02:00
Pablo Neira Ayuso ddc8b6027a netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()
These functions are extracted from the netdev family, they initialize
the pktinfo structure and validate that the IPv4 and IPv6 headers are
well-formed given that these functions are called from a path where
layer 3 sanitization did not happen yet.

These functions are placed in include/net/netfilter/nf_tables_ipv{4,6}.h
so they can be reused by a follow up patch to use them from the bridge
family too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:09 +02:00
Pablo Neira Ayuso 8df9e32e7e netfilter: nf_tables_ipv6: setup pktinfo transport field on failure to parse
Make sure the pktinfo protocol fields are initialized if this fails to
parse the transport header.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:04 +02:00
Pablo Neira Ayuso beac5afa2d netfilter: nf_tables: ensure proper initialization of nft_pktinfo fields
This patch introduces nft_set_pktinfo_unspec() that ensures proper
initialization all of pktinfo fields for non-IP traffic. This is used
by the bridge, netdev and arp families.

This new function relies on nft_set_pktinfo_proto_unspec() to set a new
tprot_set field that indicates if transport protocol information is
available. Remain fields are zeroed.

The meta expression has been also updated to check to tprot_set in first
place given that zero is a valid tprot value. Even a handcrafted packet
may come with the IPPROTO_RAW (255) protocol number so we can't rely on
this value as tprot unset.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:51:57 +02:00
Pablo Neira Ayuso dbd2be0646 netfilter: nft_dynset: allow to invert match criteria
The dynset expression matches if we can fit a new entry into the set.
If there is no room for it, then it breaks the rule evaluation.

This patch introduces the inversion flag so you can add rules to
explicitly drop packets that don't fit into the set. For example:

 # nft filter input flow table xyz size 4 { ip saddr timeout 120s counter } overflow drop

This is useful to provide a replacement for connlimit.

For the rule above, every new entry uses the IPv4 address as key in the
set, this entry gets a timeout of 120 seconds that gets refresh on every
packet seen. If we get new flow and our set already contains 4 entries
already, then this packet is dropped.

You can already express this in positive logic, assuming default policy
to drop:

 # nft filter input flow table xyz size 4 { ip saddr timeout 10s counter } accept

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:49:50 +02:00
Laura Garcia Liebana 70ca767ea1 netfilter: nft_hash: Add hash offset value
Add support to pass through an offset to the hash value. With this
feature, the sysadmin is able to generate a hash with a given
offset value.

Example:

	meta mark set jhash ip saddr mod 2 seed 0xabcd offset 100

This option generates marks according to the source address from 100 to
101.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
2016-09-12 18:37:12 +02:00
Liping Zhang 3c15b8e112 netfilter: nf_conntrack: remove unused ctl_table_path member in nf_conntrack_l3proto
After commit adf0516845 ("netfilter: remove ip_conntrack* sysctl
compat code"), ctl_table_path member in struct nf_conntrack_l3proto{}
is not used anymore, remove it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-09 16:17:58 +02:00
Laura Garcia Liebana 0d9932b287 netfilter: nft_numgen: rename until attribute by modulus
The _until_ attribute is renamed to _modulus_ as the behaviour is similar to
other expresions with number limits (ex. nft_hash).

Renaming is possible because there isn't a kernel release yet with these
changes.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:55:46 +02:00
Gao Feng c579a9e7d5 netfilter: gre: Use consistent GRE and PTTP header structure instead of the ones defined by netfilter
There are two existing strutures which defines the GRE and PPTP header.
So use these two structures instead of the ones defined by netfilter to
keep consitent with other codes.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:36:52 +02:00
Gao Feng ecc6569f35 netfilter: gre: Use consistent GRE_* macros instead of ones defined by netfilter.
There are already some GRE_* macros in kernel, so it is unnecessary
to define these macros. And remove some useless macros

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:36:48 +02:00
David S. Miller 0122c6d5fa RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAV8yHKPSw1s6N8H32AQKbWA/+MEjzVzIOpp4edvAj5DzdO5pls3NvhUAg
 2sP15E9nOL9JHrOD5snCVSRd9ROEGRE0/S9WXUjeb2VKz7C2pmTixo+MjiJRMzAR
 TZZlE2Ydx0h+A7WiywKoTY4g7VICL+UC+4XcheMzTLNS2mzqKb2GGOm3BwvGaFeT
 RKRVIlwIHziShaIs7K7ZkfKxGaDIL/9x344uPfFHaDKb33aOBnTuY6HlFf5Yu2qm
 pzh+R7cBYJMvhEqd71ESYPSbSGnjBN5zRzjZDSvXeI/30k9Ee6mFastv4fdG3Mrk
 0WOLxml9yLTlJnfXeuN0T9B0C/ur4oD4hKEDREXVTxcTEXRq/VxSJ3cYxeM3DlAz
 U795lcveiYajRv7F73jcfNuaEENQg5HsZuaFs+CJgVxQpsqs9IOpEl9CNrVsvjmf
 9crgamUj34ehZ5lgsV/Qbm7OFk16dmQ59ImGClQFgsIU6hEWCP0VgwKXWTwP7YZD
 Ucp1zWp/XTAtbrRNdkye/Z0WE5QOtoUWzftPrf95TP7LFewp0DGZ+miBV/atSHbZ
 bADXR0SOtax8u9bfs1HYTadwHk1LJYuVXNYn/KN06c9q1WKuKH/qg0JAjv/KUnJm
 Nnx0TFh4kUg999+3CtxqmupiqkD5SCckCN2ggifgDCmQhEwr1CYrGKWpZvNMq+S6
 nVNTUr7PYSI=
 =RRYk
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160904-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Small fixes

Here's a set of small fix patches:

 (1) Fix some uninitialised variables.

 (2) Set the client call state before making it live by attaching it to the
     conn struct.

 (3) Randomise the epoch and starting client conn ID values, and don't
     change the epoch when the client conn ID rolls round.

 (4) Replace deprecated create_singlethread_workqueue() calls.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 13:46:26 -07:00
David S. Miller 60175ccdf4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree.  Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.

More specifically, they are:

1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
   transport area in dccp, sctp, tcp, udp and udplite protocol
   conntrackers, from Gao Feng.

2) Missing whitespace on error message in physdev match, from Hangbin Liu.

3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.

4) Add nf_ct_expires() helper function and use it, from Florian Westphal.

5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
   from Florian.

6) Rename nf_tables set implementation to nft_set_{name}.c

7) Introduce the hash expression to allow arbitrary hashing of selector
   concatenations, from Laura Garcia Liebana.

8) Remove ip_conntrack sysctl backward compatibility code, this code has
   been around for long time already, and we have two interfaces to do
   this already: nf_conntrack sysctl and ctnetlink.

9) Use nf_conntrack_get_ht() helper function whenever possible, instead
   of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.

10) Add quota expression for nf_tables.

11) Add number generator expression for nf_tables, this supports
    incremental and random generators that can be combined with maps,
    very useful for load balancing purpose, again from Laura Garcia Liebana.

12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.

13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
    configuration, this is used by a follow up patch to perform better chain
    update validation.

14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
    nft_set_hash implementation to honor the NLM_F_EXCL flag.

15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
    patch from Florian Westphal.

16) Don't use the DYING bit to know if the conntrack event has been already
    delivered, instead a state variable to track event re-delivery
    states, also from Florian.

17) Remove the per-conntrack timer, use the workqueue approach that was
    discussed during the NFWS, from Florian Westphal.

18) Use the netlink conntrack table dump path to kill stale entries,
    again from Florian.

19) Add a garbage collector to get rid of stale conntracks, from
    Florian.

20) Reschedule garbage collector if eviction rate is high.

21) Get rid of the __nf_ct_kill_acct() helper.

22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.

23) Make nf_log_set() interface assertive on unsupported families.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 12:45:26 -07:00
David Howells 5f2d9c4438 rxrpc: Randomise epoch and starting client conn ID values
Create a random epoch value rather than a time-based one on startup and set
the top bit to indicate that this is the case.

Also create a random starting client connection ID value.  This will be
incremented from here as new client connections are created.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
Rosen, Rami dd19bde367 switchdev: Fix return value of switchdev_port_fdb_dump().
This patch fixes the retun value of switchdev_port_fdb_dump() when
CONFIG_NET_SWITCHDEV is not set. This avoids getting "warning: return makes
integer from pointer without a cast [-Wint-conversion]" when building
when CONFIG_NET_SWITCHDEV is not set under several compiler versions.
This warning is due to commit d297653dd6
("rtnetlink: fdb dump: optimize by saving last interface markers").

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 11:00:21 -07:00
Alexei Starovoitov aa6a5f3cb2 perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs
Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Alexei Starovoitov 0515e5999a bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
    struct pt_regs regs;
     __u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
  ... ctx->sample_period ...
  ... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Vivien Didelot 04bed1434d net: dsa: remove ds_to_priv
Access the priv member of the dsa_switch structure directly, instead of
having an unnecessary helper.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 22:51:12 -07:00
Nikolay Aleksandrov b6cb5ac833 net: bridge: add per-port multicast flood flag
Add a per-port flag to control the unknown multicast flood, similar to the
unknown unicast flood flag and break a few long lines in the netlink flag
exports.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 22:48:33 -07:00
Roopa Prabhu d297653dd6 rtnetlink: fdb dump: optimize by saving last interface markers
fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.

To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.

In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.

Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065

real    1m11.791s
user    0m0.070s
sys 1m8.395s

(existing code does not return all macs)

after patch:
$time bridge fdb show | wc -l
35085

real    0m2.017s
user    0m0.113s
sys 0m1.942s

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:56:15 -07:00
Gao Feng 66fdd05e7a rps: flow_dissector: Add the const for the parameter of flow_keys_have_l4
Add the const for the parameter of flow_keys_have_l4 for the readability.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:51:08 -07:00
David Howells d001648ec7 rxrpc: Don't expose skbs to in-kernel users [ver #2]
Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.

This makes the following possibilities more achievable:

 (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

 (2) skbs referring to non-data events will be able to be freed much sooner
     rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
     will be able to consult the call state.

 (3) We can shortcut the receive phase when a call is remotely aborted
     because we don't have to go through all the packets to get to the one
     cancelling the operation.

 (4) It makes it easier to do encryption/decryption directly between AFS's
     buffers and sk_buffs.

 (5) Encryption/decryption can more easily be done in the AFS's thread
     contexts - usually that of the userspace process that issued a syscall
     - rather than in one of rxrpc's background threads on a workqueue.

 (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

To make this work, the following interface function has been added:

     int rxrpc_kernel_recv_data(
		struct socket *sock, struct rxrpc_call *call,
		void *buffer, size_t bufsize, size_t *_offset,
		bool want_more, u32 *_abort_code);

This is the recvmsg equivalent.  It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.

afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them.  They don't wait synchronously yet because the socket
lock needs to be dealt with.

Five interface functions have been removed:

	rxrpc_kernel_is_data_last()
    	rxrpc_kernel_get_abort_code()
    	rxrpc_kernel_get_error_number()
    	rxrpc_kernel_free_skb()
    	rxrpc_kernel_data_consumed()

As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user.  To process the queue internally, a temporary function,
temp_deliver_data() has been added.  This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:43:27 -07:00
Vivien Didelot 8df3025520 net: dsa: add MDB support
Add SWITCHDEV_OBJ_ID_PORT_MDB support to the DSA layer.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-31 14:15:42 -07:00
Roopa Prabhu 14972cbd34 net: lwtunnel: Handle fragmentation
Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:27:18 -07:00
David Howells 4de48af663 rxrpc: Pass struct socket * to more rxrpc kernel interface functions
Pass struct socket * to more rxrpc kernel interface functions.  They should
be starting from this rather than the socket pointer in the rxrpc_call
struct if they need to access the socket.

I have left:

	rxrpc_kernel_is_data_last()
	rxrpc_kernel_get_abort_code()
	rxrpc_kernel_get_error_number()
	rxrpc_kernel_free_skb()
	rxrpc_kernel_data_consumed()

unmodified as they're all about to be removed (and, in any case, don't
touch the socket).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:07:53 +01:00
David Howells 8324f0bcfb rxrpc: Provide a way for AFS to ask for the peer address of a call
Provide a function so that kernel users, such as AFS, can ask for the peer
address of a call:

   void rxrpc_kernel_get_peer(struct rxrpc_call *call,
			      struct sockaddr_rxrpc *_srx);

In the future the kernel service won't get sk_buffs to look inside.
Further, this allows us to hide any canonicalisation inside AF_RXRPC for
when IPv6 support is added.

Also propagate this through to afs_find_server() and issue a warning if we
can't handle the address family yet.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:07:53 +01:00
David Howells e34d4234b0 rxrpc: Trace rxrpc_call usage
Add a trace event for debuging rxrpc_call struct usage.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:02:36 +01:00
Gao Feng 779994fa36 netfilter: log: Check param to avoid overflow in nf_log_set
The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:52:32 +02:00
Florian Westphal ad66713f5a netfilter: remove __nf_ct_kill_acct helper
After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:10 +02:00
Florian Westphal f330a7fdbe netfilter: conntrack: get rid of conntrack timer
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.

Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de, 'timers: Switch to
a non-cascading wheel').

Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:09 +02:00
Florian Westphal 616b14b469 netfilter: don't rely on DYING bit to detect when destroy event was sent
The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.

Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.

If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.

Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.

Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).

We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension.  The worker can then skip all entries that are in
a different state.  Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.

Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.

Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:08 +02:00
David S. Miller 6abdd5f593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 00:54:02 -04:00
Linus Torvalds 1f6a563ee0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Segregate namespaces properly in conntrack dumps, from Liping Zhang.

 2) tcp listener refcount fix in netfilter tproxy, from Eric Dumazet.

 3) Fix timeouts in qed driver due to xmit_more, from Yuval Mintz.

 4) Fix use-after-free in tcp_xmit_retransmit_queue().

 5) Userspace header fixups (use of __u32, missing includes, etc.) from
    Mikko Rapeli.

 6) Further refinements to fragmentation wrt gso and tunnels, from
    Shmulik Ladkani.

 7) Trigger poll correctly for zero length UDP packets, from Eric
    Dumazet.

 8) TCP window scaling fix, also from Eric Dumazet.

 9) SLAB_DESTROY_BY_RCU is not relevant any more for UDP sockets.

10) Module refcount leak in qdisc_create_dflt(), from Eric Dumazet.

11) Fix deadlock in cp_rx_poll() of 8139cp driver, from Gao Feng.

12) Memory leak in rhashtable's alloc_bucket_locks(), from Eric Dumazet.

13) Add new device ID to alx driver, from Owen Lin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
  Add Killer E2500 device ID in alx driver.
  net: smc91x: fix SMC accesses
  Documentation: networking: dsa: Remove platform device TODO
  net/mlx5: Increase number of ethtool steering priorities
  net/mlx5: Add error prints when validate ETS failed
  net/mlx5e: Fix memory leak if refreshing TIRs fails
  net/mlx5e: Add ethtool counter for TX xmit_more
  net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ
  net/mlx5e: Don't wait for SQ completions on close
  net/mlx5e: Don't post fragmented MPWQE when RQ is disabled
  net/mlx5e: Don't wait for RQ completions on close
  net/mlx5e: Limit UMR length to the device's limitation
  rhashtable: fix a memory leak in alloc_bucket_locks()
  sfc: fix potential stack corruption from running past stat bitmask
  team: loadbalance: push lacpdus to exact delivery
  net: hns: dereference ppe_cb->ppe_common_cb if it is non-null
  8139cp: Fix one possible deadloop in cp_rx_poll
  i40e: Change some init flow for the client
  Revert "phy: IRQ cannot be shared"
  net: dsa: bcm_sf2: Fix race condition while unmasking interrupts
  ...
2016-08-29 12:29:13 -07:00
Vidya Sagar Ravipati 5711a98221 net: ethtool: add support for 1000BaseX and missing 10G link modes
This patch enhances ethtool link mode bitmap to include
missing interface modes for 1G/10G speeds

Changes:
1000baseX is the mode introduced to cover all 1G Fiber cases.
All modes under 1000BaseX i.e. 1000BASE-SX, 1000BASE-LX, 1000BASE-LX10
and 1000BASE-BX10 are not explicitly defined at this moment.
10G CR,SR,LR and ER link modes are included for 10G speed..

Issue:
ethtool on  1G/10G SFP port reports Base-T
as this port supports 1000baseX,10G CR, SR and LR modes.

root@tor-02$ ethtool swp1
Settings for swp1:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseT/Full
                                10000baseT/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Advertised link modes:  1000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: off
        Current message level: 0x00000000 (0)

        Link detected: yes

After fix:
root@tor-02$ ethtool swp1
Settings for swp1:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseX/Full
                                10000baseCR/Full
                                10000baseSR/Full
                                10000baseLR/Full
                                10000baseER/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Advertised link modes:  1000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: off
        Current message level: 0x00000000 (0)
        Link detected: yes

Signed-off-by: Vidya Sagar Ravipati <vidya@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-29 00:27:34 -04:00
Eric Dumazet c9c3321257 tcp: add tcp_add_backlog()
When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.

While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.

Cause of the drops is the skb->truesize overestimation caused by :

- drivers allocating ~2048 (or more) bytes as a fragment to hold an
  Ethernet frame.

- various pskb_may_pull() calls bringing the headers into skb->head
  might have pulled all the frame content, but skb->truesize could
  not be lowered, as the stack has no idea of each fragment truesize.

The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.

Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-29 00:20:24 -04:00
Russell King 2fb04fdf30 net: smc91x: fix SMC accesses
Commit b70661c708 ("net: smc91x: use run-time configuration on all ARM
machines") broke some ARM platforms through several mistakes.  Firstly,
the access size must correspond to the following rule:

(a) at least one of 16-bit or 8-bit access size must be supported
(b) 32-bit accesses are optional, and may be enabled in addition to
    the above.

Secondly, it provides no emulation of 16-bit accesses, instead blindly
making 16-bit accesses even when the platform specifies that only 8-bit
is supported.

Reorganise smc91x.h so we can make use of the existing 16-bit access
emulation already provided - if 16-bit accesses are supported, use
16-bit accesses directly, otherwise if 8-bit accesses are supported,
use the provided 16-bit access emulation.  If neither, BUG().  This
exactly reflects the driver behaviour prior to the commit being fixed.

Since the conversion incorrectly cut down the available access sizes on
several platforms, we also need to go through every platform and fix up
the overly-restrictive access size: Arnd assumed that if a platform can
perform 32-bit, 16-bit and 8-bit accesses, then only a 32-bit access
size needed to be specified - not so, all available access sizes must
be specified.

This likely fixes some performance regressions in doing this: if a
platform does not support 8-bit accesses, 8-bit accesses have been
emulated by performing a 16-bit read-modify-write access.

Tested on the Intel Assabet/Neponset platform, which supports only 8-bit
accesses, which was broken by the original commit.

Fixes: b70661c708 ("net: smc91x: use run-time configuration on all ARM machines")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Tested-by: Robert Jarzmik <robert.jarzmik@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:44:55 -04:00
Tom Herbert 96a5908347 kcm: Remove TCP specific references from kcm and strparser
kcm and strparser need to work with any type of stream socket not just
TCP. Eliminate references to TCP and call generic proto_ops functions of
read_sock and peek_len. Also in strp_init check if the socket support
the proto_ops read_sock and peek_len.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00
Tom Herbert 3203558589 tcp: Set read_sock and peek_len proto_ops
In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00
Tom Herbert 0294b625ad net: Add read_sock proto_op
Add new function in proto_ops structure. This includes moving the
typedef got sk_read_actor into net.h and removing the definition from
tcp.h.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00
Linus Torvalds 25d0d91af7 Merge tag 'drm-fixes-for-4.8-rc4' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
 "A bunch of fixes covering i915, amdgpu, one tegra and some core DRM
  ones.  Nothing too strange at this point"

* tag 'drm-fixes-for-4.8-rc4' of git://people.freedesktop.org/~airlied/linux: (21 commits)
  drm/atomic: Don't potentially reset color_mgmt_changed on successive property updates.
  drm: Protect fb_defio in drivers with CONFIG_KMS_FBDEV_EMULATION
  drm/amdgpu: skip TV/CV in display parsing
  drm/amdgpu: avoid a possible array overflow
  drm/amdgpu: fix lru size grouping v2
  drm/tegra: dsi: Enhance runtime power management
  drm/i915: Fix botched merge that downgrades CSR versions.
  drm/i915/skl: Ensure pipes with changed wms get added to the state
  drm/i915/gen9: Only copy WM results for changed pipes to skl_hw
  drm/i915/skl: Add support for the SAGV, fix underrun hangs
  drm/i915/gen6+: Interpret mailbox error flags
  drm/i915: Reattach comment, complete type specification
  drm/i915: Unconditionally flush any chipset buffers before execbuf
  drm/i915/gen9: Drop invalid WARN() during data rate calculation
  drm/i915/gen9: Initialize intel_state->active_crtcs during WM sanitization (v2)
  drm: Reject page_flip for !DRIVER_MODESET
  drm/amdgpu: fix timeout value check in amd_sched_job_recovery
  drm/amdgpu: fix sdma_v2_4_ring_test_ib
  drm/amdgpu: fix amdgpu_move_blit on 32bit systems
  drm/radeon: fix radeon_move_blit on 32bit systems
  ...
2016-08-28 14:31:36 -07:00
Linus Torvalds af56ff27eb * ARM fixes:
** fixes for ITS init issues, error handling, IRQ leakage, race conditions
 ** An erratum workaround for timers
 ** Some removal of misleading use of errors and comments
 ** A fix for GICv3 on 32-bit guests
 * MIPS fix where the guest could wrongly map the first page of physical memory
 * x86 nested virtualization fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXtyfVAAoJEL/70l94x66Dhe4IAIOGI/OYVWU5IfUQ01oeRgD3
 7wN222OmyC/K0/hSZc7ndRdcQfr5ombgM9XsS/EbkcRacWxAUHDX2FaYMpKgjT2M
 Dnh2tJHuPz/4VtByGQ2fZ4hziK7amn18/MtPFCee+mIj0ya2fcWZ4qHVU8pKC6Ps
 mVVZ0kxXsdV4pw9y6XgBLz/4bTLeASKvhFZrWOnjJoa+GeH2MFwocS0xaEI0HwxP
 HVwcgoRdGXJuKUB9jE9FDWmWOgdoLnCG1bNUOvXKPcE0ZaFQDT4I4dImkBys3rqz
 jbqnhLrpGEY2ZC3Rj+VyD2MOXbYOOSi59GRwYmCkqD96ZarHxSu3PdyCxmIFWzM=
 =+4WK
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "ARM:
   - fixes for ITS init issues, error handling, IRQ leakage, race
     conditions
   - an erratum workaround for timers
   - some removal of misleading use of errors and comments
   - a fix for GICv3 on 32-bit guests

  MIPS:
   - fix for where the guest could wrongly map the first page of
     physical memory

  x86:
   - nested virtualization fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  MIPS: KVM: Check for pfn noslot case
  kvm: nVMX: fix nested tsc scaling
  KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write
  KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC
  arm64: KVM: report configured SRE value to 32-bit world
  arm64: KVM: remove misleading comment on pmu status
  KVM: arm/arm64: timer: Workaround misconfigured timer interrupt
  arm64: Document workaround for Cortex-A72 erratum #853709
  KVM: arm/arm64: Change misleading use of is_error_pfn
  KVM: arm64: ITS: avoid re-mapping LPIs
  KVM: arm64: check for ITS device on MSI injection
  KVM: arm64: ITS: move ITS registration into first VCPU run
  KVM: arm64: vgic-its: Make updates to propbaser/pendbaser atomic
  KVM: arm64: vgic-its: Plug race in vgic_put_irq
  KVM: arm64: vgic-its: Handle errors from vgic_add_lpi
  KVM: arm64: ITS: return 1 on successful MSI injection
2016-08-27 15:51:50 -07:00