It's not allowed to push Ethernet header in front of another Ethernet
header.
It's not allowed to pop Ethernet header if there's a vlan tag. This
preserves the invariant that L3 packet never has a vlan tag.
Based on previous versions by Lorand Jakab and Simon Horman.
Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings
scheme") added a bug in that the prog's reference count is not dropped
in the error path when mlx4_en_try_alloc_resources() is failing from
mlx4_xdp_set().
We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
need to release again. Earlier in the call path, dev_change_xdp_fd()
itself holds a reference to the prog as well (hence the '- 1' in the
bpf_prog_add()), so a simple atomic_sub() is safe to use here. When
an error is propagated, then bpf_prog_put() is called eventually from
dev_change_xdp_fd()
Fixes: 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The internal PTP Hardware Clock (PHC) interface limits the resolution for
frequency adjustments to one part per billion. However, some hardware
devices allow finer adjustment, and making use of the increased resolution
improves synchronization measurably on such devices.
This patch adds an alternative method that allows finer frequency tuning
by passing the scaled ppm value to PHC drivers. This value comes from
user space, and it has a resolution of about 0.015 ppb. We also deprecate
the older method, anticipating its removal once existing drivers have been
converted over.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Ulrik De Bie <ulrik.debie-os@e2big.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no more users except from net/core/dev.c
napi_hash_add() can now be static.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tc_act macro addressed a non existing field, and was not used in the
kernel source.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch prepares for insertion of SRH through setsockopt().
The new source address argument is used when an HMAC field is
present in the SRH, which must be filled. The HMAC signature
process requires the source address as input text.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary functions to compute and check the HMAC signature
of an SR-enabled packet. Two HMAC algorithms are supported: hmac(sha1) and
hmac(sha256).
In order to avoid dynamic memory allocation for each HMAC computation,
a per-cpu ring buffer is allocated for this purpose.
A new per-interface sysctl called seg6_require_hmac is added, allowing a
user-defined policy for processing HMAC-signed SR-enabled packets.
A value of -1 means that the HMAC field will always be ignored.
A value of 0 means that if an HMAC field is present, its validity will
be enforced (the packet is dropped is the signature is incorrect).
Finally, a value of 1 means that any SR-enabled packet that does not
contain an HMAC signature or whose signature is incorrect will be dropped.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a new type of interfaceless lightweight tunnel (SEG6),
enabling the encapsulation and injection of SRH within locally emitted
packets and forwarded packets.
>From a configuration viewpoint, a seg6 tunnel would be configured as follows:
ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0
Any packet whose destination address is fc00::1 would thus be encapsulated
within an outer IPv6 header containing the SRH with three segments, and would
actually be routed to the first segment of the list. If `mode inline' was
specified instead of `mode encap', then the SRH would be directly inserted
after the IPv6 header without outer encapsulation.
The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This
feature was made configurable because direct header insertion may break
several mechanisms such as PMTUD or IPSec AH.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary hooks and structures to provide support
for SR-IPv6 control plane, essentially the Generic Netlink commands
that will be used for userspace control over the Segment Routing
kernel structures.
The genetlink commands provide control over two different structures:
tunnel source and HMAC data. The tunnel source is the source address
that will be used by default when encapsulating packets into an
outer IPv6 header + SRH. If the tunnel source is set to :: then an
address of the outgoing interface will be selected as the source.
The HMAC commands currently just return ENOTSUPP and will be implemented
in a future patch.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement minimal support for processing of SR-enabled packets
as described in
https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02.
This patch implements the following operations:
- Intermediate segment endpoint: incrementation of active segment and rerouting.
- Egress for SR-encapsulated packets: decapsulation of outer IPv6 header + SRH
and routing of inner packet.
- Cleanup flag support for SR-inlined packets: removal of SRH if we are the
penultimate segment endpoint.
A per-interface sysctl seg6_enabled is provided, to accept/deny SR-enabled
packets. Default is deny.
This patch does not provide support for HMAC-signed packets.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
The attributes L2TP_ATTR_UDP_ZERO_CSUM6_RX and
L2TP_ATTR_UDP_ZERO_CSUM6_TX are used as flags,
but is defined as a u8 in a comment.
This patch redocuments them as flags.
Adding nla_policy entries would break API, so not doing that.
CC: Tom Herbert <therbert@google.com>
Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support steering rules which add encapsulation headers,
encap_id parameter is needed.
Add new mlx5_flow_act struct which holds action related parameter:
action, flow_tag and encap_id. Use mlx5_flow_act struct when adding a new
steering rule.
This patch doesn't change any functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating flow tables, allow the caller to specify creation flags.
Currently no flags are used and as such this patch doesn't add any new
functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current tunnel set action supports only IP addresses and key
options. Add UDP dst port option.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst
utility functions.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current IP tunneling classification supports only IP addresses and key.
Enhance UDP based IP tunneling classification parameters by adding UDP
src and dst port.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New encapsulation keys were added to the flower classifier, which allow
classification according to outer (encapsulation) headers attributes
such as key and IP addresses.
In order to expose those attributes outside flower, add
corresponding enums in the flow dissector.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Needed for drivers to pick the relevant action when offloading tunnel
key act.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of adding hooks inside stmmac_platform it is better to just use
the standard PM callbacks within the specific dwmac-driver. This only
used by the dwmac-rk driver.
This reverts commit cecbc5563a ("stmmac: allow to split suspend/resume
from init/exit callbacks").
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We had various problems in the past in tcp_get_info() and used
specific synchronization to avoid deadlocks.
We would like to add more instrumentation points for TCP, and
avoiding grabing socket lock in tcp_getinfo() was too costly.
Being able to lock the socket allows to provide consistent set
of fields.
inet_diag_dump_icsk() can make sure ehash locks are not
held any more when tcp_get_info() is called.
We can remove syncp added in commit d654976cbf
("tcp: fix a potential deadlock in tcp_get_info()"), but we need
to use lock_sock_fast() instead of spin_lock_bh() since TCP input
path can now be run from process context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default TX queue length of Ethernet devices have been a magic
constant of 1000, ever since the initial git import.
Looking back in historical trees[1][2] the value used to be 100,
with the same comment "Ethernet wants good queues". The commit[3]
that changed this from 100 to 1000 didn't describe why, but from
conversations with Robert Olsson it seems that it was changed
when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the
link speed increased x10 the queue size were also adjusted. This
value later caused much heartache for the bufferbloat community.
This patch merely moves the value into a defined constant.
[1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/
[2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/
[3] https://git.kernel.org/tglx/history/c/98921832c232
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.
Overall, this allows acquiring only once the receive queue
lock on dequeue.
Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.
nr sinks vanilla patched
1 440 560
3 2150 2300
6 3650 3800
9 4450 4600
12 6250 6450
v1 -> v2:
- do rmem and allocated memory scheduling under the receive lock
- do bulk scheduling in first_packet_length() and in udp_destruct_sock()
- avoid the typdef for the dequeue callback
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can use it even after orphaining the skbuff.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The BCM54810 PHY requires some semi-unique configuration, which results
in some additional configuration in addition to the standard config.
Also, some users of the BCM54810 require the PHY lanes to be swapped.
Since there is no way to detect this, add a device tree query to see if
it is applicable.
Inspired-by: Vikas Soni <vsoni@broadcom.com>
Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper function to read the AUXCTL register for the BCM54xx. This
mirrors the bcm54xx_auxctl_write function already present in the code.
Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
seq_file users can only access const version of file pointer,
because the ->file member of struct seq_operations is marked
as such. Make parameter to debugfs_real_fops() const.
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Nicolai Stange <nicstange@gmail.com>
CC: Christian Lamparter <chunkeey@gmail.com>
CC: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Use the UID in routing lookups made by protocol connect() and
sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
(e.g., Path MTU discovery) take the UID of the socket into
account.
- For packets not associated with a userspace socket, (e.g., ping
replies) use UID 0 inside the user namespace corresponding to
the network namespace the socket belongs to. This allows
all namespaces to apply routing and iptables rules to
kernel-originated traffic in that namespaces by matching UID 0.
This is better than using the UID of the kernel socket that is
sending the traffic, because the UID of kernel sockets created
at namespace creation time (e.g., the per-processor ICMP and
TCP sockets) is the UID of the user that created the socket,
which might not be mapped in the namespace.
Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
rtnetlink. The value INVALID_UID indicates no UID was
specified.
- Add a UID field to the flow structures.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.
Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().
Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:
1. Whenever sk_socket is non-null: sk_uid is the same as the UID
in sk_socket, i.e., matches the return value of sock_i_uid.
Specifically, the UID is set when userspace calls socket(),
fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
- For a socket that no longer has a sk_socket because
userspace has called close(): the previous UID.
- For a cloned socket (e.g., an incoming connection that is
established but on which userspace has not yet called
accept): the UID of the socket it was cloned from.
- For a socket that has never had an sk_socket: UID 0 inside
the user namespace corresponding to the network namespace
the socket belongs to.
Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support matching on SCTP ports in the same way that matching
on TCP and UDP ports is already supported.
Example usage:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ip parent ffff: \
flower indev eth0 ip_proto sctp dst_port 80 \
action drop
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When reading a datagram or raw packet that arrived fragmented, expose
the maximum fragment size if recorded to allow applications to
estimate receive path MTU.
At this point, the field is only recorded when ipv6 connection
tracking is enabled. A follow-up patch will record this field also
in the ipv6 input path.
Tested using the test for IP_RECVFRAGSIZE plus
ip netns exec to ip addr add dev veth1 fc07::1/64
ip netns exec from ip addr add dev veth0 fc07::2/64
ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
ip netns exec from nc -q 1 -u fc07::1 6000 < payload
Both with and without enabling connection tracking
ip6tables -A INPUT -m state --state NEW -p udp -j LOG
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.
Tested:
Sent data over a veth pair of which the source has a small mtu.
Sent data using netcat, received using a dedicated process.
Verified that the cmsg IP_RECVFRAGSIZE is returned only when
data arrives fragmented, and in that cases matches the veth mtu.
ip link add veth0 type veth peer name veth1
ip netns add from
ip netns add to
ip link set dev veth1 netns to
ip netns exec to ip addr add dev veth1 192.168.10.1/24
ip netns exec to ip link set dev veth1 up
ip link set dev veth0 netns from
ip netns exec from ip addr add dev veth0 192.168.10.2/24
ip netns exec from ip link set dev veth0 up
ip netns exec from ip link set dev veth0 mtu 1300
ip netns exec from ethtool -K veth0 ufo off
dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The old ethtool api (get_setting and set_setting) has generic mii
functions mii_ethtool_sset and mii_ethtool_gset.
To support the new ethtool api ({get|set}_link_ksettings), we add
two generics mii function mii_ethtool_{get|set}_link_ksettings_get.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This includes better integration with the routing subsystem for
nf_tables, explicit notrack support and smaller updates. More
specifically, they are:
1) Add fib lookup expression for nf_tables, from Florian Westphal. This
new expression provides a native replacement for iptables addrtype
and rp_filter matches. This is more flexible though, since we can
populate the kernel flowi representation to inquire fib to
accomodate new usecases, such as RTBH through skb mark.
2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
new expression allow you to access skbuff route metadata, more
specifically nexthop and classid fields.
3) Add notrack support for nf_tables, to skip conntracking, requested by
many users already.
4) Add boilerplate code to allow to use nf_log infrastructure from
nf_tables ingress.
5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
the xtables cluster match, from Liping Zhang.
6) Move socket lookup code into generic nf_socket_* infrastructure so
we can provide a native replacement for the xtables socket match.
7) Make sure nfnetlink_queue data that is updated on every packets is
placed in a different cache from read-only data, from Florian Westphal.
8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.
9) Start round robin number generation in nft_numgen from zero,
instead of n-1, for consistency with xtables statistics match,
patch from Liping Zhang.
10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
given we retry with a smaller allocation on failure, from Calvin Owens.
11) Cleanup xt_multiport to use switch(), from Gao feng.
12) Remove superfluous check in nft_immediate and nft_cmp, from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces an nftables rt expression for routing related data with support
for nexthop (i.e. the directly connected IP address that an outgoing packet
is sent to), which can be used either for matching or accounting, eg.
# nft add rule filter postrouting \
ip daddr 192.168.1.0/24 rt nexthop != 192.168.0.1 drop
This will drop any traffic to 192.168.1.0/24 that is not routed via
192.168.0.1.
# nft add rule filter postrouting \
flow table acct { rt nexthop timeout 600s counter }
# nft add rule ip6 filter postrouting \
flow table acct { rt nexthop timeout 600s counter }
These rules count outgoing traffic per nexthop. Note that the timeout
releases an entry if no traffic is seen for this nexthop within 10 minutes.
# nft add rule inet filter postrouting \
ether type ip \
flow table acct { rt nexthop timeout 600s counter }
# nft add rule inet filter postrouting \
ether type ip6 \
flow table acct { rt nexthop timeout 600s counter }
Same as above, but via the inet family, where the ether type must be
specified explicitly.
"rt classid" is also implemented identical to "meta rtclassid", since it
is more logical to have this match in the routing expression going forward.
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move layer 2 packet logging into nf_log_l2packet() that resides in
nf_log_common.c, so this can be shared by both bridge and netdev
families.
This patch adds the boiler plate code to register the netdev logging
family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
just dispatches to ipv4 or ipv6 one based on nfproto).
Currently supports fetching output interface index/name and the
rtm_type associated with an address.
This can be used for adding path filtering. rtm_type is useful
to e.g. enforce a strong-end host model where packets
are only accepted if daddr is configured on the interface the
packet arrived on.
The fib expression is a native nftables alternative to the
xtables addrtype and rp_filter matches.
FIB result order for oif/oifname retrieval is as follows:
- if packet is local (skb has rtable, RTF_LOCAL set, this
will also catch looped-back multicast packets), set oif to
the loopback interface.
- if fib lookup returns an error, or result points to local,
store zero result. This means '--local' option of -m rpfilter
is not supported. It is possible to use 'fib type local' or add
explicit saddr/daddr matching rules to create exceptions if this
is really needed.
- store result in the destination register.
In case of multiple routes, search set for desired oif in case
strict matching is requested.
ipv4 and ipv6 behave fib expressions are supposed to behave the same.
[ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")
http://patchwork.ozlabs.org/patch/688615/
to address fallout from this patch after rebasing nf-next, that was
posted to address compilation warnings. --pablo ]
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Systems with large pages (64KB pages for example) do not always have
huge quantity of memory.
A big SK_MEM_QUANTUM value leads to fewer interactions with the
global counters (like tcp_memory_allocated) but might trigger
memory pressure much faster, giving suboptimal TCP performance
since windows are lowered to ridiculous values.
Note that sysctl_mem units being in pages and in ABI, we also need
to change sk_prot_mem_limits() accordingly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the common pimhdr structure and helpers to access it, also cleanup the
format of the header file.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, each interfaces assumes it receives an equal portion
of HW/FW resources, but this is wasteful - different partitions
[and specifically, parititions exposing different protocol support]
might require different resources.
Implement a new resource learning scheme where the information is
received directly from the management firmware [which has knowledge
of all of the functions and can serve as arbiter].
Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Management firmware is interested in various tidbits about
the driver - including the driver state & several configuration
related fields [MTU, primtary MAC, etc.].
This adds the necessray logic to update MFW with such configurations,
some of which are passed directly via qed while for others APIs
are provide so that qede would be able to later configure if needed.
This also introduces a new default configuration for MTU which would
replace the default inherited by being an ethernet device.
Signed-off-by: Sudarsana Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for setting and using XPS when QoS via traffic
classes is enabled. With this change we will factor in the priority and
traffic class mapping of the packet and use that information to correctly
select the queue.
This allows us to define a set of queues for a given traffic class via
mqprio and then configure the XPS mapping for those queues so that the
traffic flows can avoid head-of-line blocking between the individual CPUs
if so desired.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a sysfs attribute for a Tx queue that allows us to determine the
traffic class for a given queue. This will allow us to more easily
determine this in the future. It is needed as XPS will take the traffic
class for a group of queues into account in order to avoid pulling traffic
from one traffic class into another.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions for configuring the traffic class to queue mappings have
other effects that need to be addressed. Instead of trying to export a
bunch of new functions just relocate the functions so that we can
instrument them directly with the functionality they will need.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current codes use _IOC_TYPE(cmd) == 0x89 to check if the cmd is one
socket ioctl command like SIOCGIFHWADDR. But the literal number 0x89 may
confuse readers. So create one macro SOCK_IOC_TYPE to enhance the readability.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>