Before calling dev_hard_start_xmit(), upper layers tried
to cook optimal skb list based on BQL budget.
Problem is that GSO packets can end up comsuming more than
the BQL budget.
Breaking the loop is not useful, since requeued packets
are ahead of any packets still in the qdisc.
It is also more expensive, since next TX completion will
push these packets later, while skbs are not in cpu caches.
It is also a behavior difference with TSO packets, that can
break the BQL limit by a large amount.
Note that drivers should use __netdev_tx_sent_queue()
in order to have optimal xmit_more support, and avoid
useless atomic operations as shown in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just like with normal GRO processing, we have to initialize
skb->next to NULL when we unlink overflow packets from the
GRO hash lists.
Fixes: d4546c2509 ("net: Convert GRO SKB handling to list_head.")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-16
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
sk_msg BPF integration for the latter, from Daniel and John.
2) Enable BPF syscall side to indicate for maps that they do not support
a map lookup operation as opposed to just missing key, from Prashant.
3) Add bpftool map create command which after map creation pins the
map into bpf fs for further processing, from Jakub.
4) Add bpftool support for attaching programs to maps allowing sock_map
and sock_hash to be used from bpftool, from John.
5) Improve syscall BPF map update/delete path for map-in-map types to
wait a RCU grace period for pending references to complete, from Daniel.
6) Couple of follow-up fixes for the BPF socket lookup to get it
enabled also when IPv6 is compiled as a module, from Joe.
7) Fix a generic-XDP bug to handle the case when the Ethernet header
was mangled and thus update skb's protocol and data, from Jesper.
8) Add a missing BTF header length check between header copies from
user space, from Wenwen.
9) Minor fixups in libbpf to use __u32 instead u32 types and include
proper perf_event.h uapi header instead of perf internal one, from Yonghong.
10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
to bpftool's build, from Jiri.
11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
with_addr.sh script from flow dissector selftest, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
DEC FDDIcontroller 700 (DEFZA) uses a Tx/Rx queue pair to communicate
SMT frames with adapter's firmware. Any SMT frame received from the RMC
via the Rx queue is queued back by the driver to the SMT Rx queue for
the firmware to process. Similarly the firmware uses the SMT Tx queue
to supply the driver with SMT frames which are queued back to the Tx
queue for the RMC to send to the ring.
When a network tap is attached to an FDDI interface handled by `defza'
any incoming SMT frames captured are queued to our usual processing of
network data received, which in turn delivers them to any listening
taps.
However the outgoing SMT frames produced by the firmware bypass our
network protocol stack and are therefore not delivered to taps. This in
turn means that taps are missing a part of network traffic sent by the
adapter, which may make it more difficult to track down network problems
or do general traffic analysis.
Call `dev_queue_xmit_nit' then in the SMT Tx path, having checked that
a network tap is attached, with a newly-created `dev_nit_active' helper
wrapping the usual condition used in the transmit path.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 5aad1de5ea ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.
As Stefano described in commit e9fa1495d7 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
- if the local MTU is now lower than the PMTU, that PMTU is now
incorrect
- if the local MTU was the lowest value in the path, and is increased,
we might discover a higher PMTU
Similarly to what commit e9fa1495d7 did for IPv6, update PMTU in those
cases.
If the exception was locked, the discovered PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.
To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.
Fixes: 5aad1de5ea ("ipv4: use separate genid for next hop exceptions")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
XDP can modify (and resize) the Ethernet header in the packet.
There is a bug in generic-XDP, because skb->protocol and skb->pkt_type
are setup before reaching (netif_receive_)generic_xdp.
This bug was hit when XDP were popping VLAN headers (changing
eth->h_proto), as skb->protocol still contains VLAN-indication
(ETH_P_8021Q) causing invocation of skb_vlan_untag(skb), which corrupt
the packet (basically popping the VLAN again).
This patch catch if XDP changed eth header in such a way, that SKB
fields needs to be updated.
V2: on request from Song Liu, use ETH_HLEN instead of mac_len,
in __skb_push() as eth_type_trans() use ETH_HLEN in paired skb_pull_inline().
Fixes: d445516966 ("net: xdp: support xdp generic on virtual devices")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It documents what is happening, and eliminates the spurious list
pointer poisoning.
In the long term, in order to get proper list head debugging, we
might want to use the list poison value as the indicator that
an SKB is a singleton and not on a list.
Signed-off-by: David S. Miller <davem@davemloft.net>
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the only way to ignore outgoing packets on a packet socket is
via the BPF filter. With MSG_ZEROCOPY, packets that are looped into
AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
if the filter run from packet_rcv() would reject them. So the presence
of a packet socket on the interface takes away the benefits of
MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
packets. (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
cloned, but the cost for that is much lower.)
Add a socket option to allow AF_PACKET sockets to ignore outgoing
packets to solve this. Note that the *BSDs already have something
similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
The first intended user is lldpd.
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.
This patch adds an ability to call __netif_set_xps_queue under
cpus_read_lock().
Acked-by: Jason Wang <jasowang@redhat.com>
============================================
WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20
but task is already holding lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(cpu_hotplug_lock.rw_sem);
lock(cpu_hotplug_lock.rw_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by swapper/0/1:
#0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
#1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
#2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60
v2: move cpus_read_lock() out of __netif_set_xps_queue()
Cc: "Nambiar, Amritha" <amritha.nambiar@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_set_mtu_ext is able to fail with a valid mtu value, at that
condition, extack._msg is not set and random since it is in stack,
then kernel will crash when print it.
Fixes: 7a4c53bee3 ("net: report invalid mtu value via netlink extack")
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The BTF conflicts were simple overlapping changes.
The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is similar TC_ACT_REDIRECT, but with a slightly different
semantic:
- on ingress the mirred skbs are passed to the target device
network stack without any additional check not scrubbing.
- the rcu-protected stats provided via the tcf_result struct
are updated on error conditions.
This new tcfa_action value is not exposed to the user-space
and can be used only internally by clsact.
v1 -> v2: do not touch TC_ACT_REDIRECT code path, introduce
a new action type instead
v2 -> v3:
- rename the new action value TC_ACT_REINJECT, update the
helper accordingly
- take care of uncloned reinjected packets in XDP generic
hook
v3 -> v4:
- renamed again the new action value (JiriP)
v4 -> v5:
- fix build error with !NET_CLS_ACT (kbuild bot)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an invalid MTU value is set through rtnetlink return extra error
information instead of putting message in kernel log. For other cases
where there is no visible API, keep the error report in the log.
Example:
# ip li set dev enp12s0 mtu 10000
Error: mtu greater than device maximum.
# ifconfig enp12s0 mtu 10000
SIOCSIFMTU: Invalid argument
# dmesg | tail -1
[ 2047.795467] enp12s0: mtu greater than device maximum
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix dev_change_tx_queue_len so it rolls back original value
upon a failure in dev_qdisc_change_tx_queue_len.
This is already done for notifirers' failures, share the code.
In case of failure in dev_qdisc_change_tx_queue_len, some tx queues
would still be of the new length, while they should be reverted.
Currently, the revert is not done, and is marked with a TODO label
in dev_qdisc_change_tx_queue_len, and should find some nice solution
to do it.
Yet it is still better to not apply the newly requested value.
Fixes: 48bfd55e7e ("net_sched: plug in qdisc ops change_tx_queue_len")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Reported-by: Ran Rozenstein <ranro@mellanox.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a patch by Sean Tranchetti.
Fixes: d4546c2509 ("net: Convert GRO SKB handling to list_head.")
Signed-off-by: David S. Miller <davem@davemloft.net>
FIELD_SIZEOF() is in bytes, but we want bits.
Fixes: d9f37d01e2 ("net: convert gro_count to bitmask")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gro_hash size is 192 bytes, and uses 3 cache lines, if there is few
flows, gro_hash may be not fully used, so it is unnecessary to iterate
all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline.
convert gro_count to a bitmask, and rename it as gro_bitmask, each bit
represents a element of gro_hash, only flush a gro_hash element if the
related bit is set, to speed up napi_gro_flush().
and update gro_bitmask only if it will be changed, to reduce cache
update
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-15
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various different arm32 JIT improvements in order to optimize code emission
and make the JIT code itself more robust, from Russell.
2) Support simultaneous driver and offloaded XDP in order to allow for advanced
use-cases where some work is offloaded to the NIC and some to the host. Also
add ability for bpftool to load programs and maps beyond just the cgroup case,
from Jakub.
3) Add BPF JIT support in nfp for multiplication as well as division. For the
latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.
4) Add BTF pretty print functionality to bpftool in plain and JSON output
format, from Okash.
5) Add build and installation to the BPF helper man page into bpftool, from Quentin.
6) Add a TCP BPF callback for listening sockets which is triggered right after
the socket transitions to TCP_LISTEN state, from Andrey.
7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
tree and prints all attached programs, from Roman.
8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
packets, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the query of HW-attached program from the software one.
Introduce new .ndo_bpf command to query HW-attached program.
This will allow drivers to install different programs in HW
and SW at the same time. Netlink can now also carry multiple
programs on dump (in which case mode will be set to
XDP_ATTACHED_MULTI and user has to check per-attachment point
attributes, IFLA_XDP_PROG_ID will not be present). We reuse
IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size()
doesn't need to be updated.
Note that the installation side is still not there, since all
drivers currently reject installing more than one program at
the time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
prog_attached of struct netdev_bpf should have been superseded
by simply setting prog_id long time ago, but we kept it around
to allow offloading drivers to communicate attachment mode (drv
vs hw). Subsequently drivers were also allowed to report back
attachment flags (prog_flags), and since nowadays only programs
attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell
the attachment mode from the flags driver reports. Remove
prog_attached member.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jeff Kirsher says:
====================
L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09
This patch series is meant to allow support for the L2 forward offload, aka
MACVLAN offload without the need for using ndo_select_queue.
The existing solution currently requires that we use ndo_select_queue in
the transmit path if we want to associate specific Tx queues with a given
MACVLAN interface. In order to get away from this we need to repurpose the
tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
a means of accessing the queues on the lower device. As a result we cannot
offload a device that is configured as multiqueue, however it doesn't
really make sense to configure a macvlan interfaced as being multiqueue
anyway since it doesn't really have a qdisc of its own in the first place.
The big changes in this set are:
Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
Disable XPS for single queue devices
Replace accel_priv with sb_dev in ndo_select_queue
Add sb_dev parameter to fallback function for ndo_select_queue
Consolidated ndo_select_queue functions that appeared to be duplicates
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
__netif_receive_skb_core can free the skb, so we have to use the dequeue-
enqueue model when calling it from __netif_receive_skb_list_core.
Fixes: 88eb1944e1 ("net: core: propagate SKB lists through packet_type lookup")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netif_receive_skb_list_internal(), all of skb_defer_rx_timestamp(),
do_xdp_generic() and enqueue_to_backlog() can lead to kfree(skb). Thus,
we cannot wait until after they return to remove the skb from the list;
instead, we remove it first and, in the pass case, add it to a sublist
afterwards.
In the case of enqueue_to_backlog() we have already decided not to pass
when we call the function, so we do not need a sublist.
Fixes: 7da517a3bc ("net: core: Another step of skb receive list processing")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For most of these calls we can just pass NULL through to the fallback
function as the sb_dev. The only cases where we cannot are the cases where
we might be dealing with either an upper device or a driver that would
have configured things to support an sb_dev itself.
The only driver that has any significant change in this patch set should be
ixgbe as we can drop the redundant functionality that existed in both the
ndo_select_queue function and the fallback function that was passed through
to us.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch adds a generic version of the ndo_select_queue functions for
either returning 0 or selecting a queue based on the processor ID. This is
generally meant to just reduce the number of functions we have to change
in the future when we have to deal with ndo_select_queue changes.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.
The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.
The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.
In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the subordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
After commit 07d78363dc ("net: Convert NAPI gro list into a small hash
table.")' there is 8 hash buckets, which allows more flows to be held for
merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still,
limit the hash table performance.
keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not
the total 8 skb
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generally the check should be very cheap, as the sk_buff_head is in cache.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
a major project in itself) we just let it call the regular okfn (in this
case ip_rcv_finish()) for any packets it steals, and have it give us back
a list of packets it's synchronously accepted (which normally NF_HOOK
would automatically call okfn() on, but we want to be able to potentially
pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
packet (see nf_hook_entry_hookfn()), but again, changing that can be left
for future work.
There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any
accepts earlier in the list. However, it was already possible for an
asynchronous accept to cause out-of-order receives, so presumably this is
considered OK.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__netif_receive_skb_core() does a depressingly large amount of per-packet
work that can't easily be listified, because the another_round looping
makes it nontrivial to slice up into smaller functions.
Fortunately, most of that work disappears in the fast path:
* Hardware devices generally don't have an rx_handler
* Unless you're tcpdumping or something, there is usually only one ptype
* VLAN processing comes before the protocol ptype lookup, so doesn't force
a pt_prev deliver
so normally, __netif_receive_skb_core() will run straight through and pass
back the one ptype found in ptype_base[hash of skb->protocol].
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First example of a layer splitting the list (rather than merely taking
individual packets off it).
Involves new list.h function, list_cut_before(), like list_cut_position()
but cuts on the other side of the given entry.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_receive_skb_list_internal() now processes a list and hands it
on to the next function.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use static_key for XPS maps to reduce the cost of extra map checks,
similar to how it is used for RPS and RFS. This includes static_key
'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
Rx queues map.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor XPS code to support Tx queue selection based on
CPU(s) map or Rx queue(s) map.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Improve the performance of GRO receive by splitting flows into
multiple hash chains.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manage pending per-NAPI GRO packets via list_head.
Return an SKB pointer from the GRO receive handlers. When GRO receive
handlers return non-NULL, it means that this SKB needs to be completed
at this time and removed from the NAPI queue.
Several operations are greatly simplified by this transformation,
especially timing out the oldest SKB in the list when gro_count
exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
in reverse order.
Signed-off-by: David S. Miller <davem@davemloft.net>
if dev_get_valid_name failed, propagate its return code
and remove the setting err to ENODEV, it will be set to
0 again before dev_change_net_namespace exits.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlsXFUEUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIoomg//eRNpc6x9kxTijN670AC2uD0CBTlZ
2z6mHuJaOhG8bTxjZxQfUBoo6/eZJ2YC1yq6ornGFNzw4sfKsR/j86ujJim2HAmo
opUhziq3SILGEvjsxfPkREe/wb49jy0AA/WjZqciitB1ig8Hz7xzqi0lpNaEspFh
QJFB6XXkojWGFGrRzruAVJnPS+pDWoTQR0qafs3JWKnpeinpOdZnl1hPsysAEHt5
Ag8o4qS/P9xJM0khi7T+jWECmTyT/mtWqEtFcZ0o+JLOgt/EMvNX6DO4ETDiYRD2
mVChga9x5r78bRgNy2U8IlEWWa76WpcQAEODvhzbijX4RxMAmjsmLE+e+udZSnMZ
eCITl2f7ExxrL5SwNFC/5h7pAv0RJ+SOC19vcyeV4JDlQNNVjUy/aNKv5baV0aeg
EmkeobneMWxqHx52aERz8RF1in5pT8gLOYoYnWfNpcDEmjLrwhuZLX2asIzUEqrS
SoPJ8hxIDCxceHOWIIrz5Dqef7x28Dyi46w3QINC8bSy2RnR/H3q40DRegvXOGiS
9WcbbwbhnM4Kau413qKicGCvdqTVYdeyZqo7fVelSciD139Vk7pZotyom4MuU25p
fIyGfXa8/8gkl7fZ+HNkZbba0XWNfAZt//zT095qsp3CkhVnoybwe6OwG1xRqErq
W7OOQbS7vvN/KGo=
=10u6
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Another reasonable chunk of audit changes for v4.18, thirteen patches
in total.
The thirteen patches can mostly be broken down into one of four
categories: general bug fixes, accessor functions for audit state
stored in the task_struct, negative filter matches on executable
names, and extending the (relatively) new seccomp logging knobs to the
audit subsystem.
The main driver for the accessor functions from Richard are the
changes we're working on to associate audit events with containers,
but I think they have some standalone value too so I figured it would
be good to get them in now.
The seccomp/audit patches from Tyler apply the seccomp logging
improvements from a few releases ago to audit's seccomp logging;
starting with this patchset the changes in
/proc/sys/kernel/seccomp/actions_logged should apply to both the
standard kernel logging and audit.
As usual, everything passes the audit-testsuite and it happens to
merge cleanly with your tree"
[ Heh, except it had trivial merge conflicts with the SELinux tree that
also came in from Paul - Linus ]
* tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: Fix wrong task in comparison of session ID
audit: use existing session info function
audit: normalize loginuid read access
audit: use new audit_context access funciton for seccomp_actions_logged
audit: use inline function to set audit context
audit: use inline function to get audit context
audit: convert sessionid unset to a macro
seccomp: Don't special case audited processes when logging
seccomp: Audit attempts to modify the actions_logged sysctl
seccomp: Configurable separator for the actions_logged string
seccomp: Separate read and write code for actions_logged sysctl
audit: allow not equal op for audit by executable
audit: add syscall information to FEATURE_CHANGE records
- replaceme the force_dma flag with a dma_configure bus method.
(Nipun Gupta, although one patch is іncorrectly attributed to me
due to a git rebase bug)
- use GFP_DMA32 more agressively in dma-direct. (Takashi Iwai)
- remove PCI_DMA_BUS_IS_PHYS and rely on the dma-mapping API to do the
right thing for bounce buffering.
- move dma-debug initialization to common code, and apply a few cleanups
to the dma-debug code.
- cleanup the Kconfig mess around swiotlb selection
- swiotlb comment fixup (Yisheng Xie)
- a trivial swiotlb fix. (Dan Carpenter)
- support swiotlb on RISC-V. (based on a patch from Palmer Dabbelt)
- add a new generic dma-noncoherent dma_map_ops implementation and use
it for arc, c6x and nds32.
- improve scatterlist validity checking in dma-debug. (Robin Murphy)
- add a struct device quirk to limit the dma-mask to 32-bit due to
bridge/system issues, and switch x86 to use it instead of a local
hack for VIA bridges.
- handle devices without a dma_mask more gracefully in the dma-direct
code.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlsU1hwLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPraxAAocC7JiFKW133/VugCtGA1x9uE8DPHealtsWTAeEq
KOOB3GxWMU2hKqQ4km5tcfdWoGJvvab6hmDXcitzZGi2JajO7Ae0FwIy3yvxSIKm
iH/ON7c4sJt8gKrXYsLVylmwDaimNs4a6xfODoCRgnWuovI2QrrZzupnlzPNsiOC
lv8ezzcW+Ay/gvDD/r72psO+w3QELETif/OzR/qTOtvLrVabM06eHmPQ8Wb98smu
/UPMMv6/3XwQnxpxpdyqN+p/gUdneXithzT261wTeZ+8gDXmcWBwHGcMBCimcoBi
FklW52moazIPIsTysqoNlVFsLGJTeS4p2D3BLAp5NwWYsLv+zHUVZsI1JY/8u5Ox
mM11LIfvu9JtUzaqD9SvxlxIeLhhYZZGnUoV3bQAkpHSQhN/xp2YXd5NWSo5ac2O
dch83+laZkZgd6ryw6USpt/YTPM/UHBYy7IeGGHX/PbmAke0ZlvA6Rae7kA5DG59
7GaLdwQyrHp8uGFgwze8P+R4POSk1ly73HHLBT/pFKnDD7niWCPAnBzuuEQGJs00
0zuyWLQyzOj1l6HCAcMNyGnYSsMp8Fx0fvEmKR/EYs8O83eJKXi6L9aizMZx4v1J
0wTolUWH6SIIdz474YmewhG5YOLY7mfe9E8aNr8zJFdwRZqwaALKoteRGUxa3f6e
zUE=
=6Acj
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.18' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- replace the force_dma flag with a dma_configure bus method. (Nipun
Gupta, although one patch is іncorrectly attributed to me due to a
git rebase bug)
- use GFP_DMA32 more agressively in dma-direct. (Takashi Iwai)
- remove PCI_DMA_BUS_IS_PHYS and rely on the dma-mapping API to do the
right thing for bounce buffering.
- move dma-debug initialization to common code, and apply a few
cleanups to the dma-debug code.
- cleanup the Kconfig mess around swiotlb selection
- swiotlb comment fixup (Yisheng Xie)
- a trivial swiotlb fix. (Dan Carpenter)
- support swiotlb on RISC-V. (based on a patch from Palmer Dabbelt)
- add a new generic dma-noncoherent dma_map_ops implementation and use
it for arc, c6x and nds32.
- improve scatterlist validity checking in dma-debug. (Robin Murphy)
- add a struct device quirk to limit the dma-mask to 32-bit due to
bridge/system issues, and switch x86 to use it instead of a local
hack for VIA bridges.
- handle devices without a dma_mask more gracefully in the dma-direct
code.
* tag 'dma-mapping-4.18' of git://git.infradead.org/users/hch/dma-mapping: (48 commits)
dma-direct: don't crash on device without dma_mask
nds32: use generic dma_noncoherent_ops
nds32: implement the unmap_sg DMA operation
nds32: consolidate DMA cache maintainance routines
x86/pci-dma: switch the VIA 32-bit DMA quirk to use the struct device flag
x86/pci-dma: remove the explicit nodac and allowdac option
x86/pci-dma: remove the experimental forcesac boot option
Documentation/x86: remove a stray reference to pci-nommu.c
core, dma-direct: add a flag 32-bit dma limits
dma-mapping: remove unused gfp_t parameter to arch_dma_alloc_attrs
dma-debug: check scatterlist segments
c6x: use generic dma_noncoherent_ops
arc: use generic dma_noncoherent_ops
arc: fix arc_dma_{map,unmap}_page
arc: fix arc_dma_sync_sg_for_{cpu,device}
arc: simplify arc_dma_sync_single_for_{cpu,device}
dma-mapping: provide a generic dma-noncoherent implementation
dma-mapping: simplify Kconfig dependencies
riscv: add swiotlb support
riscv: only enable ZONE_DMA32 for 64-bit
...