This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a small helper skb_postpush_rcsum() and fix up redirect locations
that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
a proper csum that covers also Ethernet header, f.e. since 2c26d34bbc
("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
also do skb_postpull_rcsum() after pulling Ethernet header off via
eth_type_trans().
When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
I can trigger the following csum issue with an IPv6 setup:
[ 505.144065] dummy1: hw csum failure
[...]
[ 505.144108] Call Trace:
[ 505.144112] <IRQ> [<ffffffff81372f08>] dump_stack+0x44/0x5c
[ 505.144134] [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
[ 505.144142] [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
[ 505.144149] [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
[ 505.144161] [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
[ 505.144170] [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
[ 505.144177] [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
[ 505.144189] [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
[ 505.144196] [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
[ 505.144204] [<ffffffff8164385d>] nf_iterate+0x5d/0x70
[ 505.144210] [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
[ 505.144218] [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
[ 505.144225] [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
[ 505.144232] [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
[ 505.144239] [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
[ 505.144245] [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
[ 505.144252] [<ffffffff8160ccff>] process_backlog+0x9f/0x140
[ 505.144259] [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
[...]
What happens is that on ingress, we push Ethernet header back in, either
from cls_bpf or right before skb_do_redirect(), but without updating csum.
The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
helper for the dev_forward_skb() case to correct the csum diff again.
Thanks to Hannes Frederic Sowa for the csum_partial() idea!
Fixes: 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper")
Fixes: 27b29f6305 ("bpf: add bpf_redirect() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the details behind the cb[] access into a small helper to decouple
and make them generic for bpf_prog_run_save_cb()/bpf_prog_run_clear_cb()
that was introduced via commit ff936a04e5 ("bpf: fix cb access in socket
filter programs"). Also add a comment to better clarify what is done in
bpf_skb_cb().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) Release nf_tables objects on netns destructions via
nft_release_afinfo().
2) Destroy basechain and rules on netdevice removal in the new netdev
family.
3) Get rid of defensive check against removal of inactive objects in
nf_tables.
4) Pass down netns pointer to our existing nfnetlink callbacks, as well
as commit() and abort() nfnetlink callbacks.
5) Allow to invert limit expression in nf_tables, so we can throttle
overlimit traffic.
6) Add packet duplication for the netdev family.
7) Add forward expression for the netdev family.
8) Define pr_fmt() in conntrack helpers.
9) Don't leave nfqueue configuration on inconsistent state in case of
errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
him.
10) Skip queue option handling after unbind.
11) Return error on unknown both in nfqueue and nflog command.
12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.
13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
from Carlos Falgueras.
14) Add support for 64 bit byteordering changes nf_tables, from Florian
Westphal.
15) Add conntrack byte/packet counter matching support to nf_tables,
also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Make device_free and device_remove operations in the mdio device
structure, so the core code does not need to differentiate between
phy devices and generic mdio devices.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not all devices on an MDIO bus are PHYs. Meaning not all MDIO drivers
are PHY drivers. Add support for generic MDIO drivers.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matching a driver to a device has both generic parts, and parts which
are specific to PHY devices. Move the PHY specific parts into
phy_device.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than have each driver set the driver owner field, do it once in
the core code. This will also help with later changes, when the device
structure will move.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MDIO PM operations are really PHY device PM operations. So move
them into phy_device. This will be needed when we support devices on
the mdio bus which are not PHYs.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than have drivers directly manipulate the mii_bus structure,
provide and API for registering and unregistering devices on an MDIO
bus, and performing lookups.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not all devices attached to an MDIO bus are phys. So add an
mdio_device structure to represent the generic parts of an mdio
device, and place this structure into the phy_device.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Have mdio_alloc() create the array of interrupt numbers, and
initialize it to POLLING. This is what most MDIO drivers want, so
allowing code to be removed from the drivers.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many Ethernet drivers contain the same netdev_info() print statement
about the attached phy. Move it into the phy device code. Additionally
add a varargs function which can be used to append additional
information.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The address of the device can be determined from the phydev structure,
rather than passing it as a parameter.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a phydev_name() function, to help with moving some structure members
from phy_device.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for moving some of the phy_device structure members,
add macros for printing errors and debug information.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are logically MDIO operations, not phy operations, so move them
into the mdio header.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Within phy.h, an address on an MII bus has been called both addr and
phy_id. phy_id is particularly confusion, since it also means the ID
found in register 3, if the device on the bus is a phy. Consistently
use addr.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A repeating pattern in drivers has become to use OF node information
and, if not found, platform specific host information to extract the
ethernet address for a given device.
Currently this is done with a call to of_get_mac_address() and then
some ifdef'd stuff for SPARC.
Consolidate this into a portable routine, and provide the
arch_get_platform_mac_address() weak function hook for all
architectures to implement if they want.
Signed-off-by: David S. Miller <davem@davemloft.net>
TX fast path uses ndo_start_xmit(), ndo_features_check() and
ndo_select_queue().
Move ndo_features_check() close to ndo_start_xmit() to increase
data locality.
All "struct net_device_ops" should now be using C99 initializers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SKF_AD_ALU_XOR_X ancillary is not like the other ancillary data
instructions since it XORs A with X while all the others replace A with
some loaded value. All the BPF JITs fail to clear A if this is used as
the first instruction in a filter. This was found using american fuzzy
lop.
Add a helper to determine if A needs to be cleared given the first
instruction in a filter, and use this in the JITs. Except for ARM, the
rest have only been compile-tested.
Fixes: 3480593131 ("net: filter: get rid of BPF_S_* enum")
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* fix IBSS which got broken over time
* new USB id for bcm43242 dongle
* arp offload configuration through inet notifier
ath9k
* add random number generator support (CONFIG_ATH9K_HWRNG)
iwlwifi
* Make scan parameters low latency aware
* Fix in the NL80211_FEATURE_FULL_AP_CLIENT_STATE state case
* Fix enable injection mode (Chaya Rachel)
* Various cleanups (Dan / Julia / myself)
* Allow to stay more time on popular channels (David Spinadel)
* Bug fixes for D0i3 (Eliad / Luca)
* Fixes for GO uAPSD (myself)
* Start of TSO support (myself)
* Rate control bug fixes (Eyal / Gregory)
* Start the work on 9000 devices (Johannes / Sara / Oren)
* Start the work on a new Tx queue allocation model (Liad)
* Debug infrastructure enhancements (Golan)
mwifiex
* add a debugfs file for chip reset
* advertise SMS4 cipher suite
* increase ap and station interface limit to 3
* enable MSI support on newer pcie devices (8897 onwards)
rtlwifi
* fix lots of module parameter usage
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQEcBAABAgAGBQJWi5RxAAoJEG4XJFUm622bujQH/Rti4fRBnJcRt5EZAv/pvdIx
AGPcYQQfFbFRjzxjrcY30AYCC6OoID9n7BNtELIWabA6dQHWGJd2aebZGH85O4Eh
xgAty5CRf9E4+9MNiv+OqkL64/E6/YmSeOYiNSWl6SgRE4VYmexLXttpppImmoER
zowmtqzLSN/4l4yX5p2BVy3NAXN0R1WjYKNUXAJvJj59814upXf+XrOKCghzHIs4
2NE/aFoscscKWCSS3+MIIW4q9QWw7f8YvAdrO6k47LLkjm4+C/I5BeAlHUOEQUS7
xyoB/7KoHXlzRen8LmoL9SsqopTL2tKEjE/1f0QgLKOHXqpPRb4iKBFjsR94HJg=
=aeCM
-----END PGP SIGNATURE-----
Merge tag 'wireless-drivers-next-for-davem-2016-01-05' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:
====================
brcfmac
* fix IBSS which got broken over time
* new USB id for bcm43242 dongle
* arp offload configuration through inet notifier
ath9k
* add random number generator support (CONFIG_ATH9K_HWRNG)
iwlwifi
* Make scan parameters low latency aware
* Fix in the NL80211_FEATURE_FULL_AP_CLIENT_STATE state case
* Fix enable injection mode (Chaya Rachel)
* Various cleanups (Dan / Julia / myself)
* Allow to stay more time on popular channels (David Spinadel)
* Bug fixes for D0i3 (Eliad / Luca)
* Fixes for GO uAPSD (myself)
* Start of TSO support (myself)
* Rate control bug fixes (Eyal / Gregory)
* Start the work on 9000 devices (Johannes / Sara / Oren)
* Start the work on a new Tx queue allocation model (Liad)
* Debug infrastructure enhancements (Golan)
mwifiex
* add a debugfs file for chip reset
* advertise SMS4 cipher suite
* increase ap and station interface limit to 3
* enable MSI support on newer pcie devices (8897 onwards)
rtlwifi
* fix lots of module parameter usage
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A preparation step which adds support for reading the hardware
internal timer and the hardware timestamping from the CQE.
In addition, advertize device_frequency_khz HCA capability.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose socket options for setting a classic or extended BPF program
for use when selecting sockets in an SO_REUSEPORT group. These options
can be used on the first socket to belong to a group before bind or
on any socket in the group after bind.
This change includes refactoring of the existing sk_filter code to
allow reuse of the existing BPF filter validation checks.
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first NFC pull request for 4.5 and it brings:
- A new driver for the STMicroelectronics ST95HF NFC chipset.
The ST95HF is an NFC digital transceiver with an embedded analog
front-end and as such relies on the Linux NFC digital
implementation. This is the 3rd user of the NFC digital stack.
- ACPI support for the ST st-nci and st21nfca drivers.
- A small improvement for the nfcsim driver, as we can now tune
the Rx delay through sysfs.
- A bunch of minor cleanups and small fixes from Christophe Ricard,
for a few drivers and the NFC core code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWgsv0AAoJEIqAPN1PVmxKvukP/3eJwA+chAUF89/fqwqFTaJN
fffLoOxx2OBIbTXD2VV36yw4bAo9tbKDAYBZiot3Ig7Kg0SeCJ5oPA9xCVnWPrEY
hxFAldvl+lWQs8nrUOgUZItiFBeUPdfW9YX/yKhUZVc+602nUG/e/+6x8B5MhIce
SAfgCyd0c16DApltP7sw1muyZMvsO6Ow6dyNzDUVYZuabvEhe3SLSj9KFJi7Thsp
h41Iv+bwPLhwF4RXGA6rei/gdEDSMRohprdj3uTDiTarGW+OpcAO0zWACS5m4eR0
zF19+HjGPUk/LpFRaU31xX0ZQQjOTmmfsOt4FBb3P7oJx47egycsadLYPexeG7nj
ruyS6ezlRX1I/tZsnyLNJK92mK5TXYLz2uJ8r2ii/BgPNE+AErB3zKCC+EjXzWhh
AvClGu5b88WJLxoq3I3l5evPwGhebGZ8N/1uiFsHOxvzKVLgxwOmNLRGN4XXxB2i
UbIHgBb6smsu/l+3q9R83kfoMaoWnr+OUIi2QPQVDt/K7t1LfsCuIhzcGSgo1VuW
fGlA1iu+CNDknofeCl4JDo2UXAETO4gdKWw87GXeUcbbraLUczZeO7FFLZqxbMYc
OCaPYshmVFeZRypYdRWDHw67ivj0/h+9iq4PP1XOROkRFH746dD/p4yamJwVi20B
samZ8VPwzgH3/ohQJyX3
=VFUH
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.5 pull request
This is the first NFC pull request for 4.5 and it brings:
- A new driver for the STMicroelectronics ST95HF NFC chipset.
The ST95HF is an NFC digital transceiver with an embedded analog
front-end and as such relies on the Linux NFC digital
implementation. This is the 3rd user of the NFC digital stack.
- ACPI support for the ST st-nci and st21nfca drivers.
- A small improvement for the nfcsim driver, as we can now tune
the Rx delay through sysfs.
- A bunch of minor cleanups and small fixes from Christophe Ricard,
for a few drivers and the NFC core code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 71557a37ad ("[netdrvr] sh_eth: Add SH7619 support") added support
for the big-endian EDMAC descriptors. However, it was never used and never
worked right until the recent driver fixes. I think we now can just remove
this support, it was only burdening the driver from the start. It should be
easy to do without disturbing the SH platform code, at least for now...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Prevent XFRM per-cpu counter updates for one namespace from being
applied to another namespace. Fix from DanS treetman.
2) Fix RCU de-reference in iwl_mvm_get_key_sta_id(), from Johannes
Berg.
3) Remove ethernet header assumption in nft_do_chain_netdev(), from
Pablo Neira Ayuso.
4) Fix cpsw PHY ident with multiple slaves and fixed-phy, from Pascal
Speck.
5) Fix use after free in sixpack_close and mkiss_close.
6) Fix VXLAN fw assertion on bnx2x, from Yuval Mintz.
7) natsemi doesn't check for DMA mapping errors, from Alexey
Khoroshilov.
8) Fix inverted test in ip6addrlbl_get(), from ANdrey Ryabinin.
9) Missing initialization of needed_headroom in geneve tunnel driver,
from Paolo Abeni.
10) Fix conntrack template leak in openvswitch, from Joe Stringer.
11) Mission initialization of wq->flags in sock_alloc_inode(), from
Nicolai Stange.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
sctp: sctp should release assoc when sctp_make_abort_user return NULL in sctp_close
net, socket, socket_wq: fix missing initialization of flags
drivers: net: cpsw: fix error return code
openvswitch: Fix template leak in error cases.
sctp: label accepted/peeled off sockets
sctp: use GFP_USER for user-controlled kmalloc
qlcnic: fix a loop exit condition better
net: cdc_ncm: avoid changing RX/TX buffers on MTU changes
geneve: initialize needed_headroom
ipv6: honor ifindex in case we receive ll addresses in router advertisements
addrconf: always initialize sysctl table data
ipv6/addrlabel: fix ip6addrlbl_get()
switchdev: bridge: Pass ageing time as clock_t instead of jiffies
sh_eth: fix 16-bit descriptor field access endianness too
veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
net: usb: cdc_ncm: Adding Dell DW5813 LTE AT&T Mobile Broadband Card
net: usb: cdc_ncm: Adding Dell DW5812 LTE Verizon Mobile Broadband Card
natsemi: add checks for dma mapping errors
rhashtable: Kill harmless RCU warning in rhashtable_walk_init
openvswitch: correct encoding of set tunnel action attributes
...
Ethernet PHYs can maintain statistics, for example errors while idle
and receive errors. Add an ethtool mechanism to retrieve these
statistics, using the same model as MAC statistics.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block fixes from Jens Axboe:
"Make the block layer great again.
Basically three amazing fixes in this pull request, split into 4
patches. Believe me, they should go into 4.4. Two of them fix a
regression, the third and last fixes an easy-to-trigger bug.
- Fix a bad irq enable through null_blk, for queue_mode=1 and using
timer completions. Add a block helper to restart a queue
asynchronously, and use that from null_blk. From me.
- Fix a performance issue in NVMe. Some devices (Intel Pxxxx) expose
a stripe boundary, and performance suffers if we cross it. We took
that into account for merging, but not for the newer splitting
code. Fix from Keith.
- Fix a kernel oops in lightnvm with multiple channels. From Matias"
* 'for-linus' of git://git.kernel.dk/linux-block:
lightnvm: wrong offset in bad blk lun calculation
null_blk: use async queue restart helper
block: add blk_start_queue_async()
block: Split bios on chunk boundaries
mod_zone_page_state() takes a "delta" integer argument. delta contains
the number of pages that should be added or subtracted from a struct
zone's vm_stat field.
If a zone is larger than 8TB this will cause overflows. E.g. for a
zone with a size slightly larger than 8TB the line
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
in mm/page_alloc.c:free_area_init_core() will result in a negative
result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
contain 0x8xxxxxxx pages which will be sign extended to a negative
value.
Fix this by changing the delta argument to long type.
This could fix an early boot problem seen on s390, where we have a 9TB
system with only one node. ZONE_DMA contains 2GB and ZONE_NORMAL the
rest. The system is trying to allocate a GFP_DMA page but ZONE_DMA is
completely empty, so it tries to reclaim pages in an endless loop.
This was seen on a heavily patched 3.10 kernel. One possible
explaination seem to be the overflows caused by mod_zone_page_state().
Unfortunately I did not have the chance to verify that this patch
actually fixes the problem, since I don't have access to the system
right now. However the overflow problem does exist anyway.
Given the description that a system with slightly less than 8TB does
work, this seems to be a candidate for the observed problem.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
microread platform_data header had an NXP header.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We currently only have an inline/sync helper to restart a stopped
queue. If drivers need an async version, they have to roll their
own. Add a generic helper instead.
Signed-off-by: Jens Axboe <axboe@fb.com>
NCM buffer sizes are negotiated with the device independently of
the network device MTU. The RX buffers are allocated by the
usbnet framework based on the rx_urb_size value set by cdc_ncm. A
single RX buffer can hold a number of MTU sized packets.
The default usbnet change_mtu ndo only modifies rx_urb_size if it
is equal to hard_mtu. And the cdc_ncm driver will set rx_urb_size
and hard_mtu independently of each other, based on dwNtbInMaxSize
and dwNtbOutMaxSize respectively. It was therefore assumed that
usbnet_change_mtu() would never touch rx_urb_size. This failed to
consider the case where dwNtbInMaxSize and dwNtbOutMaxSize happens
to be equal.
Fix by implementing an NCM specific change_mtu ndo, modifying the
netdev MTU without touching the buffer size settings.
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three fixes this time, two in SES picked up by KASAN for various types of
buffer overrun. The first is a USB array which returns page 8 whatever is
asked for and causes us to overrun with incorrect data format assumptions and
the second is an invalid iteration of page 10 (the additional information
page). The final one is a reversion of a NULL deref fix which caused
suspend/resume not to be called in pairs leading to incorrect device operation
(Jens has queued a more proper fix for the problem in block).
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJWdLxTAAoJEDeqqVYsXL0MwOYH+wYb27NxfyA7+q7z/dFz+LhQ
B9RlUfnEw57vVz7KEwleqJ9uA2jprCQndMqRoelmWtxeu5CVUBbq/1ONDWvPX2ha
Prr3wVp+SbqbtzmvGQrQ8If7o4iS47fXtwUe5RRDBdfKMUfXs7LeVBgQrpZsqlkE
va6LNKVqzYW4sneC+CfWcwwyedLGeaphNBYygKtCm7SfEkbnfH5+zhWH9JWwtYXf
r8VCCUnmF69ocx4a7MZLnSAJuXfzaJl45c0nhRiHTiokW7KYuylJm0Zd1PYkhwhV
rQr53otJsdPTyZUjmeCdS6PBlGp/HVdYIOyKt5b4Ti2S71ij9R52YPY6BdtIWeQ=
=6New
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three fixes this time, two in SES picked up by KASAN for various types
of buffer overrun. The first is a USB array which returns page 8
whatever is asked for and causes us to overrun with incorrect data
format assumptions and the second is an invalid iteration of page 10
(the additional information page).
The final fix is a reversion of a NULL deref fix which caused
suspend/resume not to be called in pairs leading to incorrect device
operation (Jens has queued a more proper fix for the problem in
block)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
ses: fix additional element traversal bug
Revert "SCSI: Fix NULL pointer dereference in runtime PM"
ses: Fix problems with simple enclosures
mmdebug.h uses BUILD_BUG_ON_INVALID(), assuming someone else included
linux/bug.h. Include it ourselves.
This saves build-failures such as:
arch/arm64/include/asm/pgtable.h: In function 'set_pte_at':
arch/arm64/include/asm/pgtable.h:281:3: error: implicit declaration of function 'BUILD_BUG_ON_INVALID' [-Werror=implicit-function-declaration]
VM_WARN_ONCE(!pte_young(pte),
Fixes: 02602a18c3 ("bug: completely remove code generated by disabled VM_BUG_ON()")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
the upcoming 4.5 kernel. This batch contains userspace netfilter header
compilation fixes, support for packet mangling in nf_tables, the new
tracing infrastructure for nf_tables and cgroup2 support for iptables.
More specifically, they are:
1) Two patches to include dependencies in our netfilter userspace
headers to resolve compilation problems, from Mikko Rapeli.
2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.
3) Remove duplicate include in the netfilter reject infrastructure,
from Stephen Hemminger.
4) Two patches to simplify the netfilter defragmentation code for IPv6,
patch from Florian Westphal.
5) Fix root ownership of /proc/net netfilter for unpriviledged net
namespaces, from Philip Whineray.
6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.
7) Add mangling support to our nf_tables payload expression, from
Patrick McHardy.
8) Introduce a new netlink-based tracing infrastructure for nf_tables,
from Florian Westphal.
9) Change setter functions in nfnetlink_log to be void, from
Rami Rosen.
10) Add netns support to the cttimeout infrastructure.
11) Add cgroup2 support to iptables, from Tejun Heo.
12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.
13) Add support for mangling pkttype in the nf_tables meta expression,
also from Florian.
BTW, I need that you pull net into net-next, I have another batch that
requires changes that I don't yet see in net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix uninitialized variable warnings in nfnetlink_queue, a lot of
people reported this... From Arnd Bergmann.
2) Don't init mutex twice in i40e driver, from Jesse Brandeburg.
3) Fix spurious EBUSY in rhashtable, from Herbert Xu.
4) Missing DMA unmaps in mvpp2 driver, from Marcin Wojtas.
5) Fix race with work structure access in pppoe driver causing
corruptions, from Guillaume Nault.
6) Fix OOPS due to sh_eth_rx() not checking whether netdev_alloc_skb()
actually succeeded or not, from Sergei Shtylyov.
7) Don't lose flags when settifn IFA_F_OPTIMISTIC in ipv6 code, from
Bjørn Mork.
8) VXLAN_HD_RCO defined incorrectly, fix from Jiri Benc.
9) Fix clock source used for cookies in SCTP, from Marcelo Ricardo
Leitner.
10) aurora driver needs HAS_DMA dependency, from Geert Uytterhoeven.
11) ndo_fill_metadata_dst op of vxlan has to handle ipv6 tunneling
properly as well, from Jiri Benc.
12) Handle request sockets properly in xfrm layer, from Eric Dumazet.
13) Double stats update in ipv6 geneve transmit path, fix from Pravin B
Shelar.
14) sk->sk_policy[] needs RCU protection, and as a result
xfrm_policy_destroy() needs to free policies using an RCU grace
period, from Eric Dumazet.
15) SCTP needs to clone ipv6 tx options in order to avoid use after
free, from Eric Dumazet.
16) Missing kbuild export if ila.h, from Stephen Hemminger.
17) Missing mdiobus_alloc() return value checking in mdio-mux.c, from
Tobias Klauser.
18) Validate protocol value range in ->create() methods, from Hannes
Frederic Sowa.
19) Fix early socket demux races that result in illegal dst reuse, from
Eric Dumazet.
20) Validate socket address length in pptp code, from WANG Cong.
21) skb_reorder_vlan_header() uses incorrect offset and can corrupt
packets, from Vlad Yasevich.
22) Fix memory leaks in nl80211 registry code, from Ola Olsson.
23) Timeout loop count handing fixes in mISDN, xgbe, qlge, sfc, and
qlcnic. From Dan Carpenter.
24) msg.msg_iocb needs to be cleared in recvfrom() otherwise, for
example, AF_ALG will interpret it as an async call. From Tadeusz
Struk.
25) inetpeer_set_addr_v4 forgets to initialize the 'vif' field, from
Eric Dumazet.
26) rhashtable enforces the minimum table size not early enough,
breaking how we calculate the per-cpu lock allocations. From
Herbert Xu.
27) Fix FCC port lockup in 82xx driver, from Martin Roth.
28) FOU sockets need to be freed using RCU, from Hannes Frederic Sowa.
29) Fix out-of-bounds access in __skb_complete_tx_timestamp() and
sock_setsockopt() wrt. timestamp handling. From WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (117 commits)
net: check both type and procotol for tcp sockets
drivers: net: xgene: fix Tx flow control
tcp: restore fastopen with no data in SYN packet
af_unix: Revert 'lock_interruptible' in stream receive code
fou: clean up socket with kfree_rcu
82xx: FCC: Fixing a bug causing to FCC port lock-up
gianfar: Don't enable RX Filer if not supported
net: fix warnings in 'make htmldocs' by moving macro definition out of field declaration
rhashtable: Fix walker list corruption
rhashtable: Enforce minimum size on initial hash table
inet: tcp: fix inetpeer_set_addr_v4()
ipv6: automatically enable stable privacy mode if stable_secret set
net: fix uninitialized variable issue
bluetooth: Validate socket address length in sco_sock_bind().
net_sched: make qdisc_tree_decrease_qlen() work for non mq
ser_gigaset: remove unnecessary kfree() calls from release method
ser_gigaset: fix deallocation of platform device structure
ser_gigaset: turn nonsense checks into WARN_ON
ser_gigaset: fix up NULL checks
qlcnic: fix a timeout loop
...
Add ndo_ops to add/del UDP ports to a device that supports geneve
offload.
v2: Comment fix.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is code in ssb fetching "invariants" that is basically a set of
board specific data. Every host requires its own implementation of
reading function. In ssb we have support for PCI, PCMCIA & SDIO.
For some (historical?) reason code reading "invariants" for SoC was
placed in arch code and provided by a callback. This is not needed
nowadays, so lets move that into ssb. This way we keep all "invariants"
functions in a single module making code cleaner.
Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
This passes the SOCK_DESTROY operation to the underlying protocol
diag handler, or returns -EOPNOTSUPP if that handler does not
define a destroy operation.
Most of this patch is just renaming functions. This is not
strictly necessary, but it would be fairly counterintuitive to
have the code to destroy inet sockets be in a function whose name
starts with inet_diag_get.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a SOCK_DESTROY operation, a destroy function
pointer to sock_diag_handler, and a diag_destroy function
pointer. It does not include any implementation code.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, inet_diag_dump_one_icsk finds a socket and then dumps
its information to userspace. Split it into a part that finds the
socket and a part that dumps the information.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The start callback allows the caller to set up a context for the
dump callbacks. Presumably, the context can then be destroyed in
the done callback.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the rhashtable_replace_fast function. This replaces one object in
the table with another atomically. The hashes of the new and old objects
must be equal.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add specifics and details the description of the interface between
the stack and drivers for doing checksum offload. This description
is meant to be as specific and complete as possible.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>