Граф коммитов

995498 Коммитов

Автор SHA1 Сообщение Дата
Eric Dumazet 8811f4a983 tcp: add sanity tests to TCP_QUEUE_SEQ
Qingyu Li reported a syzkaller bug where the repro
changes RCV SEQ _after_ restoring data in the receive queue.

mprotect(0x4aa000, 12288, PROT_READ)    = 0
mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
recvfrom(3, NULL, 20, 0, NULL, NULL)    = -1 ECONNRESET (Connection reset by peer)

syslog shows:
[  111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0
[  111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0

This should not be allowed. TCP_QUEUE_SEQ should only be used
when queues are empty.

This patch fixes this case, and the tx path as well.

Fixes: ee9952831c ("tcp: Initial repair mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005
Reported-by: Qingyu Li <ieatmuttonchuan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 15:32:05 -08:00
Andrea Parri (Microsoft) 3946688edb hv_netvsc: Fix validation in netvsc_linkstatus_callback()
Contrary to the RNDIS protocol specification, certain (pre-Fe)
implementations of Hyper-V's vSwitch did not account for the status
buffer field in the length of an RNDIS packet; the bug was fixed in
newer implementations.  Validate the status buffer fields using the
length of the 'vmtransfer_page' packet (all implementations), that
is known/validated to be less than or equal to the receive section
size and not smaller than the length of the RNDIS message.

Reported-by: Dexuan Cui <decui@microsoft.com>
Suggested-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Fixes: 505e3f00c3 ("hv_netvsc: Add (more) validation for untrusted Hyper-V values")
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 15:30:52 -08:00
DENG Qingfang 9200f515c4 net: dsa: tag_mtk: fix 802.1ad VLAN egress
A different TPID bit is used for 802.1ad VLAN frames.

Reported-by: Ilario Gelmetti <iochesonome@gmail.com>
Fixes: f0af34317f ("net: dsa: mediatek: combine MediaTek tag with VLAN tag")
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 15:29:43 -08:00
Willem de Bruijn b228c9b058 net: expand textsearch ts_state to fit skb_seq_state
The referenced commit expands the skb_seq_state used by
skb_find_text with a 4B frag_off field, growing it to 48B.

This exceeds container ts_state->cb, causing a stack corruption:

[   73.238353] Kernel panic - not syncing: stack-protector: Kernel stack
is corrupted in: skb_find_text+0xc5/0xd0
[   73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4
[   73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[   73.260078] Call Trace:
[   73.264677]  dump_stack+0x57/0x6a
[   73.267866]  panic+0xf6/0x2b7
[   73.270578]  ? skb_find_text+0xc5/0xd0
[   73.273964]  __stack_chk_fail+0x10/0x10
[   73.277491]  skb_find_text+0xc5/0xd0
[   73.280727]  string_mt+0x1f/0x30
[   73.283639]  ipt_do_table+0x214/0x410

The struct is passed between skb_find_text and its callbacks
skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through
the textsearch interface using TS_SKB_CB.

I assumed that this mapped to skb->cb like other .._SKB_CB wrappers.
skb->cb is 48B. But it maps to ts_state->cb, which is only 40B.

skb->cb was increased from 40B to 48B after ts_state was introduced,
in commit 3e3850e989 ("[NETFILTER]: Fix xfrm lookup in
ip_route_me_harder/ip6_route_me_harder").

Increase ts_state.cb[] to 48 to fit the struct.

Also add a BUILD_BUG_ON to avoid a repeat.

The alternative is to directly add a dependency from textsearch onto
linux/skbuff.h, but I think the intent is textsearch to have no such
dependencies on its callers.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911
Fixes: 97550f6fa5 ("net: compound page support in skb_seq_read")
Reported-by: Kris Karas <bugs-a17@moonlit-rail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 15:25:24 -08:00
Masanari Iida 2353db75c3 docs: networking: bonding.rst Fix a typo in bonding.rst
This patch fixes a spelling typo in bonding.rst.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:53:21 -08:00
David S. Miller 2eb4898255 linux-can-fixes-for-5.12-20210301
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAmA8xfATHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqZzcB/9LC7BTXmD2S5YMUvbH8Dy1XQ1J/Ss5
 VPiT+6eNCxFhl9YzsWXPhhvNg7WYLa5SbWid/yht7UAxgiibtB7COaePeuzTSmtQ
 OaTp1J0QcTnaXVWv2GdUnYu0S3xpIyHmvkUQND7AlIfdQg3evKiN+J3JhrO30mVx
 zLhf+gtZUU48fY/DSD7YgwsRv6+L8FFRgxcPlz5QXS3VJOo3Tic4VEjGBWaGrn0U
 xxuRsjBIPXGBYAhru3TOWuDarLWmEFDL2m6s+hPZG1YLAuuAqLiDKdXKzLjwI8mh
 yfgcnaSS8Mdrd4vjPyhSmr3pULi1Ac8BFhQkGPysJIad0/25eiuMBUZI
 =iiwZ
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.12-20210301' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2021-03-01

this is a pull request of 6 patches for net/master.

The first 3 patches are by Joakim Zhang for the flexcan driver and fix
the probing and starting of the chip.

The next patch is by me, for the mcp251xfd driver and reverts the BQL
support. BQL support got mainline with rc1 and assumes that CAN frames
are always echoed, which is not the case. A proper fix requires
changes more changes and will be rolled out via linux-can-next later.

Oleksij Rempel's patch fixes the socket ref counting if socket was
closed before setting skb ownership.

Torin Cooper-Bennun's patch for the tcan4x5x driver fixes a race
condition, where the chip is first attached the bus and then the MRAM
is initialized, which may result in lost data.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:37:08 -08:00
David S. Miller 8a00946e1a Merge branch 'enetc-fixes'
Vladimir Oltean says:

====================
Fixes for NXP ENETC driver

This contains an assorted set of fixes collected over the past 2 weeks
on the enetc driver. Some are related to VLAN processing, some to
physical link settings, some are fixups of previous hardware workarounds,
and some are simply zero-day data path bugs that for some reason were
never caught or at least identified.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Vladimir Oltean 3a5d12c9be net: enetc: keep RX ring consumer index in sync with hardware
The RX rings have a producer index owned by hardware, where newly
received frame buffers are placed, and a consumer index owned by
software, where newly allocated buffers are placed, in expectation of
hardware being able to place frame data in them.

Hardware increments the producer index when a frame is received, however
it is not allowed to increment the producer index to match the consumer
index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received
BDs. Whenever the producer index matches the value of the consumer
index, the ring has no unprocessed received frames and all BDs in the
ring have been initialized/prepared by software, i.e. hardware owns all
BDs in the ring.

The code uses the next_to_clean variable to keep track of the producer
index, and the next_to_use variable to keep track of the consumer index.

The RX rings are seeded from enetc_refill_rx_ring, which is called from
two places:

1. initially the ring is seeded until full with enetc_bd_unused(rx_ring),
   i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511:

.ndo_open
-> enetc_open
   -> enetc_setup_bdrs
      -> enetc_setup_rxbdr
         -> enetc_refill_rx_ring

2. then during the data path processing, it is refilled with 16 buffers
   at a time:

enetc_msix
-> napi_schedule
   -> enetc_poll
      -> enetc_clean_rx_ring
         -> enetc_refill_rx_ring

There is just one problem: the initial seeding done during .ndo_open
updates just the producer index (ENETC_RBPIR) with 0, and the software
next_to_clean and next_to_use variables. Notably, it will not update the
consumer index to make the hardware aware of the newly added buffers.

Wait, what? So how does it work?

Well, the reset values of the producer index and of the consumer index
of a ring are both zero. As per the description in the second paragraph,
it means that the ring is full of buffers waiting for hardware to put
frames in them, which by coincidence is almost true, because we have in
fact seeded 511 buffers into the ring.

But will the hardware attempt to access the 512th entry of the ring,
which has an invalid BD in it? Well, no, because in order to do that, it
would have to first populate the first 511 entries, and the NAPI
enetc_poll will kick in by then. Eventually, after 16 processed slots
have become available in the RX ring, enetc_clean_rx_ring will call
enetc_refill_rx_ring and then will [ finally ] update the consumer index
with the new software next_to_use variable. From now on, the
next_to_clean and next_to_use variables are in sync with the producer
and consumer ring indices.

So the day is saved, right? Well, not quite. Freeing the memory
allocated for the rings is done in:

enetc_close
-> enetc_clear_bdrs
   -> enetc_clear_rxbdr
      -> this just disables the ring
-> enetc_free_rxtx_rings
   -> enetc_free_rx_ring
      -> sets next_to_clean and next_to_use to 0

but again, nothing is committed to the hardware producer and consumer
indices (yay!). The assumption is that the ring is disabled, so the
indices don't matter anyway, and it's the responsibility of the "open"
code path to set those up.

.. Except that the "open" code path does not set those up properly.

While initially, things almost work, during subsequent enetc_close ->
enetc_open sequences, we have problems. To be precise, the enetc_open
that is subsequent to enetc_close will again refill the ring with 511
entries, but it will leave the consumer index untouched. Untouched
means, of course, equal to the value it had before disabling the ring
and draining the old buffers in enetc_close.

But as mentioned, enetc_setup_rxbdr will at least update the producer
index though, through this line of code:

	enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0);

so at this stage we'll have:

next_to_clean=0 (in hardware 0)
next_to_use=511 (in hardware we'll have the refill index prior to enetc_close)

Again, the next_to_clean and producer index are in sync and set to
correct values, so the driver manages to limp on. Eventually, 16 ring
entries will be consumed by enetc_poll, and the savior
enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then
update the hardware consumer ring based upon the new next_to_use.

So.. it works?
Well, by coincidence, it almost does, but there's a circumstance where
enetc_clean_rx_ring won't be there to save us. If the previous value of
the consumer index was 15, there's a problem, because the NAPI poll
sequence will only issue a refill when 16 or more buffers have been
consumed.

It's easiest to illustrate this with an example:

ip link set eno0 up
ip addr add 192.168.100.1/24 dev eno0
ping 192.168.100.1 -c 20 # ping this port from another board
ip link set eno0 down
ip link set eno0 up
ping 192.168.100.1 -c 20 # ping it again from the same other board

One by one:

1. ip link set eno0 up
-> calls enetc_setup_rxbdr:
   -> calls enetc_refill_rx_ring(511 buffers)
   -> next_to_clean=0 (in hw 0)
   -> next_to_use=511 (in hw 0)

2. ping 192.168.100.1 -c 20 # ping this port from another board
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0)
enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15)

20 packets transmitted, 20 packets received, 0% packet loss

3. ip link set eno0 down
enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15)

4. ip link set eno0 up
-> calls enetc_setup_rxbdr:
   -> calls enetc_refill_rx_ring(511 buffers)
   -> next_to_clean=0 (in hw 0)
   -> next_to_use=511 (in hw 15)

5. ping 192.168.100.1 -c 20 # ping it again from the same other board
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15)
enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15)

20 packets transmitted, 12 packets received, 40% packet loss

And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal
to 15 for that to happen), no nothing. The hardware enters the condition where
the producer (14) + 1 is equal to the consumer (15) index, which makes it
believe it has no more free buffers to put packets in, so it starts discarding
them:

ip netns exec ns0 ethtool -S eno0 | grep -v ': 0'
NIC statistics:
     Rx ring  0 discarded frames: 8

Summarized, if the interface receives between 16 and 32 (mod 512) frames
and then there is a link flap, then the port will eventually die with no
way to recover. If it receives less than 16 (mod 512) frames, then the
initial NAPI poll [ before the link flap ] will not update the consumer
index in hardware (it will remain zero) which will be ok when the buffers
are later reinitialized. If more than 32 (mod 512) frames are received,
the initial NAPI poll has the chance to refill the ring twice, updating
the consumer index to at least 32. So after the link flap, the consumer
index is still wrong, but the post-flap NAPI poll gets a chance to
refill the ring once (because it passes through cleaned_cnt=15) and
makes the consumer index be again back in sync with next_to_use.

The solution to this problem is actually simple, we just need to write
next_to_use into the hardware consumer index at enetc_open time, which
always brings it back in sync after an initial buffer seeding process.

The simpler thing would be to put the write to the consumer index into
enetc_refill_rx_ring directly, but there are issues with the MDIO
locking: in the NAPI poll code we have the enetc_lock_mdio() taken from
top-level and we use the unlocked enetc_wr_reg_hot, whereas in
enetc_open, the enetc_lock_mdio() is not taken at the top level, but
instead by each individual enetc_wr_reg, so we are forced to put an
additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of
the code is left as a refactoring exercise.

Fixes: d4fd0404c1 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Vladimir Oltean 96a5223b91 net: enetc: remove bogus write to SIRXIDR from enetc_setup_rxbdr
The Station Interface Receive Interrupt Detect Register (SIRXIDR)
contains a 16-bit wide mask of 'interrupt detected' events for each ring
associated with a port. Bit i is write-1-to-clean for RX ring i.

I have no explanation whatsoever how this line of code came to be
inserted in the blamed commit. I checked the downstream versions of that
patch and none of them have it.

The somewhat comical aspect of it is that we're writing a binary number
to the SIRXIDR register, which is derived from enetc_bd_unused(rx_ring).
Since the RX rings have 512 buffer descriptors, we end up writing 511 to
this register, which is 0x1ff, so we are effectively clearing the
'interrupt detected' event for rings 0-8.

This register is not what is used for interrupt handling though - it
only provides a summary for the entire SI. The hardware provides one
separate Interrupt Detect Register per RX ring, which auto-clears upon
read. So there doesn't seem to be any adverse effect caused by this
bogus write.

There is, however, one reason why this should be handled as a bugfix:
next_to_clean _should_ be committed to hardware, just not to that
register, and this was obscuring the fact that it wasn't. This is fixed
in the next patch, and removing the bogus line now allows the fix patch
to be backported beyond that point.

Fixes: fd5736bf9f ("enetc: Workaround for MDIO register access issue")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Vladimir Oltean c76a97218d net: enetc: force the RGMII speed and duplex instead of operating in inband mode
The ENETC port 0 MAC supports in-band status signaling coming from a PHY
when operating in RGMII mode, and this feature is enabled by default.

It has been reported that RGMII is broken in fixed-link, and that is not
surprising considering the fact that no PHY is attached to the MAC in
that case, but a switch.

This brings us to the topic of the patch: the enetc driver should have
not enabled the optional in-band status signaling for RGMII unconditionally,
but should have forced the speed and duplex to what was resolved by
phylink.

Note that phylink does not accept the RGMII modes as valid for in-band
signaling, and these operate a bit differently than 1000base-x and SGMII
(notably there is no clause 37 state machine so no ACK required from the
MAC, instead the PHY sends extra code words on RXD[3:0] whenever it is
not transmitting something else, so it should be safe to leave a PHY
with this option unconditionally enabled even if we ignore it). The spec
talks about this here:
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/138/RGMIIv1_5F00_3.pdf

Fixes: 71b77a7a27 ("enetc: Migrate to PHYLINK and PCS_LYNX")
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Vladimir Oltean a74dbce9d4 net: enetc: don't disable VLAN filtering in IFF_PROMISC mode
Quoting from the blamed commit:

    In promiscuous mode, it is more intuitive that all traffic is received,
    including VLAN tagged traffic. It appears that it is necessary to set
    the flag in PSIPVMR for that to be the case, so VLAN promiscuous mode is
    also temporarily enabled. On exit from promiscuous mode, the setting
    made by ethtool is restored.

Intuitive or not, there isn't any definition issued by a standards body
which says that promiscuity has anything to do with VLAN filtering - it
only has to do with accepting packets regardless of destination MAC address.

In fact people are already trying to use this misunderstanding/bug of
the enetc driver as a justification to transform promiscuity into
something it never was about: accepting every packet (maybe that would
be the "rx-all" netdev feature?):
https://lore.kernel.org/netdev/20201110153958.ci5ekor3o2ekg3ky@ipetronik.com/

This is relevant because there are use cases in the kernel (such as
tc-flower rules with the protocol 802.1Q and a vlan_id key) which do not
(yet) use the vlan_vid_add API to be compatible with VLAN-filtering NICs
such as enetc, so for those, disabling rx-vlan-filter is currently the
only right solution to make these setups work:
https://lore.kernel.org/netdev/CA+h21hoxwRdhq4y+w8Kwgm74d4cA0xLeiHTrmT-VpSaM7obhkg@mail.gmail.com/
The blamed patch has unintentionally introduced one more way for this to
work, which is to enable IFF_PROMISC, however this is non-portable
because port promiscuity is not meant to disable VLAN filtering.
Therefore, it could invite people to write broken scripts for enetc, and
then wonder why they are broken when migrating to other drivers that
don't handle promiscuity in the same way.

Fixes: 7070eea5e9 ("enetc: permit configuration of rx-vlan-filter with ethtool")
Cc: Markus Blöchl <Markus.Bloechl@ipetronik.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Vladimir Oltean 827b6fd046 net: enetc: fix incorrect TPID when receiving 802.1ad tagged packets
When the enetc ports have rx-vlan-offload enabled, they report a TPID of
ETH_P_8021Q regardless of what was actually in the packet. When
rx-vlan-offload is disabled, packets have the proper TPID. Fix this
inconsistency by finishing the TODO left in the code.

Fixes: d4fd0404c1 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Vladimir Oltean 6d36ecdbc4 net: enetc: take the MDIO lock only once per NAPI poll cycle
The workaround for the ENETC MDIO erratum caused a performance
degradation of 82 Kpps (seen with IP forwarding of two 1Gbps streams of
64B packets). This is due to excessive locking and unlocking in the fast
path, which can be avoided.

By taking the MDIO read-side lock only once per NAPI poll cycle, we are
able to regain 54 Kpps (65%) of the performance hit. The rest of the
performance degradation comes from the TX data path, but unfortunately
it doesn't look like we can optimize that away easily, even with
netdev_xmit_more(), there just isn't any skb batching done, to help with
taking the MDIO lock less often than once per packet.

We need to change the register accessor type for enetc_get_tx_tstamp,
because it now runs under the enetc_lock_mdio as per the new call path
detailed below:

enetc_msix
-> napi_schedule
   -> enetc_poll
      -> enetc_lock_mdio
      -> enetc_clean_tx_ring
         -> enetc_get_tx_tstamp
      -> enetc_clean_rx_ring
      -> enetc_unlock_mdio

Fixes: fd5736bf9f ("enetc: Workaround for MDIO register access issue")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Vladimir Oltean 3222b5b613 net: enetc: initialize RFS/RSS memories for unused ports too
Michael reports that since linux-next-20210211, the AER messages for ECC
errors have started reappearing, and this time they can be reliably
reproduced with the first ping on one of his LS1028A boards.

$ ping 1[   33.258069] pcieport 0000:00:1f.0: AER: Multiple Corrected error received: 0000:00:00.0
72.16.0.1
PING [   33.267050] pcieport 0000:00:1f.0: AER: can't find device of ID0000
172.16.0.1 (172.16.0.1): 56 data bytes
64 bytes from 172.16.0.1: seq=0 ttl=64 time=17.124 ms
64 bytes from 172.16.0.1: seq=1 ttl=64 time=0.273 ms

$ devmem 0x1f8010e10 32
0xC0000006

It isn't clear why this is necessary, but it seems that for the errors
to go away, we must clear the entire RFS and RSS memory, not just for
the ports in use.

Sadly the code is structured in such a way that we can't have unified
logic for the used and unused ports. For the minimal initialization of
an unused port, we need just to enable and ioremap the PF memory space,
and a control buffer descriptor ring. Unused ports must then free the
CBDR because the driver will exit, but used ports can not pick up from
where that code path left, since the CBDR API does not reinitialize a
ring when setting it up, so its producer and consumer indices are out of
sync between the software and hardware state. So a separate
enetc_init_unused_port function was created, and it gets called right
after the PF memory space is enabled.

Fixes: 07bf34a50e ("net: enetc: initialize the RFS and RSS memories")
Reported-by: Michael Walle <michael@walle.cc>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Vladimir Oltean c646d10dda net: enetc: don't overwrite the RSS indirection table when initializing
After the blamed patch, all RX traffic gets hashed to CPU 0 because the
hashing indirection table set up in:

enetc_pf_probe
-> enetc_alloc_si_resources
   -> enetc_configure_si
      -> enetc_setup_default_rss_table

is overwritten later in:

enetc_pf_probe
-> enetc_init_port_rss_memory

which zero-initializes the entire port RSS table in order to avoid ECC errors.

The trouble really is that enetc_init_port_rss_memory really neads
enetc_alloc_si_resources to be called, because it depends upon
enetc_alloc_cbdr and enetc_setup_cbdr. But that whole enetc_configure_si
thing could have been better thought out, it has nothing to do in a
function called "alloc_si_resources", especially since its counterpart,
"free_si_resources", does nothing to unwind the configuration of the SI.

The point is, we need to pull out enetc_configure_si out of
enetc_alloc_resources, and move it after enetc_init_port_rss_memory.
This allows us to set up the default RSS indirection table after
initializing the memory.

Fixes: 07bf34a50e ("net: enetc: initialize the RFS and RSS memories")
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:34:47 -08:00
Yejune Deng 8bd2a05527 inetpeer: use div64_ul() and clamp_val() calculate inet_peer_threshold
In inet_initpeers(), struct inet_peer on IA32 uses 128 bytes in nowdays.
Get rid of the cascade and use div64_ul() and clamp_val() calculate that
will not need to be adjusted in the future as suggested by Eric Dumazet.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:32:12 -08:00
Pavel Skripkin 093b036aa9 net/qrtr: fix __netdev_alloc_skb call
syzbot found WARNING in __alloc_pages_nodemask()[1] when order >= MAX_ORDER.
It was caused by a huge length value passed from userspace to qrtr_tun_write_iter(),
which tries to allocate skb. Since the value comes from the untrusted source
there is no need to raise a warning in __alloc_pages_nodemask().

[1] WARNING in __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5014
Call Trace:
 __alloc_pages include/linux/gfp.h:511 [inline]
 __alloc_pages_node include/linux/gfp.h:524 [inline]
 alloc_pages_node include/linux/gfp.h:538 [inline]
 kmalloc_large_node+0x60/0x110 mm/slub.c:3999
 __kmalloc_node_track_caller+0x319/0x3f0 mm/slub.c:4496
 __kmalloc_reserve net/core/skbuff.c:150 [inline]
 __alloc_skb+0x4e4/0x5a0 net/core/skbuff.c:210
 __netdev_alloc_skb+0x70/0x400 net/core/skbuff.c:446
 netdev_alloc_skb include/linux/skbuff.h:2832 [inline]
 qrtr_endpoint_post+0x84/0x11b0 net/qrtr/qrtr.c:442
 qrtr_tun_write_iter+0x11f/0x1a0 net/qrtr/tun.c:98
 call_write_iter include/linux/fs.h:1901 [inline]
 new_sync_write+0x426/0x650 fs/read_write.c:518
 vfs_write+0x791/0xa30 fs/read_write.c:605
 ksys_write+0x12d/0x250 fs/read_write.c:658
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: syzbot+80dccaee7c6630fa9dcf@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Acked-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:24:03 -08:00
David S. Miller 5db4f74ec8 Merge branch 'sh_eth-masks'
Sergey Shtylyov says:

====================
Fix TRSCER masks in the Ether driver

Here are 3 patches against DaveM's 'net' repo. I'm fixing the TRSCER masks in
the driver to match the manuals...
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:22:34 -08:00
Sergey Shtylyov 165bc5a4f3 sh_eth: fix TRSCER mask for R7S9210
According  to the RZ/A2M Group User's Manual: Hardware, Rev. 2.00,
the TRSCER register has bit 9 reserved, hence we can't use the driver's
default TRSCER mask.  Add the explicit initializer for sh_eth_cpu_data::
trscer_err_mask for R7S9210.

Fixes: 6e0bb04d0e ("sh_eth: Add R7S9210 support")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:22:34 -08:00
Sergey Shtylyov 75be7fb7f9 sh_eth: fix TRSCER mask for R7S72100
According  to  the RZ/A1H Group, RZ/A1M Group User's Manual: Hardware,
Rev. 4.00, the TRSCER register has bit 9 reserved, hence we can't use
the driver's default TRSCER mask.  Add the explicit initializer for
sh_eth_cpu_data::trscer_err_mask for R7S72100.

Fixes: db893473d3 ("sh_eth: Add support for r7s72100")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:22:34 -08:00
Sergey Shtylyov 8c91bc3d44 sh_eth: fix TRSCER mask for SH771x
According  to  the SH7710, SH7712, SH7713 Group User's Manual: Hardware,
Rev. 3.00, the TRSCER register actually has only bit 7 valid (and named
differently), with all the other bits reserved. Apparently, this was not
the case with some early revisions of the manual as we have the other
bits declared (and set) in the original driver.  Follow the suit and add
the explicit sh_eth_cpu_data::trscer_err_mask initializer for SH771x...

Fixes: 86a74ff21a ("net: sh_eth: add support for Renesas SuperH Ethernet")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:22:34 -08:00
Tong Zhang a2bd45834e atm: lanai: dont run lanai_dev_close if not open
lanai_dev_open() can fail. When it fail, lanai->base is unmapped and the
pci device is disabled. The caller, lanai_init_one(), then tries to run
atm_dev_deregister(). This will subsequently call lanai_dev_close() and
use the already released MMIO area.

To fix this issue, set the lanai->base to NULL if open fail,
and test the flag in lanai_dev_close().

[    8.324153] lanai: lanai_start() failed, err=19
[    8.324819] lanai(itf 0): shutting down interface
[    8.325211] BUG: unable to handle page fault for address: ffffc90000180024
[    8.325781] #PF: supervisor write access in kernel mode
[    8.326215] #PF: error_code(0x0002) - not-present page
[    8.326641] PGD 100000067 P4D 100000067 PUD 100139067 PMD 10013a067 PTE 0
[    8.327206] Oops: 0002 [#1] SMP KASAN NOPTI
[    8.327557] CPU: 0 PID: 95 Comm: modprobe Not tainted 5.11.0-rc7-00090-gdcc0b49040c7 #12
[    8.328229] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-4
[    8.329145] RIP: 0010:lanai_dev_close+0x4f/0xe5 [lanai]
[    8.329587] Code: 00 48 c7 c7 00 d3 01 c0 e8 49 4e 0a c2 48 8d bd 08 02 00 00 e8 6e 52 14 c1 48 80
[    8.330917] RSP: 0018:ffff8881029ef680 EFLAGS: 00010246
[    8.331196] RAX: 000000000003fffe RBX: ffff888102fb4800 RCX: ffffffffc001a98a
[    8.331572] RDX: ffffc90000180000 RSI: 0000000000000246 RDI: ffff888102fb4000
[    8.331948] RBP: ffff888102fb4000 R08: ffffffff8115da8a R09: ffffed102053deaa
[    8.332326] R10: 0000000000000003 R11: ffffed102053dea9 R12: ffff888102fb48a4
[    8.332701] R13: ffffffffc00123c0 R14: ffff888102fb4b90 R15: ffff888102fb4b88
[    8.333077] FS:  00007f08eb9056a0(0000) GS:ffff88815b400000(0000) knlGS:0000000000000000
[    8.333502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.333806] CR2: ffffc90000180024 CR3: 0000000102a28000 CR4: 00000000000006f0
[    8.334182] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    8.334557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    8.334932] Call Trace:
[    8.335066]  atm_dev_deregister+0x161/0x1a0 [atm]
[    8.335324]  lanai_init_one.cold+0x20c/0x96d [lanai]
[    8.335594]  ? lanai_send+0x2a0/0x2a0 [lanai]
[    8.335831]  local_pci_probe+0x6f/0xb0
[    8.336039]  pci_device_probe+0x171/0x240
[    8.336255]  ? pci_device_remove+0xe0/0xe0
[    8.336475]  ? kernfs_create_link+0xb6/0x110
[    8.336704]  ? sysfs_do_create_link_sd.isra.0+0x76/0xe0
[    8.336983]  really_probe+0x161/0x420
[    8.337181]  driver_probe_device+0x6d/0xd0
[    8.337401]  device_driver_attach+0x82/0x90
[    8.337626]  ? device_driver_attach+0x90/0x90
[    8.337859]  __driver_attach+0x60/0x100
[    8.338065]  ? device_driver_attach+0x90/0x90
[    8.338298]  bus_for_each_dev+0xe1/0x140
[    8.338511]  ? subsys_dev_iter_exit+0x10/0x10
[    8.338745]  ? klist_node_init+0x61/0x80
[    8.338956]  bus_add_driver+0x254/0x2a0
[    8.339164]  driver_register+0xd3/0x150
[    8.339370]  ? 0xffffffffc0028000
[    8.339550]  do_one_initcall+0x84/0x250
[    8.339755]  ? trace_event_raw_event_initcall_finish+0x150/0x150
[    8.340076]  ? free_vmap_area_noflush+0x1a5/0x5c0
[    8.340329]  ? unpoison_range+0xf/0x30
[    8.340532]  ? ____kasan_kmalloc.constprop.0+0x84/0xa0
[    8.340806]  ? unpoison_range+0xf/0x30
[    8.341014]  ? unpoison_range+0xf/0x30
[    8.341217]  do_init_module+0xf8/0x350
[    8.341419]  load_module+0x3fe6/0x4340
[    8.341621]  ? vm_unmap_ram+0x1d0/0x1d0
[    8.341826]  ? ____kasan_kmalloc.constprop.0+0x84/0xa0
[    8.342101]  ? module_frob_arch_sections+0x20/0x20
[    8.342358]  ? __do_sys_finit_module+0x108/0x170
[    8.342604]  __do_sys_finit_module+0x108/0x170
[    8.342841]  ? __ia32_sys_init_module+0x40/0x40
[    8.343083]  ? file_open_root+0x200/0x200
[    8.343298]  ? do_sys_open+0x85/0xe0
[    8.343491]  ? filp_open+0x50/0x50
[    8.343675]  ? exit_to_user_mode_prepare+0xfc/0x130
[    8.343935]  do_syscall_64+0x33/0x40
[    8.344132]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.344401] RIP: 0033:0x7f08eb887cf7
[    8.344594] Code: 48 89 57 30 48 8b 04 24 48 89 47 38 e9 1d a0 02 00 48 89 f8 48 89 f7 48 89 d6 41
[    8.345565] RSP: 002b:00007ffcd5c98ad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    8.345962] RAX: ffffffffffffffda RBX: 00000000008fea70 RCX: 00007f08eb887cf7
[    8.346336] RDX: 0000000000000000 RSI: 00000000008fd9e0 RDI: 0000000000000003
[    8.346711] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000001
[    8.347085] R10: 00007f08eb8eb300 R11: 0000000000000246 R12: 00000000008fd9e0
[    8.347460] R13: 0000000000000000 R14: 00000000008fddd0 R15: 0000000000000001
[    8.347836] Modules linked in: lanai(+) atm
[    8.348065] CR2: ffffc90000180024
[    8.348244] ---[ end trace 7fdc1c668f2003e5 ]---
[    8.348490] RIP: 0010:lanai_dev_close+0x4f/0xe5 [lanai]
[    8.348772] Code: 00 48 c7 c7 00 d3 01 c0 e8 49 4e 0a c2 48 8d bd 08 02 00 00 e8 6e 52 14 c1 48 80
[    8.349745] RSP: 0018:ffff8881029ef680 EFLAGS: 00010246
[    8.350022] RAX: 000000000003fffe RBX: ffff888102fb4800 RCX: ffffffffc001a98a
[    8.350397] RDX: ffffc90000180000 RSI: 0000000000000246 RDI: ffff888102fb4000
[    8.350772] RBP: ffff888102fb4000 R08: ffffffff8115da8a R09: ffffed102053deaa
[    8.351151] R10: 0000000000000003 R11: ffffed102053dea9 R12: ffff888102fb48a4
[    8.351525] R13: ffffffffc00123c0 R14: ffff888102fb4b90 R15: ffff888102fb4b88
[    8.351918] FS:  00007f08eb9056a0(0000) GS:ffff88815b400000(0000) knlGS:0000000000000000
[    8.352343] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.352647] CR2: ffffc90000180024 CR3: 0000000102a28000 CR4: 00000000000006f0
[    8.353022] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    8.353397] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    8.353958] modprobe (95) used greatest stack depth: 26216 bytes left

Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:18:54 -08:00
Tong Zhang 4deb550bc3 atm: eni: dont release is never initialized
label err_eni_release is reachable when eni_start() fail.
In eni_start() it calls dev->phy->start() in the last step, if start()
fail we don't need to call phy->stop(), if start() is never called, we
neither need to call phy->stop(), otherwise null-ptr-deref will happen.

In order to fix this issue, don't call phy->stop() in label err_eni_release

[    4.875714] ==================================================================
[    4.876091] BUG: KASAN: null-ptr-deref in suni_stop+0x47/0x100 [suni]
[    4.876433] Read of size 8 at addr 0000000000000030 by task modprobe/95
[    4.876778]
[    4.876862] CPU: 0 PID: 95 Comm: modprobe Not tainted 5.11.0-rc7-00090-gdcc0b49040c7 #2
[    4.877290] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd94
[    4.877876] Call Trace:
[    4.878009]  dump_stack+0x7d/0xa3
[    4.878191]  kasan_report.cold+0x10c/0x10e
[    4.878410]  ? __slab_free+0x2f0/0x340
[    4.878612]  ? suni_stop+0x47/0x100 [suni]
[    4.878832]  suni_stop+0x47/0x100 [suni]
[    4.879043]  eni_do_release+0x3b/0x70 [eni]
[    4.879269]  eni_init_one.cold+0x1152/0x1747 [eni]
[    4.879528]  ? _raw_spin_lock_irqsave+0x7b/0xd0
[    4.879768]  ? eni_ioctl+0x270/0x270 [eni]
[    4.879990]  ? __mutex_lock_slowpath+0x10/0x10
[    4.880226]  ? eni_ioctl+0x270/0x270 [eni]
[    4.880448]  local_pci_probe+0x6f/0xb0
[    4.880650]  pci_device_probe+0x171/0x240
[    4.880864]  ? pci_device_remove+0xe0/0xe0
[    4.881086]  ? kernfs_create_link+0xb6/0x110
[    4.881315]  ? sysfs_do_create_link_sd.isra.0+0x76/0xe0
[    4.881594]  really_probe+0x161/0x420
[    4.881791]  driver_probe_device+0x6d/0xd0
[    4.882010]  device_driver_attach+0x82/0x90
[    4.882233]  ? device_driver_attach+0x90/0x90
[    4.882465]  __driver_attach+0x60/0x100
[    4.882671]  ? device_driver_attach+0x90/0x90
[    4.882903]  bus_for_each_dev+0xe1/0x140
[    4.883114]  ? subsys_dev_iter_exit+0x10/0x10
[    4.883346]  ? klist_node_init+0x61/0x80
[    4.883557]  bus_add_driver+0x254/0x2a0
[    4.883764]  driver_register+0xd3/0x150
[    4.883971]  ? 0xffffffffc0038000
[    4.884149]  do_one_initcall+0x84/0x250
[    4.884355]  ? trace_event_raw_event_initcall_finish+0x150/0x150
[    4.884674]  ? unpoison_range+0xf/0x30
[    4.884875]  ? ____kasan_kmalloc.constprop.0+0x84/0xa0
[    4.885150]  ? unpoison_range+0xf/0x30
[    4.885352]  ? unpoison_range+0xf/0x30
[    4.885557]  do_init_module+0xf8/0x350
[    4.885760]  load_module+0x3fe6/0x4340
[    4.885960]  ? vm_unmap_ram+0x1d0/0x1d0
[    4.886166]  ? ____kasan_kmalloc.constprop.0+0x84/0xa0
[    4.886441]  ? module_frob_arch_sections+0x20/0x20
[    4.886697]  ? __do_sys_finit_module+0x108/0x170
[    4.886941]  __do_sys_finit_module+0x108/0x170
[    4.887178]  ? __ia32_sys_init_module+0x40/0x40
[    4.887419]  ? file_open_root+0x200/0x200
[    4.887634]  ? do_sys_open+0x85/0xe0
[    4.887826]  ? filp_open+0x50/0x50
[    4.888009]  ? fpregs_assert_state_consistent+0x4d/0x60
[    4.888287]  ? exit_to_user_mode_prepare+0x2f/0x130
[    4.888547]  do_syscall_64+0x33/0x40
[    4.888739]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    4.889010] RIP: 0033:0x7ff62fcf1cf7
[    4.889202] Code: 48 89 57 30 48 8b 04 24 48 89 47 38 e9 1d a0 02 00 48 89 f8 48 89 f71
[    4.890172] RSP: 002b:00007ffe6644ade8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    4.890570] RAX: ffffffffffffffda RBX: 0000000000f2ca70 RCX: 00007ff62fcf1cf7
[    4.890944] RDX: 0000000000000000 RSI: 0000000000f2b9e0 RDI: 0000000000000003
[    4.891318] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000001
[    4.891691] R10: 00007ff62fd55300 R11: 0000000000000246 R12: 0000000000f2b9e0
[    4.892064] R13: 0000000000000000 R14: 0000000000f2bdd0 R15: 0000000000000001
[    4.892439] ==================================================================

Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:17:36 -08:00
Guangbin Huang d9032dba5a net: phy: fix save wrong speed and duplex problem if autoneg is on
If phy uses generic driver and autoneg is on, enter command
"ethtool -s eth0 speed 50" will not change phy speed actually, but
command "ethtool eth0" shows speed is 50Mb/s because phydev->speed
has been set to 50 and no update later.

And duplex setting has same problem too.

However, if autoneg is on, phy only changes speed and duplex according to
phydev->advertising, but not phydev->speed and phydev->duplex. So in this
case, phydev->speed and phydev->duplex don't need to be set in function
phy_ethtool_ksettings_set() if autoneg is on.

Fixes: 51e2a3846e ("PHY: Avoid unnecessary aneg restarts")
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:16:16 -08:00
Jason A. Donenfeld 4372339efc net: always use icmp{,v6}_ndo_send from ndo_start_xmit
There were a few remaining tunnel drivers that didn't receive the prior
conversion to icmp{,v6}_ndo_send. Knowing now that this could lead to
memory corrution (see ee576c47db ("net: icmp: pass zeroed opts from
icmp{,v6}_ndo_send before sending") for details), there's even more
imperative to have these all converted. So this commit goes through the
remaining cases that I could find and does a boring translation to the
ndo variety.

The Fixes: line below is the merge that originally added icmp{,v6}_
ndo_send and converted the first batch of icmp{,v6}_send users. The
rationale then for the change applies equally to this patch. It's just
that these drivers were left out of the initial conversion because these
network devices are hiding in net/ rather than in drivers/net/.

Cc: Florian Westphal <fw@strlen.de>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Fixes: 803381f9f1 ("Merge branch 'icmp-account-for-NAT-when-sending-icmps-from-ndo-layer'")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:11:35 -08:00
DENG Qingfang 9eb8bc593a net: dsa: tag_rtl4_a: fix egress tags
Commit 86dd9868b8 has several issues, but was accepted too soon
before anyone could take a look.

- Double free. dsa_slave_xmit() will free the skb if the xmit function
  returns NULL, but the skb is already freed by eth_skb_pad(). Use
  __skb_put_padto() to avoid that.
- Unnecessary allocation. It has been done by DSA core since commit
  a3b0b64797.
- A u16 pointer points to skb data. It should be __be16 for network
  byte order.
- Typo in comments. "numer" -> "number".

Fixes: 86dd9868b8 ("net: dsa: tag_rtl4_a: Support also egress tags")
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:09:55 -08:00
Jan Beulich 826d82170b xen-netback: use local var in xenvif_tx_check_gop() instead of re-calculating
shinfo already holds the result of skb_shinfo(skb) at this point - no
need to re-invoke the construct even twice.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 13:05:08 -08:00
Ioana Ciornei 73f476aa19 net: phy: ti: take into account all possible interrupt sources
The previous implementation of .handle_interrupt() did not take into
account the fact that all the interrupt status registers should be
acknowledged since multiple interrupt sources could be asserted.

Fix this by reading all the status registers before exiting with
IRQ_NONE or triggering the PHY state machine.

Fixes: 1d1ae3c6ca ("net: phy: ti: implement generic .handle_interrupt() callback")
Reported-by: Sven Schuchmann <schuchmann@schleissheimer.de>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20210226153020.867852-1-ciorneiioana@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-03-01 11:46:55 -08:00
Torin Cooper-Bennun 2712625200 can: tcan4x5x: tcan4x5x_init(): fix initialization - clear MRAM before entering Normal Mode
This patch prevents a potentially destructive race condition. The
device is fully operational on the bus after entering Normal Mode, so
zeroing the MRAM after entering this mode may lead to loss of
information, e.g. new received messages.

This patch fixes the problem by first initializing the MRAM, then
bringing the device into Normale Mode.

Fixes: 5443c226ba ("can: tcan4x5x: Add tcan4x5x driver to the kernel")
Link: https://lore.kernel.org/r/20210226163440.313628-1-torin@maxiluxsystems.com
Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Torin Cooper-Bennun <torin@maxiluxsystems.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-01 11:45:15 +01:00
Oleksij Rempel e940e0895a can: skb: can_skb_set_owner(): fix ref counting if socket was closed before setting skb ownership
There are two ref count variables controlling the free()ing of a socket:
- struct sock::sk_refcnt - which is changed by sock_hold()/sock_put()
- struct sock::sk_wmem_alloc - which accounts the memory allocated by
  the skbs in the send path.

In case there are still TX skbs on the fly and the socket() is closed,
the struct sock::sk_refcnt reaches 0. In the TX-path the CAN stack
clones an "echo" skb, calls sock_hold() on the original socket and
references it. This produces the following back trace:

| WARNING: CPU: 0 PID: 280 at lib/refcount.c:25 refcount_warn_saturate+0x114/0x134
| refcount_t: addition on 0; use-after-free.
| Modules linked in: coda_vpu(E) v4l2_jpeg(E) videobuf2_vmalloc(E) imx_vdoa(E)
| CPU: 0 PID: 280 Comm: test_can.sh Tainted: G            E     5.11.0-04577-gf8ff6603c617 #203
| Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
| Backtrace:
| [<80bafea4>] (dump_backtrace) from [<80bb0280>] (show_stack+0x20/0x24) r7:00000000 r6:600f0113 r5:00000000 r4:81441220
| [<80bb0260>] (show_stack) from [<80bb593c>] (dump_stack+0xa0/0xc8)
| [<80bb589c>] (dump_stack) from [<8012b268>] (__warn+0xd4/0x114) r9:00000019 r8:80f4a8c2 r7:83e4150c r6:00000000 r5:00000009 r4:80528f90
| [<8012b194>] (__warn) from [<80bb09c4>] (warn_slowpath_fmt+0x88/0xc8) r9:83f26400 r8:80f4a8d1 r7:00000009 r6:80528f90 r5:00000019 r4:80f4a8c2
| [<80bb0940>] (warn_slowpath_fmt) from [<80528f90>] (refcount_warn_saturate+0x114/0x134) r8:00000000 r7:00000000 r6:82b44000 r5:834e5600 r4:83f4d540
| [<80528e7c>] (refcount_warn_saturate) from [<8079a4c8>] (__refcount_add.constprop.0+0x4c/0x50)
| [<8079a47c>] (__refcount_add.constprop.0) from [<8079a57c>] (can_put_echo_skb+0xb0/0x13c)
| [<8079a4cc>] (can_put_echo_skb) from [<8079ba98>] (flexcan_start_xmit+0x1c4/0x230) r9:00000010 r8:83f48610 r7:0fdc0000 r6:0c080000 r5:82b44000 r4:834e5600
| [<8079b8d4>] (flexcan_start_xmit) from [<80969078>] (netdev_start_xmit+0x44/0x70) r9:814c0ba0 r8:80c8790c r7:00000000 r6:834e5600 r5:82b44000 r4:82ab1f00
| [<80969034>] (netdev_start_xmit) from [<809725a4>] (dev_hard_start_xmit+0x19c/0x318) r9:814c0ba0 r8:00000000 r7:82ab1f00 r6:82b44000 r5:00000000 r4:834e5600
| [<80972408>] (dev_hard_start_xmit) from [<809c6584>] (sch_direct_xmit+0xcc/0x264) r10:834e5600 r9:00000000 r8:00000000 r7:82b44000 r6:82ab1f00 r5:834e5600 r4:83f27400
| [<809c64b8>] (sch_direct_xmit) from [<809c6c0c>] (__qdisc_run+0x4f0/0x534)

To fix this problem, only set skb ownership to sockets which have still
a ref count > 0.

Fixes: 0ae89beb28 ("can: add destructor for self generated skbs")
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Andre Naujoks <nautsch2@gmail.com>
Link: https://lore.kernel.org/r/20210226092456.27126-1-o.rempel@pengutronix.de
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-01 11:45:04 +01:00
Marc Kleine-Budde 2afe72ead5 can: mcp251xfd: revert "can: mcp251xfd: add BQL support"
In the following 4 patches

| 99842c9685 can: dev: can_rx_offload_get_echo_skb(): extend to return can frame length
| 9420e1d495 can: dev: can_get_echo_skb(): extend to return can frame length
| 1dcb6e57db can: dev: can_put_echo_skb(): extend to handle frame_len
| f0ef72febc can: dev: extend struct can_skb_priv to hold CAN frame length

the CAN echo SKB support was extended to hold the CAN frame
length (which is the length of the CAN frame on the wire). It is meant
as a helper for BQL support, to avoid the re-calculation of the frame
length before sending it and on TX-completion.

However if the CAN frame is send without the request to be looped back
the SKB is discarded in can_put_echo_skb() and the subsequent
can_get_echo_skb() and can_rx_offload_get_echo_skb() return 0 for the
CAN frame length. This results in BQL stalling the TX queue after a
few packages.

Until the BQL helpers can_get_echo_skb() and
can_rx_offload_get_echo_skb() are fixed, revert the BQL support for
the mcp251xfd driver.

This reverts commit 4162e18e94.

Fixes: 4162e18e94 ("can: mcp251xfd: add BQL support")
Cc: Manivannan Sadhasivam <mani@kernel.org>
Cc: Thomas Kopp <thomas.kopp@microchip.com>
Link: https://lore.kernel.org/r/20210228083347.28580-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-01 11:44:45 +01:00
Joakim Zhang c63820045e can: flexcan: invoke flexcan_chip_freeze() to enter freeze mode
Invoke flexcan_chip_freeze() to enter freeze mode, since need poll
freeze mode acknowledge.

Fixes: e955cead03 ("CAN: Add Flexcan CAN controller driver")
Link: https://lore.kernel.org/r/20210218110037.16591-4-qiangqing.zhang@nxp.com
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-01 10:34:56 +01:00
Joakim Zhang ec15e27cc8 can: flexcan: enable RX FIFO after FRZ/HALT valid
RX FIFO enable failed could happen when do system reboot stress test:

[    0.303958] flexcan 5a8d0000.can: 5a8d0000.can supply xceiver not found, using dummy regulator
[    0.304281] flexcan 5a8d0000.can (unnamed net_device) (uninitialized): Could not enable RX FIFO, unsupported core
[    0.314640] flexcan 5a8d0000.can: registering netdev failed
[    0.320728] flexcan 5a8e0000.can: 5a8e0000.can supply xceiver not found, using dummy regulator
[    0.320991] flexcan 5a8e0000.can (unnamed net_device) (uninitialized): Could not enable RX FIFO, unsupported core
[    0.331360] flexcan 5a8e0000.can: registering netdev failed
[    0.337444] flexcan 5a8f0000.can: 5a8f0000.can supply xceiver not found, using dummy regulator
[    0.337716] flexcan 5a8f0000.can (unnamed net_device) (uninitialized): Could not enable RX FIFO, unsupported core
[    0.348117] flexcan 5a8f0000.can: registering netdev failed

RX FIFO should be enabled after the FRZ/HALT are valid. But the current
code enable RX FIFO and FRZ/HALT at the same time.

Fixes: e955cead03 ("CAN: Add Flexcan CAN controller driver")
Link: https://lore.kernel.org/r/20210218110037.16591-3-qiangqing.zhang@nxp.com
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-01 10:34:56 +01:00
Joakim Zhang 449052cfeb can: flexcan: assert FRZ bit in flexcan_chip_freeze()
Assert HALT bit to enter freeze mode, there is a premise that FRZ bit is
asserted. This patch asserts FRZ bit in flexcan_chip_freeze, although
the reset value is 1b'1. This is a prepare patch, later patch will
invoke flexcan_chip_freeze() to enter freeze mode, which polling freeze
mode acknowledge.

Fixes: b1aa1c7a21 ("can: flexcan: fix transition from and to freeze mode in chip_{,un}freeze")
Link: https://lore.kernel.org/r/20210218110037.16591-2-qiangqing.zhang@nxp.com
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-01 10:34:56 +01:00
Jakub Kicinski 447621e373 Merge branch 'net-hns3-fixes-fot-net'
Huazhong Tan says:

====================
net: hns3: fixes fot -net

The patchset includes some fixes for the HNS3 ethernet driver.
====================

Link: https://lore.kernel.org/r/1614410693-8107-1-git-send-email-tanhuazhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28 12:04:02 -08:00
Jian Shen b36fc875bc net: hns3: fix bug when calculating the TCAM table info
The function hclge_fd_convert_tuple() is used to convert tuples
and tuples mask to TCAM x and y.  But it misuses the source mac
as source mac mask when convert INNER_SRC_MAC, which may cause
the flow director rule works unexpectedly. So fix it.

Fixes: 1173286802 ("net: hns3: Add input key and action config support for flow director")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28 12:04:00 -08:00
Jian Shen c75ec148a3 net: hns3: fix query vlan mask value error for flow director
Currently, the driver returns VLAN_VID_MASK for vlan mask field,
when get flow director rule information for rule doesn't use vlan.
It may cause the vlan mask value display as 0xf000 in this
case, like below:

estuary:/$ ethtool -u eth1
50 RX rings available
Total 1 rules

Filter: 2
Rule Type: TCP over IPv4
Src IP addr: 0.0.0.0 mask: 255.255.255.255
Dest IP addr: 0.0.0.0 mask: 255.255.255.255
TOS: 0x0 mask: 0xff
Src port: 0 mask: 0xffff
Dest port: 0 mask: 0xffff
VLAN EtherType: 0x0 mask: 0xffff
VLAN: 0x0 mask: 0xf000
User-defined: 0x1234 mask: 0x0
Action: Direct to queue 3

Fix it by return 0.

Fixes: 05c2314fe6 ("net: hns3: Add support for rule query of flow director")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28 12:03:59 -08:00
Jian Shen ae85ddda0f net: hns3: fix error mask definition of flow director
Currently, some bit filed definitions of flow director TCAM
configuration command are incorrect. Since the wrong MSB is
always 0, and these fields are assgined in order, so it still works.

Fix it by redefine them.

Fixes: 1173286802 ("net: hns3: Add input key and action config support for flow director")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28 12:03:59 -08:00
Daniel Borkmann 89e5c58fc1 net: Fix gro aggregation for udp encaps with zero csum
We noticed a GRO issue for UDP-based encaps such as vxlan/geneve when the
csum for the UDP header itself is 0. In that case, GRO aggregation does
not take place on the phys dev, but instead is deferred to the vxlan/geneve
driver (see trace below).

The reason is essentially that GRO aggregation bails out in udp_gro_receive()
for such case when drivers marked the skb with CHECKSUM_UNNECESSARY (ice, i40e,
others) where for non-zero csums 2abb7cdc0d ("udp: Add support for doing
checksum unnecessary conversion") promotes those skbs to CHECKSUM_COMPLETE
and napi context has csum_valid set. This is however not the case for zero
UDP csum (here: csum_cnt is still 0 and csum_valid continues to be false).

At the same time 57c67ff4bd ("udp: additional GRO support") added matches
on !uh->check ^ !uh2->check as part to determine candidates for aggregation,
so it certainly is expected to handle zero csums in udp_gro_receive(). The
purpose of the check added via 662880f442 ("net: Allow GRO to use and set
levels of checksum unnecessary") seems to catch bad csum and stop aggregation
right away.

One way to fix aggregation in the zero case is to only perform the !csum_valid
check in udp_gro_receive() if uh->check is infact non-zero.

Before:

  [...]
  swapper     0 [008]   731.946506: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100400 len=1500   (1)
  swapper     0 [008]   731.946507: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100200 len=1500
  swapper     0 [008]   731.946507: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101100 len=1500
  swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101700 len=1500
  swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101b00 len=1500
  swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100600 len=1500
  swapper     0 [008]   731.946508: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100f00 len=1500
  swapper     0 [008]   731.946509: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100a00 len=1500
  swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100500 len=1500
  swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100700 len=1500
  swapper     0 [008]   731.946516: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101d00 len=1500   (2)
  swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101000 len=1500
  swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101c00 len=1500
  swapper     0 [008]   731.946517: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101400 len=1500
  swapper     0 [008]   731.946518: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100e00 len=1500
  swapper     0 [008]   731.946518: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497101600 len=1500
  swapper     0 [008]   731.946521: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff966497100800 len=774
  swapper     0 [008]   731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497100400 len=14032 (1)
  swapper     0 [008]   731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497101d00 len=9112  (2)
  [...]

  # netperf -H 10.55.10.4 -t TCP_STREAM -l 20
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
  Recv   Send    Send
  Socket Socket  Message  Elapsed
  Size   Size    Size     Time     Throughput
  bytes  bytes   bytes    secs.    10^6bits/sec

   87380  16384  16384    20.01    13129.24

After:

  [...]
  swapper     0 [026]   521.862641: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d479000 len=11286 (1)
  swapper     0 [026]   521.862643: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479000 len=11236 (1)
  swapper     0 [026]   521.862650: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d478500 len=2898  (2)
  swapper     0 [026]   521.862650: net:netif_receive_skb: dev=enp10s0f0  skbaddr=0xffff93ab0d479f00 len=8490  (3)
  swapper     0 [026]   521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d478500 len=2848  (2)
  swapper     0 [026]   521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479f00 len=8440  (3)
  [...]

  # netperf -H 10.55.10.4 -t TCP_STREAM -l 20
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
  Recv   Send    Send
  Socket Socket  Message  Elapsed
  Size   Size    Size     Time     Throughput
  bytes  bytes   bytes    secs.    10^6bits/sec

   87380  16384  16384    20.01    24576.53

Fixes: 57c67ff4bd ("udp: additional GRO support")
Fixes: 662880f442 ("net: Allow GRO to use and set levels of checksum unnecessary")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20210226212248.8300-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28 12:01:08 -08:00
Rafał Miłecki d313d16bba net: broadcom: bcm4908_enet: enable RX after processing packets
When receiving a lot of packets hardware may run out of free
descriptiors and stop RX ring. Enable it every time after handling
received packets.

Fixes: 4feffeadbc ("net: broadcom: bcm4908enet: add BCM4908 controller driver")
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210226132038.29849-1-zajec5@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28 11:51:31 -08:00
Yinjun Zhang a4fc088ad4 ethtool: fix the check logic of at least one channel for RX/TX
The command "ethtool -L <intf> combined 0" may clean the RX/TX channel
count and skip the error path, since the attrs
tb[ETHTOOL_A_CHANNELS_RX_COUNT] and tb[ETHTOOL_A_CHANNELS_TX_COUNT]
are NULL in this case when recent ethtool is used.

Tested using ethtool v5.10.

Fixes: 7be92514b9 ("ethtool: check if there is at least one channel for TX/RX in the core")
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Link: https://lore.kernel.org/r/20210225125102.23989-1-simon.horman@netronome.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28 11:49:07 -08:00
Jakub Kicinski eee7ede695 Merge branch 'bnxt_en-error-recovery-bug-fixes'
Michael Chan says:

====================
bnxt_en: Error recovery bug fixes.

Two error recovery related bug fixes for 2 corner cases.

Please queue patch #2 for -stable.  Thanks.
====================

Link: https://lore.kernel.org/r/1614332590-17865-1-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:50:25 -08:00
Edwin Peer 20d7d1c5c9 bnxt_en: reliably allocate IRQ table on reset to avoid crash
The following trace excerpt corresponds with a NULL pointer dereference
of 'bp->irq_tbl' in bnxt_setup_inta() on an Aarch64 system after many
device resets:

    Unable to handle kernel NULL pointer dereference at ... 000000d
    ...
    pc : string+0x3c/0x80
    lr : vsnprintf+0x294/0x7e0
    sp : ffff00000f61ba70 pstate : 20000145
    x29: ffff00000f61ba70 x28: 000000000000000d
    x27: ffff0000009c8b5a x26: ffff00000f61bb80
    x25: ffff0000009c8b5a x24: 0000000000000012
    x23: 00000000ffffffe0 x22: ffff000008990428
    x21: ffff00000f61bb80 x20: 000000000000000d
    x19: 000000000000001f x18: 0000000000000000
    x17: 0000000000000000 x16: ffff800b6d0fb400
    x15: 0000000000000000 x14: ffff800b7fe31ae8
    x13: 00001ed16472c920 x12: ffff000008c6b1c9
    x11: ffff000008cf0580 x10: ffff00000f61bb80
    x9 : 00000000ffffffd8 x8 : 000000000000000c
    x7 : ffff800b684b8000 x6 : 0000000000000000
    x5 : 0000000000000065 x4 : 0000000000000001
    x3 : ffff0a00ffffff04 x2 : 000000000000001f
    x1 : 0000000000000000 x0 : 000000000000000d
    Call trace:
    string+0x3c/0x80
    vsnprintf+0x294/0x7e0
    snprintf+0x44/0x50
    __bnxt_open_nic+0x34c/0x928 [bnxt_en]
    bnxt_open+0xe8/0x238 [bnxt_en]
    __dev_open+0xbc/0x130
    __dev_change_flags+0x12c/0x168
    dev_change_flags+0x20/0x60
    ...

Ordinarily, a call to bnxt_setup_inta() (not in trace due to inlining)
would not be expected on a system supporting MSIX at all. However, if
bnxt_init_int_mode() does not end up being called after the call to
bnxt_clear_int_mode() in bnxt_fw_reset_close(), then the driver will
think that only INTA is supported and bp->irq_tbl will be NULL,
causing the above crash.

In the error recovery scenario, we call bnxt_clear_int_mode() in
bnxt_fw_reset_close() early in the sequence. Ordinarily, we will
call bnxt_init_int_mode() in bnxt_hwrm_if_change() after we
reestablish communication with the firmware after reset.  However,
if the sequence has to abort before we call bnxt_init_int_mode() and
if the user later attempts to re-open the device, then it will cause
the crash above.

We fix it in 2 ways:

1. Check for bp->irq_tbl in bnxt_setup_int_mode(). If it is NULL, call
bnxt_init_init_mode().

2. If we need to abort in bnxt_hwrm_if_change() and cannot complete
the error recovery sequence, set the BNXT_STATE_ABORT_ERR flag.  This
will cause more drastic recovery at the next attempt to re-open the
device, including a call to bnxt_init_int_mode().

Fixes: 3bc7d4a352 ("bnxt_en: Add BNXT_STATE_IN_FW_RESET state.")
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:50:23 -08:00
Vasundhara Volam d20cd74521 bnxt_en: Fix race between firmware reset and driver remove.
The driver's error recovery reset sequence can take many seconds to
complete and only the critical sections are protected by rtnl_lock.
A recent change has introduced a regression in this sequence.

bnxt_remove_one() may be called while the recovery is in progress.
Normally, unregister_netdev() would cause bnxt_close_nic() to be
called and this would cause the error recovery to safely abort
with the BNXT_STATE_ABORT_ERR flag set in bnxt_close_nic().

Recently, we added bnxt_reinit_after_abort() to allow the user to
reopen the device after an aborted recovery.  This causes the
regression in the scenario described above because we would
attempt to re-open even after the netdev has been unregistered.

Fix it by checking the netdev reg_state in
bnxt_reinit_after_abort() and abort if it is unregistered.

Fixes: 6882c36cf8 ("bnxt_en: attempt to reinitialize after aborted reset")
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:50:23 -08:00
Jakub Kicinski 82adc457ac Merge branch 'mlxsw-various-fixes'
Ido Schimmel says:

====================
mlxsw: Various fixes

This patchset contains various fixes for mlxsw.

Patch #1 fixes a race condition in a selftest. The race and fix are
explained in detail in the changelog.

Patch #2 re-adds a link mode that was wrongly removed, resulting in a
regression in some setups.

Patch #3 fixes a race condition in route installation with nexthop
objects.

Please consider patches #2 and #3 for stable.
====================

Link: https://lore.kernel.org/r/20210225165721.1322424-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:47:56 -08:00
Ido Schimmel dc860b88ce mlxsw: spectrum_router: Ignore routes using a deleted nexthop object
Routes are currently processed from a workqueue whereas nexthop objects
are processed in system call context. This can result in the driver not
finding a suitable nexthop group for a route and issuing a warning [1].

Fix this by ignoring such routes earlier in the process. The subsequent
deletion notification will be ignored as well.

[1]
 WARNING: CPU: 2 PID: 7754 at drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:4853 mlxsw_sp_router_fib_event_work+0x1112/0x1e00 [mlxsw_spectrum]
 [...]
 CPU: 2 PID: 7754 Comm: kworker/u8:0 Not tainted 5.11.0-rc6-cq-20210207-1 #16
 Hardware name: Mellanox Technologies Ltd. MSN2100/SA001390, BIOS 5.6.5 05/24/2018
 Workqueue: mlxsw_core_ordered mlxsw_sp_router_fib_event_work [mlxsw_spectrum]
 RIP: 0010:mlxsw_sp_router_fib_event_work+0x1112/0x1e00 [mlxsw_spectrum]

Fixes: cdd6cfc54c ("mlxsw: spectrum_router: Allow programming routes with nexthop objects")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reported-by: Alex Veber <alexve@nvidia.com>
Tested-by: Alex Veber <alexve@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:47:53 -08:00
Danielle Ratson ae9b24ddb6 mlxsw: spectrum_ethtool: Add an external speed to PTYS register
Currently, only external bits are added to the PTYS register, whereas
there is one external bit that is wrongly marked as internal, and so was
recently removed from the register.

Add that bit to the PTYS register again, as this bit is no longer
internal.

Its removal resulted in '100000baseLR4_ER4/Full' link mode no longer
being supported, causing a regression on some setups.

Fixes: 5bf01b571c ("mlxsw: spectrum_ethtool: Remove internal speeds from PTYS register")
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reported-by: Eddie Shklaer <eddies@nvidia.com>
Tested-by: Eddie Shklaer <eddies@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:47:53 -08:00
Danielle Ratson edcbf5137f selftests: forwarding: Fix race condition in mirror installation
When mirroring to a gretap in hardware the device expects to be
programmed with the egress port and all the encapsulating headers. This
requires the driver to resolve the path the packet will take in the
software data path and program the device accordingly.

If the path cannot be resolved (in this case because of an unresolved
neighbor), then mirror installation fails until the path is resolved.
This results in a race that causes the test to sometimes fail.

Fix this by setting the neighbor's state to permanent, so that it is
always valid.

Fixes: b5b029399f ("selftests: forwarding: mirror_gre_bridge_1d_vlan: Add STP test")
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:47:52 -08:00
Arjun Roy 2107d45f17 tcp: Fix sign comparison bug in getsockopt(TCP_ZEROCOPY_RECEIVE)
getsockopt(TCP_ZEROCOPY_RECEIVE) has a bug where we read a
user-provided "len" field of type signed int, and then compare the
value to the result of an "offsetofend" operation, which is unsigned.

Negative values provided by the user will be promoted to large
positive numbers; thus checking that len < offsetofend() will return
false when the intention was that it return true.

Note that while len is originally checked for negative values earlier
on in do_tcp_getsockopt(), subsequent calls to get_user() re-read the
value from userspace which may have changed in the meantime.

Therefore, re-add the check for negative values after the call to
get_user in the handler code for TCP_ZEROCOPY_RECEIVE.

Fixes: c8856c0514 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Arjun Roy <arjunroy@google.com>
Link: https://lore.kernel.org/r/20210225232628.4033281-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:47:15 -08:00
Heiko Thiery 6a4d7234ae net: fec: ptp: avoid register access when ipg clock is disabled
When accessing the timecounter register on an i.MX8MQ the kernel hangs.
This is only the case when the interface is down. This can be reproduced
by reading with 'phc_ctrl eth0 get'.

Like described in the change in 91c0d987a9
the igp clock is disabled when the interface is down and leads to a
system hang.

So we check if the ptp clock status before reading the timecounter
register.

Signed-off-by: Heiko Thiery <heiko.thiery@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20210225211514.9115-1-heiko.thiery@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26 15:45:18 -08:00