Fixes the following sparse warning:
net/8021q/vlan_core.c:168:6: warning:
symbol 'vlan_hw_filter_capable' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series contains updates to mlx5 core and mlx5e netdev drivers.
The main highlight of this series is the RX optimizations for striding RQ path,
introduced by Tariq.
First Four patches are trivial misc cleanups.
- Spelling mistake fix
- Dead code removal
- Warning messages
RX optimizations for striding RQ:
1) RX refactoring, cleanups and micro optimizations
- MTU calculation simplifications, obsoletes some WQEs-to-packets translation
functions and helps delete ~60 LOC.
- Do not busy-wait a pending UMR completion.
- post the new values of UMR WQE inline, instead of using a data pointer.
- use pre-initialized structures to save calculations in datapath.
2) Use linear SKB in Striding RQ "build_skb", (Using linear SKB has many advantages):
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
3) Support XDP over Striding RQ:
Now that linear SKB is supported over Striding RQ,
we can support XDP by setting stride size to PAGE_SIZE
and headroom to XDP_PACKET_HEADROOM.
Striding RQ is capable of a higher packet-rate than
conventional RQ.
Performance testing:
ConnectX-5, 24 rings, default MTU.
CQE compression ON (to reduce completions BW in PCI).
XDP_DROP packet rate:
--------------------------------------------------
| pkt size | XDP rate | 100GbE linerate | pct% |
--------------------------------------------------
| 64byte | 126.2 Mpps | 148.0 Mpps | 85% |
| 128byte | 80.0 Mpps | 84.8 Mpps | 94% |
| 256byte | 42.7 Mpps | 42.7 Mpps | 100% |
| 512byte | 23.4 Mpps | 23.4 Mpps | 100% |
--------------------------------------------------
4) Remove mlx5 page_ref bulking in Striding RQ and use page_ref_inc only when needed.
Without this bulking, we have:
- no atomic ops on WQE allocation or free
- one atomic op per SKB
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 atomic ops as before
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking have no atomic ops per packet at all.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Thanks,
saeed.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJavs5fAAoJEEg/ir3gV/o+iXQIAJQ4jcYb5V3AEPqUeiTNOH2h
e2yyXj2zXNTCl2cekmJriWfQjsA5YizaTNipHb1xR8pznAiIMGmiK5nr8idRY1Qh
M/awuoxJszj8a+z3SxrL/ilgf4HF/89YEt+5MnU/2ihBC3EGG0UbJ6TAC0cXMzmG
Xghi5omlCfsqQQkWooxPVSdRXERLsgzo5kjZ2Zpln/GJa0vVPmIV7ojoQQQzFCMf
eEQzqqEeOk4rk8Z2/5fdsWYjwa2XLnvtUtRBKX/hxCd2zYrFpGUxkzsT/Mikeu+Z
AZAJA4yfHs3dKS3T4CaKDBqhUVxdAsuecT9JlqzgLVEbmw7YypacrPT0TBBsJcI=
=qnbR
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2018-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2018-03-30
This series contains updates to mlx5 core and mlx5e netdev drivers.
The main highlight of this series is the RX optimizations for striding RQ path,
introduced by Tariq.
First Four patches are trivial misc cleanups.
- Spelling mistake fix
- Dead code removal
- Warning messages
RX optimizations for striding RQ:
1) RX refactoring, cleanups and micro optimizations
- MTU calculation simplifications, obsoletes some WQEs-to-packets translation
functions and helps delete ~60 LOC.
- Do not busy-wait a pending UMR completion.
- post the new values of UMR WQE inline, instead of using a data pointer.
- use pre-initialized structures to save calculations in datapath.
2) Use linear SKB in Striding RQ "build_skb", (Using linear SKB has many advantages):
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
3) Support XDP over Striding RQ:
Now that linear SKB is supported over Striding RQ,
we can support XDP by setting stride size to PAGE_SIZE
and headroom to XDP_PACKET_HEADROOM.
Striding RQ is capable of a higher packet-rate than
conventional RQ.
Performance testing:
ConnectX-5, 24 rings, default MTU.
CQE compression ON (to reduce completions BW in PCI).
XDP_DROP packet rate:
--------------------------------------------------
| pkt size | XDP rate | 100GbE linerate | pct% |
--------------------------------------------------
| 64byte | 126.2 Mpps | 148.0 Mpps | 85% |
| 128byte | 80.0 Mpps | 84.8 Mpps | 94% |
| 256byte | 42.7 Mpps | 42.7 Mpps | 100% |
| 512byte | 23.4 Mpps | 23.4 Mpps | 100% |
--------------------------------------------------
4) Remove mlx5 page_ref bulking in Striding RQ and use page_ref_inc only when needed.
Without this bulking, we have:
- no atomic ops on WQE allocation or free
- one atomic op per SKB
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 atomic ops as before
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking have no atomic ops per packet at all.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWr6a+fu3V2unywtrAQJDgQ/+Pdyv8EW7VtdPRNdePGJLh4WHVrCnSL64
hKzHV3gEt8PK7sJi53lkbvTYJ3hpB/SC9Li0TdOvph4ngH8pUer7HiXW9g2zrcy+
KsbhmWslMOH6uy+8f2yWoI0HmSz7XvzQATCKOJz5Lm7X0JfvSoKkhgn6VLxqG91C
p3IAYr5735/mqZIqs+FCIP2jLIPtCH/Plv18F9w847dSS1qsH6twAsUVjzn9CtJv
MHQ+Wo57tpoOwynnZO+CDLhZGdXJ5YIZ8wXNqu/EUaeAHXgqkpSd4IUZeguDpoJR
YrkvMf8cQlcBDohHBBmIVz9OOIZ8A7WbygNf7UPS9DGYjNVL9vKxF89xwAKYA5Ky
LYmUug8WYp2FIHxuWGomF15SrzbNNDfD/v7ZvjXmNxvJXXV+AbzNoXZifSAuyo2V
2bgolw22PbZAMgRvBr5OjtxGD65lpCD00ZJWBgFrWn8l3hdK/40NJW5s3nQCShmm
NCUAFV+nHvazfUNfcBAbVRPXCN8Pc0Cuspr6HJ3Wi4q5lnpJQY+hS+KFNizL6iyu
1fNRcqBF8KnVM5bE5Y27o96W/RhNdFudYB+sAH5IEGRqqG5YjkwvPgILvXhTH2kI
MDxHMFjRgvWijzLLPpBd9YAsPLddvE6WRlngqOrSp6ICP4o8Vmeb2mxgo4tbQBXu
3dIL1fEfhME=
=Lyzx
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-next-20180330' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fixes and more traces
Here are some patches that add some more tracepoints to AF_RXRPC and fix
some issues therein:
(1) Fix the use of VERSION packets to keep firewall routes open.
(2) Fix the incorrect current time usage in a tracepoint.
(3) Fix Tx ring annotation corruption.
(4) Fix accidental conversion of call-level abort into connection-level
abort.
(5) Fix calculation of resend time.
(6) Remove a couple of unused variables.
(7) Fix a bunch of checker warnings and an error. Note that not all
warnings can be quashed as checker doesn't seem to correctly handle
seqlocks.
(8) Fix a potential race between call destruction and socket/net
destruction.
(9) Add a tracepoint to track rxrpc_local refcounting.
(10) Fix an apparent leak of rxrpc_local objects.
(11) Add a tracepoint to track rxrpc_peer refcounting.
(12) Fix a leak of rxrpc_peer objects.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The variables, msg and data, have the same value. This patch removes
the extra one.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than use an on-stack array to copy a broadcast address, use
the generic eth_broadcast_addr function to save a trivial amount of
object code.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kirill Tkhai says:
====================
net_rwsem fixes
there is wext_netdev_notifier_call()->wireless_nlevent_flush()
netdevice notifier, which takes net_rwsem, so we can't take
net_rwsem in {,un}register_netdevice_notifier().
Since {,un}register_netdevice_notifier() is executed under
pernet_ops_rwsem, net_namespace_list can't change, while we
holding it, so there is no need net_rwsem in these functions [1/2].
The same is in [2/2]. We make callers of __rtnl_link_unregister()
take pernet_ops_rwsem, and close the race with setup_net()
and cleanup_net(), so __rtnl_link_unregister() does not need it.
This also fixes the problem of that __rtnl_link_unregister() does
not see initializing and exiting nets.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This function calls call_netdevice_notifier(), which also
may take net_rwsem. So, we can't use net_rwsem here.
This patch makes callers of this functions take pernet_ops_rwsem,
like register_netdevice_notifier() does. This will protect
the modifications of net_namespace_list, and allows notifiers
to take it (they won't have to care about context).
Since __rtnl_link_unregister() is used on module load
and unload (which are not frequent operations), this looks
for me better, than make all call_netdevice_notifier()
always executing in "protected net_namespace_list" context.
Also, this fixes the problem we had a deal in 328fbe747a
"Close race between {un, }register_netdevice_notifier and ...",
and guarantees __rtnl_link_unregister() does not skip
exitting net.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions take net_rwsem, while wireless_nlevent_flush()
also takes it. But down_read() can't be taken recursive,
because of rw_semaphore design, which prevents it to be occupied
by only readers forever.
Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(),
net list can't change, so these down_read()/up_read() can be removed.
Fixes: f0b07bb151 "net: Introduce net_rwsem to protect net_namespace_list"
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need for explicit calls of devm_kfree(), as the allocated
memory will be freed during driver's detach.
The driver core clears the driver data to NULL after device_release.
Thus, it is not needed to manually clear the device driver data to NULL.
So remove the unnecessary pci_set_drvdata() and devm_kfree().
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change nsim_devlink_setup to return any error back to the caller and
update nsim_init to handle it.
Requested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy says:
====================
tipc: slim down name table
We clean up and improve the name binding table:
- Replace the memory consuming 'sub_sequence/service range' array with
an RB tree.
- Introduce support for overlapping service sequences/ranges
v2: #1: Fixed a missing initialization reported by David Miller
#4: Obsoleted and replaced a few more macros to get a consistent
terminology in the API.
#5: Added new commit to fix a potential string overflow bug (it
is still only in net-next) reported by Arnd Bergmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc points out that the combined length of the fixed-length inputs to
l->name is larger than the destination buffer size:
net/tipc/link.c: In function 'tipc_link_create':
net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes
into a region of size between 26 and 58 [-Werror=format-overflow=]
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes
(assuming 75) into a destination of size 60
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
A detailed analysis reveals that the theoretical maximum length of
a link name is:
max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name =
16 + 1 + 15 + 1 + 16 + 1 + 15 = 65
Since we also need space for a trailing zero we now set MAX_LINK_NAME
to 68.
Just to be on the safe side we also replace the sprintf() call with
snprintf().
Fixes: 25b0b9c4e8 ("tipc: handle collisions of 32-bit node address
hash values")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The three address type structs in the user API have names that in
reality reflect the specific, non-Linux environment where they were
originally created.
We now give them more intuitive names, in accordance with how TIPC is
described in the current documentation.
struct tipc_portid -> struct tipc_socket_addr
struct tipc_name -> struct tipc_service_addr
struct tipc_name_seq -> struct tipc_service_range
To avoid confusion, we also update some commmets and macro names to
match the new terminology.
For compatibility, we add macros that map all old names to the new ones.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the new RB tree structure for service ranges it becomes possible to
solve an old problem; - we can now allow overlapping service ranges in
the table.
When inserting a new service range to the tree, we use 'lower' as primary
key, and when necessary 'upper' as secondary key.
Since there may now be multiple service ranges matching an indicated
'lower' value, we must also add the 'upper' value to the functions
used for removing publications, so that the correct, corresponding
range item can be found.
These changes guarantee that a well-formed publication/withdrawal item
from a peer node never will be rejected, and make it possible to
eliminate the problematic backlog functionality we currently have for
handling such cases.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function tipc_nametbl_translate() function is ugly and hard to
follow. This can be improved somewhat by introducing a stack variable
for holding the publication list to be used and re-ordering the if-
clauses for selection of algorithm.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current design of the binding table has an unnecessary memory
consuming and complex data structure. It aggregates the service range
items into an array, which is expanded by a factor two every time it
becomes too small to hold a new item. Furthermore, the arrays never
shrink when the number of ranges diminishes.
We now replace this array with an RB tree that is holding the range
items as tree nodes, each range directly holding a list of bindings.
This, along with a few name changes, improves both readability and
volume of the code, as well as reducing memory consumption and hopefully
improving cache hit rate.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov says:
====================
net: bridge: MTU handling changes
As previously discussed the recent changes break some setups and could lead
to packet drops. Thus the first patch reverts the behaviour for the bridge
to follow the minimum MTU but also keeps the ability to set the MTU to the
maximum (out of all ports) if vlan filtering is enabled. Patch 02 is the
bigger change in behaviour - we've always had trouble when configuring
bridges and their MTU which is auto tuning on port events
(add/del/changemtu), which means config software needs to chase it and fix
it after each such event, after patch 02 we allow the user to configure any
MTU (ETH_MIN/MAX limited) but once that is done the bridge stops auto
tuning and relies on the user to keep the MTU correct.
This should be compatible with cases that don't touch the MTU (or set it
to the same value), while allowing to configure the MTU and not worry
about it changing afterwards.
The patches are intentionally split like this, so that if they get accepted
and there are any complaints patch 02 can be reverted.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As Roopa noted today the biggest source of problems when configuring
bridge and ports is that the bridge MTU keeps changing automatically on
port events (add/del/changemtu). That leads to inconsistent behaviour
and network config software needs to chase the MTU and fix it on each
such event. Let's improve on that situation and allow for the user to
set any MTU within ETH_MIN/MAX limits, but once manually configured it
is the user's responsibility to keep it correct afterwards.
In case the MTU isn't manually set - the behaviour reverts to the
previous and the bridge follows the minimum MTU.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently the bridge was changed to automatically set maximum MTU on port
events (add/del/changemtu) when vlan filtering is enabled, but that
actually changes behaviour in a way which breaks some setups and can lead
to packet drops. In order to still allow that maximum to be set while being
compatible, we add the ability for the user to tune the bridge MTU up to
the maximum when vlan filtering is enabled, but that has to be done
explicitly and all port events (add/del/changemtu) lead to resetting that
MTU to the minimum as before.
Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev says:
====================
net: thunderx: implement DMAC filtering support
By default CN88XX BGX accepts all incoming multicast and broadcast
packets and filtering is disabled. The nic driver doesn't provide
an ability to change such behaviour.
This series is to implement DMAC filtering management for CN88XX
nic driver allowing user to enable/disable filtering and configure
specific MAC addresses to filter traffic.
Changes from v1:
build issues:
- update code in order to address compiler warnings;
checkpatch.pl reported issues:
- update code in order to fit 80 symbols length;
- update commit descriptions in order to fit 80 symbols length;
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The ndo_set_rx_mode() is called from atomic context which causes
messages response timeouts while VF to PF communication via MSIx.
To get rid of that we're copy passed mc list, parse flags and queue
handling of kernel request to ordered workqueue.
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel calls ndo_set_rx_mode() callback from atomic context which
causes messaging timeouts between VF and PF (as they’re implemented via
MSIx). So in order to handle ndo_set_rx_mode() we need to get rid of it.
This commit implements necessary workqueue related structures to let VF
queue kernel request processing in non-atomic context later.
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is to add message handling for ndo_set_rx_mode()
callback at PF side.
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel calls ndo_set_rx_mode() callback supplying it will all necessary
info, such as device state flags, multicast mac addresses list and so on.
Since we have only 128 bits to communicate with PF we need to initiate
several requests to PF with small/short operation each based on input data.
So this commit implements following PF messages codes along with new
data structures for them:
NIC_MBOX_MSG_RESET_XCAST to flush all filters configured for this
particular network interface (VF)
NIC_MBOX_MSG_ADD_MCAST to add new MAC address to DMAC filter registers
for this particular network interface (VF)
NIC_MBOX_MSG_SET_XCAST to apply filtering configuration to filter control
register
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ThunderX NIC could be partitioned to up to 128 VFs and thus
represented to system. Each VF is mapped to pair BGX:LMAC, and each of VF
is configured by kernel individually. Eventually the bunch of VFs could be
mapped onto same pair BGX:LMAC and thus could cause several multicast
filtering configuration requests to LMAC with the same MAC addresses.
This commit is to add ThunderX NIC BGX filtering manipulation routines.
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ThunderX NIC has two Ethernet Interfaces (BGX) each of them could has
up to four Logical MACs configured. Each of BGX has 32 filters to be
configured for filtering ingress packets. The number of filters available
to particular LMAC is from 8 (if we have four LMACs configured per BGX)
up to 32 (in case of only one LMAC is configured per BGX).
At the same time the NIC could present up to 128 VFs to OS as network
interfaces, each of them kernel will configure with set of MAC addresses
for filtering. So to prevent dupes in BGX filter registers from different
network interfaces it is required to cache and track all filter
configuration requests prior to applying them onto BGX filter registers.
This commit is to update LMAC structures with control fields to
allocate/releasing filters tracking list along with implementing
dmac array allocate/release per LMAC.
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ThunderX NIC has set of registers which allows to configure
filter policy for ingress packets. There are three possible regimes
of filtering multicasts, broadcasts and unicasts: accept all, reject all
and accept filter allowed only.
Current implementation has enum with all of them and two generic macro
for enabling filtering et all (CAM_ACCEPT) and enabling/disabling
broadcast packets, which also should be corrected in order to represent
register bits properly. All these values are private for driver and
there is no need to ‘publish’ them via header file.
This commit is to move filtering register manipulation values from
header file into source with explicit assignment of exact register
values to them to be used while register configuring.
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl says:
====================
Meson8m2 support for dwmac-meson8b
The Meson8m2 SoC is an updated version of the Meson8 SoC. Some of the
peripherals are shared with Meson8b (for example the watchdog registers
and the internal temperature sensor calibration procedure).
Meson8m2 also seems to include the same Gigabit MAC register layout as
Meson8b.
The registers in the Amlogic dwmac "glue" seem identical between Meson8b
and Meson8m2. Manual testing seems to confirm this.
To be extra-safe a new compatible string is added because there's no
(public) documentation on the Meson8m2 SoC. This will allow us to
implement any SoC-specific variations later on (if needed).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The Meson8m2 SoC uses a similar (potentially even identical) register
layout as the Meson8b and GXBB SoCs for the dwmac glue.
Add a new compatible string and update the module description to
indicate support for these SoCs.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Meson8m2 SoC uses a similar (potentially even identical) register
layout for the dwmac glue as Meson8b and GXBB. Unfortunately there is no
documentation available.
Testing shows that both, RMII and RGMII PHYs are working if they are
configured as on Meson8b. Add a new compatible string to the
documentation so differences (if there are any) between Meson8m2 and the
other SoCs can be taken care of within the driver.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon a new UMR post, check if the WQE buffer contains
a previous UMR WQE. If so, modify the dynamic fields
instead of a whole WQE overwrite. This saves a memcpy.
In current setting, after 2 WQ cycles (12 UMR posts),
this will always be the case.
No degradation sensed.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
All UMR WQEs of an RQ share many common fields. We use
pre-initialized structures to save calculations in datapath.
One field (xlt_offset) was the only reason we saved a pre-initialized
copy per WQE index.
Here we remove its initialization (move its calculation to datapath),
and reduce the number of copies to one-per-RQ.
A very small datapath calculation is added, it occurs once per a MPWQE
(i.e. once every 256KB), but reduces memory consumption and gives
better cache utilization.
Performance testing:
Tested packet rate, no degradation sensed.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add XDP support over Striding RQ.
Now that linear SKB is supported over Striding RQ,
we can support XDP by setting stride size to PAGE_SIZE
and headroom to XDP_PACKET_HEADROOM.
Upon a MPWQE free, do not release pages that are being
XDP xmit, they will be released upon completions.
Striding RQ is capable of a higher packet-rate than
conventional RQ.
A performance gain is expected for all cases that had
a HW packet-rate bottleneck. This is the case whenever
using many flows that distribute to many cores.
Performance testing:
ConnectX-5, 24 rings, default MTU.
CQE compression ON (to reduce completions BW in PCI).
XDP_DROP packet rate:
--------------------------------------------------
| pkt size | XDP rate | 100GbE linerate | pct% |
--------------------------------------------------
| 64byte | 126.2 Mpps | 148.0 Mpps | 85% |
| 128byte | 80.0 Mpps | 84.8 Mpps | 94% |
| 256byte | 42.7 Mpps | 42.7 Mpps | 100% |
| 512byte | 23.4 Mpps | 23.4 Mpps | 100% |
--------------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Make the xdp_xmit indication available for Striding RQ
by taking it out of the type-specific union.
This refactor is a preparation for a downstream patch that
adds XDP support over Striding RQ.
In addition, use a bitmap instead of a boolean for possible
future flags.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When modifying the page mapping of a HW memory region
(via a UMR post), post the new values inlined in WQE,
instead of using a data pointer.
This is a micro-optimization, inline UMR WQEs of different
rings scale better in HW.
In addition, this obsoletes a few control flows and helps
delete ~50 LOC.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Do not busy-wait a pending UMR completion. Under high HW load,
busy-waiting a delayed completion would fully utilize the CPU core
and mistakenly indicate a SW bottleneck.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Gets the process of a UMR WQE post in one function,
in preparation for a downstream patch that inlines
the WQE data.
No functional change here.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In Striding RQ, each WQE serves multiple packets
(hence called Multi-Packet WQE, MPWQE).
The size of a MPWQE is constant (currently 256KB).
Upon a ringparam set operation, we calculate the number of
MPWQEs per RQ. For this, first it is needed to determine the
number of packets that can reside within a single MPWQE.
In this patch we use the actual MTU size instead of ETH_DATA_LEN
for this calculation.
This implies that a change in MTU might require a change
in Striding RQ ring size.
In addition, this obsoletes some WQEs-to-packets translation
functions and helps delete ~60 LOC.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Knowing the MTU is required for RQ creation flow.
By our design, channels creation flow is totally isolated
from priv/netdev, and can be completed with access to
channels params and mdev.
Adding the MTU to the channels params helps preserving that.
In addition, we save it in RQ to make its access faster in
datapath checks.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
With ConnectX-4, we expect the force teardown to fail in case that
DC was enabled, therefore change the message from error to warning.
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
1. This function is not used anywhere in mlx5 driver
2. It has a memcpy statement that makes no sense and produces build
warning with gcc8
drivers/net/ethernet/mellanox/mlx5/core/transobj.c: In function 'mlx5_core_query_xsrq':
drivers/net/ethernet/mellanox/mlx5/core/transobj.c:347:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict]
Fixes: 01949d0109 ("net/mlx5_core: Enable XRCs and SRQs when using ISSI > 0")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Instead of looking for the EQ of the CQ, remove that redundant code and
use the eq pointer stored in the cq struct.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When a new client call is requested, an rxrpc_conn_parameters struct object
is passed in with a bunch of parameters set, such as the local endpoint to
use. A pointer to the target peer record is also placed in there by
rxrpc_get_client_conn() - and this is removed if and only if a new
connection object is allocated. Thus it leaks if a new connection object
isn't allocated.
Fix this by putting any peer object attached to the rxrpc_conn_parameters
object in the function that allocated it.
Fixes: 19ffa01c9c ("rxrpc: Use structs to hold connection params and protocol info")
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_local objects cannot be disposed of until all the connections that
point to them have been RCU'd as a connection object holds refcount on the
local endpoint it is communicating through. Currently, this can cause an
assertion failure to occur when a network namespace is destroyed as there's
no check that the RCU destructors for the connections have been run before
we start trying to destroy local endpoints.
The kernel reports:
rxrpc: AF_RXRPC: Leaked local 0000000036a41bc1 {5}
------------[ cut here ]------------
kernel BUG at ../net/rxrpc/local_object.c:439!
Fix this by keeping a count of the live connections and waiting for it to
go to zero at the end of rxrpc_destroy_all_connections().
Fixes: dee46364ce ("rxrpc: Add RCU destruction for connections and calls")
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_call structs don't pin sockets or network namespaces, but may attempt
to access both after their refcount reaches 0 so that they can detach
themselves from the network namespace. However, there's no guarantee that
the socket still exists at this point (so sock_net(&call->socket->sk) may
be invalid) and the namespace may have gone away if the call isn't pinning
a peer.
Fix this by (a) carrying a net pointer in the rxrpc_call struct and (b)
waiting for all calls to be destroyed when the network namespace goes away.
This was detected by checker:
net/rxrpc/call_object.c:634:57: warning: incorrect type in argument 1 (different address spaces)
net/rxrpc/call_object.c:634:57: expected struct sock const *sk
net/rxrpc/call_object.c:634:57: got struct sock [noderef] <asn:4>*<noident>
Fixes: 2baec2c3f8 ("rxrpc: Support network namespacing")
Signed-off-by: David Howells <dhowells@redhat.com>