WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Arnd Bergmann	ad2f99aedf	net: bridge: move bridge ioctls out of .ndo_do_ioctl Working towards obsoleting the .ndo_do_ioctl operation entirely, stop passing the SIOCBRADDIF/SIOCBRDELIF device ioctl commands into this callback. My first attempt was to add another ndo_siocbr() callback, but as there is only a single driver that takes these commands and there is already a hook mechanism to call directly into this driver, extend this hook instead, and use it for both the deviceless and the device specific ioctl commands. Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: bridge@lists.linux-foundation.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-27 20:11:45 +01:00
Arnd Bergmann	561d835281	bridge: use ndo_siocdevprivate The bridge driver has an old set of ioctls using the SIOCDEVPRIVATE namespace that have never worked in compat mode and are explicitly forbidden already. Move them over to ndo_siocdevprivate and fix compat mode for these, because we can. Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: bridge@lists.linux-foundation.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-27 20:11:43 +01:00
Vladimir Oltean	ee80dd2e89	net: bridge: add a helper for retrieving port VLANs from the data path Introduce a brother of br_vlan_get_info() which is protected by the RCU mechanism, as opposed to br_vlan_get_info() which relies on taking the write-side rtnl_mutex. This is needed for drivers which need to find out whether a bridge port has a VLAN configured or not. For example, certain DSA switches might not offer complete source port identification to the CPU on RX, just the VLAN in which the packet was received. Based on this VLAN, we cannot set an accurate skb->dev ingress port, but at least we can configure one that behaves the same as the correct one would (this is possible because DSA sets skb->offload_fwd_mark = 1). When we look at the bridge RX handler (br_handle_frame), we see that what matters regarding skb->dev is the VLAN ID and the port STP state. So we need to select an skb->dev that has the same bridge VLAN as the packet we're receiving, and is in the LEARNING or FORWARDING STP state. The latter is easy, but for the former, we should somehow keep a shadow list of the bridge VLANs on each port, and a lookup table between VLAN ID and the 'designated port for imprecise RX'. That is rather complicated to keep in sync properly (the designated port per VLAN needs to be updated on the addition and removal of a VLAN, as well as on the join/leave events of the bridge on that port). So, to avoid all that complexity, let's just iterate through our finite number of ports and ask the bridge, for each packet: "do you have this VLAN configured on this port?". Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Cc: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-26 22:35:22 +01:00
Vladimir Oltean	f7cdb3ecc9	net: bridge: update BROPT_VLAN_ENABLED before notifying switchdev in br_vlan_filter_toggle SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING is notified by the bridge from two places: - nbp_vlan_init(), during bridge port creation - br_vlan_filter_toggle(), during a netlink/sysfs/ioctl change requested by user space If a switchdev driver uses br_vlan_enabled(br_dev) inside its handler for the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING attribute notifier, different things will be seen depending on whether the bridge calls from the first path or the second: - in nbp_vlan_init(), br_vlan_enabled() reflects the current state of the bridge - in br_vlan_filter_toggle(), br_vlan_enabled() reflects the past state of the bridge This can lead in some cases to complications in driver implementation, which can be avoided if these could reliably use br_vlan_enabled(). Nothing seems to depend on this behavior, and it seems overall more straightforward for br_vlan_enabled() to return the proper value even during the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING notifier, so temporarily enable the bridge option, then revert it if the switchdev notifier failed. Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Cc: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-26 22:35:22 +01:00
Vladimir Oltean	c538115439	net: bridge: fix build when setting skb->offload_fwd_mark with CONFIG_NET_SWITCHDEV=n Switchdev support can be disabled at compile time, and in that case, struct sk_buff will not contain the offload_fwd_mark field. To make the code in br_forward.c work in both cases, we do what is done in other places and we create a helper function, with an empty shim definition, that is implemented by the br_switchdev.o translation module. This is always compiled if and only if CONFIG_NET_SWITCHDEV is y or m. Reported-by: kernel test robot <lkp@intel.com> Fixes: `472111920f` ("net: bridge: switchdev: allow the TX data plane forwarding to be offloaded") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-24 21:48:26 +01:00
Tobias Waldekranz	472111920f	net: bridge: switchdev: allow the TX data plane forwarding to be offloaded Allow switchdevs to forward frames from the CPU in accordance with the bridge configuration in the same way as is done between bridge ports. This means that the bridge will only send a single skb towards one of the ports under the switchdev's control, and expects the driver to deliver the packet to all eligible ports in its domain. Primarily this improves the performance of multicast flows with multiple subscribers, as it allows the hardware to perform the frame replication. The basic flow between the driver and the bridge is as follows: - When joining a bridge port, the switchdev driver calls switchdev_bridge_port_offload() with tx_fwd_offload = true. - The bridge sends offloadable skbs to one of the ports under the switchdev's control using skb->offload_fwd_mark = true. - The switchdev driver checks the skb->offload_fwd_mark field and lets its FDB lookup select the destination port mask for this packet. v1->v2: - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long - introduce a static key "br_switchdev_fwd_offload_used" to minimize the impact of the newly introduced feature on all the setups which don't have hardware that can make use of it - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache line access - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in __br_forward() - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge is being used - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP v2->v3: - replace the solution based on .ndo_dfwd_add_station with a solution based on switchdev_bridge_port_offload - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD v3->v4: rebase v4->v5: - make sure the static key is decremented on bridge port unoffload - more function and variable renaming and comments for them: br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload fwd_accel to tx_fwd_offload Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-23 16:32:37 +01:00
David S. Miller	5af84df962	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Conflicts are simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-23 16:13:06 +01:00
Vladimir Oltean	4e51bf44a0	net: bridge: move the switchdev object replay helpers to "push" mode Starting with commit `4f2673b3a2` ("net: bridge: add helper to replay port and host-joined mdb entries"), DSA has introduced some bridge helpers that replay switchdev events (FDB/MDB/VLAN additions and deletions) that can be lost by the switchdev drivers in a variety of circumstances: - an IP multicast group was host-joined on the bridge itself before any switchdev port joined the bridge, leading to the host MDB entries missing in the hardware database. - during the bridge creation process, the MAC address of the bridge was added to the FDB as an entry pointing towards the bridge device itself, but with no switchdev ports being part of the bridge yet, this local FDB entry would remain unknown to the switchdev hardware database. - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface, before any switchdev port joined that LAG, leading to the hardware database missing those entries. - a switchdev port left a LAG that is a bridge port, while the LAG remained part of the bridge, and all FDB/MDB/VLAN entries remained installed in the hardware database of the switchdev port. Also, since commit `0d2cfbd41c` ("net: bridge: ignore switchdev events for LAG ports which didn't request replay"), DSA introduced a method, based on a const void ctx, to ensure that two switchdev ports under the same LAG that is a bridge port do not see the same MDB/VLAN entry being replayed twice by the bridge, once for every bridge port that joins the LAG. With so many ordering corner cases being possible, it seems unreasonable to expect a switchdev driver writer to get it right from the first try. Therefore, now that DSA has experimented with the bridge replay helpers for a little bit, we can move the code to the bridge driver where it is more readily available to all switchdev drivers. To convert the switchdev object replay helpers from "pull mode" (where the driver asks for them) to a "push mode" (where the bridge offers them automatically), the biggest problem is that the bridge needs to be aware when a switchdev port joins and leaves, even when the switchdev is only indirectly a bridge port (for example when the bridge port is a LAG upper of the switchdev). Luckily, we already have a hook for that, in the form of the newly introduced switchdev_bridge_port_offload() and switchdev_bridge_port_unoffload() calls. These offer a natural place for hooking the object addition and deletion replays. Extend the above 2 functions with: - pointers to the switchdev atomic notifier (for FDB replays) and the blocking notifier (for MDB and VLAN replays). - the "const void ctx" argument required for drivers to be able to disambiguate between which port is targeted, when multiple ports are lowers of the same LAG that is a bridge port. Most of the drivers pass NULL to this argument, except the ones that support LAG offload and have the proper context check already in place in the switchdev blocking notifier handler. Also unexport the replay helpers, since nobody except the bridge calls them directly now. Note that: (a) we abuse the terminology slightly, because FDB entries are not "switchdev objects", but we count them as objects nonetheless. With no direct way to prove it, I think they are not modeled as switchdev objects because those can only be installed by the bridge to the hardware (as opposed to FDB entries which can be propagated in the other direction too). This is merely an abuse of terms, FDB entries are replayed too, despite not being objects. (b) the bridge does not attempt to sync port attributes to newly joined ports, just the countable stuff (the objects). The reason for this is simple: no universal and symmetric way to sync and unsync them is known. For example, VLAN filtering: what to do on unsync, disable or leave it enabled? Similarly, STP state, ageing timer, etc etc. What a switchdev port does when it becomes standalone again is not really up to the bridge's competence, and the driver should deal with it. On the other hand, replaying deletions of switchdev objects can be seen a matter of cleanup and therefore be treated by the bridge, hence this patch. We make the replay helpers opt-in for drivers, because they might not bring immediate benefits for them: - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(), so br_vlan_replay() should not do anything for the new drivers on which we call it. The existing drivers where there was even a slight possibility for there to exist a VLAN on a bridge port before they join it are already guarded against this: mlxsw and prestera deny joining LAG interfaces that are members of a bridge. - br_fdb_replay() should now notify of local FDB entries, but I patched all drivers except DSA to ignore these new entries in commit `2c4eca3ef7` ("net: bridge: switchdev: include local flag in FDB notifications"). Driver authors can lift this restriction as they wish, and when they do, they can also opt into the FDB replay functionality. - br_mdb_replay() should fix a real issue which is described in commit `4f2673b3a2` ("net: bridge: add helper to replay port and host-joined mdb entries"). However most drivers do not offload the SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw offload this switchdev object, and I don't completely understand the way in which they offload this switchdev object anyway. So I'll leave it up to these drivers' respective maintainers to opt into br_mdb_replay(). So most of the drivers pass NULL notifier blocks for the replay helpers, except: - dpaa2-switch which was already acked/regression-tested with the helpers enabled (and there isn't much of a downside in having them) - ocelot which already had replay logic in "pull" mode - DSA which already had replay logic in "pull" mode An important observation is that the drivers which don't currently request bridge event replays don't even have the switchdev_bridge_port_{offload,unoffload} calls placed in proper places right now. This was done to avoid unnecessary rework for drivers which might never even add support for this. For driver writers who wish to add replay support, this can be used as a tentative placement guide: https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/ Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Vladimir Oltean	7105b50b7e	net: bridge: guard the switchdev replay helpers against a NULL notifier block There is a desire to make the object and FDB replay helpers optional when moving them inside the bridge driver. For example a certain driver might not offload host MDBs and there is no case where the replay helpers would be of immediate use to it. So it would be nice if we could allow drivers to pass NULL pointers for the atomic and blocking notifier blocks, and the replay helpers to do nothing in that case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Vladimir Oltean	2f5dc00f7a	net: bridge: switchdev: let drivers inform which bridge ports are offloaded On reception of an skb, the bridge checks if it was marked as 'already forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it is, it assigns the source hardware domain of that skb based on the hardware domain of the ingress port. Then during forwarding, it enforces that the egress port must have a different hardware domain than the ingress one (this is done in nbp_switchdev_allowed_egress). Non-switchdev drivers don't report any physical switch id (neither through devlink nor .ndo_get_port_parent_id), therefore the bridge assigns them a hardware domain of 0, and packets coming from them will always have skb->offload_fwd_mark = 0. So there aren't any restrictions. Problems appear due to the fact that DSA would like to perform software fallback for bonding and team interfaces that the physical switch cannot offload. +-- br0 ---+ / / \| \ / / \| \ / \| \| bond0 / \| \| / \ swp0 swp1 swp2 swp3 swp4 There, it is desirable that the presence of swp3 and swp4 under a non-offloaded LAG does not preclude us from doing hardware bridging beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high enough that software bridging between {swp0,swp1,swp2} and bond0 is not impractical. But this creates an impossible paradox given the current way in which port hardware domains are assigned. When the driver receives a packet from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to something. - If we set it to 0, then the bridge will forward it towards swp1, swp2 and bond0. But the switch has already forwarded it towards swp1 and swp2 (not to bond0, remember, that isn't offloaded, so as far as the switch is concerned, ports swp3 and swp4 are not looking up the FDB, and the entire bond0 is a destination that is strictly behind the CPU). But we don't want duplicated traffic towards swp1 and swp2, so it's not ok to set skb->offload_fwd_mark = 0. - If we set it to 1, then the bridge will not forward the skb towards the ports with the same switchdev mark, i.e. not to swp1, swp2 and bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should have forwarded the skb there. So the real issue is that bond0 will be assigned the same hardware domain as {swp0,swp1,swp2}, because the function that assigns hardware domains to bridge ports, nbp_switchdev_add(), recurses through bond0's lower interfaces until it finds something that implements devlink (calls dev_get_port_parent_id with bool recurse = true). This is a problem because the fact that bond0 can be offloaded by swp3 and swp4 in our example is merely an assumption. A solution is to give the bridge explicit hints as to what hardware domain it should use for each port. Currently, the bridging offload is very 'silent': a driver registers a netdevice notifier, which is put on the netns's notifier chain, and which sniffs around for NETDEV_CHANGEUPPER events where the upper is a bridge, and the lower is an interface it knows about (one registered by this driver, normally). Then, from within that notifier, it does a bunch of stuff behind the bridge's back, without the bridge necessarily knowing that there's somebody offloading that port. It looks like this: ip link set swp0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v call_netdevice_notifiers \| v dsa_slave_netdevice_event \| v oh, hey! it's for me! \| v .port_bridge_join What we do to solve the conundrum is to be less silent, and change the switchdev drivers to present themselves to the bridge. Something like this: ip link set swp0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the \| \| hardware domain for v \| this port, and zero dsa_slave_netdevice_event \| if I got nothing. \| \| v \| oh, hey! it's for me! \| \| \| v \| .port_bridge_join \| \| \| +------------------------+ switchdev_bridge_port_offload(swp0, swp0) Then stacked interfaces (like bond0 on top of swp3/swp4) would be treated differently in DSA, depending on whether we can or cannot offload them. The offload case: ip link set bond0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the \| \| switchdev mark for v \| bond0. dsa_slave_netdevice_event \| Coincidentally (or not), \| \| bond0 and swp0, swp1, swp2 v \| all have the same switchdev hmm, it's not quite for me, \| mark now, since the ASIC but my driver has already \| is able to forward towards called .port_lag_join \| all these ports in hw. for it, because I have \| a port with dp->lag_dev == bond0. \| \| \| v \| .port_bridge_join \| for swp3 and swp4 \| \| \| +------------------------+ switchdev_bridge_port_offload(bond0, swp3) switchdev_bridge_port_offload(bond0, swp4) And the non-offload case: ip link set bond0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge waiting: call_netdevice_notifiers ^ huh, switchdev_bridge_port_offload \| \| wasn't called, okay, I'll use a v \| hwdom of zero for this one. dsa_slave_netdevice_event : Then packets received on swp0 will \| : not be software-forwarded towards v : swp1, but they will towards bond0. it's not for me, but bond0 is an upper of swp3 and swp4, but their dp->lag_dev is NULL because they couldn't offload it. Basically we can draw the conclusion that the lowers of a bridge port can come and go, so depending on the configuration of lowers for a bridge port, it can dynamically toggle between offloaded and unoffloaded. Therefore, we need an equivalent switchdev_bridge_port_unoffload too. This patch changes the way any switchdev driver interacts with the bridge. From now on, everybody needs to call switchdev_bridge_port_offload and switchdev_bridge_port_unoffload, otherwise the bridge will treat the port as non-offloaded and allow software flooding to other ports from the same ASIC. Note that these functions lay the ground for a more complex handshake between switchdev drivers and the bridge in the future. For drivers that will request a replay of the switchdev objects when they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we place the call to switchdev_bridge_port_unoffload() strategically inside the NETDEV_PRECHANGEUPPER notifier's code path, and not inside NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers need the netdev adjacency lists to be valid, and that is only true in NETDEV_PRECHANGEUPPER. Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Tobias Waldekranz	8582661048	net: bridge: switchdev: recycle unused hwdoms Since hwdoms have only been used thus far for equality comparisons, the bridge has used the simplest possible assignment policy; using a counter to keep track of the last value handed out. With the upcoming transmit offloading, we need to perform set operations efficiently based on hwdoms, e.g. we want to answer questions like "has this skb been forwarded to any port within this hwdom?" Move to a bitmap-based allocation scheme that recycles hwdoms once all members leaves the bridge. This means that we can use a single unsigned long to keep track of the hwdoms that have received an skb. v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX) into a plain unsigned long. v2->v6: none Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Tobias Waldekranz	f7cf972f93	net: bridge: disambiguate offload_fwd_mark Before this change, four related - but distinct - concepts where named offload_fwd_mark: - skb->offload_fwd_mark: Set by the switchdev driver if the underlying hardware has already forwarded this frame to the other ports in the same hardware domain. - nbp->offload_fwd_mark: An idetifier used to group ports that share the same hardware forwarding domain. - br->offload_fwd_mark: Counter used to make sure that unique IDs are used in cases where a bridge contains ports from multiple hardware domains. - skb->cb->offload_fwd_mark: The hardware domain on which the frame ingressed and was forwarded. Introduce the term "hardware forwarding domain" ("hwdom") in the bridge to denote a set of ports with the following property: If an skb with skb->offload_fwd_mark set, is received on a port belonging to hwdom N, that frame has already been forwarded to all other ports in hwdom N. By decoupling the name from "offload_fwd_mark", we can extend the term's definition in the future - e.g. to add constraints that describe expected egress behavior - without overloading the meaning of "offload_fwd_mark". - nbp->offload_fwd_mark thus becomes nbp->hwdom. - br->offload_fwd_mark becomes br->last_hwdom. - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight change in naming here mandates a slight change in behavior of the nbp_switchdev_frame_mark() function. Previously, it only set this value in skb->cb for packets with skb->offload_fwd_mark true (ones which were forwarded in hardware). Whereas now we always track the incoming hwdom for all packets coming from a switchdev (even for the packets which weren't forwarded in hardware, such as STP BPDUs, IGMP reports etc). As all uses of skb->cb->offload_fwd_mark were already gated behind checks of skb->offload_fwd_mark, this will not introduce any functional change, but it paves the way for future changes where the ingressing hwdom must be known for frames coming from a switchdev regardless of whether they were forwarded in hardware or not (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now always tracks which one). A typical example where this is relevant: the switchdev has a fixed configuration to trap STP BPDUs, but STP is not running on the bridge and the group_fwd_mask allows them to be forwarded. Say we have this setup: br0 / \| \ / \| \ swp0 swp1 swp2 A BPDU comes in on swp0 and is trapped to the CPU; the driver does not set skb->offload_fwd_mark. The bridge determines that the frame should be forwarded to swp{1,2}. It is imperative that forward offloading is _not_ allowed in this case, as the source hwdom is already "poisoned". Recording the source hwdom allows this case to be handled properly. v2->v3: added code comments v3->v6: none Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-22 00:26:23 -07:00
Nikolay Aleksandrov	58d913a326	net: bridge: multicast: add context support for host-joined groups Adding bridge multicast context support for host-joined groups is easy because we only need the proper timer value. We pass the already chosen context and use its timer value. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 14:34:47 -07:00
Nikolay Aleksandrov	6567cb438a	net: bridge: multicast: add mdb context support Choose the proper bridge multicast context when user-spaces is adding mdb entries. Currently we require the vlan to be configured on at least one device (port or bridge) in order to add an mdb entry if vlan mcast snooping is enabled (vlan snooping implies vlan filtering). Note that we always allow deleting an entry, regardless of the vlan state. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 14:34:47 -07:00
Nikolay Aleksandrov	54cb43199e	net: bridge: multicast: fix igmp/mld port context null pointer dereferences With the recent change to use bridge/port multicast context pointers instead of bridge/port I missed to convert two locations which pass the port pointer as-is, but with the new model we need to verify the port context is non-NULL first and retrieve the port from it. The first location is when doing querier selection when a query is received, the second location is when leaving a group. The port context will be null if the packets originated from the bridge device (i.e. from the host). The fix is simple just check if the port context exists and retrieve the port pointer from it. Fixes: `adc47037a7` ("net: bridge: multicast: use multicast contexts instead of bridge or port") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-21 09:04:03 -07:00
Nikolay Aleksandrov	9dee572c38	net: bridge: vlan: add mcast snooping control Add a new global vlan option which controls whether multicast snooping is enabled or disabled for a single vlan. It controls the vlan private flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	9aba624d7c	net: bridge: vlan: notify when global options change Add support for global options notifications. They use only RTM_NEWVLAN since global options can only be set and are contained in a separate vlan global options attribute. Notifications are compressed in ranges where possible, i.e. the sequential vlan options are equal. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	743a53d963	net: bridge: vlan: add support for dumping global vlan options Add a new vlan options dump flag which causes only global vlan options to be dumped. The dumps are done only with bridge devices, ports are ignored. They support vlan compression if the options in sequential vlans are equal (currently always true). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	47ecd2dbd8	net: bridge: vlan: add support for global options We can have two types of vlan options depending on context: - per-device vlan options (split in per-bridge and per-port) - global vlan options The second type wasn't supported in the bridge until now, but we need them for per-vlan multicast support, per-vlan STP support and other options which require global vlan context. They are contained in the global bridge vlan context even if the vlan is not configured on the bridge device itself. This patch adds initial netlink attributes and support for setting these global vlan options, they can only be set (RTM_NEWVLAN) and the operation must use the bridge device. Since there are no such options yet it shouldn't have any functional effect. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	1e9ca45662	net: bridge: multicast: include router port vlan id in notifications Use the port multicast context to check if the router port is a vlan and in case it is include its vlan id in the notification. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	615cc23e62	net: bridge: multicast: add vlan querier and query support Add basic vlan context querier support, if the contexts passed to multicast_alloc_query are vlan then the query will be tagged. Also handle querier start/stop of vlan contexts. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	4cdd0d10f3	net: bridge: multicast: check if should use vlan mcast ctx Add helpers which check if the current bridge/port multicast context should be used (i.e. they're not disabled) and use them for Rx IGMP/MLD processing, timers and new group addition. It is important for vlans to disable processing of timer/packet after the multicast_lock is obtained if the vlan context doesn't have BR_VLFLAG_MCAST_ENABLED. There are two cases when that flag is missing: - if the vlan is getting destroyed it will be removed and timers will be stopped - if the vlan mcast snooping is being disabled Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	eb1593a0b4	net: bridge: multicast: use the port group to port context helper We need to use the new port group to port context helper in places where we cannot pass down the proper context (i.e. functions that can be called by timers or outside the packet snooping paths). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	74edfd483d	net: bridge: multicast: add helper to get port mcast context from port group Add br_multicast_pg_to_port_ctx() which returns the proper port multicast context from either port or vlan based on bridge option and vlan flags. As the comment inside explains the locking is a bit tricky, we rely on the fact that BR_VLFLAG_MCAST_ENABLED requires multicast_lock to change and we also require it to be held to call that helper. If we find the vlan under rcu and it still has the flag then we can be sure it will be alive until we unlock multicast_lock which should be enough. Note that the context might change from vlan to bridge between different calls to this helper as the mcast vlan knob requires only rtnl so it should be used carefully and for read-only/check purposes. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	f4b7002a70	net: bridge: add vlan mcast snooping knob Add a global knob that controls if vlan multicast snooping is enabled. The proper contexts (vlan or bridge-wide) will be chosen based on the knob when processing packets and changing bridge device state. Note that vlans have their individual mcast snooping enabled by default, but this knob is needed to turn on bridge vlan snooping. It is disabled by default. To enable the knob vlan filtering must also be enabled, it doesn't make sense to have vlan mcast snooping without vlan filtering since that would lead to inconsistencies. Disabling vlan filtering will also automatically disable vlan mcast snooping. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov	7b54aaaf53	net: bridge: multicast: add vlan state initialization and control Add helpers to enable/disable vlan multicast based on its flags, we need two flags because we need to know if the vlan has multicast enabled globally (user-controlled) and if it has it enabled on the specific device (bridge or port). The new private vlan flags are: - BR_VLFLAG_MCAST_ENABLED: locally enabled multicast on the device, used when removing a vlan, toggling vlan mcast snooping and controlling single vlan (kernel-controlled, valid under RTNL and multicast_lock) - BR_VLFLAG_GLOBAL_MCAST_ENABLED: globally enabled multicast for the vlan, used to control the bridge-wide vlan mcast snooping for a single vlan (user-controlled, can be checked under any context) Bridge vlan contexts are created with multicast snooping enabled by default to be in line with the current bridge snooping defaults. In order to actually activate per vlan snooping and context usage a bridge-wide knob will be added later which will default to disabled. If that knob is enabled then automatically all vlan snooping will be enabled. All vlan contexts are initialized with the current bridge multicast context defaults. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov	613d61dbef	net: bridge: vlan: add global and per-port multicast context Add global and per-port vlan multicast context, only initialized but still not used. No functional changes intended. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov	adc47037a7	net: bridge: multicast: use multicast contexts instead of bridge or port Pass multicast context pointers to multicast functions instead of bridge/port. This would make it easier later to switch these contexts to their per-vlan versions. The patch is basically search and replace, no functional changes. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov	d3d065c003	net: bridge: multicast: factor out bridge multicast context Factor out the bridge's global multicast context into a separate structure which will later be used for per-vlan global context. No functional changes intended. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov	9632233e7d	net: bridge: multicast: factor out port multicast context Factor out the port's multicast context into a separate structure which will later be shared for per-port,vlan context. No functional changes intended. We need the structure even if bridge multicast is not defined to pass down as pointer to forwarding functions. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-20 05:41:19 -07:00
Vladimir Oltean	cbb56b03ec	net: bridge: do not replay fdb entries pointing towards the bridge twice This simple script: ip link add br0 type bridge ip link set swp2 master br0 ip link set br0 address 00:01:02:03:04:05 ip link del br0 produces this result on a DSA switch: [ 421.306399] br0: port 1(swp2) entered blocking state [ 421.311445] br0: port 1(swp2) entered disabled state [ 421.472553] device swp2 entered promiscuous mode [ 421.488986] device swp2 left promiscuous mode [ 421.493508] br0: port 1(swp2) entered disabled state [ 421.886107] sja1105 spi0.1: port 1 failed to delete 00:01:02:03:04:05 vid 1 from fdb: -ENOENT [ 421.894374] sja1105 spi0.1: port 1 failed to delete 00:01:02:03:04:05 vid 0 from fdb: -ENOENT [ 421.943982] br0: port 1(swp2) entered blocking state [ 421.949030] br0: port 1(swp2) entered disabled state [ 422.112504] device swp2 entered promiscuous mode A very simplified view of what happens is: (1) the bridge port is created, and the bridge device inherits its MAC address (2) when joining, the bridge port (DSA) requests a replay of the addition of all FDB entries towards this bridge port and towards the bridge device itself. In fact, DSA calls br_fdb_replay() twice: br_fdb_replay(br, brport_dev); br_fdb_replay(br, br); DSA uses reference counting for the FDB entries. So the MAC address of the bridge is simply kept with refcount 2. When the bridge port leaves under normal circumstances, everything cancels out since the replay of the FDB entry deletion is also done twice per VLAN. (3) when the bridge MAC address changes, switchdev is notified of the deletion of the old address and of the insertion of the new one. But the old address does not really go away, since it had refcount 2, and the new address is added "only" with refcount 1. (4) when the bridge port leaves now, it will replay a deletion of the FDB entries pointing towards the bridge twice. Then DSA will complain that it can't delete something that no longer exists. It is clear that the problem is that the FDB entries towards the bridge are replayed too many times, so let's fix that problem. Fixes: `63c51453c8` ("net: dsa: replay the local bridge FDB entries pointing to the bridge dev too") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20210719093916.4099032-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-07-20 13:08:45 +02:00
Nikolay Aleksandrov	000b7287b6	net: bridge: multicast: fix MRD advertisement router port marking race When an MRD advertisement is received on a bridge port with multicast snooping enabled, we mark it as a router port automatically, that includes adding that port to the router port list. The multicast lock protects that list, but it is not acquired in the MRD advertisement case leading to a race condition, we need to take it to fix the race. Cc: stable@vger.kernel.org Cc: linus.luessing@c0d3.blue Fixes: `4b3087c7e3` ("bridge: Snoop Multicast Router Advertisements") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-11 12:11:06 -07:00
Nikolay Aleksandrov	04bef83a33	net: bridge: multicast: fix PIM hello router port marking race When a PIM hello packet is received on a bridge port with multicast snooping enabled, we mark it as a router port automatically, that includes adding that port the router port list. The multicast lock protects that list, but it is not acquired in the PIM message case leading to a race condition, we need to take it to fix the race. Cc: stable@vger.kernel.org Fixes: `91b02d3d13` ("bridge: mcast: add router port on PIM hello message") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-11 12:11:06 -07:00
Wolfgang Bumiller	a019abd802	net: bridge: sync fdb to new unicast-filtering ports Since commit `2796d0c648` ("bridge: Automatically manage port promiscuous mode.") bridges with `vlan_filtering 1` and only 1 auto-port don't set IFF_PROMISC for unicast-filtering-capable ports. Normally on port changes `br_manage_promisc` is called to update the promisc flags and unicast filters if necessary, but it cannot distinguish between new ports and ones losing their promisc flag, and new ports end up not receiving the MAC address list. Fix this by calling `br_fdb_sync_static` in `br_add_if` after the port promisc flags are updated and the unicast filter was supposed to have been filled. Fixes: `2796d0c648` ("bridge: Automatically manage port promiscuous mode.") Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-02 13:34:07 -07:00
Vladimir Oltean	f851a721a6	net: bridge: allow br_fdb_replay to be called for the bridge device When a port joins a bridge which already has local FDB entries pointing to the bridge device itself, we would like to offload those, so allow the "dev" argument to be equal to the bridge too. The code already does what we need in that case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-29 10:46:23 -07:00
Tobias Waldekranz	6eb38bf8eb	net: bridge: switchdev: send FDB notifications for host addresses Treat addresses added to the bridge itself in the same way as regular ports and send out a notification so that drivers may sync it down to the hardware FDB. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-29 10:46:23 -07:00
Vladimir Oltean	3e19ae7c6f	net: bridge: use READ_ONCE() and WRITE_ONCE() compiler barriers for fdb->dst Annotate the writer side of fdb->dst: - fdb_create() - br_fdb_update() - fdb_add_entry() - br_fdb_external_learn_add() with WRITE_ONCE() and the reader side: - br_fdb_test_addr() - br_fdb_update() - fdb_fill_info() - fdb_add_entry() - fdb_delete_by_addr_and_port() - br_fdb_external_learn_add() - br_switchdev_fdb_notify() with compiler barriers such that the readers do not attempt to reload fdb->dst multiple times, leading to potentially different destination ports when the fdb entry is updated concurrently. This is especially important in read-side sections where fdb->dst is used more than once, but let's convert all accesses for the sake of uniformity. Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-29 10:46:22 -07:00
Horatiu Vultur	f7458934b0	net: bridge: mrp: Update the Test frames for MRA According to the standard IEC 62439-2, in case the node behaves as MRA and needs to send Test frames on ring ports, then these Test frames need to have an Option TLV and a Sub-Option TLV which has the type AUTO_MGR. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-28 15:46:10 -07:00
Vladimir Oltean	7e8c18586d	net: bridge: allow the switchdev replay functions to be called for deletion When a switchdev port leaves a LAG that is a bridge port, the switchdev objects and port attributes offloaded to that port are not removed: ip link add br0 type bridge ip link add bond0 type bond mode 802.3ad ip link set swp0 master bond0 ip link set bond0 master br0 bridge vlan add dev bond0 vid 100 ip link set swp0 nomaster VLAN 100 will remain installed on swp0 despite it going into standalone mode, because as far as the bridge is concerned, nothing ever happened to its bridge port. Let's extend the bridge vlan, fdb and mdb replay functions to take a 'bool adding' argument, and make DSA and ocelot call the replay functions with 'adding' as false from the switchdev unsync path, for the switch port that leaves the bridge. Note that this patch in itself does not salvage anything, because in the current pull mode of operation, DSA still needs to call the replay helpers with adding=false. This will be done in another patch. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-28 14:09:03 -07:00
Vladimir Oltean	bdf123b455	net: bridge: constify variables in the replay helpers Some of the arguments and local variables for the newly added switchdev replay helpers can be const, so let's make them so. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-28 14:09:03 -07:00
Vladimir Oltean	0d2cfbd41c	net: bridge: ignore switchdev events for LAG ports which didn't request replay There is a slight inconvenience in the switchdev replay helpers added recently, and this is when: ip link add br0 type bridge ip link add bond0 type bond ip link set bond0 master br0 bridge vlan add dev bond0 vid 100 ip link set swp0 master bond0 ip link set swp1 master bond0 Since the underlying driver (currently only DSA) asks for a replay of VLANs when swp0 and swp1 join the LAG because it is bridged, what will happen is that DSA will try to react twice on the VLAN event for swp0. This is not really a huge problem right now, because most drivers accept duplicates since the bridge itself does, but it will become a problem when we add support for replaying switchdev object deletions. Let's fix this by adding a blank void *ctx in the replay helpers, which will be passed on by the bridge in the switchdev notifications. If the context is NULL, everything is the same as before. But if the context is populated with a valid pointer, the underlying switchdev driver (currently DSA) can use the pointer to 'see through' the bridge port (which in the example above is bond0) and 'know' that the event is only for a particular physical port offloading that bridge port, and not for all of them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-28 14:09:03 -07:00
Vladimir Oltean	e887b2df62	net: bridge: include the is_local bit in br_fdb_replay Since commit `2c4eca3ef7` ("net: bridge: switchdev: include local flag in FDB notifications"), the bridge emits SWITCHDEV_FDB_ADD_TO_DEVICE events with the is_local flag populated (but we ignore it nonetheless). We would like DSA to start treating this bit, but it is still not populated by the replay helper, so add it there too. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-28 14:09:03 -07:00
gushengxian	98534fce52	bridge: cfm: remove redundant return Return statements are not needed in Void function. Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-22 10:35:15 -07:00
Jakub Kicinski	adc2e56ebe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-06-18 19:47:02 -07:00
Colin Ian King	040c12570e	net: bridge: remove redundant continue statement The continue statement at the end of a for-loop has no effect, invert the if expression and remove the continue. Addresses-Coverity: ("Continue has no effect") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-18 12:08:40 -07:00
Nikolay Aleksandrov	cfc579f9d8	net: bridge: fix vlan tunnel dst refcnt when egressing The egress tunnel code uses dst_clone() and directly sets the result which is wrong because the entry might have 0 refcnt or be already deleted, causing number of problems. It also triggers the WARN_ON() in dst_hold()[1] when a refcnt couldn't be taken. Fix it by using dst_hold_safe() and checking if a reference was actually taken before setting the dst. [1] dmesg WARN_ON log and following refcnt errors WARNING: CPU: 5 PID: 38 at include/net/dst.h:230 br_handle_egress_vlan_tunnel+0x10b/0x134 [bridge] Modules linked in: 8021q garp mrp bridge stp llc bonding ipv6 virtio_net CPU: 5 PID: 38 Comm: ksoftirqd/5 Kdump: loaded Tainted: G W 5.13.0-rc3+ #360 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 RIP: 0010:br_handle_egress_vlan_tunnel+0x10b/0x134 [bridge] Code: e8 85 bc 01 e1 45 84 f6 74 90 45 31 f6 85 db 48 c7 c7 a0 02 19 a0 41 0f 94 c6 31 c9 31 d2 44 89 f6 e8 64 bc 01 e1 85 db 75 02 <0f> 0b 31 c9 31 d2 44 89 f6 48 c7 c7 70 02 19 a0 e8 4b bc 01 e1 49 RSP: 0018:ffff8881003d39e8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa01902a0 RBP: ffff8881040c6700 R08: 0000000000000000 R09: 0000000000000001 R10: 2ce93d0054fe0d00 R11: 54fe0d00000e0000 R12: ffff888109515000 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000401 FS: 0000000000000000(0000) GS:ffff88822bf40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f42ba70f030 CR3: 0000000109926000 CR4: 00000000000006e0 Call Trace: br_handle_vlan+0xbc/0xca [bridge] __br_forward+0x23/0x164 [bridge] deliver_clone+0x41/0x48 [bridge] br_handle_frame_finish+0x36f/0x3aa [bridge] ? skb_dst+0x2e/0x38 [bridge] ? br_handle_ingress_vlan_tunnel+0x3e/0x1c8 [bridge] ? br_handle_frame_finish+0x3aa/0x3aa [bridge] br_handle_frame+0x2c3/0x377 [bridge] ? __skb_pull+0x33/0x51 ? vlan_do_receive+0x4f/0x36a ? br_handle_frame_finish+0x3aa/0x3aa [bridge] __netif_receive_skb_core+0x539/0x7c6 ? __list_del_entry_valid+0x16e/0x1c2 __netif_receive_skb_list_core+0x6d/0xd6 netif_receive_skb_list_internal+0x1d9/0x1fa gro_normal_list+0x22/0x3e dev_gro_receive+0x55b/0x600 ? detach_buf_split+0x58/0x140 napi_gro_receive+0x94/0x12e virtnet_poll+0x15d/0x315 [virtio_net] __napi_poll+0x2c/0x1c9 net_rx_action+0xe6/0x1fb __do_softirq+0x115/0x2d8 run_ksoftirqd+0x18/0x20 smpboot_thread_fn+0x183/0x19c ? smpboot_unregister_percpu_thread+0x66/0x66 kthread+0x10a/0x10f ? kthread_mod_delayed_work+0xb6/0xb6 ret_from_fork+0x22/0x30 ---[ end trace 49f61b07f775fd2b ]--- dst_release: dst:00000000c02d677a refcnt:-1 dst_release underflow Cc: stable@vger.kernel.org Fixes: `11538d039a` ("bridge: vlan dst_metadata hooks in ingress and egress paths") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-10 14:06:43 -07:00
Nikolay Aleksandrov	58e2071742	net: bridge: fix vlan tunnel dst null pointer dereference This patch fixes a tunnel_dst null pointer dereference due to lockless access in the tunnel egress path. When deleting a vlan tunnel the tunnel_dst pointer is set to NULL without waiting a grace period (i.e. while it's still usable) and packets egressing are dereferencing it without checking. Use READ/WRITE_ONCE to annotate the lockless use of tunnel_id, use RCU for accessing tunnel_dst and make sure it is read only once and checked in the egress path. The dst is already properly RCU protected so we don't need to do anything fancy than to make sure tunnel_id and tunnel_dst are read only once and checked in the egress path. Cc: stable@vger.kernel.org Fixes: `11538d039a` ("bridge: vlan dst_metadata hooks in ingress and egress paths") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-10 14:06:43 -07:00
Horatiu Vultur	fcb3463585	net: bridge: mrp: Update ring transitions. According to the standard IEC 62439-2, the number of transitions needs to be counted for each transition 'between' ring state open and ring state closed and not from open state to closed state. Therefore fix this for both ring and interconnect ring. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-04 14:41:28 -07:00
Nigel Christian	ccc882f0d8	net: bridge: remove redundant assignment The variable br is assigned a value that is not being read after exiting case IFLA_STATS_LINK_XSTATS_SLAVE. The assignment is redundant and can be removed. Addresses-Coverity ("Unused value") Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-25 15:20:27 -07:00
Matteo Croce	30515832e9	net: bridge: fix build when IPv6 is disabled The br_ip6_multicast_add_router() prototype is defined only when CONFIG_IPV6 is enabled, but the function is always referenced, so there is this build error with CONFIG_IPV6 not defined: net/bridge/br_multicast.c: In function ‘__br_multicast_enable_port’: net/bridge/br_multicast.c:1743:3: error: implicit declaration of function ‘br_ip6_multicast_add_router’; did you mean ‘br_ip4_multicast_add_router’? [-Werror=implicit-function-declaration] 1743 \| br_ip6_multicast_add_router(br, port); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| br_ip4_multicast_add_router net/bridge/br_multicast.c: At top level: net/bridge/br_multicast.c:2804:13: warning: conflicting types for ‘br_ip6_multicast_add_router’ 2804 \| static void br_ip6_multicast_add_router(struct net_bridge *br, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ net/bridge/br_multicast.c:2804:13: error: static declaration of ‘br_ip6_multicast_add_router’ follows non-static declaration net/bridge/br_multicast.c:1743:3: note: previous implicit declaration of ‘br_ip6_multicast_add_router’ was here 1743 \| br_ip6_multicast_add_router(br, port); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix this build error by moving the definition out of the #ifdef. Fixes: `a3c02e769e` ("net: bridge: mcast: split multicast router state for IPv4 and IPv6") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-14 10:38:59 -07:00
Nikolay Aleksandrov	bbc6f2cca7	net: bridge: fix br_multicast_is_router stub when igmp is disabled br_multicast_is_router takes two arguments when bridge IGMP is enabled and just one when it's disabled, fix the stub to take two as well. Fixes: `1a3065a268` ("net: bridge: mcast: prepare is-router function for mcast router split") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-14 10:35:20 -07:00
Linus Lüssing	3b85f9ba34	net: bridge: mcast: export multicast router presence adjacent to a port To properly support routable multicast addresses in batman-adv in a group-aware way, a batman-adv node needs to know if it serves multicast routers. This adds a function to the bridge to export this so that batman-adv can then make full use of the Multicast Router Discovery capability of the bridge. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	b7fb091654	net: bridge: mcast: add ip4+ip6 mcast router timers to mdb netlink Now that we have split the multicast router state into two, one for IPv4 and one for IPv6, also add individual timers to the mdb netlink router port dump. Leaving the old timer attribute for backwards compatibility. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	a3c02e769e	net: bridge: mcast: split multicast router state for IPv4 and IPv6 A multicast router for IPv4 does not imply that the same host also is a multicast router for IPv6 and vice versa. To reduce multicast traffic when a host is only a multicast router for one of these two protocol families, keep router state for IPv4 and IPv6 separately. Similar to how querier state is kept separately. For backwards compatibility for netlink and switchdev notifications these two will still only notify if a port switched from either no IPv4/IPv6 multicast router to any IPv4/IPv6 multicast router or the other way round. However a full netlink MDB router dump will now also include a multicast router timeout for both IPv4 and IPv6. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	ed2d35971a	net: bridge: mcast: split router port del+notify for mcast router split In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants split router port deletion and notification into two functions. When we disable a port for instance later we want to only send one notification to switchdev and netlink for compatibility and want to avoid sending one for IPv4 and one for IPv6. For that the split is needed. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	d9b8c4d8d9	net: bridge: mcast: prepare add-router function for mcast router split In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants move the protocol specific router list and timer access to ip4 wrapper functions. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	ee5fb2223e	net: bridge: mcast: prepare expiry functions for mcast router split In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants move the protocol specific timer access to an ip4 wrapper function. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	1a3065a268	net: bridge: mcast: prepare is-router function for mcast router split In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants make br_multicast_is_router() protocol family aware. Note that for now br_ip6_multicast_is_router() uses the currently still common ip4_mc_router_timer for now. It will be renamed to ip6_mc_router_timer later when the split is performed. While at it also renames the "1" and "2" constants in br_multicast_is_router() to the MDB_RTR_TYPE_TEMP_QUERY and MDB_RTR_TYPE_PERM enums. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	b19232effd	net: bridge: mcast: prepare query reception for mcast router split In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants and as the br_multicast_mark_router() will be split for that remove the select querier wrapper and instead add ip4 and ip6 variants for br_multicast_query_received(). Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	ff391c5d98	net: bridge: mcast: prepare mdb netlink for mcast router split In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants and to avoid IPv6 #ifdef clutter later add some inline functions for the protocol specific parts in the mdb router netlink code. Also the we need iterate over the port instead of router list to be able put one router port entry with both the IPv4 and IPv6 multicast router info later. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	44ebb081dc	net: bridge: mcast: add wrappers for router node retrieval In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants and to avoid IPv6 #ifdef clutter later add two wrapper functions for router node retrieval in the payload forwarding code. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Linus Lüssing	ce6f709775	net: bridge: mcast: rename multicast router lists and timers In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants, rename the affected variable to the IPv4 version first to avoid some renames in later commits. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 14:04:31 -07:00
Zhang Zhengming	59259ff7a8	bridge: Fix possible races between assigning rx_handler_data and setting IFF_BRIDGE_PORT bit There is a crash in the function br_get_link_af_size_filtered, as the port_exists(dev) is true and the rx_handler_data of dev is NULL. But the rx_handler_data of dev is correct saved in vmcore. The oops looks something like: ... pc : br_get_link_af_size_filtered+0x28/0x1c8 [bridge] ... Call trace: br_get_link_af_size_filtered+0x28/0x1c8 [bridge] if_nlmsg_size+0x180/0x1b0 rtnl_calcit.isra.12+0xf8/0x148 rtnetlink_rcv_msg+0x334/0x370 netlink_rcv_skb+0x64/0x130 rtnetlink_rcv+0x28/0x38 netlink_unicast+0x1f0/0x250 netlink_sendmsg+0x310/0x378 sock_sendmsg+0x4c/0x70 __sys_sendto+0x120/0x150 __arm64_sys_sendto+0x30/0x40 el0_svc_common+0x78/0x130 el0_svc_handler+0x38/0x78 el0_svc+0x8/0xc In br_add_if(), we found there is no guarantee that assigning rx_handler_data to dev->rx_handler_data will before setting the IFF_BRIDGE_PORT bit of priv_flags. So there is a possible data competition: CPU 0: CPU 1: (RCU read lock) (RTNL lock) rtnl_calcit() br_add_slave() if_nlmsg_size() br_add_if() br_get_link_af_size_filtered() -> netdev_rx_handler_register ... // The order is not guaranteed ... -> dev->priv_flags \|= IFF_BRIDGE_PORT; // The IFF_BRIDGE_PORT bit of priv_flags has been set -> if (br_port_exists(dev)) { // The dev->rx_handler_data has NOT been assigned -> p = br_port_get_rcu(dev); .... -> rcu_assign_pointer(dev->rx_handler_data, rx_handler_data); ... Fix it in br_get_link_af_size_filtered, using br_port_get_check_rcu() and checking the return value. Signed-off-by: Zhang Zhengming <zhangzhengming@huawei.com> Reviewed-by: Zhao Lei <zhaolei69@huawei.com> Reviewed-by: Wang Xiaogang <wangxiaogang3@huawei.com> Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-29 15:33:17 -07:00
Linus Lüssing	9901408815	net: bridge: mcast: fix broken length + header check for MRDv6 Adv. The IPv6 Multicast Router Advertisements parsing has the following two issues: For one thing, ICMPv6 MRD Advertisements are smaller than ICMPv6 MLD messages (ICMPv6 MRD Adv.: 8 bytes vs. ICMPv6 MLDv1/2: >= 24 bytes, assuming MLDv2 Reports with at least one multicast address entry). When ipv6_mc_check_mld_msg() tries to parse an Multicast Router Advertisement its MLD length check will fail - and it will wrongly return -EINVAL, even if we have a valid MRD Advertisement. With the returned -EINVAL the bridge code will assume a broken packet and will wrongly discard it, potentially leading to multicast packet loss towards multicast routers. The second issue is the MRD header parsing in br_ip6_multicast_mrd_rcv(): It wrongly checks for an ICMPv6 header immediately after the IPv6 header (IPv6 next header type). However according to RFC4286, section 2 all MRD messages contain a Router Alert option (just like MLD). So instead there is an IPv6 Hop-by-Hop option for the Router Alert between the IPv6 and ICMPv6 header, again leading to the bridge wrongly discarding Multicast Router Advertisements. To fix these two issues, introduce a new return value -ENODATA to ipv6_mc_check_mld() to indicate a valid ICMPv6 packet with a hop-by-hop option which is not an MLD but potentially an MRD packet. This also simplifies further parsing in the bridge code, as ipv6_mc_check_mld() already fully checks the ICMPv6 header and hop-by-hop option. These issues were found and fixed with the help of the mrdisc tool (https://github.com/troglobit/mrdisc). Fixes: `4b3087c7e3` ("bridge: Snoop Multicast Router Advertisements") Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-27 14:02:06 -07:00
Florian Westphal	47a6959fa3	netfilter: allow to turn off xtables compat layer The compat layer needs to parse untrusted input (the ruleset) to translate it to a 64bit compatible format. We had a number of bugs in this department in the past, so allow users to turn this feature off. Add CONFIG_NETFILTER_XTABLES_COMPAT kconfig knob and make it default to y to keep existing behaviour. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-04-26 18:16:56 +02:00
Florian Westphal	4c95e0728e	netfilter: ebtables: remove the 3 ebtables pointers from struct net ebtables stores the table internal data (what gets passed to the ebt_do_table() interpreter) in struct net. nftables keeps the internal interpreter format in pernet lists and passes it via the netfilter core infrastructure (priv pointer). Do the same for ebtables: the nf_hook_ops are duplicated via kmemdup, then the ops->priv pointer is set to the table that is being registered. After that, the netfilter core passes this table info to the hookfn. This allows to remove the pointers from struct net. Same pattern can be applied to ip/ip6/arptables. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-04-26 03:20:07 +02:00
Vladimir Oltean	68f5c12abb	net: bridge: fix error in br_multicast_add_port when CONFIG_NET_SWITCHDEV=n When CONFIG_NET_SWITCHDEV is disabled, the shim for switchdev_port_attr_set inside br_mc_disabled_update returns -EOPNOTSUPP. This is not caught, and propagated to the caller of br_multicast_add_port, preventing ports from joining the bridge. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Fixes: `ae1ea84b33` ("net: bridge: propagate error code and extack from br_mc_disabled_update") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-21 13:13:33 -07:00
Jakub Kicinski	8203c7ce4e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ethernet/stmicro/stmmac/stmmac_main.c - keep the ZC code, drop the code related to reinit net/bridge/netfilter/ebtables.c - fix build after move to net_generic Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-04-17 11:08:07 -07:00
Vladimir Oltean	2c4eca3ef7	net: bridge: switchdev: include local flag in FDB notifications As explained in bugfix commit `6ab4c3117a` ("net: bridge: don't notify switchdev for local FDB addresses") as well as in this discussion: https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/ the switchdev notifiers for FDB entries managed to have a zero-day bug, which was that drivers would not know what to do with local FDB entries, because they were not told that they are local. The bug fix was to simply not notify them of those addresses. Let us now add the 'is_local' bit to bridge FDB entries, and make all drivers ignore these entries by their own choice. Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-16 15:15:45 -07:00
Tobias Waldekranz	e5b4b8988b	net: bridge: switchdev: refactor br_switchdev_fdb_notify Instead of having to add more and more arguments to br_switchdev_fdb_call_notifiers, get rid of it and build the info struct directly in br_switchdev_fdb_notify. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-16 15:15:45 -07:00
Florian Fainelli	ae1ea84b33	net: bridge: propagate error code and extack from br_mc_disabled_update Some Ethernet switches might only be able to support disabling multicast snooping globally, which is an issue for example when several bridges span the same physical device and request contradictory settings. Propagate the return value of br_mc_disabled_update() such that this limitation is transmitted correctly to user-space. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-14 14:32:05 -07:00
Florian Westphal	7ee3c61dcd	netfilter: bridge: add pre_exit hooks for ebtable unregistration Just like ip/ip6/arptables, the hooks have to be removed, then synchronize_rcu() has to be called to make sure no more packets are being processed before the ruleset data is released. Place the hook unregistration in the pre_exit hook, then call the new ebtables pre_exit function from there. Years ago, when first netns support got added for netfilter+ebtables, this used an older (now removed) netfilter hook unregister API, that did a unconditional synchronize_rcu(). Now that all is done with call_rcu, ebtable_{filter,nat,broute} pernet exit handlers may free the ebtable ruleset while packets are still in flight. This can only happens on module removal, not during netns exit. The new function expects the table name, not the table struct. This is because upcoming patch set (targeting -next) will remove all net->xt.{nat,filter,broute}_table instances, this makes it necessary to avoid external references to those member variables. The existing APIs will be converted, so follow the upcoming scheme of passing name + hook type instead. Fixes: `aee12a0a37` ("ebtables: remove nf_hook_register usage") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-04-10 21:16:54 +02:00
Florian Westphal	5b53951cfc	netfilter: ebtables: use net_generic infra ebtables currently uses net->xt.tables[BRIDGE], but upcoming patch will move net->xt.tables away from struct net. To avoid exposing x_tables internals to ebtables, use a private list instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-04-06 00:34:52 +02:00
Florian Westphal	77ccee96a6	netfilter: nf_log_bridge: merge with nf_log_syslog Provide bridge log support from nf_log_syslog. After the merge there is no need to load the "real packet loggers", all of them now reside in the same module. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-03-31 22:34:05 +02:00
David S. Miller	efd13b71a3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-25 15:31:22 -07:00
Felix Fietkau	26267bf9bb	netfilter: flowtable: bridge vlan hardware offload and switchdev The switch might have already added the VLAN tag through PVID hardware offload. Keep this extra VLAN in the flowtable but skip it on egress. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-24 12:48:39 -07:00
Felix Fietkau	bcf2766b13	net: bridge: resolve forwarding path for VLAN tag actions in bridge devices Depending on the VLAN settings of the bridge and the port, the bridge can either add or remove a tag. When vlan filtering is enabled, the fdb lookup also needs to know the VLAN tag/proto for the destination address To provide this, keep track of the stack of VLAN tags for the path in the lookup context Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-24 12:48:38 -07:00
Pablo Neira Ayuso	ec9d16bab6	net: bridge: resolve forwarding path for bridge devices Add .ndo_fill_forward_path for bridge devices. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-24 12:48:38 -07:00
Colin Ian King	ad248f7761	net: bridge: Fix missing return assignment from br_vlan_replay_one call The call to br_vlan_replay_one is returning an error return value but this is not being assigned to err and the following check on err is currently always false because err was initialized to zero. Fix this by assigning err. Addresses-Coverity: ("'Constant' variable guards dead code") Fixes: `22f67cdfae` ("net: bridge: add helper to replay VLANs installed on port") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-24 12:45:48 -07:00
Horatiu Vultur	b3cb91b97c	bridge: mrp: Disable roles before deleting the MRP instance When an MRP instance was created, the driver was notified that the instance is created and then in a different callback about role of the instance. But when the instance was deleted the driver was notified only that the MRP instance is deleted and not also that the role is disabled. This patch make sure that the driver is notified that the role is changed to disabled before the MRP instance is deleted to have similar callbacks with the creating of the instance. In this way it would simplify the logic in the drivers. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-24 12:14:08 -07:00
Vladimir Oltean	22f67cdfae	net: bridge: add helper to replay VLANs installed on port Currently this simple setup with DSA: ip link add br0 type bridge vlan_filtering 1 ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 will not work because the bridge has created the PVID in br_add_if -> nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1, but that was too early, since swp0 was not yet a lower of bond0, so it had no reason to act upon that notification. We need a helper in the bridge to replay the switchdev VLAN objects that were notified since the bridge port creation, because some of them may have been missed. As opposed to the br_mdb_replay function, the vg->vlan_list write side protection is offered by the rtnl_mutex which is sleepable, so we don't need to queue up the objects in atomic context, we can replay them right away. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-23 14:49:05 -07:00
Vladimir Oltean	04846f903b	net: bridge: add helper to replay port and local fdb entries When a switchdev port starts offloading a LAG that is already in a bridge and has an FDB entry pointing to it: ip link set bond0 master br0 bridge fdb add dev bond0 00:01:02:03:04:05 master static ip link set swp0 master bond0 the switchdev driver will have no idea that this FDB entry is there, because it missed the switchdev event emitted at its creation. Ido Schimmel pointed this out during a discussion about challenges with switchdev offloading of stacked interfaces between the physical port and the bridge, and recommended to just catch that condition and deny the CHANGEUPPER event: https://lore.kernel.org/netdev/20210210105949.GB287766@shredder.lan/ But in fact, we might need to deal with the hard thing anyway, which is to replay all FDB addresses relevant to this port, because it isn't just static FDB entries, but also local addresses (ones that are not forwarded but terminated by the bridge). There, we can't just say 'oh yeah, there was an upper already so I'm not joining that'. So, similar to the logic for replaying MDB entries, add a function that must be called by individual switchdev drivers and replays local FDB entries as well as ones pointing towards a bridge port. This time, we use the atomic switchdev notifier block, since that's what FDB entries expect for some reason. Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-23 14:49:05 -07:00
Vladimir Oltean	4f2673b3a2	net: bridge: add helper to replay port and host-joined mdb entries I have a system with DSA ports, and udhcpcd is configured to bring interfaces up as soon as they are created. I create a bridge as follows: ip link add br0 type bridge As soon as I create the bridge and udhcpcd brings it up, I also have avahi which automatically starts sending IPv6 packets to advertise some local services, and because of that, the br0 bridge joins the following IPv6 groups due to the code path detailed below: 33:33:ff:6d:c1:9c vid 0 33:33:00:00:00:6a vid 0 33:33:00:00:00:fb vid 0 br_dev_xmit -> br_multicast_rcv -> br_ip6_multicast_add_group -> __br_multicast_add_group -> br_multicast_host_join -> br_mdb_notify This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host hooked up, and switchdev will attempt to offload the host joined groups to an empty list of ports. Of course nobody offloads them. Then when we add a port to br0: ip link set swp0 master br0 the bridge doesn't replay the host-joined MDB entries from br_add_if, and eventually the host joined addresses expire, and a switchdev notification for deleting it is emitted, but surprise, the original addition was already completely missed. The strategy to address this problem is to replay the MDB entries (both the port ones and the host joined ones) when the new port joins the bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can be populated and only then attached to a bridge that you offload). However there are 2 possibilities: the addresses can be 'pushed' by the bridge into the port, or the port can 'pull' them from the bridge. Considering that in the general case, the new port can be really late to the party, and there may have been many other switchdev ports that already received the initial notification, we would like to avoid delivering duplicate events to them, since they might misbehave. And currently, the bridge calls the entire switchdev notifier chain, whereas for replaying it should just call the notifier block of the new guy. But the bridge doesn't know what is the new guy's notifier block, it just knows where the switchdev notifier chain is. So for simplification, we make this a driver-initiated pull for now, and the notifier block is passed as an argument. To emulate the calling context for mdb objects (deferred and put on the blocking notifier chain), we must iterate under RCU protection through the bridge's mdb entries, queue them, and only call them once we're out of the RCU read-side critical section. There was some opportunity for reuse between br_mdb_switchdev_host_port, br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev mdb object is created, so a helper was created. Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-23 14:49:05 -07:00
Vladimir Oltean	f1d42ea100	net: bridge: add helper to retrieve the current ageing time The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from: sysfs/ioctl/netlink -> br_set_ageing_time -> __set_ageing_time therefore not at bridge port creation time, so: (a) switchdev drivers have to hardcode the initial value for the address ageing time, because they didn't get any notification (b) that hardcoded value can be out of sync, if the user changes the ageing time before enslaving the port to the bridge We need a helper in the bridge, such that switchdev drivers can query the current value of the bridge ageing time when they start offloading it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-23 14:49:05 -07:00
Vladimir Oltean	c0e715bbd5	net: bridge: add helper for retrieving the current bridge port STP state It may happen that we have the following topology with DSA or any other switchdev driver with LAG offload: ip link add br0 type bridge stp_state 1 ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 ip link set swp1 master bond0 STP decides that it should put bond0 into the BLOCKING state, and that's that. The ports that are actively listening for the switchdev port attributes emitted for the bond0 bridge port (because they are offloading it) and have the honor of seeing that switchdev port attribute can react to it, so we can program swp0 and swp1 into the BLOCKING state. But if then we do: ip link set swp2 master bond0 then as far as the bridge is concerned, nothing has changed: it still has one bridge port. But this new bridge port will not see any STP state change notification and will remain FORWARDING, which is how the standalone code leaves it in. We need a function in the bridge driver which retrieves the current STP state, such that drivers can synchronize to it when they may have missed switchdev events. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-23 14:49:05 -07:00
Vladimir Oltean	6ab4c3117a	net: bridge: don't notify switchdev for local FDB addresses As explained in this discussion: https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/ the switchdev notifiers for FDB entries managed to have a zero-day bug. The bridge would not say that this entry is local: ip link add br0 type bridge ip link set swp0 master br0 bridge fdb add dev swp0 00:01:02:03:04:05 master local and the switchdev driver would be more than happy to offload it as a normal static FDB entry. This is despite the fact that 'local' and non-'local' entries have completely opposite directions: a local entry is locally terminated and not forwarded, whereas a static entry is forwarded and not locally terminated. So, for example, DSA would install this entry on swp0 instead of installing it on the CPU port as it should. There is an even sadder part, which is that the 'local' flag is implicit if 'static' is not specified, meaning that this command produces the same result of adding a 'local' entry: bridge fdb add dev swp0 00:01:02:03:04:05 master I've updated the man pages for 'bridge', and after reading it now, it should be pretty clear to any user that the commands above were broken and should have never resulted in the 00:01:02:03:04:05 address being forwarded (this behavior is coherent with non-switchdev interfaces): https://patchwork.kernel.org/project/netdevbpf/cover/20210211104502.2081443-1-olteanv@gmail.com/ If you're a user reading this and this is what you want, just use: bridge fdb add dev swp0 00:01:02:03:04:05 master static Because switchdev should have given drivers the means from day one to classify FDB entries as local/non-local, but didn't, it means that all drivers are currently broken. So we can just as well omit the switchdev notifications for local FDB entries, which is exactly what this patch does to close the bug in stable trees. For further development work where drivers might want to trap the local FDB entries to the host, we can add a 'bool is_local' to br_switchdev_fdb_call_notifiers(), and selectively make drivers act upon that bit, while all the others ignore those entries if the 'is_local' bit is set. Fixes: `6b26b51b1d` ("net: bridge: Add support for notifying devices about FDB add/del") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-23 14:39:41 -07:00
Nikolay Aleksandrov	0353b4a96b	net: bridge: when suppression is enabled exclude RARP packets Recently we had an interop issue where RARP packets got suppressed with bridge neigh suppression enabled, but the check in the code was meant to suppress GARP. Exclude RARP packets from it which would allow some VMWare setups to work, to quote the report: "Those RARP packets usually get generated by vMware to notify physical switches when vMotion occurs. vMware may use random sip/tip or just use sip=tip=0. So the RARP packet sometimes get properly flooded by the vtep and other times get dropped by the logic" Reported-by: Amer Abdalamer <amer@nvidia.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-22 13:30:24 -07:00
Vladimir Oltean	f5fcca89f5	net: bridge: declare br_vlan_tunnel_lookup argument tunnel_id as __be64 The only caller of br_vlan_tunnel_lookup, br_handle_ingress_vlan_tunnel, extracts the tunnel_id from struct ip_tunnel_info::struct ip_tunnel_key:: tun_id which is a __be64 value. The exact endianness does not seem to matter, because the tunnel id is just used as a lookup key for the VLAN group's tunnel hash table, and the value is not interpreted directly per se. Moreover, rhashtable_lookup_fast treats the key argument as a const void *. Therefore, there is no functional change associated with this patch, just one to silence "make W=1" builds. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-22 13:11:59 -07:00
Nikolay Aleksandrov	e09cf58205	net: bridge: mcast: factor out common allow/block EHT handling We hande EHT state change for ALLOW messages in INCLUDE mode and for BLOCK messages in EXCLUDE mode similarly - create the new set entries with the proper filter mode. We also handle EHT state change for ALLOW messages in EXCLUDE mode and for BLOCK messages in INCLUDE mode in a similar way - delete the common entries (current set and new set). Factor out all the common code as follows: - ALLOW/INCLUDE, BLOCK/EXCLUDE: call __eht_create_set_entries() - ALLOW/EXCLUDE, BLOCK/INCLUDE: call __eht_del_common_set_entries() The set entries creation can be reused in __eht_inc_exc() as well. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-16 11:57:57 -07:00
Nikolay Aleksandrov	6aa2c371c7	net: bridge: mcast: remove unreachable EHT code In the initial EHT versions there were common functions which handled allow/block messages for both INCLUDE and EXCLUDE modes, but later they were separated. It seems I've left some common code which cannot be reached because the filter mode is checked before calling the respective functions, i.e. the host filter is always in EXCLUDE mode when using __eht_allow_excl() and __eht_block_excl() thus we can drop the host_excl checks inside and simplify the code a bit. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-16 11:57:57 -07:00
Gustavo A. R. Silva	ecd1c6a51f	net: bridge: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Horatiu Vultur	cd605d455a	bridge: mrp: Update br_mrp to use new return values of br_mrp_switchdev Check the return values of the br_mrp_switchdev function. In case of: - BR_MRP_NONE, return the error to userspace, - BR_MRP_SW, continue with SW implementation, - BR_MRP_HW, continue without SW implementation, Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-16 14:47:46 -08:00
Horatiu Vultur	1a3ddb0b75	bridge: mrp: Extend br_mrp_switchdev to detect better the errors This patch extends the br_mrp_switchdev functions to be able to have a better understanding what cause the issue and if the SW needs to be used as a backup. There are the following cases: - when the code is compiled without CONFIG_NET_SWITCHDEV. In this case return success so the SW can continue with the protocol. Depending on the function, it returns 0 or BR_MRP_SW. - when code is compiled with CONFIG_NET_SWITCHDEV and the driver doesn't implement any MRP callbacks. In this case the HW can't run MRP so it just returns -EOPNOTSUPP. So the SW will stop further to configure the node. - when code is compiled with CONFIG_NET_SWITCHDEV and the driver fully supports any MRP functionality. In this case the SW doesn't need to do anything. The functions will return 0 or BR_MRP_HW. - when code is compiled with CONFIG_NET_SWITCHDEV and the HW can't run completely the protocol but it can help the SW to run it. For example, the HW can't support completely MRM role(can't detect when it stops receiving MRP Test frames) but it can redirect these frames to CPU. In this case it is possible to have a SW fallback. The SW will try initially to call the driver with sw_backup set to false, meaning that the HW should implement completely the role. If the driver returns -EOPNOTSUPP, the SW will try again with sw_backup set to false, meaning that the SW will detect when it stops receiving the frames but it needs HW support to redirect the frames to CPU. In case the driver returns 0 then the SW will continue to configure the node accordingly. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-16 14:47:46 -08:00
Horatiu Vultur	e1bd99d07e	bridge: mrp: Add 'enum br_mrp_hw_support' Add the enum br_mrp_hw_support that is used by the br_mrp_switchdev functions to allow the SW to detect the cases where HW can't implement the functionality or when SW is used as a backup. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-16 14:47:46 -08:00
Vladimir Oltean	c97f47e3c1	net: bridge: fix br_vlan_filter_toggle stub when CONFIG_BRIDGE_VLAN_FILTERING=n The prototype of br_vlan_filter_toggle was updated to include a netlink extack, but the stub definition wasn't, which results in a build error when CONFIG_BRIDGE_VLAN_FILTERING=n. Fixes: `9e781401cb` ("net: bridge: propagate extack through store_bridge_parm") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-15 13:15:10 -08:00
Vladimir Oltean	dcbdf1350e	net: bridge: propagate extack through switchdev_port_attr_set The benefit is the ability to propagate errors from switchdev drivers for the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING and SWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL attributes. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-14 17:38:11 -08:00
Vladimir Oltean	9e781401cb	net: bridge: propagate extack through store_bridge_parm The bridge sysfs interface stores parameters for the STP, VLAN, multicast etc subsystems using a predefined function prototype. Sometimes the underlying function being called supports a netlink extended ack message, and we ignore it. Let's expand the store_bridge_parm function prototype to include the extack, and just print it to console, but at least propagate it where applicable. Where not applicable, create a shim function in the br_sysfs_br.c file that discards the extra function argument. This patch allows us to propagate the extack argument to br_vlan_set_default_pvid, br_vlan_set_proto and br_vlan_filter_toggle, and from there, further up in br_changelink from br_netlink.c. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-14 17:38:11 -08:00
Vladimir Oltean	7a572964e0	net: bridge: remove __br_vlan_filter_toggle This function is identical with br_vlan_filter_toggle. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-14 17:38:11 -08:00
Vladimir Oltean	e18f4c18ab	net: switchdev: pass flags and mask to both {PRE_,}BRIDGE_FLAGS attributes This switchdev attribute offers a counterproductive API for a driver writer, because although br_switchdev_set_port_flag gets passed a "flags" and a "mask", those are passed piecemeal to the driver, so while the PRE_BRIDGE_FLAGS listener knows what changed because it has the "mask", the BRIDGE_FLAGS listener doesn't, because it only has the final value. But certain drivers can offload only certain combinations of settings, like for example they cannot change unicast flooding independently of multicast flooding - they must be both on or both off. The way the information is passed to switchdev makes drivers not expressive enough, and unable to reject this request ahead of time, in the PRE_BRIDGE_FLAGS notifier, so they are forced to reject it during the deferred BRIDGE_FLAGS attribute, where the rejection is currently ignored. This patch also changes drivers to make use of the "mask" field for edge detection when possible. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-12 17:08:04 -08:00
Vladimir Oltean	078bbb851e	net: bridge: don't print in br_switchdev_set_port_flag For the netlink interface, propagate errors through extack rather than simply printing them to the console. For the sysfs interface, we still print to the console, but at least that's one layer higher than in switchdev, which also allows us to silently ignore the offloading of flags if that is ever needed in the future. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-12 17:08:04 -08:00
Vladimir Oltean	304ae3bf1c	net: bridge: offload all port flags at once in br_setport If for example this command: ip link set swp0 type bridge_slave flood off mcast_flood off learning off succeeded at configuring BR_FLOOD and BR_MCAST_FLOOD but not at BR_LEARNING, there would be no attempt to revert the partial state in any way. Arguably, if the user changes more than one flag through the same netlink command, this one _should_ be all or nothing, which means it should be passed through switchdev as all or nothing. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-12 17:08:04 -08:00
David S. Miller	dc9d87581d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2021-02-10 13:30:12 -08:00
Horatiu Vultur	b2bdba1cbc	bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state The function br_mrp_port_switchdev_set_state was called both with MRP port state and STP port state, which is an issue because they don't match exactly. Therefore, update the function to be used only with STP port state and use the id SWITCHDEV_ATTR_ID_PORT_STP_STATE. The choice of using STP over MRP is that the drivers already implement SWITCHDEV_ATTR_ID_PORT_STP_STATE and already in SW we update the port STP state. Fixes: `9a9f26e8f7` ("bridge: mrp: Connect MRP API with the switchdev API") Fixes: `fadd409136` ("bridge: switchdev: mrp: Implement MRP API for switchdev") Fixes: `2f1a11ae11` ("bridge: mrp: Add MRP interface.") Reported-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-08 16:20:57 -08:00
Vladimir Oltean	8043c845b6	net: bridge: use switchdev for port flags set through sysfs too Looking through patchwork I don't see that there was any consensus to use switchdev notifiers only in case of netlink provided port flags but not sysfs (as a sort of deprecation, punishment or anything like that), so we should probably keep the user interface consistent in terms of functionality. http://patchwork.ozlabs.org/project/netdev/patch/20170605092043.3523-3-jiri@resnulli.us/ http://patchwork.ozlabs.org/project/netdev/patch/20170608064428.4785-3-jiri@resnulli.us/ Fixes: `3922285d96` ("net: bridge: Add support for offloading port attributes") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-08 15:43:19 -08:00
Jakub Kicinski	c273a20c30	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Remove indirection and use nf_ct_get() instead from nfnetlink_log and nfnetlink_queue, from Florian Westphal. 2) Add weighted random twos choice least-connection scheduling for IPVS, from Darby Payne. 3) Add a __hash placeholder in the flow tuple structure to identify the field to be included in the rhashtable key hash calculation. 4) Add a new nft_parse_register_load() and nft_parse_register_store() to consolidate register load and store in the core. 5) Statify nft_parse_register() since it has no more module clients. 6) Remove redundant assignment in nft_cmp, from Colin Ian King. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: netfilter: nftables: remove redundant assignment of variable err netfilter: nftables: statify nft_parse_register() netfilter: nftables: add nft_parse_register_store() and use it netfilter: nftables: add nft_parse_register_load() and use it netfilter: flowtable: add hash offset field to tuple ipvs: add weighted random twos choice algorithm netfilter: ctnetlink: remove get_ct indirection ==================== Link: https://lore.kernel.org/r/20210206015005.23037-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-02-06 15:34:23 -08:00
Xu Wang	1697291dae	net: bridge: mcast: Use ERR_CAST instead of ERR_PTR(PTR_ERR()) Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...)). net/bridge/br_multicast.c:1246:9-16: WARNING: ERR_CAST can be used with mp Generated by: scripts/coccinelle/api/err_cast.cocci Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Link: https://lore.kernel.org/r/20210204070549.83636-1-vulab@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-02-06 10:51:01 -08:00
Nikolay Aleksandrov	1e16f382ae	net: bridge: add warning comments to avoid extending sysfs We're moving to netlink-only options, so add comments in the bridge's sysfs files to warn against adding any new sysfs entries. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-29 21:33:02 -08:00
Nikolay Aleksandrov	7d0888d52f	net: bridge: mcast: drop hosts limit sysfs support We decided to stop adding new sysfs bridge options and continue with netlink only, so remove hosts limit sysfs support. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-29 21:33:02 -08:00
Jakub Kicinski	c358f95205	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/can/dev.c `b552766c87` ("can: dev: prevent potential information leak in can_fill_info()") `3e77f70e73` ("can: dev: move driver related infrastructure into separate subdir") `0a042c6ec9` ("can: dev: move netlink related code into seperate file") Code move. drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c `57ac4a31c4` ("net/mlx5e: Correctly handle changing the number of queues when the interface is down") `214baf2287` ("net/mlx5e: Support HTB offload") Adjacent code changes net/switchdev/switchdev.c `20776b465c` ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP") `ffb68fc58e` ("net: switchdev: remove the transaction structure from port object notifiers") `bae33f2b5a` ("net: switchdev: remove the transaction structure from port attributes") Transaction parameter gets dropped otherwise keep the fix. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-28 17:09:31 -08:00
Nikolay Aleksandrov	2dba407f99	net: bridge: multicast: make tracked EHT hosts limit configurable Add two new port attributes which make EHT hosts limit configurable and export the current number of tracked EHT hosts: - IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT: configure/retrieve current limit - IFLA_BRPORT_MCAST_EHT_HOSTS_CNT: current number of tracked hosts Setting IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT to 0 is currently not allowed. Note that we have to increase RTNL_SLAVE_MAX_TYPE to 38 minimum, I've increased it to 40 to have space for two more future entries. v2: move br_multicast_eht_set_hosts_limit() to br_multicast_eht.c, no functional change Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-27 17:40:35 -08:00
Nikolay Aleksandrov	89268b056e	net: bridge: multicast: add per-port EHT hosts limit Add a default limit of 512 for number of tracked EHT hosts per-port. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-27 17:40:35 -08:00
Pablo Neira Ayuso	345023b0db	netfilter: nftables: add nft_parse_register_store() and use it This new function combines the netlink register attribute parser and the store validation function. This update requires to replace: enum nft_registers dreg:8; in many of the expression private areas otherwise compiler complains with: error: cannot take address of bit-field ‘dreg’ when passing the register field as reference. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2021-01-27 23:16:02 +01:00
Nikolay Aleksandrov	3e841bacf7	net: bridge: multicast: fix br_multicast_eht_set_entry_lookup indentation Fix the messed up indentation in br_multicast_eht_set_entry_lookup(). Fixes: `baa74d39ca` ("net: bridge: multicast: add EHT source set handling functions") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210125082040.13022-1-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-26 16:26:50 -08:00
Jiapeng Zhong	8d21c882ab	bridge: Use PTR_ERR_OR_ZERO instead if(IS_ERR(...)) + PTR_ERR coccicheck suggested using PTR_ERR_OR_ZERO() and looking at the code. Fix the following coccicheck warnings: ./net/bridge/br_multicast.c:1295:7-13: WARNING: PTR_ERR_OR_ZERO can be used. Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/1611542381-91178-1-git-send-email-abaci-bugfix@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-25 18:23:07 -08:00
Rasmus Villemoes	6781939054	net: mrp: move struct definitions out of uapi None of these are actually used in the kernel/userspace interface - there's a userspace component of implementing MRP, and userspace will need to construct certain frames to put on the wire, but there's no reason the kernel should provide the relevant definitions in a UAPI header. In fact, some of those definitions were broken until previous commit, so only keep the few that are actually referenced in the kernel code, and move them to the br_private_mrp.h header. Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-23 12:38:42 -08:00
Nikolay Aleksandrov	d5a1022283	net: bridge: multicast: mark IGMPv3/MLDv2 fast-leave deletes Mark groups which were deleted due to fast leave/EHT. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov	e87e4b5caa	net: bridge: multicast: handle block pg delete for all cases A block report can result in empty source and host sets for both include and exclude groups so if there are no hosts left we can safely remove the group. Pull the block group handling so it can cover both cases and add a check if EHT requires the delete. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov	c9739016a0	net: bridge: multicast: add EHT host filter_mode handling We should be able to handle host filter mode changing. For exclude mode we must create a zero-src entry so the group will be kept even without any S,G entries (non-zero source sets). That entry doesn't count to the entry limit and can always be created, its timer is refreshed on new exclude reports and if we change the host filter mode to include then it gets removed and we rely only on the non-zero source sets. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov	b66bf55bbc	net: bridge: multicast: optimize TO_INCLUDE EHT timeouts This is an optimization specifically for TO_INCLUDE which sends queries for the older entries and thus lowers the S,G timers to LMQT. If we have the following situation for a group in either include or exclude mode: - host A was interested in srcs X and Y, but is timing out - host B sends TO_INCLUDE src Z, the bridge lowers X and Y's timeouts to LMQT - host B sends BLOCK src Z after LMQT time has passed => since host B is the last host we can delete the group, but if we still have host A's EHT entries for X and Y (i.e. if they weren't lowered to LMQT previously) then we'll have to wait another LMQT time before deleting the group, with this optimization we can directly remove it regardless of the group mode as there are no more interested hosts Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov	ddc255d993	net: bridge: multicast: add EHT include and exclude handling Add support for IGMPv3/MLDv2 include and exclude EHT handling. Similar to how the reports are processed we have 2 cases when the group is in include or exclude mode, these are processed as follows: - group include - is_include: create missing entries - to_include: flush existing entries and create a new set from the report, obviously if the src set is empty then we delete the group - group exclude - is_exclude: create missing entries - to_exclude: flush existing entries and create a new set from the report, any empty source set entries are removed If the group is in a different mode then we just flush all entries reported by the host and we create a new set with the new mode entries created from the report. If the report is include type, the source list is empty and the group has empty sources' set then we remove it. Any source set entries which are empty are removed as well. If the group is in exclude mode it can exist without any S,G entries (allowing for all traffic to pass). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov	474ddb37fa	net: bridge: multicast: add EHT allow/block handling Add support for IGMPv3/MLDv2 allow/block EHT handling. Similar to how the reports are processed we have 2 cases when the group is in include or exclude mode, these are processed as follows: - group include - allow: create missing entries - block: remove existing matching entries and remove the corresponding S,G entries if there are no more set host entries, then possibly delete the whole group if there are no more S,G entries - group exclude - allow - host include: create missing entries - host exclude: remove existing matching entries and remove the corresponding S,G entries if there are no more set host entries - block - host include: remove existing matching entries and remove the corresponding S,G entries if there are no more set host entries, then possibly delete the whole group if there are no more S,G entries - host exclude: create missing entries Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov	dba6b0a5ca	net: bridge: multicast: add EHT host delete function Now that we can delete set entries, we can use that to remove EHT hosts. Since the group's host set entries exist only when there are related source set entries we just have to flush all source set entries joined by the host set entry and it will be automatically removed. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov	baa74d39ca	net: bridge: multicast: add EHT source set handling functions Add EHT source set and set-entry create, delete and lookup functions. These allow to manipulate source sets which contain their own host sets with entries which joined that S,G. We're limiting the maximum number of tracked S,G entries per host to PG_SRC_ENT_LIMIT (currently 32) which is the current maximum of S,G entries for a group. There's a per-set timer which will be used to destroy the whole set later. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov	5b16328879	net: bridge: multicast: add EHT host handling functions Add functions to create, destroy and lookup an EHT host. These are per-host entries contained in the eht_host_tree in net_bridge_port_group which are used to store a list of all sources (S,G) entries joined for that group by each host, the host's current filter mode and total number of joined entries. No functional changes yet, these would be used in later patches. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov	8f07b83119	net: bridge: multicast: add EHT structures and definitions Add EHT structures for tracking hosts and sources per group. We keep one set for each host which has all of the host's S,G entries, and one set for each multicast source which has all hosts that have joined that S,G. For each host, source entry we record the filter_mode and we keep an expiry timer. There is also one global expiry timer per source set, it is updated with each set entry update, it will be later used to lower the set's timer instead of lowering each entry's timer separately. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov	e7cfcf2c18	net: bridge: multicast: calculate idx position without changing ptr We need to preserve the srcs pointer since we'll be passing it for EHT handling later. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov	0ad57c99e8	net: bridge: multicast: __grp_src_block_incl can modify pg Prepare __grp_src_block_incl() for being able to cause a notification due to changes. Currently it cannot happen, but EHT would change that since we'll be deleting sources immediately. Make sure that if the pg is deleted we don't return true as that would cause the caller to access freed pg. This patch shouldn't cause any functional change. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov	54bea72196	net: bridge: multicast: pass host src address to IGMPv3/MLDv2 functions We need to pass the host address so later it can be used for explicit host tracking. No functional change. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov	9e10b9e656	net: bridge: multicast: rename src_size to addr_size Rename src_size argument to addr_size in preparation for passing host address as an argument to IGMPv3/MLDv2 functions. No functional change. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-22 19:39:56 -08:00
Menglong Dong	a98c0c4742	net: bridge: check vlan with eth_type_vlan() method Replace some checks for ETH_P_8021Q and ETH_P_8021AD with eth_type_vlan(). Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210117080950.122761-1-dong.menglong@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-18 14:27:33 -08:00
Vladimir Oltean	b7a9e0da2d	net: switchdev: remove vid_begin -> vid_end range from VLAN objects The call path of a switchdev VLAN addition to the bridge looks something like this today: nbp_vlan_init \| __br_vlan_set_default_pvid \| \| \| \| \| br_afspec \| \| \| \| \| \| \| v \| \| \| br_process_vlan_info \| \| \| \| \| \| \| v \| \| \| br_vlan_info \| \| \| / \ / \| \| / \ / \| \| / \ / \| \| / \ / v v v v v nbp_vlan_add br_vlan_add ------+ \| ^ ^ \| \| \| / \| \| \| \| / / / \| \ br_vlan_get_master/ / v \ ^ / / br_vlan_add_existing \ \| / / \| \ \| / / / \ \| / / / \ \| / / / \ \| / / / v \| \| v / __vlan_add / / \| / / \| / v \| / __vlan_vid_add \| / \ \| / v v v br_switchdev_port_vlan_add The ranges UAPI was introduced to the bridge in commit `bdced7ef78` ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec) have always been passed one by one, through struct bridge_vlan_info tmp_vinfo, to br_vlan_info. So the range never went too far in depth. Then Scott Feldman introduced the switchdev_port_bridge_setlink function in commit `47f8328bb1` ("switchdev: add new switchdev bridge setlink"). That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made full use of the range. But switchdev_port_bridge_setlink was called like this: br_setlink -> br_afspec -> switchdev_port_bridge_setlink Basically, the switchdev and the bridge code were not tightly integrated. Then commit `41c498b935` ("bridge: restore br_setlink back to original") came, and switchdev drivers were required to implement .ndo_bridge_setlink = switchdev_port_bridge_setlink for a while. In the meantime, commits such as `0944d6b5a2` ("bridge: try switchdev op first in __vlan_vid_add/del") finally made switchdev penetrate the br_vlan_info() barrier and start to develop the call path we have today. But remember, br_vlan_info() still receives VLANs one by one. Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit `29ab586c3d` ("net: switchdev: Remove bridge bypass support from switchdev") so that drivers would not implement .ndo_bridge_setlink any longer. The switchdev_port_bridge_setlink also got deleted. This refactoring removed the parallel bridge_setlink implementation from switchdev, and left the only switchdev VLAN objects to be the ones offloaded from __vlan_vid_add (basically RX filtering) and __vlan_add (the latter coming from commit `9c86ce2c1a` ("net: bridge: Notify about bridge VLANs")). That is to say, today the switchdev VLAN object ranges are not used in the kernel. Refactoring the above call path is a bit complicated, when the bridge VLAN call path is already a bit complicated. Let's go off and finish the job of commit `29ab586c3d` by deleting the bogus iteration through the VLAN ranges from the drivers. Some aspects of this feature never made too much sense in the first place. For example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID flag supposed to mean, when a port can obviously have a single pvid? This particular configuration _is_ denied as of commit `6623c60dc2` ("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API perspective, the driver still has to play pretend, and only offload the vlan->vid_end as pvid. And the addition of a switchdev VLAN object can modify the flags of another, completely unrelated, switchdev VLAN object! (a VLAN that is PVID will invalidate the PVID flag from whatever other VLAN had previously been offloaded with switchdev and had that flag. Yet switchdev never notifies about that change, drivers are supposed to guess). Nonetheless, having a VLAN range in the API makes error handling look scarier than it really is - unwinding on errors and all of that. When in reality, no one really calls this API with more than one VLAN. It is all unnecessary complexity. And despite appearing pretentious (two-phase transactional model and all), the switchdev API is really sloppy because the VLAN addition and removal operations are not paired with one another (you can add a VLAN 100 times and delete it just once). The bridge notifies through switchdev of a VLAN addition not only when the flags of an existing VLAN change, but also when nothing changes. There are switchdev drivers out there who don't like adding a VLAN that has already been added, and those checks don't really belong at driver level. But the fact that the API contains ranges is yet another factor that prevents this from being addressed in the future. Of the existing switchdev pieces of hardware, it appears that only Mellanox Spectrum supports offloading more than one VLAN at a time, through mlxsw_sp_port_vlan_set. I have kept that code internal to the driver, because there is some more bookkeeping that makes use of it, but I deleted it from the switchdev API. But since the switchdev support for ranges has already been de facto deleted by a Mellanox employee and nobody noticed for 4 years, I'm going to assume it's not a biggie. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-11 16:00:56 -08:00
Menglong Dong	efb5b338da	net: bridge: fix misspellings using codespell tool Some typos are found out by codespell tool: $ codespell ./net/bridge/ ./net/bridge/br_stp.c:604: permanant ==> permanent ./net/bridge/br_stp.c:605: persistance ==> persistence ./net/bridge/br.c:125: underlaying ==> underlying ./net/bridge/br_input.c:43: modue ==> mode ./net/bridge/br_mrp.c:828: Determin ==> Determine ./net/bridge/br_mrp.c:848: Determin ==> Determine ./net/bridge/br_mrp.c:897: Determin ==> Determine Fix typos found by codespell. Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20210108025332.52480-1-dong.menglong@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-09 13:54:47 -08:00
Vladimir Oltean	90dc8fd360	net: bridge: notify switchdev of disappearance of old FDB entry upon migration Currently the bridge emits atomic switchdev notifications for dynamically learnt FDB entries. Monitoring these notifications works wonders for switchdev drivers that want to keep their hardware FDB in sync with the bridge's FDB. For example station A wants to talk to station B in the diagram below, and we are concerned with the behavior of the bridge on the DUT device: DUT +-------------------------------------+ \| br0 \| \| +------+ +------+ +------+ +------+ \| \| \| \| \| \| \| \| \| \| \| \| \| swp0 \| \| swp1 \| \| swp2 \| \| eth0 \| \| +-------------------------------------+ \| \| \| Station A \| \| \| \| +--+------+--+ +--+------+--+ \| \| \| \| \| \| \| \| \| \| swp0 \| \| \| \| swp0 \| \| Another \| +------+ \| \| +------+ \| Another switch \| br0 \| \| br0 \| switch \| +------+ \| \| +------+ \| \| \| \| \| \| \| \| \| \| \| swp1 \| \| \| \| swp1 \| \| +--+------+--+ +--+------+--+ \| Station B Interfaces swp0, swp1, swp2 are handled by a switchdev driver that has the following property: frames injected from its control interface bypass the internal address analyzer logic, and therefore, this hardware does not learn from the source address of packets transmitted by the network stack through it. So, since bridging between eth0 (where Station B is attached) and swp0 (where Station A is attached) is done in software, the switchdev hardware will never learn the source address of Station B. So the traffic towards that destination will be treated as unknown, i.e. flooded. This is where the bridge notifications come in handy. When br0 on the DUT sees frames with Station B's MAC address on eth0, the switchdev driver gets these notifications and can install a rule to send frames towards Station B's address that are incoming from swp0, swp1, swp2, only towards the control interface. This is all switchdev driver private business, which the notification makes possible. All is fine until someone unplugs Station B's cable and moves it to the other switch: DUT +-------------------------------------+ \| br0 \| \| +------+ +------+ +------+ +------+ \| \| \| \| \| \| \| \| \| \| \| \| \| swp0 \| \| swp1 \| \| swp2 \| \| eth0 \| \| +-------------------------------------+ \| \| \| Station A \| \| \| \| +--+------+--+ +--+------+--+ \| \| \| \| \| \| \| \| \| \| swp0 \| \| \| \| swp0 \| \| Another \| +------+ \| \| +------+ \| Another switch \| br0 \| \| br0 \| switch \| +------+ \| \| +------+ \| \| \| \| \| \| \| \| \| \| \| swp1 \| \| \| \| swp1 \| \| +--+------+--+ +--+------+--+ \| Station B Luckily for the use cases we care about, Station B is noisy enough that the DUT hears it (on swp1 this time). swp1 receives the frames and delivers them to the bridge, who enters the unlikely path in br_fdb_update of updating an existing entry. It moves the entry in the software bridge to swp1 and emits an addition notification towards that. As far as the switchdev driver is concerned, all that it needs to ensure is that traffic between Station A and Station B is not forever broken. If it does nothing, then the stale rule to send frames for Station B towards the control interface remains in place. But Station B is no longer reachable via the control interface, but via a port that can offload the bridge port learning attribute. It's just that the port is prevented from learning this address, since the rule overrides FDB updates. So the rule needs to go. The question is via what mechanism. It sure would be possible for this switchdev driver to keep track of all addresses which are sent to the control interface, and then also listen for bridge notifier events on its own ports, searching for the ones that have a MAC address which was previously sent to the control interface. But this is cumbersome and inefficient. Instead, with one small change, the bridge could notify of the address deletion from the old port, in a symmetrical manner with how it did for the insertion. Then the switchdev driver would not be required to monitor learn/forget events for its own ports. It could just delete the rule towards the control interface upon bridge entry migration. This would make hardware address learning be possible again. Then it would take a few more packets until the hardware and software FDB would be in sync again. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-01-07 15:34:45 -08:00
Wang Hai	989a1db06e	net: bridge: Fix a warning when del bridge sysfs I got a warining report: br_sysfs_addbr: can't create group bridge4/bridge ------------[ cut here ]------------ sysfs group 'bridge' not found for kobject 'bridge4' WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group fs/sysfs/group.c:279 [inline] WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group+0x153/0x1b0 fs/sysfs/group.c:270 Modules linked in: iptable_nat ... Call Trace: br_dev_delete+0x112/0x190 net/bridge/br_if.c:384 br_dev_newlink net/bridge/br_netlink.c:1381 [inline] br_dev_newlink+0xdb/0x100 net/bridge/br_netlink.c:1362 __rtnl_newlink+0xe11/0x13f0 net/core/rtnetlink.c:3441 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500 rtnetlink_rcv_msg+0x385/0x980 net/core/rtnetlink.c:5562 netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x793/0xc80 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0x139/0x170 net/socket.c:671 ____sys_sendmsg+0x658/0x7d0 net/socket.c:2353 ___sys_sendmsg+0xf8/0x170 net/socket.c:2407 __sys_sendmsg+0xd3/0x190 net/socket.c:2440 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In br_device_event(), if the bridge sysfs fails to be added, br_device_event() should return error. This can prevent warining when removing bridge sysfs that do not exist. Fixes: `bb900b27a2` ("bridge: allow creating bridge devices with netlink") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Tested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20201211122921.40386-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 18:27:49 -08:00
Jakub Kicinski	7bca5021a4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Missing dependencies in NFT_BRIDGE_REJECT, from Randy Dunlap. 2) Use atomic_inc_return() instead of atomic_add_return() in IPVS, from Yejune Deng. 3) Simplify check for overquota in xt_nfacct, from Kaixu Xia. 4) Move nfnl_acct_list away from struct net, from Miao Wang. 5) Pass actual sk in reject actions, from Jan Engelhardt. 6) Add timeout and protoinfo to ctnetlink destroy events, from Florian Westphal. 7) Four patches to generalize set infrastructure to support for multiple expressions per set element. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: netfilter: nftables: netlink support for several set element expressions netfilter: nftables: generalize set extension to support for several expressions netfilter: nftables: move nft_expr before nft_set netfilter: nftables: generalize set expressions support netfilter: ctnetlink: add timeout and protoinfo to destroy events netfilter: use actual socket sk for REJECT action netfilter: nfnl_acct: remove data from struct net netfilter: Remove unnecessary conversion to bool ipvs: replace atomic_add_return() netfilter: nft_reject_bridge: fix build errors due to code movement ==================== Link: https://lore.kernel.org/r/20201212230513.3465-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 15:43:21 -08:00
Jakub Kicinski	46d5e62dd3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-11 22:29:38 -08:00
Joseph Huang	851d0a73c9	bridge: Fix a deadlock when enabling multicast snooping When enabling multicast snooping, bridge module deadlocks on multicast_lock if 1) IPv6 is enabled, and 2) there is an existing querier on the same L2 network. The deadlock was caused by the following sequence: While holding the lock, br_multicast_open calls br_multicast_join_snoopers, which eventually causes IP stack to (attempt to) send out a Listener Report (in igmp6_join_group). Since the destination Ethernet address is a multicast address, br_dev_xmit feeds the packet back to the bridge via br_multicast_rcv, which in turn calls br_multicast_add_group, which then deadlocks on multicast_lock. The fix is to move the call br_multicast_join_snoopers outside of the critical section. This works since br_multicast_join_snoopers only deals with IP and does not modify any multicast data structures of the bridge, so there's no need to hold the lock. Steps to reproduce: 1. sysctl net.ipv6.conf.all.force_mld_version=1 2. have another querier 3. ip link set dev bridge type bridge mcast_snooping 0 && \ ip link set dev bridge type bridge mcast_snooping 1 < deadlock > A typical call trace looks like the following: [ 936.251495] _raw_spin_lock+0x5c/0x68 [ 936.255221] br_multicast_add_group+0x40/0x170 [bridge] [ 936.260491] br_multicast_rcv+0x7ac/0xe30 [bridge] [ 936.265322] br_dev_xmit+0x140/0x368 [bridge] [ 936.269689] dev_hard_start_xmit+0x94/0x158 [ 936.273876] __dev_queue_xmit+0x5ac/0x7f8 [ 936.277890] dev_queue_xmit+0x10/0x18 [ 936.281563] neigh_resolve_output+0xec/0x198 [ 936.285845] ip6_finish_output2+0x240/0x710 [ 936.290039] __ip6_finish_output+0x130/0x170 [ 936.294318] ip6_output+0x6c/0x1c8 [ 936.297731] NF_HOOK.constprop.0+0xd8/0xe8 [ 936.301834] igmp6_send+0x358/0x558 [ 936.305326] igmp6_join_group.part.0+0x30/0xf0 [ 936.309774] igmp6_group_added+0xfc/0x110 [ 936.313787] __ipv6_dev_mc_inc+0x1a4/0x290 [ 936.317885] ipv6_dev_mc_inc+0x10/0x18 [ 936.321677] br_multicast_open+0xbc/0x110 [bridge] [ 936.326506] br_multicast_toggle+0xec/0x140 [bridge] Fixes: `4effd28c12` ("bridge: join all-snoopers multicast address") Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20201204235628.50653-1-Joseph.Huang@garmin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-07 17:14:43 -08:00
Zhang Changzhong	ee4f52a8de	net: bridge: vlan: fix error return code in __vlan_add() Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: `f8ed289fab` ("bridge: vlan: use br_vlan_(get\|put)_master to deal with refcounts") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/1607071737-33875-1-git-send-email-zhangchangzhong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 15:41:06 -08:00
Jakub Kicinski	55fd59b003	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Conflicts: drivers/net/ethernet/ibm/ibmvnic.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-03 15:44:09 -08:00
Danielle Ratson	22ec19f3ae	bridge: switchdev: Notify about VLAN protocol changes Drivers that support bridge offload need to be notified about changes to the bridge's VLAN protocol so that they could react accordingly and potentially veto the change. Add a new switchdev attribute to communicate the change to drivers. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-01 15:21:13 -08:00
Antoine Tenart	44f64f23ba	netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the hooks as, while it's an expected value for a bridge, routing expects PACKET_HOST. The change is undone later on after hook traversal. This can be seen with pairs of functions updating skb>pkt_type and then reverting it to its original value: For hook NF_INET_PRE_ROUTING: setup_pre_routing / br_nf_pre_routing_finish For hook NF_INET_FORWARD: br_nf_forward_ip / br_nf_forward_finish But the third case where netfilter does this, for hook NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing but never reverted. A comment says: /* We assume any code from br_dev_queue_push_xmit onwards doesn't care * about the value of skb->pkt_type. */ But when having a tunnel (say vxlan) attached to a bridge we have the following call trace: br_nf_pre_routing br_nf_pre_routing_ipv6 br_nf_pre_routing_finish br_nf_forward_ip br_nf_forward_finish br_nf_post_routing <- pkt_type is updated to PACKET_HOST br_nf_dev_queue_xmit <- but not reverted to its original value vxlan_xmit vxlan_xmit_one skb_tunnel_check_pmtu <- a check on pkt_type is performed In this specific case, this creates issues such as when an ICMPv6 PTB should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST and returns early). If the comment is right and no one cares about the value of skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting it to its original value should be safe. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20201123174902.622102-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-28 11:46:51 -08:00
Horatiu Vultur	bfd042321a	bridge: mrp: Implement LC mode for MRP Extend MRP to support LC mode(link check) for the interconnect port. This applies only to the interconnect ring. Opposite to RC mode(ring check) the LC mode is using CFM frames to detect when the link goes up or down and based on that the userspace will need to react. One advantage of the LC mode over RC mode is that there will be fewer frames in the normal rings. Because RC mode generates InTest on all ports while LC mode sends CFM frame only on the interconnect port. All 4 nodes part of the interconnect ring needs to have the same mode. And it is not possible to have running LC and RC mode at the same time on a node. Whenever the MIM starts it needs to detect the status of the other 3 nodes in the interconnect ring so it would send a frame called InLinkStatus, on which the clients needs to reply with their link status. This patch adds InLinkStatus frame type and extends existing rules on how to forward this frame. Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20201124082525.273820-1-horatiu.vultur@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-25 13:33:35 -08:00
Randy Dunlap	fd2d6bc4c2	netfilter: nft_reject_bridge: fix build errors due to code movement Fix build errors in net/bridge/netfilter/nft_reject_bridge.ko by selecting NF_REJECT_IPV4, which provides the missing symbols. ERROR: modpost: "nf_reject_skb_v4_tcp_reset" [net/bridge/netfilter/nft_reject_bridge.ko] undefined! ERROR: modpost: "nf_reject_skb_v4_unreach" [net/bridge/netfilter/nft_reject_bridge.ko] undefined! Fixes: `fa538f7cf0` ("netfilter: nf_reject: add reject skbuff creation helpers") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-11-22 13:44:51 +01:00
Heiner Kallweit	7609ecb2ed	net: bridge: switch to net core statistics counters handling Use netdev->tstats instead of a member of net_bridge for storing a pointer to the per-cpu counters. This allows us to use core functionality for statistics handling. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/9bad2be2-fd84-7c6e-912f-cee433787018@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-21 14:40:50 -08:00
Jakub Kicinski	56495a2442	Merge https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-19 19:08:46 -08:00
Heiner Kallweit	281cc2843b	net: bridge: replace struct br_vlan_stats with pcpu_sw_netstats Struct br_vlan_stats duplicates pcpu_sw_netstats (apart from br_vlan_stats not defining an alignment requirement), therefore switch to using the latter one. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/04d25c3d-c5f6-3611-6d37-c2f40243dae2@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-18 17:23:16 -08:00
Heiner Kallweit	7a30ecc923	net: bridge: add missing counters to ndo_get_stats64 callback In br_forward.c and br_input.c fields dev->stats.tx_dropped and dev->stats.multicast are populated, but they are ignored in ndo_get_stats64. Fixes: `28172739f0` ("net: fix 64 bit counters on 32 bit arches") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/58ea9963-77ad-a7cf-8dfd-fc95ab95f606@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-16 15:47:50 -08:00
Horatiu Vultur	0169b82054	bridge: mrp: Use hlist_head instead of list_head for mrp Replace list_head with hlist_head for MRP list under the bridge. There is no need for a circular list when a linear list will work. This will also decrease the size of 'struct net_bridge'. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20201106215049.1448185-1-horatiu.vultur@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-09 16:42:12 -08:00
Jakub Kicinski	b65ca4c388	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next 1) Move existing bridge packet reject infra to nf_reject_{ipv4,ipv6}.c from Jose M. Guisado. 2) Consolidate nft_reject_inet initialization and dump, also from Jose. 3) Add the netdev reject action, from Jose. 4) Allow to combine the exist flag and the destroy command in ipset, from Joszef Kadlecsik. 5) Expose bucket size parameter for hashtables, also from Jozsef. 6) Expose the init value for reproducible ipset listings, from Jozsef. 7) Use __printf attribute in nft_request_module, from Andrew Lunn. 8) Allow to use reject from the inet ingress chain. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: netfilter: nft_reject_inet: allow to use reject from inet ingress netfilter: nftables: Add __printf() attribute netfilter: ipset: Expose the initval hash parameter to userspace netfilter: ipset: Add bucketsize parameter to all hash types netfilter: ipset: Support the -exist flag with the destroy command netfilter: nft_reject: add reject verdict support for netdev netfilter: nft_reject: unify reject init and dump into nft_reject netfilter: nf_reject: add reject skbuff creation helpers ==================== Link: https://lore.kernel.org/r/20201104141149.30082-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-04 18:05:56 -08:00
Vladimir Oltean	c43fd36f7f	net: bridge: mcast: fix stub definition of br_multicast_querier_exists The commit cited below has changed only the functional prototype of br_multicast_querier_exists, but forgot to do that for the stub prototype (the one where CONFIG_BRIDGE_IGMP_SNOOPING is disabled). Fixes: `955062b03f` ("net: bridge: mcast: add support for raw L2 multicast groups") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20201101000845.190009-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-31 17:23:19 -07:00
Jose M. Guisado Gomez	312ca575a5	netfilter: nft_reject: unify reject init and dump into nft_reject Bridge family is using the same static init and dump function as inet. This patch removes duplicate code unifying these functions body into nft_reject.c so they can be reused in the rest of families supporting reject verdict. Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-31 10:40:42 +01:00
Jose M. Guisado Gomez	fa538f7cf0	netfilter: nf_reject: add reject skbuff creation helpers Adds reject skbuff creation helper functions to ipv4/6 nf_reject infrastructure. Use these functions for reject verdict in bridge family. Can be reused by all different families that support reject and will not inject the reject packet through ip local out. Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-31 10:40:22 +01:00
Vladimir Oltean	0e761ac08f	net: bridge: explicitly convert between mdb entry state and port group flags When creating a new multicast port group, there is implicit conversion between the __u8 state member of struct br_mdb_entry and the unsigned char flags member of struct net_bridge_port_group. This implicit conversion relies on the fact that MDB_PERMANENT is equal to MDB_PG_FLAGS_PERMANENT. Let's be more explicit and convert the state to flags manually. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20201028234815.613226-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-30 17:58:16 -07:00
Nikolay Aleksandrov	955062b03f	net: bridge: mcast: add support for raw L2 multicast groups Extend the bridge multicast control and data path to configure routes for L2 (non-IP) multicast groups. The uapi struct br_mdb_entry union u is extended with another variant, mac_addr, which does not change the structure size, and which is valid when the proto field is zero. To be compatible with the forwarding code that is already in place, which acts as an IGMP/MLD snooping bridge with querier capabilities, we need to declare that for L2 MDB entries (for which there exists no such thing as IGMP/MLD snooping/querying), that there is always a querier. Otherwise, these entries would be flooded to all bridge ports and not just to those that are members of the L2 multicast group. Needless to say, only permanent L2 multicast groups can be installed on a bridge port. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20201028233831.610076-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-30 17:49:19 -07:00
Henrik Bjoernlund	b6d0425b81	bridge: cfm: Netlink Notifications. This is the implementation of Netlink notifications out of CFM. Notifications are initiated whenever a state change happens in CFM. IFLA_BRIDGE_CFM: Points to the CFM information. IFLA_BRIDGE_CFM_MEP_STATUS_INFO: This indicate that the MEP instance status are following. IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO: This indicate that the peer MEP status are following. CFM nested attribute has the following attributes in next level. IFLA_BRIDGE_CFM_MEP_STATUS_INSTANCE: The MEP instance number of the delivered status. The type is NLA_U32. IFLA_BRIDGE_CFM_MEP_STATUS_OPCODE_UNEXP_SEEN: The MEP instance received CFM PDU with unexpected Opcode. The type is NLA_U32 (bool). IFLA_BRIDGE_CFM_MEP_STATUS_VERSION_UNEXP_SEEN: The MEP instance received CFM PDU with unexpected version. The type is NLA_U32 (bool). IFLA_BRIDGE_CFM_MEP_STATUS_RX_LEVEL_LOW_SEEN: The MEP instance received CCM PDU with MD level lower than configured level. This frame is discarded. The type is NLA_U32 (bool). IFLA_BRIDGE_CFM_CC_PEER_STATUS_INSTANCE: The MEP instance number of the delivered status. The type is NLA_U32. IFLA_BRIDGE_CFM_CC_PEER_STATUS_PEER_MEPID: The added Peer MEP ID of the delivered status. The type is NLA_U32. IFLA_BRIDGE_CFM_CC_PEER_STATUS_CCM_DEFECT: The CCM defect status. The type is NLA_U32 (bool). True means no CCM frame is received for 3.25 intervals. IFLA_BRIDGE_CFM_CC_CONFIG_EXP_INTERVAL. IFLA_BRIDGE_CFM_CC_PEER_STATUS_RDI: The last received CCM PDU RDI. The type is NLA_U32 (bool). IFLA_BRIDGE_CFM_CC_PEER_STATUS_PORT_TLV_VALUE: The last received CCM PDU Port Status TLV value field. The type is NLA_U8. IFLA_BRIDGE_CFM_CC_PEER_STATUS_IF_TLV_VALUE: The last received CCM PDU Interface Status TLV value field. The type is NLA_U8. IFLA_BRIDGE_CFM_CC_PEER_STATUS_SEEN: A CCM frame has been received from Peer MEP. The type is NLA_U32 (bool). This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO. IFLA_BRIDGE_CFM_CC_PEER_STATUS_TLV_SEEN: A CCM frame with TLV has been received from Peer MEP. The type is NLA_U32 (bool). This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO. IFLA_BRIDGE_CFM_CC_PEER_STATUS_SEQ_UNEXP_SEEN: A CCM frame with unexpected sequence number has been received from Peer MEP. The type is NLA_U32 (bool). When a sequence number is not one higher than previously received then it is unexpected. This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO. Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:44 -07:00
Henrik Bjoernlund	e77824d81d	bridge: cfm: Netlink GET status Interface. This is the implementation of CFM netlink status get information interface. Add new nested netlink attributes. These attributes are used by the user space to get status information. GETLINK: Request filter RTEXT_FILTER_CFM_STATUS: Indicating that CFM status information must be delivered. IFLA_BRIDGE_CFM: Points to the CFM information. IFLA_BRIDGE_CFM_MEP_STATUS_INFO: This indicate that the MEP instance status are following. IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO: This indicate that the peer MEP status are following. CFM nested attribute has the following attributes in next level. GETLINK RTEXT_FILTER_CFM_STATUS: IFLA_BRIDGE_CFM_MEP_STATUS_INSTANCE: The MEP instance number of the delivered status. The type is u32. IFLA_BRIDGE_CFM_MEP_STATUS_OPCODE_UNEXP_SEEN: The MEP instance received CFM PDU with unexpected Opcode. The type is u32 (bool). IFLA_BRIDGE_CFM_MEP_STATUS_VERSION_UNEXP_SEEN: The MEP instance received CFM PDU with unexpected version. The type is u32 (bool). IFLA_BRIDGE_CFM_MEP_STATUS_RX_LEVEL_LOW_SEEN: The MEP instance received CCM PDU with MD level lower than configured level. This frame is discarded. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_PEER_STATUS_INSTANCE: The MEP instance number of the delivered status. The type is u32. IFLA_BRIDGE_CFM_CC_PEER_STATUS_PEER_MEPID: The added Peer MEP ID of the delivered status. The type is u32. IFLA_BRIDGE_CFM_CC_PEER_STATUS_CCM_DEFECT: The CCM defect status. The type is u32 (bool). True means no CCM frame is received for 3.25 intervals. IFLA_BRIDGE_CFM_CC_CONFIG_EXP_INTERVAL. IFLA_BRIDGE_CFM_CC_PEER_STATUS_RDI: The last received CCM PDU RDI. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_PEER_STATUS_PORT_TLV_VALUE: The last received CCM PDU Port Status TLV value field. The type is u8. IFLA_BRIDGE_CFM_CC_PEER_STATUS_IF_TLV_VALUE: The last received CCM PDU Interface Status TLV value field. The type is u8. IFLA_BRIDGE_CFM_CC_PEER_STATUS_SEEN: A CCM frame has been received from Peer MEP. The type is u32 (bool). This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO. IFLA_BRIDGE_CFM_CC_PEER_STATUS_TLV_SEEN: A CCM frame with TLV has been received from Peer MEP. The type is u32 (bool). This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO. IFLA_BRIDGE_CFM_CC_PEER_STATUS_SEQ_UNEXP_SEEN: A CCM frame with unexpected sequence number has been received from Peer MEP. The type is u32 (bool). When a sequence number is not one higher than previously received then it is unexpected. This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO. Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:44 -07:00
Henrik Bjoernlund	5e312fc0e7	bridge: cfm: Netlink GET configuration Interface. This is the implementation of CFM netlink configuration get information interface. Add new nested netlink attributes. These attributes are used by the user space to get configuration information. GETLINK: Request filter RTEXT_FILTER_CFM_CONFIG: Indicating that CFM configuration information must be delivered. IFLA_BRIDGE_CFM: Points to the CFM information. IFLA_BRIDGE_CFM_MEP_CREATE_INFO: This indicate that MEP instance create parameters are following. IFLA_BRIDGE_CFM_MEP_CONFIG_INFO: This indicate that MEP instance config parameters are following. IFLA_BRIDGE_CFM_CC_CONFIG_INFO: This indicate that MEP instance CC functionality parameters are following. IFLA_BRIDGE_CFM_CC_RDI_INFO: This indicate that CC transmitted CCM PDU RDI parameters are following. IFLA_BRIDGE_CFM_CC_CCM_TX_INFO: This indicate that CC transmitted CCM PDU parameters are following. IFLA_BRIDGE_CFM_CC_PEER_MEP_INFO: This indicate that the added peer MEP IDs are following. CFM nested attribute has the following attributes in next level. GETLINK RTEXT_FILTER_CFM_CONFIG: IFLA_BRIDGE_CFM_MEP_CREATE_INSTANCE: The created MEP instance number. The type is u32. IFLA_BRIDGE_CFM_MEP_CREATE_DOMAIN: The created MEP domain. The type is u32 (br_cfm_domain). It must be BR_CFM_PORT. This means that CFM frames are transmitted and received directly on the port - untagged. Not in a VLAN. IFLA_BRIDGE_CFM_MEP_CREATE_DIRECTION: The created MEP direction. The type is u32 (br_cfm_mep_direction). It must be BR_CFM_MEP_DIRECTION_DOWN. This means that CFM frames are transmitted and received on the port. Not in the bridge. IFLA_BRIDGE_CFM_MEP_CREATE_IFINDEX: The created MEP residence port ifindex. The type is u32 (ifindex). IFLA_BRIDGE_CFM_MEP_DELETE_INSTANCE: The deleted MEP instance number. The type is u32. IFLA_BRIDGE_CFM_MEP_CONFIG_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_MEP_CONFIG_UNICAST_MAC: The configured MEP unicast MAC address. The type is 6u8 (array). This is used as SMAC in all transmitted CFM frames. IFLA_BRIDGE_CFM_MEP_CONFIG_MDLEVEL: The configured MEP unicast MD level. The type is u32. It must be in the range 1-7. No CFM frames are passing through this MEP on lower levels. IFLA_BRIDGE_CFM_MEP_CONFIG_MEPID: The configured MEP ID. The type is u32. It must be in the range 0-0x1FFF. This MEP ID is inserted in any transmitted CCM frame. IFLA_BRIDGE_CFM_CC_CONFIG_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_CC_CONFIG_ENABLE: The Continuity Check (CC) functionality is enabled or disabled. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CONFIG_EXP_INTERVAL: The CC expected receive interval of CCM frames. The type is u32 (br_cfm_ccm_interval). This is also the transmission interval of CCM frames when enabled. IFLA_BRIDGE_CFM_CC_CONFIG_EXP_MAID: The CC expected receive MAID in CCM frames. The type is CFM_MAID_LENGTHu8. This is MAID is also inserted in transmitted CCM frames. IFLA_BRIDGE_CFM_CC_PEER_MEP_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_CC_PEER_MEPID: The CC Peer MEP ID added. The type is u32. When a Peer MEP ID is added and CC is enabled it is expected to receive CCM frames from that Peer MEP. IFLA_BRIDGE_CFM_CC_RDI_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_CC_RDI_RDI: The RDI that is inserted in transmitted CCM PDU. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CCM_TX_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_CC_CCM_TX_DMAC: The transmitted CCM frame destination MAC address. The type is 6*u8 (array). This is used as DMAC in all transmitted CFM frames. IFLA_BRIDGE_CFM_CC_CCM_TX_SEQ_NO_UPDATE: The transmitted CCM frame update (increment) of sequence number is enabled or disabled. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CCM_TX_PERIOD: The period of time where CCM frame are transmitted. The type is u32. The time is given in seconds. SETLINK IFLA_BRIDGE_CFM_CC_CCM_TX must be done before timeout to keep transmission alive. When period is zero any ongoing CCM frame transmission will be stopped. IFLA_BRIDGE_CFM_CC_CCM_TX_IF_TLV: The transmitted CCM frame update with Interface Status TLV is enabled or disabled. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CCM_TX_IF_TLV_VALUE: The transmitted Interface Status TLV value field. The type is u8. IFLA_BRIDGE_CFM_CC_CCM_TX_PORT_TLV: The transmitted CCM frame update with Port Status TLV is enabled or disabled. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CCM_TX_PORT_TLV_VALUE: The transmitted Port Status TLV value field. The type is u8. Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:43 -07:00
Henrik Bjoernlund	2be665c394	bridge: cfm: Netlink SET configuration Interface. This is the implementation of CFM netlink configuration set information interface. Add new nested netlink attributes. These attributes are used by the user space to create/delete/configure CFM instances. SETLINK: IFLA_BRIDGE_CFM: Indicate that the following attributes are CFM. IFLA_BRIDGE_CFM_MEP_CREATE: This indicate that a MEP instance must be created. IFLA_BRIDGE_CFM_MEP_DELETE: This indicate that a MEP instance must be deleted. IFLA_BRIDGE_CFM_MEP_CONFIG: This indicate that a MEP instance must be configured. IFLA_BRIDGE_CFM_CC_CONFIG: This indicate that a MEP instance Continuity Check (CC) functionality must be configured. IFLA_BRIDGE_CFM_CC_PEER_MEP_ADD: This indicate that a CC Peer MEP must be added. IFLA_BRIDGE_CFM_CC_PEER_MEP_REMOVE: This indicate that a CC Peer MEP must be removed. IFLA_BRIDGE_CFM_CC_CCM_TX: This indicate that the CC transmitted CCM PDU must be configured. IFLA_BRIDGE_CFM_CC_RDI: This indicate that the CC transmitted CCM PDU RDI must be configured. CFM nested attribute has the following attributes in next level. SETLINK RTEXT_FILTER_CFM_CONFIG: IFLA_BRIDGE_CFM_MEP_CREATE_INSTANCE: The created MEP instance number. The type is u32. IFLA_BRIDGE_CFM_MEP_CREATE_DOMAIN: The created MEP domain. The type is u32 (br_cfm_domain). It must be BR_CFM_PORT. This means that CFM frames are transmitted and received directly on the port - untagged. Not in a VLAN. IFLA_BRIDGE_CFM_MEP_CREATE_DIRECTION: The created MEP direction. The type is u32 (br_cfm_mep_direction). It must be BR_CFM_MEP_DIRECTION_DOWN. This means that CFM frames are transmitted and received on the port. Not in the bridge. IFLA_BRIDGE_CFM_MEP_CREATE_IFINDEX: The created MEP residence port ifindex. The type is u32 (ifindex). IFLA_BRIDGE_CFM_MEP_DELETE_INSTANCE: The deleted MEP instance number. The type is u32. IFLA_BRIDGE_CFM_MEP_CONFIG_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_MEP_CONFIG_UNICAST_MAC: The configured MEP unicast MAC address. The type is 6u8 (array). This is used as SMAC in all transmitted CFM frames. IFLA_BRIDGE_CFM_MEP_CONFIG_MDLEVEL: The configured MEP unicast MD level. The type is u32. It must be in the range 1-7. No CFM frames are passing through this MEP on lower levels. IFLA_BRIDGE_CFM_MEP_CONFIG_MEPID: The configured MEP ID. The type is u32. It must be in the range 0-0x1FFF. This MEP ID is inserted in any transmitted CCM frame. IFLA_BRIDGE_CFM_CC_CONFIG_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_CC_CONFIG_ENABLE: The Continuity Check (CC) functionality is enabled or disabled. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CONFIG_EXP_INTERVAL: The CC expected receive interval of CCM frames. The type is u32 (br_cfm_ccm_interval). This is also the transmission interval of CCM frames when enabled. IFLA_BRIDGE_CFM_CC_CONFIG_EXP_MAID: The CC expected receive MAID in CCM frames. The type is CFM_MAID_LENGTHu8. This is MAID is also inserted in transmitted CCM frames. IFLA_BRIDGE_CFM_CC_PEER_MEP_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_CC_PEER_MEPID: The CC Peer MEP ID added. The type is u32. When a Peer MEP ID is added and CC is enabled it is expected to receive CCM frames from that Peer MEP. IFLA_BRIDGE_CFM_CC_RDI_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_CC_RDI_RDI: The RDI that is inserted in transmitted CCM PDU. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CCM_TX_INSTANCE: The configured MEP instance number. The type is u32. IFLA_BRIDGE_CFM_CC_CCM_TX_DMAC: The transmitted CCM frame destination MAC address. The type is 6*u8 (array). This is used as DMAC in all transmitted CFM frames. IFLA_BRIDGE_CFM_CC_CCM_TX_SEQ_NO_UPDATE: The transmitted CCM frame update (increment) of sequence number is enabled or disabled. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CCM_TX_PERIOD: The period of time where CCM frame are transmitted. The type is u32. The time is given in seconds. SETLINK IFLA_BRIDGE_CFM_CC_CCM_TX must be done before timeout to keep transmission alive. When period is zero any ongoing CCM frame transmission will be stopped. IFLA_BRIDGE_CFM_CC_CCM_TX_IF_TLV: The transmitted CCM frame update with Interface Status TLV is enabled or disabled. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CCM_TX_IF_TLV_VALUE: The transmitted Interface Status TLV value field. The type is u8. IFLA_BRIDGE_CFM_CC_CCM_TX_PORT_TLV: The transmitted CCM frame update with Port Status TLV is enabled or disabled. The type is u32 (bool). IFLA_BRIDGE_CFM_CC_CCM_TX_PORT_TLV_VALUE: The transmitted Port Status TLV value field. The type is u8. Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:43 -07:00
Henrik Bjoernlund	dc32cbb3db	bridge: cfm: Kernel space implementation of CFM. CCM frame RX added. This is the third commit of the implementation of the CFM protocol according to 802.1Q section 12.14. Functionality is extended with CCM frame reception. The MEP instance now contains CCM based status information. Most important is the CCM defect status indicating if correct CCM frames are received with the expected interval. Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:43 -07:00
Henrik Bjoernlund	a806ad8ee2	bridge: cfm: Kernel space implementation of CFM. CCM frame TX added. This is the second commit of the implementation of the CFM protocol according to 802.1Q section 12.14. Functionality is extended with CCM frame transmission. Interface is extended with these functions: br_cfm_cc_rdi_set() br_cfm_cc_ccm_tx() br_cfm_cc_config_set() A MEP Continuity Check feature can be configured by br_cfm_cc_config_set() The Continuity Check parameters can be configured to be used when transmitting CCM. A MEP can be configured to start or stop transmission of CCM frames by br_cfm_cc_ccm_tx() The CCM will be transmitted for a selected period in seconds. Must call this function before timeout to keep transmission alive. A MEP transmitting CCM can be configured with inserted RDI in PDU by br_cfm_cc_rdi_set() Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:43 -07:00
Henrik Bjoernlund	86a14b79e1	bridge: cfm: Kernel space implementation of CFM. MEP create/delete. This is the first commit of the implementation of the CFM protocol according to 802.1Q section 12.14. It contains MEP instance create, delete and configuration. Connectivity Fault Management (CFM) comprises capabilities for detecting, verifying, and isolating connectivity failures in Virtual Bridged Networks. These capabilities can be used in networks operated by multiple independent organizations, each with restricted management access to each others equipment. CFM functions are partitioned as follows: - Path discovery - Fault detection - Fault verification and isolation - Fault notification - Fault recovery Interface consists of these functions: br_cfm_mep_create() br_cfm_mep_delete() br_cfm_mep_config_set() br_cfm_cc_config_set() br_cfm_cc_peer_mep_add() br_cfm_cc_peer_mep_remove() A MEP instance is created by br_cfm_mep_create() -It is the Maintenance association End Point described in 802.1Q section 19.2. -It is created on a specific level (1-7) and is assuring that no CFM frames are passing through this MEP on lower levels. -It initiates and validates CFM frames on its level. -It can only exist on a port that is related to a bridge. -Attributes given cannot be changed until the instance is deleted. A MEP instance can be deleted by br_cfm_mep_delete(). A created MEP instance has attributes that can be configured by br_cfm_mep_config_set(). A MEP Continuity Check feature can be configured by br_cfm_cc_config_set() The Continuity Check Receiver state machine can be enabled and disabled. According to 802.1Q section 19.2.8 A MEP can have Peer MEPs added and removed by br_cfm_cc_peer_mep_add() and br_cfm_cc_peer_mep_remove() The Continuity Check feature can maintain connectivity status on each added Peer MEP. Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:43 -07:00
Henrik Bjoernlund	f323aa54be	bridge: cfm: Add BRIDGE_CFM to Kconfig. This makes it possible to include or exclude the CFM protocol according to 802.1Q section 12.14. Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:43 -07:00
Henrik Bjoernlund	90c628dd47	net: bridge: extend the process of special frames This patch extends the processing of frames in the bridge. Currently MRP frames needs special processing and the current implementation doesn't allow a nice way to process different frame types. Therefore try to improve this by adding a list that contains frame types that need special processing. This list is iterated for each input frame and if there is a match based on frame type then these functions will be called and decide what to do with the frame. It can process the frame then the bridge doesn't need to do anything or don't process so then the bridge will do normal forwarding. Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 18:39:43 -07:00
Timothée COCAULT	63137bc588	netfilter: ebtables: Fixes dropping of small packets in bridge nat Fixes an error causing small packets to get dropped. skb_ensure_writable expects the second parameter to be a length in the ethernet payload.=20 If we want to write the ethernet header (src, dst), we should pass 0. Otherwise, packets with small payloads (< ETH_ALEN) will get dropped. Fixes: `c1a8311679` ("netfilter: bridge: convert skb_make_writable to skb_ensure_writable") Signed-off-by: Timothée COCAULT <timothee.cocault@orange.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-20 13:54:53 +02:00
Heiner Kallweit	f3f04f0f3a	net: bridge: use new function dev_fetch_sw_netstats Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/d1c3ff29-5691-9d54-d164-16421905fa59@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:33:49 -07:00
Jakub Kicinski	9d49aea13f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Small conflict around locking in rxrpc_process_event() - channel_lock moved to bundle in next, while state lock needs _bh() from net. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 15:44:50 -07:00
Henrik Bjoernlund	b6c02ef549	bridge: Netlink interface fix. This commit is correcting NETLINK br_fill_ifinfo() to be able to handle 'filter_mask' with multiple flags asserted. Fixes: `36a8e8e265` ("bridge: Extend br_fill_ifinfo to return MPR status") Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 12:05:07 -07:00
David S. Miller	8b0308fe31	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Rejecting non-native endian BTF overlapped with the addition of support for it. The rest were more simple overlapping changes, except the renesas ravb binding update, which had to follow a file move as well as a YAML conversion. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-05 18:40:01 -07:00
Taehee Yoo	eff7423365	net: core: introduce struct netdev_nested_priv for nested interface infrastructure Functions related to nested interface infrastructure such as netdev_walk_all_{ upper \| lower }_dev() pass both private functions and "data" pointer to handle their own things. At this point, the data pointer type is void *. In order to make it easier to expand common variables and functions, this new netdev_nested_priv structure is added. In the following patch, a new member variable will be added into this struct to fix the lockdep issue. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-28 15:00:15 -07:00
Nikolay Aleksandrov	f2f3729fb6	net: bridge: fdb: don't flush ext_learn entries When a user-space software manages fdb entries externally it should set the ext_learn flag which marks the fdb entry as externally managed and avoids expiring it (they're treated as static fdbs). Unfortunately on events where fdb entries are flushed (STP down, netlink fdb flush etc) these fdbs are also deleted automatically by the bridge. That in turn causes trouble for the managing user-space software (e.g. in MLAG setups we lose remote fdb entries on port flaps). These entries are completely externally managed so we should avoid automatically deleting them, the only exception are offloaded entries (i.e. BR_FDB_ADDED_BY_EXT_LEARN + BR_FDB_OFFLOADED). They are flushed as before. Fixes: `eb100e0e24` ("net: bridge: allow to add externally learned entries from user-space") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-28 12:47:43 -07:00
Nikolay Aleksandrov	7470558240	net: bridge: mcast: remove only S,G port groups from sg_port hash We should remove a group from the sg_port hash only if it's an S,G entry. This makes it correct and more symmetric with group add. Also since *,G groups are not added to that hash we can hide a bug. Fixes: `085b53c8be` ("net: bridge: mcast: add sg_port rhashtable") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-25 16:50:19 -07:00
Nikolay Aleksandrov	36cfec7359	net: bridge: mcast: when forwarding handle filter mode and blocked flag We need to avoid forwarding to ports in MCAST_INCLUDE filter mode when the mdst entry is a *,G or when the port has the blocked flag. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:35 -07:00
Nikolay Aleksandrov	094b82fd53	net: bridge: mcast: handle host state Since host joins are considered as EXCLUDE {} joins we need to reflect that in all of *,G ports' S,G entries. Since the S,Gs can have host_joined == true only set automatically we can safely set it to false when removing all automatically added entries upon S,G delete. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	9116ffbf1d	net: bridge: mcast: add support for blocked port groups When excluding S,G entries we need a way to block a particular S,G,port. The new port group flag is managed based on the source's timer as per RFCs 3376 and 3810. When a source expires and its port group is in EXCLUDE mode, it will be blocked. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	8266a0491e	net: bridge: mcast: handle port group filter modes We need to handle group filter mode transitions and initial state. To change a port group's INCLUDE -> EXCLUDE mode (or when we have added a new port group in EXCLUDE mode) we need to add that port to all of ,G ports' S,G entries for proper replication. When the EXCLUDE state is changed from IGMPv3 report, br_multicast_fwd_filter_exclude() must be called after the source list processing because the assumption is that all of the group's S,G entries will be created before transitioning to EXCLUDE mode, i.e. most importantly its blocked entries will already be added so it will not get automatically added to them. The transition EXCLUDE -> INCLUDE happens only when a port group timer expires, it requires us to remove that port from all of ,G ports' S,G entries where it was automatically added previously. Finally when we are adding a new S,G entry we must add all of ,G's EXCLUDE ports to it. In order to distinguish automatically added ,G EXCLUDE ports we have a new port group flag - MDB_PG_FLAGS_STAR_EXCL. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	b08123684b	net: bridge: mcast: install S,G entries automatically based on reports This patch adds support for automatic install of S,G mdb entries based on the port group's source list and the source entry's timer. Once installed the S,G will be used when forwarding packets if the approprate multicast/mld versions are set. A new source flag called BR_SGRP_F_INSTALLED denotes if the source has a forwarding mdb entry installed. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	085b53c8be	net: bridge: mcast: add sg_port rhashtable To speedup S,G forward handling we need to be able to quickly find out if a port is a member of an S,G group. To do that add a global S,G port rhashtable with key: source addr, group addr, protocol, vid (all br_ip fields) and port pointer. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	8f8cb77e0b	net: bridge: mcast: add rt_protocol field to the port group struct We need to be able to differentiate between pg entries created by user-space and the kernel when we start generating S,G entries for IGMPv3/MLDv2's fast path. User-space entries are created by default as RTPROT_STATIC and the kernel entries are RTPROT_KERNEL. Later we can allow user-space to provide the entry rt_protocol so we can differentiate between who added the entries specifically (e.g. clag, admin, frr etc). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	7d07a68c25	net: bridge: mcast: when igmpv3/mldv2 are enabled lookup (S,G) first, then (,G) If (S,G) entries are enabled (igmpv3/mldv2) then look them up first. If there isn't a present (S,G) entry then try to find (,G). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	88d4bd1804	net: bridge: mdb: add support for add/del/dump of entries with source Add new mdb attributes (MDBE_ATTR_SOURCE for setting, MDBA_MDB_EATTR_SOURCE for dumping) to allow add/del and dump of mdb entries with a source address (S,G). New S,G entries are created with filter mode of MCAST_INCLUDE. The same attributes are used for IPv4 and IPv6, they're validated and parsed based on their protocol. S,G host joined entries which are added by user are not allowed yet. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	9c4258c78a	net: bridge: mdb: add support to extend add/del commands Since the MDB add/del code expects an exact struct br_mdb_entry we can't really add any extensions, thus add a new nested attribute at the level of MDBA_SET_ENTRY called MDBA_SET_ENTRY_ATTRS which will be used to pass all new options via netlink attributes. This patch doesn't change anything functionally since the new attribute is not used yet, only parsed. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	eab3227b12	net: bridge: mcast: rename br_ip's u member to dst Since now we have src in br_ip, u no longer makes sense so rename it to dst. No functional changes. v2: fix build with CONFIG_BATMAN_ADV_MCAST CC: Marek Lindner <mareklindner@neomailbox.ch> CC: Simon Wunderlich <sw@simonwunderlich.de> CC: Antonio Quartulli <a@unstable.cc> CC: Sven Eckelmann <sven@narfation.org> CC: b.a.t.m.a.n@lists.open-mesh.org Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	deb965662d	net: bridge: mcast: use br_ip's src for src groups and querier address Now that we have src and dst in br_ip it is logical to use the src field for the cases where we need to work with a source address such as querier source address and group source address. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	83f7398ea5	net: bridge: mdb: use extack in br_mdb_add() and br_mdb_add_group() Pass and use extack all the way down to br_mdb_add_group(). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	7eea629d07	net: bridge: mdb: move all port and bridge checks to br_mdb_add To avoid doing duplicate device checks and searches (the same were done in br_mdb_add and __br_mdb_add) pass the already found port to __br_mdb_add and pull the bridge's netif_running and enabled multicast checks to br_mdb_add. This would also simplify the future extack errors. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
Nikolay Aleksandrov	2ac95dfe25	net: bridge: mdb: use extack in br_mdb_parse() We can drop the pr_info() calls and just use extack to return a meaningful error to user-space when br_mdb_parse() fails. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-23 13:24:34 -07:00
David S. Miller	3ab0a7a0c3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-22 16:45:34 -07:00
Vladimir Oltean	99f62a7460	net: bridge: br_vlan_get_pvid_rcu() should dereference the VLAN group under RCU When calling the RCU brother of br_vlan_get_pvid(), lockdep warns: ============================= WARNING: suspicious RCU usage 5.9.0-rc3-01631-g13c17acb8e38-dirty #814 Not tainted ----------------------------- net/bridge/br_private.h:1054 suspicious rcu_dereference_protected() usage! Call trace: lockdep_rcu_suspicious+0xd4/0xf8 __br_vlan_get_pvid+0xc0/0x100 br_vlan_get_pvid_rcu+0x78/0x108 The warning is because br_vlan_get_pvid_rcu() calls nbp_vlan_group() which calls rtnl_dereference() instead of rcu_dereference(). In turn, rtnl_dereference() calls rcu_dereference_protected() which assumes operation under an RCU write-side critical section, which obviously is not the case here. So, when the incorrect primitive is used to access the RCU-protected VLAN group pointer, READ_ONCE() is not used, which may cause various unexpected problems. I'm sad to say that br_vlan_get_pvid() and br_vlan_get_pvid_rcu() cannot share the same implementation. So fix the bug by splitting the 2 functions, and making br_vlan_get_pvid_rcu() retrieve the VLAN groups under proper locking annotations. Fixes: `7582f5b70f` ("bridge: add br_vlan_get_pvid_rcu()") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-21 17:37:44 -07:00
Randy Dunlap	4bbd026cb9	net: bridge: delete duplicated words Drop repeated words in net/bridge/. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: bridge@lists.linux-foundation.org Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-18 14:12:43 -07:00
Nikolay Aleksandrov	d5bf31ddd8	net: bridge: mcast: don't ignore return value of __grp_src_toex_excl When we're handling TO_EXCLUDE report in EXCLUDE filter mode we should not ignore the return value of __grp_src_toex_excl() as we'll miss sending notifications about group changes. Fixes: `5bf1e00b68` ("net: bridge: mcast: support for IGMPV3/MLDv2 CHANGE_TO_INCLUDE/EXCLUDE report") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-16 17:13:25 -07:00
Alexandra Winter	d05e8e68b0	bridge: Add SWITCHDEV_FDB_FLUSH_TO_BRIDGE notifier so the switchdev can notifiy the bridge to flush non-permanent fdb entries for this port. This is useful whenever the hardware fdb of the switchdev is reset, but the netdev and the bridgeport are not deleted. Note that this has the same effect as the IFLA_BRPORT_FLUSH attribute. CC: Jiri Pirko <jiri@resnulli.us> CC: Ivan Vecera <ivecera@redhat.com> CC: Roopa Prabhu <roopa@nvidia.com> CC: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-15 13:21:47 -07:00
Ido Schimmel	12913f7459	bridge: mcast: Fix incomplete MDB dump Each MDB entry is encoded in a nested netlink attribute called 'MDBA_MDB_ENTRY'. In turn, this attribute contains another nested attributed called 'MDBA_MDB_ENTRY_INFO', which encodes a single port group entry within the MDB entry. The cited commit added the ability to restart a dump from a specific port group entry. However, on failure to add a port group entry to the dump the entire MDB entry (stored in 'nest2') is removed, resulting in missing port group entries. Fix this by finalizing the MDB entry with the partial list of already encoded port group entries. Fixes: `5205e919c9` ("net: bridge: mcast: add support for src list and filter mode dumping") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-11 14:49:47 -07:00
David S. Miller	d85427e3c8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Rewrite inner header IPv6 in ICMPv6 messages in ip6t_NPT, from Michael Zhou. 2) do_ip_vs_set_ctl() dereferences uninitialized value, from Peilin Ye. 3) Support for userdata in tables, from Jose M. Guisado. 4) Do not increment ct error and invalid stats at the same time, from Florian Westphal. 5) Remove ct ignore stats, also from Florian. 6) Add ct stats for clash resolution, from Florian Westphal. 7) Bump reference counter bump on ct clash resolution only, this is safe because bucket lock is held, again from Florian. 8) Use ip_is_fragment() in xt_HMARK, from YueHaibing. 9) Add wildcard support for nft_socket, from Balazs Scheidler. 10) Remove superfluous IPVS dependency on iptables, from Yaroslav Bolyukin. 11) Remove unused definition in ebt_stp, from Wang Hai. 12) Replace CONFIG_NFT_CHAIN_NAT_{IPV4,IPV6} by CONFIG_NFT_NAT in selftests/net, from Fabian Frederick. 13) Add userdata support for nft_object, from Jose M. Guisado. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-09 11:21:19 -07:00
Nikolay Aleksandrov	071445c605	net: bridge: mcast: fix unused br var when lockdep isn't defined Stephen reported the following warning: net/bridge/br_multicast.c: In function 'br_multicast_find_port': net/bridge/br_multicast.c:1818:21: warning: unused variable 'br' [-Wunused-variable] 1818 \| struct net_bridge *br = mp->br; \| ^~ It happens due to bridge's mlock_dereference() when lockdep isn't defined. Silence the warning by annotating the variable as __maybe_unused. Fixes: `0436862e41` ("net: bridge: mcast: support for IGMPv3/MLDv2 ALLOW_NEW_SOURCES report") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-08 20:11:57 -07:00
Wang Hai	36c3be8a2c	netfilter: ebt_stp: Remove unused macro BPDU_TYPE_TCN BPDU_TYPE_TCN is never used after it was introduced. So better to remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-09-08 12:56:38 +02:00
Nikolay Aleksandrov	e12cec65b5	net: bridge: mcast: destroy all entries via gc Since each entry type has timers that can be running simultaneously we need to make sure that entries are not freed before their timers have finished. In order to do that generalize the src gc work to mcast gc work and use a callback to free the entries (mdb, port group or src). v3: add IPv6 support v2: force mcast gc on port del to make sure all port group timers have finished before freeing the bridge port Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-09-07 13:16:36 -07:00
Nikolay Aleksandrov	23550b8313	net: bridge: mcast: improve IGMPv3/MLDv2 query processing When an IGMPv3/MLDv2 query is received and we're operating in such mode then we need to avoid updating group timers if the suppress flag is set. Also we should update only timers for groups in exclude mode. v3: add IPv6/MLDv2 support Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-09-07 13:16:36 -07:00
Nikolay Aleksandrov	109865fe12	net: bridge: mcast: support for IGMPV3/MLDv2 BLOCK_OLD_SOURCES report We already have all necessary helpers, so process IGMPV3/MLDv2 BLOCK_OLD_SOURCES as per the RFCs. v3: add IPv6/MLDv2 support v2: directly do flag bit operations Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-09-07 13:16:36 -07:00
Nikolay Aleksandrov	5bf1e00b68	net: bridge: mcast: support for IGMPV3/MLDv2 CHANGE_TO_INCLUDE/EXCLUDE report In order to process IGMPV3/MLDv2 CHANGE_TO_INCLUDE/EXCLUDE report types we need new helpers which allow us to mark entries based on their timer state and to query only marked entries. v3: add IPv6/MLDv2 support, fix other_query checks v2: directly do flag bit operations Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-09-07 13:16:35 -07:00
Nikolay Aleksandrov	e6231bca6a	net: bridge: mcast: support for IGMPV3/MLDv2 MODE_IS_INCLUDE/EXCLUDE report In order to process IGMPV3/MLDv2_MODE_IS_INCLUDE/EXCLUDE report types we need some new helpers which allow us to set/clear flags for all current entries and later delete marked entries after the report sources have been processed. v3: add IPv6/MLDv2 support v2: drop flag helpers and directly do flag bit operations Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-09-07 13:16:35 -07:00

... 2 3 4 5 6 ...

2388 Коммитов