aboutsummaryrefslogtreecommitdiffstats
path: root/net
diff options
context:
space:
mode:
authorJakub Kicinski <kuba@kernel.org>2025-09-11 19:02:53 -0700
committerJakub Kicinski <kuba@kernel.org>2025-09-11 19:02:53 -0700
commitbf2650d0599c999cf2fdffa721b00e25989918fb (patch)
treec8a9ac70ba6bc098ee0aadca33511b087ab2ba45 /net
parentnet: devmem: expose tcp_recvmsg_locked errors (diff)
parentselftests: forwarding: Add test for BR_BOOLOPT_FDB_LOCAL_VLAN_0 (diff)
downloadlinux-bf2650d0599c999cf2fdffa721b00e25989918fb.tar.gz
linux-bf2650d0599c999cf2fdffa721b00e25989918fb.zip
Merge branch 'bridge-allow-keeping-local-fdb-entries-only-on-vlan-0'
Petr Machata says: ==================== bridge: Allow keeping local FDB entries only on VLAN 0 The bridge FDB contains one local entry per port per VLAN, for the MAC of the port in question, and likewise for the bridge itself. This allows bridge to locally receive and punt "up" any packets whose destination MAC address matches that of one of the bridge interfaces or of the bridge itself. The number of these local "service" FDB entries grows linearly with number of bridge-global VLAN memberships, but that in turn will tend to grow quadratically with number of ports and per-port VLAN memberships. While that does not cause issues during forwarding lookups, it does make dumps impractically slow. As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB that just contains these 400K local entries, takes 6.5s. That's _without_ considering iproute2 formatting overhead, this is just how long it takes to walk the FDB (repeatedly), serialize it into netlink messages, and parse the messages back in userspace. This is to illustrate that with growing number of ports and VLANs, the time required to dump this repetitive information blows up. Arguably 4K VLANs per interface is not a very realistic configuration, but then modern switches can instead have several hundred interfaces, and we have fielded requests for >1K VLAN memberships per port among customers. FDB entries are currently all kept on a single linked list, and then dumping uses this linked list to walk all entries and dump them in order. When the message buffer is full, the iteration is cut short, and later restarted. Of course, to restart the iteration, it's first necessary to walk the already-dumped front part of the list before starting dumping again. So one possibility is to organize the FDB entries in different structure more amenable to walk restarts. One option is to walk directly the hash table. The advantage is that no auxiliary structure needs to be introduced. With a rough sketch of this approach, the above scenario gets dumped in not quite 3 s, saving over 50 % of time. However hash table iteration requires maintaining an active cursor that must be collected when the dump is aborted. It looks like that would require changes in the NDO protocol to allow to run this cleanup. Moreover, on hash table resize the iteration is simply restarted. FDB dumps are currently not guaranteed to correspond to any one particular state: entries can be missed, or be duplicated. But with hash table iteration we would get that plus the much less graceful resize behavior, where swaths of FDB are duplicated. Another option is to maintain the FDB entries in a red-black tree. We have a PoC of this approach on hand, and the above scenario is dumped in about 2.5 s. Still not as snappy as we'd like it, but better than the hash table. However the savings come at the expense of a more expensive insertion, and require locking during dumps, which blocks insertion. The upside of these approaches is that they provide benefits whatever the FDB contents. But it does not seem like either of these is workable. However we intend to clean up the RB tree PoC and present it for consideration later on in case the trade-offs are considered acceptable. Yet another option might be to use in-kernel FDB filtering, and to filter the local entries when dumping. Unfortunately, this does not help all that much either, because the linked-list walk still needs to happen. Also, with the obvious filtering interface built around ndm_flags / ndm_state filtering, one can't just exclude pure local entries in one query. One needs to dump all non-local entries first, and then to get permanent entries in another run filter local & added_by_user. I.e. one needs to pay the iteration overhead twice, and then integrate the result in userspace. To get significant savings, one would need a very specific knob like "dump, but skip/only include local entries". But if we are adding a local-specific knobs, maybe let's have an option to just not duplicate them in the first place. All this FDB duplication is there merely to make things snappy during forwarding. But high-radix switches with thousands of VLANs typically do not process much traffic in the SW datapath at all, but rather offload vast majority of it. So we could exchange some of the runtime performance for a neater FDB. To that end, in this patchset, introduce a new bridge option, BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries installed only on VLAN 0, instead of duplicating them across all VLANs. Then to maintain the local termination behavior, on FDB miss, the bridge does a second lookup on VLAN 0. Enabling this option changes the bridge behavior in expected ways. Since the entries are only kept on VLAN 0, FDB get, flush and dump will not perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects forwarding on all VLANs. This patchset is loosely based on a privately circulated patch by Nikolay Aleksandrov. The patchset progresses as follows: - Patch #1 introduces a bridge option to enable the above feature. Then patches #2 to #5 gradually patch the bridge to do the right thing when the option is enabled. Finally patch #6 adds the UAPI knob and the code for when the feature is enabled or disabled. - Patches #7, #8 and #9 contain fixes and improvements to selftest libraries - Patch #10 contains a new selftest ==================== Link: https://patch.msgid.link/cover.1757004393.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Diffstat (limited to 'net')
-rw-r--r--net/bridge/br.c22
-rw-r--r--net/bridge/br_fdb.c114
-rw-r--r--net/bridge/br_input.c8
-rw-r--r--net/bridge/br_private.h3
-rw-r--r--net/bridge/br_vlan.c10
5 files changed, 147 insertions, 10 deletions
diff --git a/net/bridge/br.c b/net/bridge/br.c
index c683baa3847f..512872a2ef81 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -259,6 +259,23 @@ static struct notifier_block br_switchdev_blocking_notifier = {
.notifier_call = br_switchdev_blocking_event,
};
+static int
+br_toggle_fdb_local_vlan_0(struct net_bridge *br, bool on,
+ struct netlink_ext_ack *extack)
+{
+ int err;
+
+ if (br_opt_get(br, BROPT_FDB_LOCAL_VLAN_0) == on)
+ return 0;
+
+ err = br_fdb_toggle_local_vlan_0(br, on, extack);
+ if (err)
+ return err;
+
+ br_opt_toggle(br, BROPT_FDB_LOCAL_VLAN_0, on);
+ return 0;
+}
+
/* br_boolopt_toggle - change user-controlled boolean option
*
* @br: bridge device
@@ -287,6 +304,9 @@ int br_boolopt_toggle(struct net_bridge *br, enum br_boolopt_id opt, bool on,
case BR_BOOLOPT_MDB_OFFLOAD_FAIL_NOTIFICATION:
br_opt_toggle(br, BROPT_MDB_OFFLOAD_FAIL_NOTIFICATION, on);
break;
+ case BR_BOOLOPT_FDB_LOCAL_VLAN_0:
+ err = br_toggle_fdb_local_vlan_0(br, on, extack);
+ break;
default:
/* shouldn't be called with unsupported options */
WARN_ON(1);
@@ -307,6 +327,8 @@ int br_boolopt_get(const struct net_bridge *br, enum br_boolopt_id opt)
return br_opt_get(br, BROPT_MST_ENABLED);
case BR_BOOLOPT_MDB_OFFLOAD_FAIL_NOTIFICATION:
return br_opt_get(br, BROPT_MDB_OFFLOAD_FAIL_NOTIFICATION);
+ case BR_BOOLOPT_FDB_LOCAL_VLAN_0:
+ return br_opt_get(br, BROPT_FDB_LOCAL_VLAN_0);
default:
/* shouldn't be called with unsupported options */
WARN_ON(1);
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 902694c0ce64..58d22e2b85fc 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -459,6 +459,9 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
struct net_bridge_fdb_entry *f;
struct net_bridge *br = p->br;
struct net_bridge_vlan *v;
+ bool local_vlan_0;
+
+ local_vlan_0 = br_opt_get(br, BROPT_FDB_LOCAL_VLAN_0);
spin_lock_bh(&br->hash_lock);
vg = nbp_vlan_group(p);
@@ -468,11 +471,11 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
/* delete old one */
fdb_delete_local(br, p, f);
- /* if this port has no vlan information
- * configured, we can safely be done at
- * this point.
+ /* if this port has no vlan information configured, or
+ * local entries are only kept on VLAN 0, we can safely
+ * be done at this point.
*/
- if (!vg || !vg->num_vlans)
+ if (!vg || !vg->num_vlans || local_vlan_0)
goto insert;
}
}
@@ -481,7 +484,7 @@ insert:
/* insert new address, may fail if invalid address or dup. */
fdb_add_local(br, p, newaddr, 0);
- if (!vg || !vg->num_vlans)
+ if (!vg || !vg->num_vlans || local_vlan_0)
goto done;
/* Now add entries for every VLAN configured on the port.
@@ -500,6 +503,9 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
struct net_bridge_vlan_group *vg;
struct net_bridge_fdb_entry *f;
struct net_bridge_vlan *v;
+ bool local_vlan_0;
+
+ local_vlan_0 = br_opt_get(br, BROPT_FDB_LOCAL_VLAN_0);
spin_lock_bh(&br->hash_lock);
@@ -511,7 +517,7 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
fdb_add_local(br, NULL, newaddr, 0);
vg = br_vlan_group(br);
- if (!vg || !vg->num_vlans)
+ if (!vg || !vg->num_vlans || local_vlan_0)
goto out;
/* Now remove and add entries for every VLAN configured on the
* bridge. This function runs under RTNL so the bitmap will not
@@ -576,6 +582,102 @@ void br_fdb_cleanup(struct work_struct *work)
mod_delayed_work(system_long_wq, &br->gc_work, work_delay);
}
+static void br_fdb_delete_locals_per_vlan_port(struct net_bridge *br,
+ struct net_bridge_port *p)
+{
+ struct net_bridge_vlan_group *vg;
+ struct net_bridge_vlan *v;
+ struct net_device *dev;
+
+ if (p) {
+ vg = nbp_vlan_group(p);
+ dev = p->dev;
+ } else {
+ vg = br_vlan_group(br);
+ dev = br->dev;
+ }
+
+ list_for_each_entry(v, &vg->vlan_list, vlist)
+ br_fdb_find_delete_local(br, p, dev->dev_addr, v->vid);
+}
+
+static void br_fdb_delete_locals_per_vlan(struct net_bridge *br)
+{
+ struct net_bridge_port *p;
+
+ ASSERT_RTNL();
+
+ list_for_each_entry(p, &br->port_list, list)
+ br_fdb_delete_locals_per_vlan_port(br, p);
+
+ br_fdb_delete_locals_per_vlan_port(br, NULL);
+}
+
+static int br_fdb_insert_locals_per_vlan_port(struct net_bridge *br,
+ struct net_bridge_port *p,
+ struct netlink_ext_ack *extack)
+{
+ struct net_bridge_vlan_group *vg;
+ struct net_bridge_vlan *v;
+ struct net_device *dev;
+ int err;
+
+ if (p) {
+ vg = nbp_vlan_group(p);
+ dev = p->dev;
+ } else {
+ vg = br_vlan_group(br);
+ dev = br->dev;
+ }
+
+ list_for_each_entry(v, &vg->vlan_list, vlist) {
+ if (!br_vlan_should_use(v))
+ continue;
+
+ err = br_fdb_add_local(br, p, dev->dev_addr, v->vid);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static int br_fdb_insert_locals_per_vlan(struct net_bridge *br,
+ struct netlink_ext_ack *extack)
+{
+ struct net_bridge_port *p;
+ int err;
+
+ ASSERT_RTNL();
+
+ list_for_each_entry(p, &br->port_list, list) {
+ err = br_fdb_insert_locals_per_vlan_port(br, p, extack);
+ if (err)
+ goto rollback;
+ }
+
+ err = br_fdb_insert_locals_per_vlan_port(br, NULL, extack);
+ if (err)
+ goto rollback;
+
+ return 0;
+
+rollback:
+ NL_SET_ERR_MSG_MOD(extack, "fdb_local_vlan_0 toggle: FDB entry insertion failed");
+ br_fdb_delete_locals_per_vlan(br);
+ return err;
+}
+
+int br_fdb_toggle_local_vlan_0(struct net_bridge *br, bool on,
+ struct netlink_ext_ack *extack)
+{
+ if (!on)
+ return br_fdb_insert_locals_per_vlan(br, extack);
+
+ br_fdb_delete_locals_per_vlan(br);
+ return 0;
+}
+
static bool __fdb_flush_matches(const struct net_bridge *br,
const struct net_bridge_fdb_entry *f,
const struct net_bridge_fdb_flush_desc *desc)
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 5f6ac9bf1527..67b4c905e49a 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -202,6 +202,14 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
break;
case BR_PKT_UNICAST:
dst = br_fdb_find_rcu(br, eth_hdr(skb)->h_dest, vid);
+ if (unlikely(!dst && vid &&
+ br_opt_get(br, BROPT_FDB_LOCAL_VLAN_0))) {
+ dst = br_fdb_find_rcu(br, eth_hdr(skb)->h_dest, 0);
+ if (dst &&
+ (!test_bit(BR_FDB_LOCAL, &dst->flags) ||
+ test_bit(BR_FDB_ADDED_BY_USER, &dst->flags)))
+ dst = NULL;
+ }
break;
default:
break;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 8de0904b9627..16be5d250402 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -487,6 +487,7 @@ enum net_bridge_opts {
BROPT_MCAST_VLAN_SNOOPING_ENABLED,
BROPT_MST_ENABLED,
BROPT_MDB_OFFLOAD_FAIL_NOTIFICATION,
+ BROPT_FDB_LOCAL_VLAN_0,
};
struct net_bridge {
@@ -843,6 +844,8 @@ void br_fdb_find_delete_local(struct net_bridge *br,
void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr);
void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr);
void br_fdb_cleanup(struct work_struct *work);
+int br_fdb_toggle_local_vlan_0(struct net_bridge *br, bool on,
+ struct netlink_ext_ack *extack);
void br_fdb_delete_by_port(struct net_bridge *br,
const struct net_bridge_port *p, u16 vid, int do_all);
struct net_bridge_fdb_entry *br_fdb_find_rcu(struct net_bridge *br,
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 939a3aa78d5c..ae911220cb3c 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -331,10 +331,12 @@ static int __vlan_add(struct net_bridge_vlan *v, u16 flags,
/* Add the dev mac and count the vlan only if it's usable */
if (br_vlan_should_use(v)) {
- err = br_fdb_add_local(br, p, dev->dev_addr, v->vid);
- if (err) {
- br_err(br, "failed insert local address into bridge forwarding table\n");
- goto out_filt;
+ if (!br_opt_get(br, BROPT_FDB_LOCAL_VLAN_0)) {
+ err = br_fdb_add_local(br, p, dev->dev_addr, v->vid);
+ if (err) {
+ br_err(br, "failed insert local address into bridge forwarding table\n");
+ goto out_filt;
+ }
}
vg->num_vlans++;
}