<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/net/openvswitch, branch v6.14</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.14</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.14'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-03-13T09:29:34Z</updated>
<entry>
<title>Revert "openvswitch: switch to per-action label counting in conntrack"</title>
<updated>2025-03-13T09:29:34Z</updated>
<author>
<name>Xin Long</name>
<email>lucien.xin@gmail.com</email>
</author>
<published>2025-03-08T18:05:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1063ae07383c0ddc5bcce170260c143825846b03'/>
<id>urn:sha1:1063ae07383c0ddc5bcce170260c143825846b03</id>
<content type='text'>
Currently, ovs_ct_set_labels() is only called for confirmed conntrack
entries (ct) within ovs_ct_commit(). However, if the conntrack entry
does not have the labels_ext extension, attempting to allocate it in
ovs_ct_get_conn_labels() for a confirmed entry triggers a warning in
nf_ct_ext_add():

  WARN_ON(nf_ct_is_confirmed(ct));

This happens when the conntrack entry is created externally before OVS
increments net-&gt;ct.labels_used. The issue has become more likely since
commit fcb1aa5163b1 ("openvswitch: switch to per-action label counting
in conntrack"), which changed to use per-action label counting and
increment net-&gt;ct.labels_used when a flow with ct action is added.

Since there’s no straightforward way to fully resolve this issue at the
moment, this reverts the commit to avoid breaking existing use cases.

Fixes: fcb1aa5163b1 ("openvswitch: switch to per-action label counting in conntrack")
Reported-by: Jianbo Liu &lt;jianbol@nvidia.com&gt;
Signed-off-by: Xin Long &lt;lucien.xin@gmail.com&gt;
Acked-by: Aaron Conole &lt;aconole@redhat.com&gt;
Link: https://patch.msgid.link/1bdeb2f3a812bca016a225d3de714427b2cd4772.1741457143.git.lucien.xin@gmail.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</content>
</entry>
<entry>
<title>net: openvswitch: remove misbehaving actions length check</title>
<updated>2025-03-13T09:29:18Z</updated>
<author>
<name>Ilya Maximets</name>
<email>i.maximets@ovn.org</email>
</author>
<published>2025-03-08T00:45:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a1e64addf3ff9257b45b78bc7d743781c3f41340'/>
<id>urn:sha1:a1e64addf3ff9257b45b78bc7d743781c3f41340</id>
<content type='text'>
The actions length check is unreliable and produces different results
depending on the initial length of the provided netlink attribute and
the composition of the actual actions inside of it.  For example, a
user can add 4088 empty clone() actions without triggering -EMSGSIZE,
on attempt to add 4089 such actions the operation will fail with the
-EMSGSIZE verdict.  However, if another 16 KB of other actions will
be *appended* to the previous 4089 clone() actions, the check passes
and the flow is successfully installed into the openvswitch datapath.

The reason for a such a weird behavior is the way memory is allocated.
When ovs_flow_cmd_new() is invoked, it calls ovs_nla_copy_actions(),
that in turn calls nla_alloc_flow_actions() with either the actual
length of the user-provided actions or the MAX_ACTIONS_BUFSIZE.  The
function adds the size of the sw_flow_actions structure and then the
actually allocated memory is rounded up to the closest power of two.

So, if the user-provided actions are larger than MAX_ACTIONS_BUFSIZE,
then MAX_ACTIONS_BUFSIZE + sizeof(*sfa) rounded up is 32K + 24 -&gt; 64K.
Later, while copying individual actions, we look at ksize(), which is
64K, so this way the MAX_ACTIONS_BUFSIZE check is not actually
triggered and the user can easily allocate almost 64 KB of actions.

However, when the initial size is less than MAX_ACTIONS_BUFSIZE, but
the actions contain ones that require size increase while copying
(such as clone() or sample()), then the limit check will be performed
during the reserve_sfa_size() and the user will not be allowed to
create actions that yield more than 32 KB internally.

This is one part of the problem.  The other part is that it's not
actually possible for the userspace application to know beforehand
if the particular set of actions will be rejected or not.

Certain actions require more space in the internal representation,
e.g. an empty clone() takes 4 bytes in the action list passed in by
the user, but it takes 12 bytes in the internal representation due
to an extra nested attribute, and some actions require less space in
the internal representations, e.g. set(tunnel(..)) normally takes
64+ bytes in the action list provided by the user, but only needs to
store a single pointer in the internal implementation, since all the
data is stored in the tunnel_info structure instead.

And the action size limit is applied to the internal representation,
not to the action list passed by the user.  So, it's not possible for
the userpsace application to predict if the certain combination of
actions will be rejected or not, because it is not possible for it to
calculate how much space these actions will take in the internal
representation without knowing kernel internals.

All that is causing random failures in ovs-vswitchd in userspace and
inability to handle certain traffic patterns as a result.  For example,
it is reported that adding a bit more than a 1100 VMs in an OpenStack
setup breaks the network due to OVS not being able to handle ARP
traffic anymore in some cases (it tries to install a proper datapath
flow, but the kernel rejects it with -EMSGSIZE, even though the action
list isn't actually that large.)

Kernel behavior must be consistent and predictable in order for the
userspace application to use it in a reasonable way.  ovs-vswitchd has
a mechanism to re-direct parts of the traffic and partially handle it
in userspace if the required action list is oversized, but that doesn't
work properly if we can't actually tell if the action list is oversized
or not.

Solution for this is to check the size of the user-provided actions
instead of the internal representation.  This commit just removes the
check from the internal part because there is already an implicit size
check imposed by the netlink protocol.  The attribute can't be larger
than 64 KB.  Realistically, we could reduce the limit to 32 KB, but
we'll be risking to break some existing setups that rely on the fact
that it's possible to create nearly 64 KB action lists today.

Vast majority of flows in real setups are below 100-ish bytes.  So
removal of the limit will not change real memory consumption on the
system.  The absolutely worst case scenario is if someone adds a flow
with 64 KB of empty clone() actions.  That will yield a 192 KB in the
internal representation consuming 256 KB block of memory.  However,
that list of actions is not meaningful and also a no-op.  Real world
very large action lists (that can occur for a rare cases of BUM
traffic handling) are unlikely to contain a large number of clones and
will likely have a lot of tunnel attributes making the internal
representation comparable in size to the original action list.
So, it should be fine to just remove the limit.

Commit in the 'Fixes' tag is the first one that introduced the
difference between internal representation and the user-provided action
lists, but there were many more afterwards that lead to the situation
we have today.

Fixes: 7d5437c709de ("openvswitch: Add tunneling interface.")
Signed-off-by: Ilya Maximets &lt;i.maximets@ovn.org&gt;
Reviewed-by: Aaron Conole &lt;aconole@redhat.com&gt;
Link: https://patch.msgid.link/20250308004609.2881861-1-i.maximets@ovn.org
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</content>
</entry>
<entry>
<title>openvswitch: use RCU protection in ovs_vport_cmd_fill_info()</title>
<updated>2025-02-11T02:09:09Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2025-02-07T13:58:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=90b2f49a502fa71090d9f4fe29a2f51fe5dff76d'/>
<id>urn:sha1:90b2f49a502fa71090d9f4fe29a2f51fe5dff76d</id>
<content type='text'>
ovs_vport_cmd_fill_info() can be called without RTNL or RCU.

Use RCU protection and dev_net_rcu() to avoid potential UAF.

Fixes: 9354d4520342 ("openvswitch: reliable interface indentification in port dumps")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Link: https://patch.msgid.link/20250207135841.1948589-6-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>openvswitch: fix lockup on tx to unregistering netdev with carrier</title>
<updated>2025-01-11T02:20:49Z</updated>
<author>
<name>Ilya Maximets</name>
<email>i.maximets@ovn.org</email>
</author>
<published>2025-01-09T12:21:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=47e55e4b410f7d552e43011baa5be1aab4093990'/>
<id>urn:sha1:47e55e4b410f7d552e43011baa5be1aab4093990</id>
<content type='text'>
Commit in a fixes tag attempted to fix the issue in the following
sequence of calls:

    do_output
    -&gt; ovs_vport_send
       -&gt; dev_queue_xmit
          -&gt; __dev_queue_xmit
             -&gt; netdev_core_pick_tx
                -&gt; skb_tx_hash

When device is unregistering, the 'dev-&gt;real_num_tx_queues' goes to
zero and the 'while (unlikely(hash &gt;= qcount))' loop inside the
'skb_tx_hash' becomes infinite, locking up the core forever.

But unfortunately, checking just the carrier status is not enough to
fix the issue, because some devices may still be in unregistering
state while reporting carrier status OK.

One example of such device is a net/dummy.  It sets carrier ON
on start, but it doesn't implement .ndo_stop to set the carrier off.
And it makes sense, because dummy doesn't really have a carrier.
Therefore, while this device is unregistering, it's still easy to hit
the infinite loop in the skb_tx_hash() from the OVS datapath.  There
might be other drivers that do the same, but dummy by itself is
important for the OVS ecosystem, because it is frequently used as a
packet sink for tcpdump while debugging OVS deployments.  And when the
issue is hit, the only way to recover is to reboot.

Fix that by also checking if the device is running.  The running
state is handled by the net core during unregistering, so it covers
unregistering case better, and we don't really need to send packets
to devices that are not running anyway.

While only checking the running state might be enough, the carrier
check is preserved.  The running and the carrier states seem disjoined
throughout the code and different drivers.  And other core functions
like __dev_direct_xmit() check both before attempting to transmit
a packet.  So, it seems safer to check both flags in OVS as well.

Fixes: 066b86787fa3 ("net: openvswitch: fix race on port output")
Reported-by: Friedrich Weber &lt;f.weber@proxmox.com&gt;
Closes: https://mail.openvswitch.org/pipermail/ovs-discuss/2025-January/053423.html
Signed-off-by: Ilya Maximets &lt;i.maximets@ovn.org&gt;
Tested-by: Friedrich Weber &lt;f.weber@proxmox.com&gt;
Reviewed-by: Aaron Conole &lt;aconole@redhat.com&gt;
Link: https://patch.msgid.link/20250109122225.4034688-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net: convert to nla_get_*_default()</title>
<updated>2024-11-11T18:32:06Z</updated>
<author>
<name>Johannes Berg</name>
<email>johannes.berg@intel.com</email>
</author>
<published>2024-11-08T10:41:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a885a6b2d37eaaae08323583bdb1928c8a2935fc'/>
<id>urn:sha1:a885a6b2d37eaaae08323583bdb1928c8a2935fc</id>
<content type='text'>
Most of the original conversion is from the spatch below,
but I edited some and left out other instances that were
either buggy after conversion (where default values don't
fit into the type) or just looked strange.

    @@
    expression attr, def;
    expression val;
    identifier fn =~ "^nla_get_.*";
    fresh identifier dfn = fn ## "_default";
    @@
    (
    -if (attr)
    -  val = fn(attr);
    -else
    -  val = def;
    +val = dfn(attr, def);
    |
    -if (!attr)
    -  val = def;
    -else
    -  val = fn(attr);
    +val = dfn(attr, def);
    |
    -if (!attr)
    -  return def;
    -return fn(attr);
    +return dfn(attr, def);
    |
    -attr ? fn(attr) : def
    +dfn(attr, def)
    |
    -!attr ? def : fn(attr)
    +dfn(attr, def)
    )

Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Reviewed-by: Toke Høiland-Jørgensen &lt;toke@kernel.org&gt;
Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>openvswitch: Pass on secpath details for internal port rx.</title>
<updated>2024-11-06T01:38:25Z</updated>
<author>
<name>Aaron Conole</name>
<email>aconole@redhat.com</email>
</author>
<published>2024-11-01T20:47:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7d1c2d517f503c63aac3775b51ec96210a6e6ef9'/>
<id>urn:sha1:7d1c2d517f503c63aac3775b51ec96210a6e6ef9</id>
<content type='text'>
Clearing the secpath for internal ports will cause packet drops when
ipsec offload or early SW ipsec decrypt are used.  Systems that rely
on these will not be able to actually pass traffic via openvswitch.

There is still an open issue for a flow miss packet - this is because
we drop the extensions during upcall and there is no facility to
restore such data (and it is non-trivial to add such functionality
to the upcall interface).  That means that when a flow miss occurs,
there will still be packet drops.  With this patch, when a flow is
found then traffic which has an associated xfrm extension will
properly flow.

Signed-off-by: Aaron Conole &lt;aconole@redhat.com&gt;
Acked-by: Eelco Chaudron &lt;echaudro@redhat.com&gt;
Link: https://patch.msgid.link/20241101204732.183840-1-aconole@redhat.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>netdev_features: convert NETIF_F_NETNS_LOCAL to dev-&gt;netns_local</title>
<updated>2024-09-03T09:36:43Z</updated>
<author>
<name>Alexander Lobakin</name>
<email>aleksander.lobakin@intel.com</email>
</author>
<published>2024-08-29T12:33:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=05c1280a2bcfca187fe7fa90bb240602cf54af0a'/>
<id>urn:sha1:05c1280a2bcfca187fe7fa90bb240602cf54af0a</id>
<content type='text'>
"Interface can't change network namespaces" is rather an attribute,
not a feature, and it can't be changed via Ethtool.
Make it a "cold" private flag instead of a netdev_feature and free
one more bit.

Signed-off-by: Alexander Lobakin &lt;aleksander.lobakin@intel.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</content>
</entry>
<entry>
<title>netdev_features: convert NETIF_F_LLTX to dev-&gt;lltx</title>
<updated>2024-09-03T09:36:43Z</updated>
<author>
<name>Alexander Lobakin</name>
<email>aleksander.lobakin@intel.com</email>
</author>
<published>2024-08-29T12:33:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=00d066a4d4edbe559ba6c35153da71d4b2b8a383'/>
<id>urn:sha1:00d066a4d4edbe559ba6c35153da71d4b2b8a383</id>
<content type='text'>
NETIF_F_LLTX can't be changed via Ethtool and is not a feature,
rather an attribute, very similar to IFF_NO_QUEUE (and hot).
Free one netdev_features_t bit and make it a "hot" private flag.

Signed-off-by: Alexander Lobakin &lt;aleksander.lobakin@intel.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</content>
</entry>
<entry>
<title>net: openvswitch: Use ERR_CAST() to return</title>
<updated>2024-08-30T18:11:45Z</updated>
<author>
<name>Yan Zhen</name>
<email>yanzhen@vivo.com</email>
</author>
<published>2024-08-29T09:55:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b26b64493343659cce8bbffa358bf39e4f68bdec'/>
<id>urn:sha1:b26b64493343659cce8bbffa358bf39e4f68bdec</id>
<content type='text'>
Using ERR_CAST() is more reasonable and safer, When it is necessary
to convert the type of an error pointer and return it.

Signed-off-by: Yan Zhen &lt;yanzhen@vivo.com&gt;
Acked-by: Eelco Chaudron &lt;echaudro@redhat.com&gt;
Reviewed-by: Aaron Conole &lt;aconole@redhat.com&gt;
Link: https://patch.msgid.link/20240829095509.3151987-1-yanzhen@vivo.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tc: adjust network header after 2nd vlan push</title>
<updated>2024-08-27T09:37:42Z</updated>
<author>
<name>Boris Sukholitko</name>
<email>boris.sukholitko@broadcom.com</email>
</author>
<published>2024-08-22T10:35:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=938863727076f684abb39d1d0f9dce1924e9028e'/>
<id>urn:sha1:938863727076f684abb39d1d0f9dce1924e9028e</id>
<content type='text'>
&lt;tldr&gt;
skb network header of the single-tagged vlan packet continues to point the
vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This
causes problem at the dissector which expects double-tagged packet network
header to point to the inner vlan.

The fix is to adjust network header in tcf_act_vlan.c but requires
refactoring of skb_vlan_push function.
&lt;/tldr&gt;

Consider the following shell script snippet configuring TC rules on the
veth interface:

ip link add veth0 type veth peer veth1
ip link set veth0 up
ip link set veth1 up

tc qdisc add dev veth0 clsact

tc filter add dev veth0 ingress pref 10 chain 0 flower \
	num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5
tc filter add dev veth0 ingress pref 20 chain 0 flower \
	num_of_vlans 1 action vlan push id 100 \
	protocol 0x8100 action goto chain 5
tc filter add dev veth0 ingress pref 30 chain 5 flower \
	num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success"

Sending double-tagged vlan packet with the IP payload inside:

cat &lt;&lt;ENDS | text2pcap - - | tcpreplay -i veth1 -
0000  00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64   ..........."...d
0010  81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11   ......E..&amp;......
0020  18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12   ................
0030  e1 c7 00 00 00 00 00 00 00 00 00 00               ............
ENDS

will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to
the dmesg.

OTOH, sending single-tagged vlan packet:

cat &lt;&lt;ENDS | text2pcap - - | tcpreplay -i veth1 -
0000  00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14   ..........."....
0010  08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00   ..E..*..........
0020  00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00   ................
0030  00 00 00 00 00 00 00 00 00 00 00 00               ............
ENDS

will match rule 20, will push the second vlan tag but will *not* match
rule 30. IOW, the match at rule 30 fails if the second vlan was freshly
pushed by the kernel.

Lets look at  __skb_flow_dissect working on the double-tagged vlan packet.
Here is the relevant code from around net/core/flow_dissector.c:1277
copy-pasted here for convenience:

	if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX &amp;&amp;
	    skb &amp;&amp; skb_vlan_tag_present(skb)) {
		proto = skb-&gt;protocol;
	} else {
		vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
					    data, hlen, &amp;_vlan);
		if (!vlan) {
			fdret = FLOW_DISSECT_RET_OUT_BAD;
			break;
		}

		proto = vlan-&gt;h_vlan_encapsulated_proto;
		nhoff += sizeof(*vlan);
	}

The "else" clause above gets the protocol of the encapsulated packet from
the skb data at the network header location. printk debugging has showed
that in the good double-tagged packet case proto is
htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet
case proto is garbage leading to the failure to match tc filter 30.

proto is being set from the skb header pointed by nhoff parameter which is
defined at the beginning of __skb_flow_dissect
(net/core/flow_dissector.c:1055 in the current version):

		nhoff = skb_network_offset(skb);

Therefore the culprit seems to be that the skb network offset is different
between double-tagged packet received from the interface and single-tagged
packet having its vlan tag pushed by TC.

Lets look at the interesting points of the lifetime of the single/double
tagged packets as they traverse our packet flow.

Both of them will start at __netif_receive_skb_core where the first vlan
tag will be stripped:

	if (eth_type_vlan(skb-&gt;protocol)) {
		skb = skb_vlan_untag(skb);
		if (unlikely(!skb))
			goto out;
	}

At this stage in double-tagged case skb-&gt;data points to the second vlan tag
while in single-tagged case skb-&gt;data points to the network (eg. IP)
header.

Looking at TC vlan push action (net/sched/act_vlan.c) we have the following
code at tcf_vlan_act (interesting points are in square brackets):

	if (skb_at_tc_ingress(skb))
[1]		skb_push_rcsum(skb, skb-&gt;mac_len);

	....

	case TCA_VLAN_ACT_PUSH:
		err = skb_vlan_push(skb, p-&gt;tcfv_push_proto, p-&gt;tcfv_push_vid |
				    (p-&gt;tcfv_push_prio &lt;&lt; VLAN_PRIO_SHIFT),
				    0);
		if (err)
			goto drop;
		break;

	....

out:
	if (skb_at_tc_ingress(skb))
[3]		skb_pull_rcsum(skb, skb-&gt;mac_len);

And skb_vlan_push (net/core/skbuff.c:6204) function does:

		err = __vlan_insert_tag(skb, skb-&gt;vlan_proto,
					skb_vlan_tag_get(skb));
		if (err)
			return err;

		skb-&gt;protocol = skb-&gt;vlan_proto;
[2]		skb-&gt;mac_len += VLAN_HLEN;

in the case of pushing the second tag. Lets look at what happens with
skb-&gt;data of the single-tagged packet at each of the above points:

1. As a result of the skb_push_rcsum, skb-&gt;data is moved back to the start
   of the packet.

2. First VLAN tag is moved from the skb into packet buffer, skb-&gt;mac_len is
   incremented, skb-&gt;data still points to the start of the packet.

3. As a result of the skb_pull_rcsum, skb-&gt;data is moved forward by the
   modified skb-&gt;mac_len, thus pointing to the network header again.

Then __skb_flow_dissect will get confused by having double-tagged vlan
packet with the skb-&gt;data at the network header.

The solution for the bug is to preserve "skb-&gt;data at second vlan header"
semantics in the skb_vlan_push function. We do this by manipulating
skb-&gt;network_header rather than skb-&gt;mac_len. skb_vlan_push callers are
updated to do skb_reset_mac_len.

Signed-off-by: Boris Sukholitko &lt;boris.sukholitko@broadcom.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;

</content>
</entry>
</feed>
