aboutsummaryrefslogtreecommitdiffstats
path: root/include/uapi (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-07-17ethtool: rss: initial RSS_SET (indirection table handling)Jakub Kicinski1-0/+1
Add initial support for RSS_SET, for now only operations on the indirection table are supported. Unlike the ioctl don't check if at least one parameter is being changed. This is how other ethtool-nl ops behave, so pick the ethtool-nl consistency vs copying ioctl behavior. There are two special cases here: 1) resetting the table to defaults; 2) support for tables of different size. For (1) I use an empty Netlink attribute (array of size 0). (2) may require some background. AFAICT a lot of modern devices allow allocating RSS tables of different sizes. mlx5 can upsize its tables, bnxt has some "table size calculation", and Intel folks asked about RSS table sizing in context of resource allocation in the past. The ethtool IOCTL API has a concept of table size, but right now the user is expected to provide a table exactly the size the device requests. Some drivers may change the table size at runtime (in response to queue count changes) but the user is not in control of this. What's not great is that all RSS contexts share the same table size. For example a device with 128 queues enabled, 16 RSS contexts 8 queues in each will likely have 256 entry tables for each of the 16 contexts, while 32 would be more than enough given each context only has 8 queues. To address this the Netlink API should avoid enforcing table size at the uAPI level, and should allow the user to express the min table size they expect. To fully solve (2) we will need more driver plumbing but at the uAPI level this patch allows the user to specify a table size smaller than what the device advertises. The device table size must be a multiple of the user requested table size. We then replicate the user-provided table to fill the full device size table. This addresses the "allow the user to express the min table size" objective, while not enforcing any fixed size. From Netlink perspective .get_rxfh_indir_size() is now de facto the "max" table size supported by the device. We may choose to support table replication in ethtool, too, when we actually plumb this thru the device APIs. Initially I was considering moving full pattern generation to the kernel (which queues to use, at which frequency and what min sequence length). I don't think this complexity would buy us much and most if not all devices have pow-2 table sizes, which simplifies the replication a lot. Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250716000331.1378807-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-12/+0
Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14tcp: add LINUX_MIB_BEYOND_WINDOWEric Dumazet1-0/+1
Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW Incremented when an incoming packet is received beyond the receiver window. nstat -az | grep TcpExtBeyondWindow Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711114006.480026-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14Add support to set NAPI threaded for individual NAPISamiullah Khawaja1-0/+1
A net device has a threaded sysctl that can be used to enable threaded NAPI polling on all of the NAPI contexts under that device. Allow enabling threaded NAPI polling at individual NAPI level using netlink. Extend the netlink operation `napi-set` and allow setting the threaded attribute of a NAPI. This will enable the threaded polling on a NAPI context. Add a test in `nl_netdev.py` that verifies various cases of threaded NAPI being set at NAPI and at device level. Tested ./tools/testing/selftests/net/nl_netdev.py TAP version 13 1..7 ok 1 nl_netdev.empty_check ok 2 nl_netdev.lo_check ok 3 nl_netdev.page_pool_check ok 4 nl_netdev.napi_list_check ok 5 nl_netdev.dev_set_threaded ok 6 nl_netdev.napi_set_threaded ok 7 nl_netdev.nsim_rxq_reset_down # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250710211203.3979655-1-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14Revert "netfilter: nf_tables: Add notifications for hook changes"Phil Sutter2-12/+0
This reverts commit 465b9ee0ee7bc268d7f261356afd6c4262e48d82. Such notifications fit better into core or nfnetlink_hook code, following the NFNL_MSG_HOOK_GET message format. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-10ethtool: rss: report which fields are configured for hashingJakub Kicinski1-0/+34
Implement ETHTOOL_GRXFH over Netlink. The number of flow types is reasonable (around 20) so report all of them at once for simplicity. Do not maintain the flow ID mapping with ioctl at the uAPI level. This gives us a chance to clean up the confusion that come from RxNFC vs RxFH (flow direction vs hashing) in the ioctl. Try to align with the names used in ethtool CLI, they seem to have stood the test of time just fine. One annoyance is that we still call L4 ports the weird names, but I guess they also apply to IPSec (where they cover the SPI) so it is what it is. $ ynl --family ethtool --dump rss-get { "header": { "dev-index": 1, "dev-name": "enp1s0" }, "hfunc": 1, "hkey": b"...", "indir": [0, 1, ...], "flow-hash": { "ether": {"l2da"}, "ah-esp4": {"ip-src", "ip-dst"}, "ah-esp6": {"ip-src", "ip-dst"}, "ah4": {"ip-src", "ip-dst"}, "ah6": {"ip-src", "ip-dst"}, "esp4": {"ip-src", "ip-dst"}, "esp6": {"ip-src", "ip-dst"}, "ip4": {"ip-src", "ip-dst"}, "ip6": {"ip-src", "ip-dst"}, "sctp4": {"ip-src", "ip-dst"}, "sctp6": {"ip-src", "ip-dst"}, "udp4": {"ip-src", "ip-dst"}, "udp6": {"ip-src", "ip-dst"} "tcp4": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"}, "tcp6": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"}, }, } Link: https://patch.msgid.link/20250708220640.2738464-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10ethtool: mark ETHER_FLOW as usable for Rx hashJakub Kicinski1-2/+2
Looks like some drivers (ena, enetc, fbnic.. there's probably more) consider ETHER_FLOW to be legitimate target for flow hashing. I'm not sure how intentional that is from the uAPI perspective vs just an effect of ethtool IOCTL doing minimal input validation. But Netlink will do strict validation, so we need to decide whether we allow this use case or not. I don't see a strong reason against it, and rejecting it would potentially regress a number of drivers. So update the comments and flow_type_hashable(). Link: https://patch.msgid.link/20250708220640.2738464-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10Merge branch 'virtio_udp_tunnel_08_07_2025' of ↵Jakub Kicinski4-0/+54
https://github.com/pabeni/linux-devel Paolo Abeni says: ==================== virtio: introduce GSO over UDP tunnel Some virtualized deployments use UDP tunnel pervasively and are impacted negatively by the lack of GSO support for such kind of traffic in the virtual NIC driver. The virtio_net specification recently introduced support for GSO over UDP tunnel, this series updates the virtio implementation to support such a feature. Currently the kernel virtio support limits the feature space to 64, while the virtio specification allows for a larger number of features. Specifically the GSO-over-UDP-tunnel-related virtio features use bits 65-69. The first four patches in this series rework the virtio and vhost feature support to cope with up to 128 bits. The limit is set by a define and could be easily raised in future, as needed. This implementation choice is aimed at keeping the code churn as limited as possible. For the same reason, only the virtio_net driver is reworked to leverage the extended feature space; all other virtio/vhost drivers are unaffected, but could be upgraded to support the extended features space in a later time. The last four patches bring in the actual GSO over UDP tunnel support. As per specification, some additional fields are introduced into the virtio net header to support the new offload. The presence of such fields depends on the negotiated features. New helpers are introduced to convert the UDP-tunneled skb metadata to an extended virtio net header and vice versa. Such helpers are used by the tun and virtio_net driver to cope with the newly supported offloads. Tested with basic stream transfer with all the possible permutations of host kernel/qemu/guest kernel with/without GSO over UDP tunnel support. ==================== Link: https://patch.msgid.link/cover.1751874094.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-2/+6
Cross-merge networking fixes after downstream PR (net-6.16-rc6). No conflicts. Adjacent changes: Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml 0a12c435a1d6 ("dt-bindings: net: sun8i-emac: Add A100 EMAC compatible") b3603c0466a8 ("dt-bindings: net: sun8i-emac: Rename A523 EMAC0 to GMAC0") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+4
Pull KVM fixes from Paolo Bonzini: "Many patches, pretty much all of them small, that accumulated while I was on vacation. ARM: - Remove the last leftovers of the ill-fated FPSIMD host state mapping at EL2 stage-1 - Fix unexpected advertisement to the guest of unimplemented S2 base granule sizes - Gracefully fail initialising pKVM if the interrupt controller isn't GICv3 - Also gracefully fail initialising pKVM if the carveout allocation fails - Fix the computing of the minimum MMIO range required for the host on stage-2 fault - Fix the generation of the GICv3 Maintenance Interrupt in nested mode x86: - Reject SEV{-ES} intra-host migration if one or more vCPUs are actively being created, so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM - Use a pre-allocated, per-vCPU buffer for handling de-sparsification of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too large" issue - Allow out-of-range/invalid Xen event channel ports when configuring IRQ routing, to avoid dictating a specific ioctl() ordering to userspace - Conditionally reschedule when setting memory attributes to avoid soft lockups when userspace converts huge swaths of memory to/from private - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest - Add a missing field in struct sev_data_snp_launch_start that resulted in the guest-visible workarounds field being filled at the wrong offset - Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid VM-Fail on INVVPID - Advertise supported TDX TDVMCALLs to userspace - Pass SetupEventNotifyInterrupt arguments to userspace - Fix TSC frequency underflow" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: avoid underflow when scaling TSC frequency KVM: arm64: Remove kvm_arch_vcpu_run_map_fp() KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes KVM: arm64: Don't free hyp pages with pKVM on GICv2 KVM: arm64: Fix error path in init_hyp_mode() KVM: arm64: Adjust range correctly during host stage-2 faults KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi() KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush KVM: SVM: Add missing member in SNP_LAUNCH_START command structure Documentation: KVM: Fix unexpected unindent warnings KVM: selftests: Add back the missing check of MONITOR/MWAIT availability KVM: Allow CPU to reschedule while setting per-page memory attributes KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table. KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
2025-07-10net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockoptJason Xing1-0/+1
This patch provides a setsockopt method to let applications leverage to adjust how many descs to be handled at most in one send syscall. It mitigates the situation where the default value (32) that is too small leads to higher frequency of triggering send syscall. Considering the prosperity/complexity the applications have, there is no absolutely ideal suggestion fitting all cases. So keep 32 as its default value like before. The patch does the following things: - Add XDP_MAX_TX_SKB_BUDGET socket option. - Set max_tx_budget to 32 by default in the initialization phase as a per-socket granular control. - Set the range of max_tx_budget as [32, xs->tx->nentries]. The idea behind this comes out of real workloads in production. We use a user-level stack with xsk support to accelerate sending packets and minimize triggering syscalls. When the packets are aggregated, it's not hard to hit the upper bound (namely, 32). The moment user-space stack fetches the -EAGAIN error number passed from sendto(), it will loop to try again until all the expected descs from tx ring are sent out to the driver. Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of sendto() and higher throughput/PPS. Here is what I did in production, along with some numbers as follows: For one application I saw lately, I suggested using 128 as max_tx_budget because I saw two limitations without changing any default configuration: 1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind this was I counted how many descs are transmitted to the driver at one time of sendto() based on [1] patch and then I calculated the possibility of hitting the upper bound. Finally I chose 128 as a suitable value because 1) it covers most of the cases, 2) a higher number would not bring evident results. After twisting the parameters, a stable improvement of around 4% for both PPS and throughput and less resources consumption were found to be observed by strace -c -p xxx: 1) %time was decreased by 7.8% 2) error counter was decreased from 18367 to 572 [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/ Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08af_unix: Introduce SO_INQ.Kuniyuki Iwashima1-0/+3
We have an application that uses almost the same code for TCP and AF_UNIX (SOCK_STREAM). TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an alternative. Let's introduce the generic version of TCP_INQ. If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that contains the exact value of ioctl(SIOCINQ). The cmsg is also included when msg->msg_get_inq is non-zero to make sockets io_uring-friendly. Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to override setsockopt() for SOL_SOCKET. By having the flag in struct unix_sock, instead of struct sock, we can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq. Note also that supporting custom getsockopt() for SOL_SOCKET will need preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08tun: enable gso over UDP tunnel support.Paolo Abeni1-0/+9
Add new tun features to represent the newly introduced virtio GSO over UDP tunnel offload. Allows detection and selection of such features via the existing TUNSETOFFLOAD ioctl and compute the expected virtio header size and tunnel header offset using the current netdev features, so that we can plug almost seamless the newly introduced virtio helpers to serialize the extended virtio header. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- v6 -> v7: - rebased v4 -> v5: - encapsulate the guest feature guessing in a tun helper - dropped irrelevant check on xdp buff headroom - do not remove unrelated black line - avoid line len > 80 char v3 -> v4: - virtio tnl-related fields are at fixed offset, cleanup the code accordingly. - use netdev features instead of flags bit to check for the configured offload - drop packet in case of enabled features/configured hdr size mismatch v2 -> v3: - cleaned-up uAPI comments - use explicit struct layout instead of raw buf.
2025-07-08net: implement virtio helpers to handle UDP GSO tunneling.Paolo Abeni1-0/+33
The virtio specification are introducing support for GSO over UDP tunnel. This patch brings in the needed defines and the additional virtio hdr parsing/building helpers. The UDP tunnel support uses additional fields in the virtio hdr, and such fields location can change depending on other negotiated features - specifically VIRTIO_NET_F_HASH_REPORT. Try to be as conservative as possible with the new field validation. Existing implementation for plain GSO offloads allow for invalid/ self-contradictory values of such fields. With GSO over UDP tunnel we can be more strict, with no need to deal with legacy implementation. Since the checksum-related field validation is asymmetric in the driver and in the device, introduce a separate helper to implement the new checks (to be used only on the driver side). Note that while the feature space exceeds the 64-bit boundaries, the guest offload space is fixed by the specification of the VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET command to a 64-bit size. Prior to the UDP tunnel GSO support, each guest offload bit corresponded to the feature bit with the same value and vice versa. Due to the limited 'guest offload' space, relevant features in the high 64 bits are 'mapped' to free bits in the lower range. That is simpler than defining a new command (and associated features) to exchange an extended guest offloads set. As a consequence, the uAPIs also specify the mapped guest offload value corresponding to the UDP tunnel GSO features. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> -- v4 -> v5: - avoid lines above 80 chars v3 -> v4: - fixed offset for UDP GSO tunnel, update accordingly the helpers - tried to clarified vlan_hlen semantic - virtio_net_chk_data_valid() -> virtio_net_handle_csum_offload() v2 -> v3: - add definitions for possible vnet hdr layouts with tunnel support v1 -> v2: - 'relay' -> 'rely' typo - less unclear comment WRT enforced inner GSO checks - inner header fields are allowed only with 'modern' virtio, thus are always le - clarified in the commit message the need for 'mapped features' defines - assume little_endian is true when UDP GSO is enabled. - fix inner proto type value
2025-07-08vhost-net: allow configuring extended featuresPaolo Abeni2-0/+12
Use the extended feature type for 'acked_features' and implement two new ioctls operation allowing the user-space to set/query an unbounded amount of features. The actual number of processed features is limited by VIRTIO_FEATURES_MAX and attempts to set features above such limit fail with EOPNOTSUPP. Note that: the legacy ioctls implicitly truncate the negotiated features to the lower 64 bits range and the 'acked_backend_features' field don't need conversion, as the only negotiated feature there is in the low 64 bit range. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again (2)Thomas Weißschuh1-2/+2
BITS_PER_LONG does not exist in UAPI headers, so can't be used by the UAPI __GENMASK(). Instead __BITS_PER_LONG needs to be used. When __GENMASK() was introduced in commit 3c7a8e190bc5 ("uapi: introduce uapi-friendly macros for GENMASK"), the code was fine. A broken revert in 1e7933a575ed ("uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"") introduced the incorrect usage of BITS_PER_LONG. That was fixed in commit 11fcf368506d ("uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again"). But a broken sync of the kernel headers with the tools/ headers in commit fc92099902fb ("tools headers: Synchronize linux/bits.h with the kernel sources") undid the fix. Reapply the fix and while at it also fix the tools header. Fixes: fc92099902fb ("tools headers: Synchronize linux/bits.h with the kernel sources") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
2025-07-08net/handshake: Add new parameter 'HANDSHAKE_A_ACCEPT_KEYRING'Hannes Reinecke1-0/+1
Add a new netlink parameter 'HANDSHAKE_A_ACCEPT_KEYRING' to provide the serial number of the keyring to use. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250701144657.104401-1-hare@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: mctp: add gateway routing supportJeremy Kerr1-0/+8
This change allows for gateway routing, where a route table entry may reference a routable endpoint (by network and EID), instead of routing directly to a netdevice. We add support for a RTM_GATEWAY attribute for netlink route updates, with an attribute format of: struct mctp_fq_addr { unsigned int net; mctp_eid_t eid; } - we need the net here to uniquely identify the target EID, as we no longer have the device reference directly (which would provide the net id in the case of direct routes). This makes route lookups recursive, as a route lookup that returns a gateway route must be resolved into a direct route (ie, to a device) eventually. We provide a limit to the route lookups, to prevent infinite loop routing. The route lookup populates a new 'nexthop' field in the dst structure, which now specifies the key for the neighbour table lookup on device output, rather than using the packet destination address directly. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250702-dev-forwarding-v5-13-1468191da8a4@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: bonding: add broadcast_neighbor netlink optionTonghao Zhang1-0/+1
User can config or display the bonding broadcast_neighbor option via iproute2/netlink. Cc: Jay Vosburgh <jv@jvosburgh.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/76b90700ba5b98027dfb51a2f3c5cfea0440a21b.1751031306.git.tonghao@bamaicloud.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-07net: openvswitch: allow providing upcall pid for the 'execute' commandIlya Maximets1-0/+6
When a packet enters OVS datapath and there is no flow to handle it, packet goes to userspace through a MISS upcall. With per-CPU upcall dispatch mechanism, we're using the current CPU id to select the Netlink PID on which to send this packet. This allows us to send packets from the same traffic flow through the same handler. The handler will process the packet, install required flow into the kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a recirculation action that will pass the (likely modified) packet through the flow lookup again. And if the flow is not found, the packet will be sent to userspace again through another MISS upcall. However, the handler thread in userspace is likely running on a different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled in the syscall context of that thread. So, when the time comes to send the packet through another upcall, the per-CPU dispatch will choose a different Netlink PID, and this packet will end up processed by a different handler thread on a different CPU. The process continues as long as there are new recirculations, each time the packet goes to a different handler thread before it is sent out of the OVS datapath to the destination port. In real setups the number of recirculations can go up to 4 or 5, sometimes more. There is always a chance to re-order packets while processing upcalls, because userspace will first install the flow and then re-inject the original packet. So, there is a race window when the flow is already installed and the second packet can match it and be forwarded to the destination before the first packet is re-injected. But the fact that packets are going through multiple upcalls handled by different userspace threads makes the reordering noticeably more likely, because we not only have a race between the kernel and a userspace handler (which is hard to avoid), but also between multiple userspace handlers. For example, let's assume that 10 packets got enqueued through a MISS upcall for handler-1, it will start processing them, will install the flow into the kernel and start re-injecting packets back, from where they will go through another MISS to handler-2. Handler-2 will install the flow into the kernel and start re-injecting the packets, while handler-1 continues to re-inject the last of the 10 packets, they will hit the flow installed by handler-2 and be forwarded without going to the handler-2, while handler-2 still re-injects the first of these 10 packets. Given multiple recirculations and misses, these 10 packets may end up completely mixed up on the output from the datapath. Let's allow userspace to specify on which Netlink PID the packets should be upcalled while processing OVS_PACKET_CMD_EXECUTE. This makes it possible to ensure that all the packets are processed by the same handler thread in the userspace even with them being upcalled multiple times in the process. Packets will remain in order since they will be enqueued to the same socket and re-injected in the same order. This doesn't eliminate re-ordering as stated above, since we still have a race between kernel and the userspace thread, but it allows to eliminate races between multiple userspace threads. Userspace knows the PID of the socket on which the original upcall is received, so there is no need to send it up from the kernel. Solution requires storing the value somewhere for the duration of the packet processing. There are two potential places for this: our skb extension or the per-CPU storage. It's not clear which is better, so just following currently used scheme of storing this kind of things along the skb. We still have a decent amount of space in the cb. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20250702155043.2331772-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni1-6/+26
Cross-merge networking fixes after downstream PR (net-6.16-rc5). No conflicts. No adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-02devlink: Extend devlink rate API with traffic classes bandwidth managementCarolina Jubran1-0/+9
Introduce support for specifying relative bandwidth shares between traffic classes (TC) in the devlink-rate API. This new option allows users to allocate bandwidth across multiple traffic classes in a single command. This feature provides a more granular control over traffic management, especially for scenarios requiring Enhanced Transmission Selection. Users can now define a relative bandwidth share for each traffic class. For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5 (RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the total bandwidth. The actual percentage each class receives depends on the ratio of its share value to the sum of all shares. Example: DEV=pci/0000:08:00.0 $ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \ tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0 $ devlink port function rate set $DEV/vfs_group \ tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0 Example usage with ynl: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"rate-tc-index": 0, "rate-tc-bw": 50}, {"rate-tc-index": 1, "rate-tc-bw": 50}, {"rate-tc-index": 2, "rate-tc-bw": 0}, {"rate-tc-index": 3, "rate-tc-bw": 0}, {"rate-tc-index": 4, "rate-tc-bw": 0}, {"rate-tc-index": 5, "rate-tc-bw": 0}, {"rate-tc-index": 6, "rate-tc-bw": 0}, {"rate-tc-index": 7, "rate-tc-bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0}, {'rate-tc-bw': 50, 'rate-tc-index': 1}, {'rate-tc-bw': 0, 'rate-tc-index': 2}, {'rate-tc-bw': 0, 'rate-tc-index': 3}, {'rate-tc-bw': 0, 'rate-tc-index': 4}, {'rate-tc-bw': 0, 'rate-tc-index': 5}, {'rate-tc-bw': 0, 'rate-tc-index': 6}, {'rate-tc-bw': 0, 'rate-tc-index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30neighbor: Add NTF_EXT_VALIDATED flag for externally validated entriesIdo Schimmel1-0/+5
tl;dr ===== Add a new neighbor flag ("extern_valid") that can be used to indicate to the kernel that a neighbor entry was learned and determined to be valid externally. The kernel will not try to remove or invalidate such an entry, leaving these decisions to the user space control plane. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed. Background ========== In a typical EVPN multi-homing setup each host is multi-homed using a set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf switches (VTEPs). VTEPs that are connected to the same ES are called ES peers. When a neighbor entry is learned on a VTEP, it is distributed to both ES peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers use the neighbor entry when routing traffic towards the multi-homed host and remote VTEPs use it for ARP/NS suppression. Motivation ========== If the ES link between a host and the VTEP on which the neighbor entry was locally learned goes down, the EVPN MAC/IP advertisement route will be withdrawn and the neighbor entries will be removed from both ES peers and remote VTEPs. Routing towards the multi-homed host and ARP/NS suppression can fail until another ES peer locally learns the neighbor entry and distributes it via an EVPN MAC/IP advertisement route. "draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these intermittent failures by having the ES peers install the neighbor entries as before, but also injecting EVPN MAC/IP advertisement routes with a proxy indication. When the previously mentioned ES link goes down and the original EVPN MAC/IP advertisement route is withdrawn, the ES peers will not withdraw their neighbor entries, but instead start aging timers for the proxy indication. If an ES peer locally learns the neighbor entry (i.e., it becomes "reachable"), it will restart its aging timer for the entry and emit an EVPN MAC/IP advertisement route without a proxy indication. An ES peer will stop its aging timer for the proxy indication if it observes the removal of the proxy indication from at least one of the ES peers advertising the entry. In the event that the aging timer for the proxy indication expired, an ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer expired on all ES peers and they all withdrew their proxy advertisements, the neighbor entry will be completely removed from the EVPN fabric. Implementation ============== In the above scheme, when the control plane (e.g., FRR) advertises a neighbor entry with a proxy indication, it expects the corresponding entry in the data plane (i.e., the kernel) to remain valid and not be removed due to garbage collection or loss of carrier. The control plane also expects the kernel to notify it if the entry was learned locally (i.e., became "reachable") so that it will remove the proxy indication from the EVPN MAC/IP advertisement route. That is why these entries cannot be programmed with dummy states such as "permanent" or "noarp". Instead, add a new neighbor flag ("extern_valid") which indicates that the entry was learned and determined to be valid externally and should not be removed or invalidated by the kernel. The kernel can probe the entry and notify user space when it becomes "reachable" (it is initially installed as "stale"). However, if the kernel does not receive a confirmation, have it return the entry to the "stale" state instead of the "failed" state. In other words, an entry marked with the "extern_valid" flag behaves like any other dynamically learned entry other than the fact that the kernel cannot remove or invalidate it. One can argue that the "extern_valid" flag should not prevent garbage collection and that instead a neighbor entry should be programmed with both the "extern_valid" and "extern_learn" flags. There are two reasons for not doing that: 1. Unclear why a control plane would like to program an entry that the kernel cannot invalidate but can completely remove. 2. The "extern_learn" flag is used by FRR for neighbor entries learned on remote VTEPs (for ARP/NS suppression) whereas here we are concerned with local entries. This distinction is currently irrelevant for the kernel, but might be relevant in the future. Given that the flag only makes sense when the neighbor has a valid state, reject attempts to add a neighbor with an invalid state and with this flag set. For example: # ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid Error: Cannot create externally validated neighbor with an invalid state. # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid # ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid Error: Cannot mark neighbor as externally validated with an invalid state. The above means that a neighbor cannot be created with the "extern_valid" flag and flags such as "use" or "managed" as they result in a neighbor being created with an invalid state ("none") and immediately getting probed: # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use Error: Cannot create externally validated neighbor with an invalid state. However, these flags can be used together with "extern_valid" after the neighbor was created with a valid state: # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use One consequence of preventing the kernel from invalidating a neighbor entry is that by default it will only try to determine reachability using unicast probes. This can be changed using the "mcast_resolicit" sysctl: # sysctl net.ipv4.neigh.br0/10.mcast_resolicit 0 # tcpdump -nn -e -i br0.10 -Q out arp & # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 # sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 iproute2 patches can be found here [2]. [1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03 [2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-27dpll: add reference-sync netlink attributeArkadiusz Kubalewski1-0/+1
Add new netlink attribute to allow user space configuration of reference sync pin pairs, where both pins are used to provide one clock signal consisting of both: base frequency and sync signal. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Milena Olech <milena.olech@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20250626135219.1769350-2-arkadiusz.kubalewski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-27Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linuxLinus Torvalds1-6/+26
Pull block fixes from Jens Axboe: - Fixes for ublk: - fix C++ narrowing warnings in the uapi header - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header - fix for the ublk ->queue_rqs() implementation, limiting a batch to just the specific task AND ring - ublk_get_data() error handling fix - sanity check more arguments in ublk_ctrl_add_dev() - selftest addition - NVMe pull request via Christoph: - reset delayed remove_work after reconnect - fix atomic write size validation - Fix for a warning introduced in bdev_count_inflight_rw() in this merge window * tag 'block-6.16-20250626' of git://git.kernel.dk/linux: block: fix false warning in bdev_count_inflight_rw() ublk: sanity check add_dev input for underflow nvme: fix atomic write size validation nvme: refactor the atomic write unit detection nvme: reset delayed remove_work after reconnect ublk: setup ublk_io correctly in case of ublk_get_data() failure ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header ublk: fix narrowing warnings in UAPI header selftests: ublk: don't take same backing file for more than one ublk devices ublk: build batch from IOs in same io_ring_ctx and io task
2025-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-5/+31
Cross-merge networking fixes after downstream PR (net-6.16-rc4). Conflicts: Documentation/netlink/specs/mptcp_pm.yaml 9e6dd4c256d0 ("netlink: specs: mptcp: replace underscores with dashes in names") ec362192aa9e ("netlink: specs: fix up indentation errors") https://lore.kernel.org/20250626122205.389c2cd4@canb.auug.org.au Adjacent changes: Documentation/netlink/specs/fou.yaml 791a9ed0a40d ("netlink: specs: fou: replace underscores with dashes in names") 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26Merge tag 'net-6.16-rc4' of ↵Linus Torvalds2-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bluetooth and wireless. Current release - regressions: - bridge: fix use-after-free during router port configuration Current release - new code bugs: - eth: wangxun: fix the creation of page_pool Previous releases - regressions: - netpoll: initialize UDP checksum field before checksumming - wifi: mac80211: finish link init before RCU publish - bluetooth: fix use-after-free in vhci_flush() - eth: - ionic: fix DMA mapping test - bnxt: properly flush XDP redirect lists Previous releases - always broken: - netlink: specs: enforce strict naming of properties - unix: don't leave consecutive consumed OOB skbs. - vsock: fix linux/vm_sockets.h userspace compilation errors - selftests: fix TCP packet checksum" * tag 'net-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits) net: libwx: fix the creation of page_pool net: selftests: fix TCP packet checksum atm: Release atm_dev_mutex after removing procfs in atm_dev_deregister(). netlink: specs: enforce strict naming of properties netlink: specs: tc: replace underscores with dashes in names netlink: specs: rt-link: replace underscores with dashes in names netlink: specs: mptcp: replace underscores with dashes in names netlink: specs: ovs_flow: replace underscores with dashes in names netlink: specs: devlink: replace underscores with dashes in names netlink: specs: dpll: replace underscores with dashes in names netlink: specs: ethtool: replace underscores with dashes in names netlink: specs: fou: replace underscores with dashes in names netlink: specs: nfsd: replace underscores with dashes in names net: enetc: Correct endianness handling in _enetc_rd_reg64 atm: idt77252: Add missing `dma_map_error()` bnxt: properly flush XDP redirect lists vsock/uapi: fix linux/vm_sockets.h userspace compilation errors wifi: mac80211: finish link init before RCU publish wifi: iwlwifi: mvm: assume '1' as the default mac_config_cmd version selftest: af_unix: Add tests for -ECONNRESET. ...
2025-06-25netlink: specs: mptcp: replace underscores with dashes in namesJakub Kicinski1-3/+3
We're trying to add a strict regexp for the name format in the spec. Underscores will not be allowed, dashes should be used instead. This makes no difference to C (codegen, if used, replaces special chars in names) but it gives more uniform naming in Python. Fixes: bc8aeb2045e2 ("Documentation: netlink: add a YAML spec for mptcp") Reviewed-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250624211002.3475021-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25uapi: net_dropmon: drop unused is_drop_point_hw macroRubenKelevra1-7/+0
Commit 4ea7e38696c7 ("dropmon: add ability to detect when hardware drops rx packets") introduced is_drop_point_hw, but the symbol was never referenced anywhere in the kernel tree and is currently not used by dropwatch. I could not find, to the best of my abilities, a current out-of-tree user of this macro. The definition also contains a syntax error in its for-loop, so any project that tried to compile against it would fail. Removing the macro therefore eliminates dead code without breaking existing users. Signed-off-by: RubenKelevra <rubenkelevra@gmail.com> Link: https://patch.msgid.link/20250624165711.1188691-1-rubenkelevra@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25net: ethtool: rss: add notificationsJakub Kicinski1-0/+1
In preparation for RSS_SET handling in ethnl introduce Netlink notifications for RSS. Only cover modifications, not creation and not removal of a context, because the latter may deserve a different notification type. We should cross that bridge when we add the support for context add / remove via Netlink. Link: https://patch.msgid.link/20250623231720.3124717-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25netlink: specs: add the multicast group name to specJakub Kicinski2-2/+2
Add the multicast group's name to the YAML spec. Without it YNL doesn't know how to subscribe to notifications. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250623231720.3124717-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-24ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI headerCaleb Sander Mateos1-2/+22
UBLK_F_SUPPORT_ZERO_COPY has a very old comment describing the initial idea for how zero-copy would be implemented. The actual implementation added in commit 1f6540e2aabb ("ublk: zc register/unregister bvec") uses io_uring registered buffers rather than shared memory mapping. Remove the inaccurate remarks about mapping ublk request memory into the ublk server's address space and requiring 4K block size. Replace them with a description of the current zero-copy mechanism. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250621171015.354932-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24ublk: fix narrowing warnings in UAPI headerCaleb Sander Mateos1-4/+4
When a C++ file compiled with -Wc++11-narrowing includes the UAPI header linux/ublk_cmd.h, ublk_sqe_addr_to_auto_buf_reg()'s assignments of u64 values to u8, u16, and u32 fields result in compiler warnings. Add explicit casts to the intended types to avoid these warnings. Drop the unnecessary bitmasks. Reported-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 99c1e4eb6a3f ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250621162842.337452-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24vsock/uapi: fix linux/vm_sockets.h userspace compilation errorsStefano Garzarella1-0/+4
If a userspace application just include <linux/vm_sockets.h> will fail to build with the following errors: /usr/include/linux/vm_sockets.h:182:39: error: invalid application of ‘sizeof’ to incomplete type ‘struct sockaddr’ 182 | unsigned char svm_zero[sizeof(struct sockaddr) - | ^~~~~~ /usr/include/linux/vm_sockets.h:183:39: error: ‘sa_family_t’ undeclared here (not in a function) 183 | sizeof(sa_family_t) - | Include <sys/socket.h> for userspace (guarded by ifndef __KERNEL__) where `struct sockaddr` and `sa_family_t` are defined. We already do something similar in <linux/mptcp.h> and <linux/if.h>. Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Reported-by: Daan De Meyer <daan.j.demeyer@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250623100053.40979-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-24wifi: cfg80211: Report per-radio RTS threshold to userspaceRoopni Devanathan1-0/+2
In case of multi-radio wiphys, with per-radio RTS threshold brought into use, RTS threshold for each radio in a wiphy can be recorded in wiphy parameter - wiphy_radio_cfg, as an array. Add a new attribute - NL80211_WIPHY_RADIO_ATTR_RTS_THRESHOLD in nested parameter - NL80211_ATTR_WIPHY_RADIOS. When a request for getting RTS threshold for a particular radio is received, parse the radio id and get the required data. Add this data to the newly added nested attribute NL80211_WIPHY_RADIO_ATTR_RTS_THRESHOLD. Add support to report this data to userspace. Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com> Link: https://patch.msgid.link/20250615082312.619639-4-quic_rdevanat@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: cfg80211/mac80211: Add support to get radio indexRoopni Devanathan1-0/+10
Currently, per-radio attributes are set on per-phy basis, i.e., all the radios present in a wiphy will take attributes values sent from user. But each radio in a wiphy can get different values from userspace based on its requirement. To extend support to set per-radio attributes, add support to get radio index from userspace. Add an NL attribute - NL80211_ATTR_WIPHY_RADIO_INDEX, to get user specified radio index for which attributes should be changed. Pass this to individual drivers, so that the drivers can use this radio index to change per-radio attributes when necessary. Currently, per-radio attributes identified are: NL80211_ATTR_WIPHY_TX_POWER_LEVEL NL80211_ATTR_WIPHY_ANTENNA_TX NL80211_ATTR_WIPHY_ANTENNA_RX NL80211_ATTR_WIPHY_RETRY_SHORT NL80211_ATTR_WIPHY_RETRY_LONG NL80211_ATTR_WIPHY_FRAG_THRESHOLD NL80211_ATTR_WIPHY_RTS_THRESHOLD NL80211_ATTR_WIPHY_COVERAGE_CLASS NL80211_ATTR_TXQ_LIMIT NL80211_ATTR_TXQ_MEMORY_LIMIT NL80211_ATTR_TXQ_QUANTUM By default, the radio index is set to -1. This means the attribute should be treated as a global configuration. If the user has not specified any index, then the radio index passed to individual drivers would be -1. This would indicate that the attribute applies to all radios in that wiphy. Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com> Link: https://patch.msgid.link/20250615082312.619639-2-quic_rdevanat@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+22
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some missing synchronisation - A small fix for the irqbypass hook fixes, tightening the check and ensuring that we only deal with MSI for both the old and the new route entry - Rework the way the shadow LRs are addressed in a nesting configuration, plugging an embarrassing bug as well as simplifying the whole process - Add yet another fix for the dreaded arch_timer_edge_cases selftest RISC-V: - Fix the size parameter check in SBI SFENCE calls - Don't treat SBI HFENCE calls as NOPs x86 TDX: - Complete API for handling complex TDVMCALLs in userspace. This was delayed because the spec lacked a way for userspace to deny supporting these calls; the new exit code is now approved" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: TDX: Exit to userspace for GetTdVmCallInfo KVM: TDX: Handle TDG.VP.VMCALL<GetQuote> KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs KVM: arm64: VHE: Centralize ISBs when returning to host KVM: arm64: Remove cpacr_clear_set() KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd() KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync() KVM: arm64: Reorganise CPTR trap manipulation KVM: arm64: VHE: Synchronize CPTR trap deactivation KVM: arm64: VHE: Synchronize restore of host debug registers KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases KVM: arm64: Explicitly treat routing entry type changes as changes KVM: arm64: nv: Fix tracking of shadow list registers RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
2025-06-21Merge tag 'perf-tools-fixes-for-v6.16-1-2025-06-20' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools fixes from Arnaldo Carvalho de Melo: - Fix some file descriptor leaks that stand out with recent changes to 'perf list' - Fix prctl include to fix building 'perf bench futex' hash with musl libc - Restrict 'perf test' uniquifying entry to machines with 'uncore_imc' PMUs - Document new output fields (op, cache, mem, dtlb, snoop) used with 'perf mem' - Synchronize kernel header copies * tag 'perf-tools-fixes-for-v6.16-1-2025-06-20' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: tools headers x86 cpufeatures: Sync with the kernel sources perf bench futex: Fix prctl include in musl libc perf test: Directory file descriptor leak perf evsel: Missed close() when probing hybrid core PMUs tools headers: Synchronize linux/bits.h with the kernel sources tools arch amd ibs: Sync ibs.h with the kernel sources tools arch x86: Sync the msr-index.h copy with the kernel sources tools headers: Syncronize linux/build_bug.h with the kernel sources tools headers: Update the copy of x86's mem{cpy,set}_64.S used in 'perf bench' tools headers UAPI: Sync linux/kvm.h with the kernel sources tools headers UAPI: Sync the drm/drm.h with the kernel sources perf beauty: Update copy of linux/socket.h with the kernel sources tools headers UAPI: Sync kvm header with the kernel sources tools headers x86 svm: Sync svm headers with the kernel sources tools headers UAPI: Sync KVM's vmx.h header with the kernel sources tools kvm headers arm64: Update KVM header from the kernel sources tools headers UAPI: Sync linux/prctl.h with the kernel sources to pick FUTEX knob perf mem: Document new output fields (op, cache, mem, dtlb, snoop) tools headers: Update the fs headers with the kernel sources perf test: Restrict uniquifying test to machines with 'uncore_imc'
2025-06-20KVM: TDX: Exit to userspace for SetupEventNotifyInterruptPaolo Bonzini1-0/+4
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Exit to userspace for GetTdVmCallInfoBinbin Wu1-0/+5
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX, to allow userspace to provide information about the support of TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API. GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>, <ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>, <Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and <Instruction.WRMSR>. They must be supported by VMM to support TDX guests. For GetTdVmCallInfo - When leaf (r12) to enumerate TDVMCALL functionality is set to 0, successful execution indicates all GHCI base TDVMCALLs listed above are supported. Update the KVM TDX document with the set of the GHCI base APIs. - When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it indicates the TDX guest is querying the supported TDVMCALLs beyond the GHCI base TDVMCALLs. Exit to userspace to let userspace set the TDVMCALL sub-function bit(s) accordingly to the leaf outputs. KVM could set the TDVMCALL bit(s) supported by itself when the TDVMCALLs don't need support from userspace after returning from userspace and before entering guest. Currently, no such TDVMCALLs implemented, KVM just sets the values returned from userspace. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>Binbin Wu1-0/+17
Handle TDVMCALL for GetQuote to generate a TD-Quote. GetQuote is a doorbell-like interface used by TDX guests to request VMM to generate a TD-Quote signed by a service hosting TD-Quoting Enclave operating on the host. A TDX guest passes a TD Report (TDREPORT_STRUCT) in a shared-memory area as parameter. Host VMM can access it and queue the operation for a service hosting TD-Quoting enclave. When completed, the Quote is returned via the same shared-memory area. KVM only checks the GPA from the TDX guest has the shared-bit set and drops the shared-bit before exiting to userspace to avoid bleeding the shared-bit into KVM's exit ABI. KVM forwards the request to userspace VMM (e.g. QEMU) and userspace VMM queues the operation asynchronously. KVM sets the return code according to the 'ret' field set by userspace to notify the TDX guest whether the request has been queued successfully or not. When the request has been queued successfully, the TDX guest can poll the status field in the shared-memory area to check whether the Quote generation is completed or not. When completed, the generated Quote is returned via the same buffer. Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is required to handle the KVM exit reason as the initial support for TDX, by reentering KVM to ensure that the TDVMCALL is complete. While at it, add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20wifi: cfg80211: Add support for link reconfiguration negotiation offload to ↵Kavita Kavita1-1/+5
driver In the case of SME-in-driver, the driver can internally choose to update the links based on the AP MLD recommendation and do link reconfiguration negotiation with AP MLD. (e.g., After the driver processing the BSS Transition Management request frame received from the AP MLD with Neighbor Report containing Multi-Link element with recommended links information chooses to do link reconfiguration negotiation with AP MLD). To support this, extend cfg80211_mlo_reconf_add_done() and NL80211_CMD_ASSOC_MLO_RECONF to indicate added links information for driver-initiated link reconfiguration requests. For removed links, the driver indicates links information using the NL80211_CMD_LINKS_REMOVED event for driver-initiated cases, the same as supplicant initiated cases. For the driver-initiated case, cfg80211 will receive link reconfiguration result asynchronously from driver so holding BSSes of the accepted add links is needed in the event path. Also, no need of unhold call for the rejected add link BSSes since there was no hold call happened previously. Once the supplicant receives the NL80211_CMD_ASSOC_MLO_RECONF event, it needs to process the information about newly added links and install per-link group keys (e.g., GTK/IGTK/BIGTK etc.). In case of the SME-in-driver, using a vendor interface etc. to notify the supplicant to initiate a link reconfiguration request and then supplicant sending command to the cfg80211 can lead to race conditions. The correct design to avoid this is that the driver indicates the cfg80211 directly with the results of the link reconfiguration negotiation. Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com> Link: https://patch.msgid.link/20250604105757.2542-3-quic_kkavita@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20wifi: cfg80211: Improve the documentation for NL80211_CMD_ASSOC_MLO_RECONFKavita Kavita1-1/+5
The existing documentation for the NL80211_CMD_ASSOC_MLO_RECONF does not clearly explain handling of link reconfiguration request results from the driver. Add documentation to explain that the command is used as an event to notify userspace about added links information, and that the existing NL80211_CMD_LINKS_REMOVED command is used to notify userspace about removed links information. Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com> Link: https://patch.msgid.link/20250604105757.2542-2-quic_kkavita@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-8/+4
Cross-merge networking fixes after downstream PR (net-6.16-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19Merge tag 'net-6.16-rc3' of ↵Linus Torvalds2-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless. The ath12k fix to avoid FW crashes requires adding support for a number of new FW commands so it's quite large in terms of LoC. The rest is relatively small. Current release - fix to a fix: - ptp: fix breakage after ptp_vclock_in_use() rework Current release - regressions: - openvswitch: allocate struct ovs_pcpu_storage dynamically, static allocation may exhaust module loader limit on smaller systems Previous releases - regressions: - tcp: fix tcp_packet_delayed() for peers with no selective ACK support Previous releases - always broken: - wifi: ath12k: don't activate more links than firmware supports - tcp: make sure sockets open via passive TFO have valid NAPI ID - eth: bnxt_en: update MRU and RSS table of RSS contexts on queue reset, prevent Rx queues from silently hanging after queue reset - NFC: uart: set tty->disc_data only in success path" * tag 'net-6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits) net: airoha: Differentiate hwfd buffer size for QDMA0 and QDMA1 net: airoha: Compute number of descriptors according to reserved memory size tools: ynl: fix mixing ops and notifications on one socket net: atm: fix /proc/net/atm/lec handling net: atm: add lec_mutex mlxbf_gige: return EPROBE_DEFER if PHY IRQ is not available net: airoha: Always check return value from airoha_ppe_foe_get_entry() NFC: nci: uart: Set tty->disc_data only in success path calipso: Fix null-ptr-deref in calipso_req_{set,del}attr(). MAINTAINERS: Remove Shannon Nelson from MAINTAINERS file net: lan743x: fix potential out-of-bounds write in lan743x_ptp_io_event_clock_get() eth: fbnic: avoid double free when failing to DMA-map FW msg tcp: fix passive TFO socket having invalid NAPI ID selftests: net: add test for passive TFO socket NAPI ID selftests: net: add passive TFO test binary selftests: netdevsim: improve lib.sh include in peer.sh tipc: fix null-ptr-deref when acquiring remote ip of ethernet bearer Octeontx2-pf: Fix Backpresure configuration net: ftgmac100: select FIXED_PHY net: ethtool: remove duplicate defines for family info ...
2025-06-18net: ethtool: Add PSE port priority support featureKory Maincent (Dent Project)1-0/+2
This patch expands the status information provided by ethtool for PSE c33 with current port priority and max port priority. It also adds a call to pse_ethtool_set_prio() to configure the PSE port priority. Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-8-78a1a645e2ee@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18net: pse-pd: Add support for budget evaluation strategiesKory Maincent (Dent Project)1-0/+18
This patch introduces the ability to configure the PSE PI budget evaluation strategies. Budget evaluation strategies is utilized by PSE controllers to determine which ports to turn off first in scenarios such as power budget exceedance. The pis_prio_max value is used to define the maximum priority level supported by the controller. Both the current priority and the maximum priority are exposed to the user through the pse_ethtool_get_status call. This patch add support for two mode of budget evaluation strategies. 1. Static Method: This method involves distributing power based on PD classification. It’s straightforward and stable, the PSE core keeping track of the budget and subtracting the power requested by each PD’s class. Advantages: Every PD gets its promised power at any time, which guarantees reliability. Disadvantages: PD classification steps are large, meaning devices request much more power than they actually need. As a result, the power supply may only operate at, say, 50% capacity, which is inefficient and wastes money. Priority max value is matching the number of PSE PIs within the PSE. 2. Dynamic Method: To address the inefficiencies of the static method, vendors like Microchip have introduced dynamic power budgeting, as seen in the PD692x0 firmware. This method monitors the current consumption per port and subtracts it from the available power budget. When the budget is exceeded, lower-priority ports are shut down. Advantages: This method optimizes resource utilization, saving costs. Disadvantages: Low-priority devices may experience instability. Priority max value is set by the PSE controller driver. For now, budget evaluation methods are not configurable and cannot be mixed. They are hardcoded in the PSE driver itself, as no current PSE controller supports both methods. Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-7-78a1a645e2ee@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18net: ethtool: Add support for new power domains index descriptionKory Maincent (Dent Project)1-0/+1
Report the index of the newly introduced PSE power domain to the user, enabling improved management of the power budget for PSE devices. Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-5-78a1a645e2ee@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18net: pse-pd: Add support for reporting eventsKory Maincent (Dent Project)1-0/+19
Add support for devm_pse_irq_helper() to register PSE interrupts and report events such as over-current or over-temperature conditions. This follows a similar approach to the regulator API but also sends notifications using a dedicated PSE ethtool netlink socket. Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-2-78a1a645e2ee@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18net: ethtool: remove duplicate defines for family infoJakub Kicinski2-6/+2
Commit under fixes switched to uAPI generation from the YAML spec. A number of custom defines were left behind, mostly for commands very hard to express in YAML spec. Among what was left behind was the name and version of the generic netlink family. Problem is that the codegen always outputs those values so we ended up with a duplicated, differently named set of defines. Provide naming info in YAML and remove the incorrect defines. Fixes: 8d0580c6ebdd ("ethtool: regenerate uapi header from the spec") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250617202240.811179-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>