summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
AgeCommit message (Collapse)AuthorLines
13 daysMerge tag 'nf-next-26-04-10' of ↵Jakub Kicinski-0/+37
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1-3) IPVS updates from Julian Anastasov to enhance visibility into IPVS internal state by exposing hash size, load factor etc and allows userspace to tune the load factor used for resizing hash tables. 4) reject empty/not nul terminated device names from xt_physdev. This isn't a bug fix; existing code doesn't require a c-string. But clean this up anyway because conceptually the interface name definitely should be a c-string. 5) Switch nfnetlink to skb_mac_header helpers that didn't exist back when this code was written. This gives us additional debug checks but is not intended to change functionality. 6) Let the xt ttl/hoplimit match reject unknown operator modes. This is a cleanup, the evaluation function simply returns false when the mode is out of range. From Marino Dzalto. 7) xt_socket match should enable defrag after all other checks. This bug is harmless, historically defrag could not be disabled either except by rmmod. 8) remove UDP-Lite conntrack support, from Fernando Fernandez Mancera. 9) Avoid a couple -Wflex-array-member-not-at-end warnings in the old xtables 32bit compat code, from Gustavo A. R. Silva. 10) nftables fwd expression should drop packets when their ttl/hl has expired. This is a bug fix deferred, its not deemed important enough for -rc8. 11) Add additional checks before assuming the mac header is an ethernet header, from Zhengchuan Liang. * tag 'nf-next-26-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: require Ethernet MAC header before using eth_hdr() netfilter: nft_fwd_netdev: check ttl/hl before forwarding netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings netfilter: conntrack: remove UDP-Lite conntrack support netfilter: xt_socket: enable defrag after all other checks netfilter: xt_HL: add pr_fmt and checkentry validation netfilter: nfnetlink: prefer skb_mac_header helpers netfilter: x_physdev: reject empty or not-nul terminated device names ipvs: add conn_lfactor and svc_lfactor sysctl vars ipvs: add ip_vs_status info ipvs: show the current conn_tab size to users ==================== Link: https://patch.msgid.link/20260410112352.23599-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-10docs: net: bridge: document stp_mode attributeAndy Roulin-0/+22
Add documentation for the IFLA_BR_STP_MODE bridge attribute in the "User space STP helper" section of the bridge documentation. Reference the BR_STP_MODE_* values via kernel-doc and describe the use case for network namespace environments. Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Andy Roulin <aroulin@nvidia.com> Link: https://patch.msgid.link/20260405205224.3163000-3-aroulin@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-10ipvs: add conn_lfactor and svc_lfactor sysctl varsJulian Anastasov-0/+37
Allow the default load factor for the connection and service tables to be configured. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-04-09net: Implement netdev_nl_queue_create_doitDaniel Borkmann-0/+6
Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ynl --family netdev --output-json --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-08devlink: Document resource scope filteringOr Har-Toov-0/+35
Document the scope parameter for devlink resource show, which allows filtering the dump to device-level or port-level resources only. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260407194107.148063-13-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-08devlink: Document port-level resources and full dumpOr Har-Toov-0/+35
Document the port-level resource support and the option to dump all resources, including both device-level and port-level entries. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260407194107.148063-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-08net: dsa: remove struct platform_dataVladimir Oltean-5/+0
This is not used anywhere in the kernel. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20260406212158.721806-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-26docs/mlx5: Fix typo subfuctionRyohei Kinugawa-2/+2
Fix two typos: - 'Subfunctons' -> 'Subfunctions' - 'subfuction' -> 'subfunction' Reviewed-by: Joe Damato <joe@dama.to> Signed-off-by: Ryohei Kinugawa <ryohei.kinugawa@gmail.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20260324053416.70166-1-ryohei.kinugawa@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-18net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECSHaiyang Zhang-0/+11
Add two parameters for drivers supporting Rx CQE coalescing / descriptor writeback. ETHTOOL_A_COALESCE_RX_CQE_FRAMES: Maximum number of frames that can be coalesced into a CQE or writeback. ETHTOOL_A_COALESCE_RX_CQE_NSECS: Max time in nanoseconds after the first packet arrival in a coalesced CQE or writeback to be sent. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/20260317191826.1346111-2-haiyangz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-14documentation: networking: add shared devlink documentationJiri Pirko-0/+98
Document shared devlink instances for multiple PFs on the same chip. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20260312100407.551173-13-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-14tcp: implement RFC 7323 window retraction receiver requirementsSimon Baatz-0/+1
By default, the Linux TCP implementation does not shrink the advertised window (RFC 7323 calls this "window retraction") with the following exceptions: - When an incoming segment cannot be added due to the receive buffer running out of memory. Since commit 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze") a zero window will be advertised in this case. It turns out that reaching the required memory pressure is easy when window scaling is in use. In the simplest case, sending a sufficient number of segments smaller than the scale factor to a receiver that does not read data is enough. - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink") addressed the "eating memory" problem by introducing a sysctl knob that allows shrinking the window before running out of memory. However, RFC 7323 does not only state that shrinking the window is necessary in some cases, it also formulates requirements for TCP implementations when doing so (Section 2.4). This commit addresses the receiver-side requirements: After retracting the window, the peer may have a snd_nxt that lies within a previously advertised window but is now beyond the retracted window. This means that all incoming segments (including pure ACKs) will be rejected until the application happens to read enough data to let the peer's snd_nxt be in window again (which may be never). To comply with RFC 7323, the receiver MUST honor any segment that would have been in window for any ACK sent by the receiver and, when window scaling is in effect, SHOULD track the maximum window sequence number it has advertised. This patch tracks that maximum window sequence number rcv_mwnd_seq throughout the connection and uses it in tcp_sequence() when deciding whether a segment is acceptable. rcv_mwnd_seq is updated together with rcv_wup and rcv_wnd in tcp_select_window(). If we count tcp_sequence() as fast path, it is read in the fast path. Therefore, rcv_mwnd_seq is put into rcv_wnd's cacheline group. The logic for handling received data in tcp_data_queue() is already sufficient and does not need to be updated. Signed-off-by: Simon Baatz <gmbnomis@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260309-tcp_rfc7323_retract_wnd_rfc-v3-1-4c7f96b1ec69@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-12docs: octeontx2: fix typo in documentationShravyaPanchagiri-1/+1
Fix spelling mistake "Crate" to "Create" in the documentation. Signed-off-by: ShravyaPanchagiri <shravy112@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260311030450.8461-1-shravy112@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-11tcp: add sysctl_tcp_shrink_window to netns_ipv4_sysctl.rstEric Dumazet-0/+1
Add missing entry for sysctl_tcp_shrink_window. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260310073855.564927-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-10inet: add ip_local_port_step_width sysctl to improve port usage distributionFernando Fernandez Mancera-0/+17
With the current port selection algorithm, ports after a reserved port range or long time used port are used more often than others [1]. This causes an uneven port usage distribution. This combines with cloud environments blocking connections between the application server and the database server if there was a previous connection with the same source port, leading to connectivity problems between applications on cloud environments. The real issue here is that these firewalls cannot cope with standards-compliant port reuse. This is a workaround for such situations and an improvement on the distribution of ports selected. The proposed solution is to implement a variant of RFC 6056 Algorithm 5. The step size is selected randomly on every connect() call ensuring it is a coprime with respect to the size of the range of ports we want to scan. This way, we can ensure that all ports within the range are scanned before returning an error. To enable this algorithm, the user must configure the new sysctl option "net.ipv4.ip_local_port_step_width". In addition, on graphs generated we can observe that the distribution of source ports is more even with the proposed approach. [2] [1] https://0xffsoftware.com/port_graph_current_alg.html [2] https://0xffsoftware.com/port_graph_random_step_alg.html Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260309023946.5473-2-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-10net/smc: Add documentation for limit_smc_hs and hs_ctrlKyoji Ogasawara-0/+27
Document missing SMC sysctl parameters limit_smc_hs and hs_ctrl Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: D. Wythe<alibuda@linux.alibaba.com> Link: https://patch.msgid.link/20260309124541.22723-3-sawara04.o@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-10net/smc: fix indentation in smcr_buf_type sectionKyoji Ogasawara-8/+8
smcr_buf_type section used inconsistent indentation compared with the rest of this document. Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: D. Wythe<alibuda@linux.alibaba.com> Link: https://patch.msgid.link/20260309124541.22723-2-sawara04.o@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-04net-sysfs: use rps_tag_ptr and remove metadata from rps_sock_flow_tableEric Dumazet-4/+9
Instead of storing the @mask at the beginning of rps_sock_flow_table, use 5 low order bits of the rps_tag_ptr to store the log of the size. This removes a potential cache line miss to fetch @mask. More importantly, we can switch to vmalloc_huge() without wasting memory. Tested with: numactl --interleave=all bash -c "echo 4194304 >/proc/sys/net/core/rps_sock_flow_entries" Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-27net/handshake: Fixed grammar mistakeLeon Kral-1/+1
The word "a" was used instead of "an" which is grammatically incorrect. Fixed by changing from "a" to "an". This improves readability of the documentation. Signed-off-by: Leon Kral <leon.j.kral@protonmail.com> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Link: https://patch.msgid.link/20260227001151.41610-1-leon.j.kral@protonmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-26docs: ethtool: clarify the bit-by-bit bitset format descriptionYohei Kojima-4/+8
Clarify the bit-by-bit bitset format's behavior around mandatory attributes and bit identification. More specifically, the following changes are made: * Rephrase a misleading sentence which implies name and index are mutually exclusive * Describe that ETHTOOL_A_BITSET_BITS nest is mandatory * Describe that a request fails if inconsistent identifiers are given Signed-off-by: Yohei Kojima <yk@y-koj.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/ef90a56965ca66e57aa177929ce3e10c5ca815fa.1772031974.git.yk@y-koj.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-26docs: net: document neigh gc_interval sysctlGabriel Goller-0/+7
Add entry for the neigh/default/gc_interval sysctl. This sysctl is unused since kernel v2.6.8. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Gabriel Goller <g.goller@proxmox.com> Link: https://patch.msgid.link/20260225095822.44050-1-g.goller@proxmox.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24icmp: increase net.ipv4.icmp_msgs_{per_sec,burst}Eric Dumazet-3/+3
These sysctls were added in 4cdf507d5452 ("icmp: add a global rate limitation") and their default values might be too small. Some network tools send probes to closed UDP ports from many hosts to estimate proportion of packet drops on a particular target. This patch sets both sysctls to 10000. Note the per-peer rate-limit (as described in RFC 4443 2.4 (f)) intent is still enforced. This also increases security, see b38e7819cae9 ("icmp: randomize the global rate limiter") for reference. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260223161742.929830-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-24docs: net: document neigh gc_stale_time sysctlGabriel Goller-0/+11
Add missing documentation for a neighbor table garbage collector sysctl parameter in ip-sysctl.rst: neigh/default/gc_stale_time: controls how long an unused neighbor entry is kept before becoming eligible for garbage collection (default: 60 seconds) Signed-off-by: Gabriel Goller <g.goller@proxmox.com> Link: https://patch.msgid.link/20260223101257.47563-1-g.goller@proxmox.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow()Eric Dumazet-3/+4
Following part was needed before the blamed commit, because inet_getpeer_v6() second argument was the prefix. /* Give more bandwidth to wider prefixes. */ if (rt->rt6i_dst.plen < 128) tmo >>= ((128 - rt->rt6i_dst.plen)>>5); Now inet_getpeer_v6() retrieves hosts, we need to remove @tmo adjustement or wider prefixes likes /24 allow 8x more ICMP to be sent for a given ratelimit. As we had this issue for a while, this patch changes net.ipv6.icmp.ratelimit default value from 1000ms to 100ms to avoid potential regressions. Also add a READ_ONCE() when reading net->ipv6.sysctl.icmpv6_time. Fixes: fd0273d7939f ("ipv6: Remove external dependency on rt6i_dst and rt6i_src") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260216142832.3834174-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-11Merge tag 'net-next-7.0' of ↵Linus Torvalds-229/+180
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - A significant effort all around the stack to guide the compiler to make the right choice when inlining code, to avoid unneeded calls for small helper and stack canary overhead in the fast-path. This generates better and faster code with very small or no text size increases, as in many cases the call generated more code than the actual inlined helper. - Extend AccECN implementation so that is now functionally complete, also allow the user-space enabling it on a per network namespace basis. - Add support for memory providers with large (above 4K) rx buffer. Paired with hw-gro, larger rx buffer sizes reduce the number of buffers traversing the stack, dincreasing single stream CPU usage by up to ~30%. - Do not add HBH header to Big TCP GSO packets. This simplifies the RX path, the TX path and the NIC drivers, and is possible because user-space taps can now interpret correctly such packets without the HBH hint. - Allow IPv6 routes to be configured with a gateway address that is resolved out of a different interface than the one specified, aligning IPv6 to IPv4 behavior. - Multi-queue aware sch_cake. This makes it possible to scale the rate shaper of sch_cake across multiple CPUs, while still enforcing a single global rate on the interface. - Add support for the nbcon (new buffer console) infrastructure to netconsole, enabling lock-free, priority-based console operations that are safer in crash scenarios. - Improve the TCP ipv6 output path to cache the flow information, saving cpu cycles, reducing cache line misses and stack use. - Improve netfilter packet tracker to resolve clashes for most protocols, avoiding unneeded drops on rare occasions. - Add IP6IP6 tunneling acceleration to the flowtable infrastructure. - Reduce tcp socket size by one cache line. - Notify neighbour changes atomically, avoiding inconsistencies between the notification sequence and the actual states sequence. - Add vsock namespace support, allowing complete isolation of vsocks across different network namespaces. - Improve xsk generic performances with cache-alignment-oriented optimizations. - Support netconsole automatic target recovery, allowing netconsole to reestablish targets when underlying low-level interface comes back online. Driver API: - Support for switching the working mode (automatic vs manual) of a DPLL device via netlink. - Introduce PHY ports representation to expose multiple front-facing media ports over a single MAC. - Introduce "rx-polarity" and "tx-polarity" device tree properties, to generalize polarity inversion requirements for differential signaling. - Add helper to create, prepare and enable managed clocks. Device drivers: - Add Huawei hinic3 PF etherner driver. - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet controller. - Add ethernet driver for MaxLinear MxL862xx switches - Remove parallel-port Ethernet driver. - Convert existing driver timestamp configuration reporting to hwtstamp_get and remove legacy ioctl(). - Convert existing drivers to .get_rx_ring_count(), simplifing the RX ring count retrieval. Also remove the legacy fallback path. - Ethernet high-speed NICs: - Broadcom (bnxt, bng): - bnxt: add FW interface update to support FEC stats histogram and NVRAM defragmentation - bng: add TSO and H/W GRO support - nVidia/Mellanox (mlx5): - improve latency of channel restart operations, reducing the used H/W resources - add TSO support for UDP over GRE over VLAN - add flow counters support for hardware steering (HWS) rules - use a static memory area to store headers for H/W GRO, leading to 12% RX tput improvement - Intel (100G, ice, idpf): - ice: reorganizes layout of Tx and Rx rings for cacheline locality and utilizes __cacheline_group* macros on the new layouts - ice: introduces Synchronous Ethernet (SyncE) support - Meta (fbnic): - adds debugfs for firmware mailbox and tx/rx rings vectors - Ethernet virtual: - geneve: introduce GRO/GSO support for double UDP encapsulation - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - some code refactoring and cleanups - RealTek (r8169): - add support for RTL8127ATF (10G Fiber SFP) - add dash and LTR support - Airoha: - AN8811HB 2.5 Gbps phy support - Freescale (fec): - add XDP zero-copy support - Thunderbolt: - add get link setting support to allow bonding - Renesas: - add support for RZ/G3L GBETH SoC - Ethernet switches: - Maxlinear: - support R(G)MII slow rate configuration - add support for Intel GSW150 - Motorcomm (yt921x): - add DCB/QoS support - TI: - icssm-prueth: support bridging (STP/RSTP) via the switchdev framework - Ethernet PHYs: - Realtek: - enable SGMII and 2500Base-X in-band auto-negotiation - simplify and reunify C22/C45 drivers - Micrel: convert bindings to DT schema - CAN: - move skb headroom content into skb extensions, making CAN metadata access more robust - CAN drivers: - rcar_canfd: - add support for FD-only mode - add support for the RZ/T2H SoC - sja1000: cleanup the CAN state handling - WiFi: - implement EPPKE/802.1X over auth frames support - split up drop reasons better, removing generic RX_DROP - additional FTM capabilities: 6 GHz support, supported number of spatial streams and supported number of LTF repetitions - better mac80211 iterators to enumerate resources - initial UHR (Wi-Fi 8) support for cfg80211/mac80211 - WiFi drivers: - Qualcomm/Atheros: - ath11k: support for Channel Frequency Response measurement - ath12k: a significant driver refactor to support multi-wiphy devices and and pave the way for future device support in the same driver (rather than splitting to ath13k) - ath12k: support for the QCC2072 chipset - Intel: - iwlwifi: partial Neighbor Awareness Networking (NAN) support - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn - RealTek (rtw89): - preparations for RTL8922DE support - Bluetooth: - implement setsockopt(BT_PHY) to set the connection packet type/PHY - set link_policy on incoming ACL connections - Bluetooth drivers: - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE - btqca: add WCN6855 firmware priority selection feature" * tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits) bnge/bng_re: Add a new HSI net: macb: Fix tx/rx malfunction after phy link down and up af_unix: Fix memleak of newsk in unix_stream_connect(). net: ti: icssg-prueth: Add optional dependency on HSR net: dsa: add basic initial driver for MxL862xx switches net: mdio: add unlocked mdiodev C45 bus accessors net: dsa: add tag format for MxL862xx switches dt-bindings: net: dsa: add MaxLinear MxL862xx selftests: drivers: net: hw: Modify toeplitz.c to poll for packets octeontx2-pf: Unregister devlink on probe failure net: renesas: rswitch: fix forwarding offload statemachine ionic: Rate limit unknown xcvr type messages tcp: inet6_csk_xmit() optimization tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock() tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect() ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6 ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update() ipv6: use np->final in inet6_sk_rebuild_header() ipv6: add daddr/final storage in struct ipv6_pinfo net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup() ...
2026-02-03tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSISTChia-Yu Chang-1/+3
Detect spurious retransmission of a previously sent ACK carrying the AccECN option after the second retransmission. Since this might be caused by the middlebox dropping ACK with options it does not recognize, disable the sending of the AccECN option in all subsequent ACKs. This patch follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field (accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was sent with duplicate SACK info. Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl: (TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and persistently sends AccECN option once it fits into TCP option space. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-13-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03tcp: try to avoid safer when ACKs are thinnedIlpo Järvinen-0/+1
Add newly acked pkts EWMA. When ACK thinning occurs, select between safer and unsafe cep delta in AccECN processing based on it. If the packets ACKed per ACK tends to be large, don't conservatively assume ACE field overflow. This patch uses the existing 2-byte holes in the rx group for new u16 variables withtout creating more holes. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u32 delivered_ecn_bytes[3]; /* 2744 12 */ /* XXX 4 bytes hole, try to pack */ [...] __cacheline_group_end__tcp_sock_write_rx[0]; /* 2816 0 */ [...] /* size: 3264, cachelines: 51, members: 177 */ } [AFTER THIS PATCH] struct tcp_sock { [...] u32 delivered_ecn_bytes[3]; /* 2744 12 */ u16 pkts_acked_ewma; /* 2756 2 */ /* XXX 2 bytes hole, try to pack */ [...] __cacheline_group_end__tcp_sock_write_rx[0]; /* 2816 0 */ [...] /* size: 3264, cachelines: 51, members: 178 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-02docs: networking: mention that RSS table should be 4x the queue countJakub Kicinski-4/+8
Spell out the recommendation that the RSS table should be 4x the queue count to avoid traffic imbalance. Include minor rephrasing and removal of the explicit 128 entry example since a 128 entry table is inadequate on modern machines. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131225454.1225151-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28net: ethernet: neterion: s2io: remove unused driverEthan Nelson-Moore-197/+0
The s2io driver supports Exar (formerly Neterion and S2io) PCI-X 10 Gigabit Ethernet cards. Hardware supporting PCI-X has not been manufactured in years. On x86, it was quickly replaced by PCIe. While it stuck around longer on POWER hardware, the last POWER hardware to support it was POWER7, which is not supported by ppc64le Linux distributions. The last supported mainstream ppc64 Linux distribution was RHEL 7; while it is still supported under ELS, ELS is only available for x86 and IBM Z. It is possible to use many PCI-X cards in standard PCI slots (which are still available on new motherboards), but it does not make sense to do so for 10 Gigabit Ethernet because the maximum bandwidth of standard PCI is only 1067 Mbps. It is therefore highly unlikely that this driver is still being used. Remove the driver, and move the former maintainer to the CREDITS file (restoring credit for the vxge driver, which was removed in commit f05643a0f60b ("eth: remove neterion/vxge"). Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Link: https://patch.msgid.link/20260126031352.22997-1-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25Documentation: net: Fix typos in netdevices.rstDimitri Daskalakis-2/+2
Fixes two minor typos. Specifically, on -> or and Devices -> Device. Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com> Link: https://patch.msgid.link/20260122225723.2368698-1-dimitri.daskalakis1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23Documentation: use a source-read extension for the index link boilerplateJani Nikula-84/+0
The root document usually has a special :ref:`genindex` link to the generated index. This is also the case for Documentation/index.rst. The other index.rst files deeper in the directory hierarchy usually don't. For SPHINXDIRS builds, the root document isn't Documentation/index.rst, but some other index.rst in the hierarchy. Currently they have a ".. only::" block to add the index link when doing SPHINXDIRS html builds. This is obviously very tedious and repetitive. The link is also added to all index.rst files in the hierarchy for SPHINXDIRS builds, not just the root document. Put the boilerplate in a sphinx-includes/subproject-index.rst file, and include it at the end of the root document for subproject builds in an ad-hoc source-read extension defined in conf.py. For now, keep having the boilerplate in translations, because this approach currently doesn't cover translated index link headers. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Tested-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> [jc: did s/doctree/kern_doc_dir/ ] Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <20260123143149.2024303-1-jani.nikula@intel.com>
2026-01-20net: remove legacy way to get/set HW timestamp configVadim Fedorenko-4/+3
With all drivers converted to use ndo_hwstamp callbacks the legacy way can be removed, marking ioctl interface as deprecated. Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260116062121.1230184-1-vadim.fedorenko@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20Merge tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linuxJakub Kicinski-0/+20
Pavel Begunkov says: ==================== Add support for providers with large rx buffer Many modern NICs support configurable receive buffer lengths, and zcrx and memory providers can use buffers larger than 4K to improve performance. When paired with hw-gro larger rx buffer sizes can drastically reduce the number of buffers traversing the stack and save a lot of processing time. It also allows to give to users larger contiguous chunks of data. Single stream benchmarks showed up to ~30% CPU util improvement. E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC: packets=23987040 (MB=2745098), rps=199559 (MB/s=22837) CPU %usr %nice %sys %iowait %irq %soft %idle 0 1.53 0.00 27.78 2.72 1.31 66.45 0.22 packets=24078368 (MB=2755550), rps=200319 (MB/s=22924) CPU %usr %nice %sys %iowait %irq %soft %idle 0 0.69 0.00 8.26 31.65 1.83 57.00 0.57 This series adds net infrastructure for memory providers configuring the size and implements it for bnxt. It's an opt-in feature for drivers, they should advertise support for the parameter in the qops and must check if the hardware supports the given size. It's limited to memory providers as it drastically simplifies implementation. It doesn't affect the fast path zcrx uAPI, and the user exposed parameter is defined in zcrx terms, which allows it to be flexible and adjusted in the future. A liburing example can be found at [2] full branch: [1] https://github.com/isilence/linux.git zcrx/large-buffers-v8 Liburing example: [2] https://github.com/isilence/liburing.git zcrx/rx-buf-len * tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux: io_uring/zcrx: document area chunking parameter selftests: iou-zcrx: test large chunk sizes eth: bnxt: support qcfg provided rx page size eth: bnxt: adjust the fill level of agg queues with larger buffers eth: bnxt: store rx buffer size per queue net: pass queue rx page size from memory provider net: add bare bone queue configs net: reduce indent of struct netdev_queue_mgmt_ops members net: memzero mp params when closing a queue ==================== Link: https://patch.msgid.link/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17docs: tls: Enhance TLS resync async process documentationShahar Shitrit-0/+30
Expand the tls-offload.rst documentation to provide a more detailed explanation of the asynchronous resync process, including the role of struct tls_offload_resync_async in managing resync requests on the kernel side. Also, add documentation for helper functions tls_offload_rx_resync_async_request_start/ _end/ _cancel. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1768298883-1602599-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-15net: phy: remove unused fixup unregistering functionsHeiner Kallweit-21/+1
No user of PHY fixups unregisters these. IOW: The fixup unregistering functions are unused and can be removed. Remove also documentation for these functions. Whilst at it, remove also mentioning of phy_register_fixup() from the Documentation, as this function has been static since ea47e70e476f ("net: phy: remove fixup-related definitions from phy.h which are not used outside phylib"). Fixup unregistering functions were added with f38e7a32ee4f ("phy: add phy fixup unregister functions") in 2016, and last user was removed with 6782d06a47ad ("net: usb: lan78xx: Remove KSZ9031 PHY fixup") in 2024. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/ff8ac321-435c-48d0-b376-fbca80c0c22e@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13Documentation: networking: Document the phy_port infrastructureMaxime Chevallier-0/+112
This documentation aims at describing the main goal of the phy_port infrastructure. Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-15-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-14io_uring/zcrx: document area chunking parameterPavel Begunkov-0/+20
struct io_uring_zcrx_ifq_reg::rx_buf_len is used as a hint specifying the kernel what buffer size it should use. Document the API and limitations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2025-12-01Documentation: net: dsa: mention simple HSR offload helpersVladimir Oltean-0/+8
Keep the documentation up to date. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251130131657.65080-16-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-01Documentation: net: dsa: mention availability of RedBoxVladimir Oltean-5/+4
Since commit 5055cccfc2d1 ("net: hsr: Provide RedBox support (HSR-SAN)"), RedBox is available (including for offload in DSA). Update the DSA documentation that states it isn't. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251130131657.65080-15-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: remove icsk->icsk_retransmit_timerEric Dumazet-1/+0
Now sk->sk_timer is no longer used by TCP keepalive, we can use its storage for TCP and MPTCP retransmit timers for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: introduce icsk->icsk_keepalive_timerEric Dumazet-0/+1
sk->sk_timer has been used for TCP keepalives. Keepalive timers are not in fast path, we want to use sk->sk_timer storage for retransmit timers, for better cache locality. Create icsk->icsk_keepalive_timer and change keepalive code to no longer use sk->sk_timer. Added space is reclaimed in the following patch. This includes changes to MPTCP, which was also using sk_timer. Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer for inet_sk_diag_fill() sake. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20net/mlx5: implement swp_l4_csum_mode via devlink paramsDaniel Zahka-0/+14
swp_l4_csum_mode controls how L4 transmit checksums are computed when using Software Parser (SWP) hints for header locations. Supported values: 1. default: device will choose between full_csum or l4_only. Driver will discover the device's choice during initialization. 2. full_csum: calculate L4 checksum with the pseudo-header. 3. l4_only: calculate L4 checksum without the pseudo-header. Only available when swp_l4_csum_mode_l4_only is set in mlx5_ifc_nv_sw_offload_cap_bits. Note that 'default' might be returned from the device and passed to userspace, and it might also be set during a devlink_param::reset_default() call, but attempts to set a value of default directly with param-set will be rejected. The l4_only setting is a dependency for PSP initialization in mlx5e_psp_init(). Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251119025038.651131-5-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20devlink: support default values for param-get and param-setDaniel Zahka-0/+10
Support querying and resetting to default param values. Introduce two new devlink netlink attrs: DEVLINK_ATTR_PARAM_VALUE_DEFAULT and DEVLINK_ATTR_PARAM_RESET_DEFAULT. The former is used to contain an optional parameter value inside of the param_value nested attribute. The latter is used in param-set requests from userspace to indicate that the driver should reset the param to its default value. To implement this, two new functions are added to the devlink driver api: devlink_param::get_default() and devlink_param::reset_default(). These callbacks allow drivers to implement default param actions for runtime and permanent cmodes. For driverinit params, the core latches the last value set by a driver via devl_param_driverinit_value_set(), and uses that as the default value for a param. Because default parameter values are optional, it would be impossible to discern whether or not a param of type bool has default value of false or not provided if the default value is encoded using a netlink flag type. For this reason, when a DEVLINK_PARAM_TYPE_BOOL has an associated default value, the default value is encoded using a u8 type. Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251119025038.651131-4-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20tcp: add net.ipv4.tcp_rcvbuf_low_rttEric Dumazet-0/+11
This is a follow up of commit aa251c84636c ("tcp: fix too slow tcp_rcvbuf_grow() action") which brought again the issue that I tried to fix in commit 65c5287892e9 ("tcp: fix sk_rcvbuf overshoot") We also recently increased tcp_rmem[2] to 32 MB in commit 572be9bf9d0d ("tcp: increase tcp_rmem[2] to 32 MB") Idea of this patch is to not let tcp_rcvbuf_grow() grow sk->sk_rcvbuf too fast for small RTT flows. If sk->sk_rcvbuf is too big, this can force NIC driver to not recycle pages from their page pool, and also can cause cache evictions for DDIO enabled cpus/NIC, as receivers are usually slower than senders. Add net.ipv4.tcp_rcvbuf_low_rtt sysctl, set by default to 1000 usec (1 ms) If RTT if smaller than the sysctl value, use the RTT/tcp_rcvbuf_low_rtt ratio to control sk_rcvbuf inflation. Tested: Pair of hosts with a 200Gbit IDPF NIC. Using netperf/netserver Client initiates 8 TCP bulk flows, asking netserver to use CPU #10 only. super_netperf 8 -H server -T,10 -l 30 On server, use perf -e tcp:tcp_rcvbuf_grow while test is running. Before: sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1 perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script|tail -20|cut -c30-230 1153.051201: tcp:tcp_rcvbuf_grow: time=398 rtt_us=382 copied=6905856 inq=180224 space=6115328 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil 1153.138752: tcp:tcp_rcvbuf_grow: time=446 rtt_us=413 copied=5529600 inq=180224 space=4505600 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil 1153.361484: tcp:tcp_rcvbuf_grow: time=415 rtt_us=380 copied=7061504 inq=204800 space=6725632 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil 1153.457642: tcp:tcp_rcvbuf_grow: time=483 rtt_us=421 copied=5885952 inq=720896 space=4407296 ooo=0 scaling_ratio=240 rcvbuf=23763511 rcv_ssthresh=22223271 window_clamp=22278291 rcv_wnd=21430272 famil 1153.466002: tcp:tcp_rcvbuf_grow: time=308 rtt_us=281 copied=3244032 inq=180224 space=2883584 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41713664 famil 1153.747792: tcp:tcp_rcvbuf_grow: time=394 rtt_us=332 copied=4460544 inq=585728 space=3063808 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41373696 famil 1154.260747: tcp:tcp_rcvbuf_grow: time=652 rtt_us=226 copied=10977280 inq=737280 space=9486336 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29197743 window_clamp=29217691 rcv_wnd=28368896 fami 1154.375019: tcp:tcp_rcvbuf_grow: time=461 rtt_us=443 copied=7573504 inq=507904 space=6856704 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25288704 famil 1154.463072: tcp:tcp_rcvbuf_grow: time=494 rtt_us=408 copied=7983104 inq=200704 space=7065600 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25579520 famil 1154.474658: tcp:tcp_rcvbuf_grow: time=507 rtt_us=459 copied=5586944 inq=540672 space=4718592 ooo=0 scaling_ratio=240 rcvbuf=17852266 rcv_ssthresh=16692999 window_clamp=16736499 rcv_wnd=16056320 famil 1154.584657: tcp:tcp_rcvbuf_grow: time=494 rtt_us=427 copied=8126464 inq=204800 space=7782400 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil 1154.702117: tcp:tcp_rcvbuf_grow: time=480 rtt_us=406 copied=5734400 inq=180224 space=5349376 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil 1155.941595: tcp:tcp_rcvbuf_grow: time=717 rtt_us=670 copied=11042816 inq=3784704 space=7159808 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=14614528 fam 1156.384735: tcp:tcp_rcvbuf_grow: time=529 rtt_us=473 copied=9011200 inq=180224 space=7258112 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=18018304 famil 1157.821676: tcp:tcp_rcvbuf_grow: time=529 rtt_us=272 copied=8224768 inq=602112 space=6545408 ooo=0 scaling_ratio=240 rcvbuf=67000000 rcv_ssthresh=62793576 window_clamp=62812500 rcv_wnd=62115840 famil 1158.906379: tcp:tcp_rcvbuf_grow: time=710 rtt_us=445 copied=11845632 inq=540672 space=10240000 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29205935 window_clamp=29217691 rcv_wnd=28536832 fam 1164.600160: tcp:tcp_rcvbuf_grow: time=841 rtt_us=430 copied=12976128 inq=1290240 space=11304960 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29212591 window_clamp=29217691 rcv_wnd=27856896 fa 1165.163572: tcp:tcp_rcvbuf_grow: time=845 rtt_us=800 copied=12632064 inq=540672 space=7921664 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25912795 window_clamp=25937095 rcv_wnd=25260032 fami 1165.653464: tcp:tcp_rcvbuf_grow: time=388 rtt_us=309 copied=4493312 inq=180224 space=3874816 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41995899 window_clamp=42050919 rcv_wnd=41713664 famil 1166.651211: tcp:tcp_rcvbuf_grow: time=556 rtt_us=553 copied=6328320 inq=540672 space=5554176 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=20946944 famil After: sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1000 perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script|tail -20|cut -c30-230 1457.053149: tcp:tcp_rcvbuf_grow: time=128 rtt_us=24 copied=1441792 inq=40960 space=1269760 ooo=0 scaling_ratio=240 rcvbuf=2960741 rcv_ssthresh=2605474 window_clamp=2775694 rcv_wnd=2568192 family=AF_I 1458.000778: tcp:tcp_rcvbuf_grow: time=128 rtt_us=31 copied=1441792 inq=24576 space=1400832 ooo=0 scaling_ratio=240 rcvbuf=3060163 rcv_ssthresh=2810042 window_clamp=2868902 rcv_wnd=2674688 family=AF_I 1458.088059: tcp:tcp_rcvbuf_grow: time=190 rtt_us=110 copied=3227648 inq=385024 space=2781184 ooo=0 scaling_ratio=240 rcvbuf=6728240 rcv_ssthresh=6252705 window_clamp=6307725 rcv_wnd=5799936 family=AF 1458.148549: tcp:tcp_rcvbuf_grow: time=232 rtt_us=129 copied=3956736 inq=237568 space=2842624 ooo=0 scaling_ratio=240 rcvbuf=6731333 rcv_ssthresh=6252705 window_clamp=6310624 rcv_wnd=5918720 family=AF 1458.466861: tcp:tcp_rcvbuf_grow: time=193 rtt_us=83 copied=2949120 inq=180224 space=2457600 ooo=0 scaling_ratio=240 rcvbuf=5751438 rcv_ssthresh=5357689 window_clamp=5391973 rcv_wnd=5054464 family=AF_ 1458.775476: tcp:tcp_rcvbuf_grow: time=257 rtt_us=127 copied=4304896 inq=352256 space=3346432 ooo=0 scaling_ratio=240 rcvbuf=8067131 rcv_ssthresh=7523275 window_clamp=7562935 rcv_wnd=7061504 family=AF 1458.776631: tcp:tcp_rcvbuf_grow: time=200 rtt_us=96 copied=3260416 inq=143360 space=2768896 ooo=0 scaling_ratio=240 rcvbuf=6397256 rcv_ssthresh=5938567 window_clamp=5997427 rcv_wnd=5828608 family=AF_ 1459.707973: tcp:tcp_rcvbuf_grow: time=215 rtt_us=96 copied=2506752 inq=163840 space=1388544 ooo=0 scaling_ratio=240 rcvbuf=3068867 rcv_ssthresh=2768282 window_clamp=2877062 rcv_wnd=2555904 family=AF_ 1460.246494: tcp:tcp_rcvbuf_grow: time=231 rtt_us=80 copied=3756032 inq=204800 space=3117056 ooo=0 scaling_ratio=240 rcvbuf=7288091 rcv_ssthresh=6773725 window_clamp=6832585 rcv_wnd=6471680 family=AF_ 1460.714596: tcp:tcp_rcvbuf_grow: time=270 rtt_us=110 copied=4714496 inq=311296 space=3719168 ooo=0 scaling_ratio=240 rcvbuf=8957739 rcv_ssthresh=8339020 window_clamp=8397880 rcv_wnd=7933952 family=AF 1462.029977: tcp:tcp_rcvbuf_grow: time=101 rtt_us=19 copied=1105920 inq=40960 space=1036288 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=1986560 family=AF_I 1462.802385: tcp:tcp_rcvbuf_grow: time=89 rtt_us=45 copied=1069056 inq=0 space=1064960 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=2035712 family=AF_INET6 1462.918648: tcp:tcp_rcvbuf_grow: time=105 rtt_us=33 copied=1441792 inq=180224 space=1069056 ooo=0 scaling_ratio=240 rcvbuf=2383282 rcv_ssthresh=2091684 window_clamp=2234326 rcv_wnd=1896448 family=AF_ 1463.222533: tcp:tcp_rcvbuf_grow: time=273 rtt_us=144 copied=4603904 inq=385024 space=3469312 ooo=0 scaling_ratio=240 rcvbuf=8422564 rcv_ssthresh=7891053 window_clamp=7896153 rcv_wnd=7409664 family=AF 1466.519312: tcp:tcp_rcvbuf_grow: time=130 rtt_us=23 copied=1343488 inq=0 space=1261568 ooo=0 scaling_ratio=240 rcvbuf=2780158 rcv_ssthresh=2493778 window_clamp=2606398 rcv_wnd=2494464 family=AF_INET6 1466.681003: tcp:tcp_rcvbuf_grow: time=128 rtt_us=21 copied=1441792 inq=12288 space=1343488 ooo=0 scaling_ratio=240 rcvbuf=2932027 rcv_ssthresh=2578555 window_clamp=2748775 rcv_wnd=2568192 family=AF_I 1470.689959: tcp:tcp_rcvbuf_grow: time=255 rtt_us=122 copied=3932160 inq=204800 space=3551232 ooo=0 scaling_ratio=240 rcvbuf=8182038 rcv_ssthresh=7647384 window_clamp=7670660 rcv_wnd=7442432 family=AF 1471.754154: tcp:tcp_rcvbuf_grow: time=188 rtt_us=95 copied=2138112 inq=577536 space=1429504 ooo=0 scaling_ratio=240 rcvbuf=3113650 rcv_ssthresh=2806426 window_clamp=2919046 rcv_wnd=2248704 family=AF_ 1476.813542: tcp:tcp_rcvbuf_grow: time=269 rtt_us=99 copied=3088384 inq=180224 space=2564096 ooo=0 scaling_ratio=240 rcvbuf=6219470 rcv_ssthresh=5771893 window_clamp=5830753 rcv_wnd=5509120 family=AF_ 1477.738309: tcp:tcp_rcvbuf_grow: time=166 rtt_us=54 copied=1777664 inq=180224 space=1417216 ooo=0 scaling_ratio=240 rcvbuf=3117118 rcv_ssthresh=2874958 window_clamp=2922298 rcv_wnd=2613248 family=AF_ We can see sk_rcvbuf values are much smaller, and that rtt_us (estimation of rtt from a receiver point of view) is kept small, instead of being bloated. No difference in throughput. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Tested-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20251119084813.3684576-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20tcp: tcp_moderate_rcvbuf is only used in rx pathEric Dumazet-1/+1
sysctl_tcp_moderate_rcvbuf is only used from tcp_rcvbuf_grow(). Move it to netns_ipv4_read_rx group. Remove various CACHELINE_ASSERT_GROUP_SIZE() from netns_ipv4_struct_check(), as they have no real benefit but cause pain for all changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251119084813.3684576-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-18Merge tag 'ipsec-next-2025-11-18' of ↵Jakub Kicinski-61/+78
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2025-11-18 1) Relax a lock contention bottleneck to improve IPsec crypto offload performance. From Jianbo Liu. 2) Deprecate pfkey, the interface will be removed in 2027. 3) Update xfrm documentation and move it to ipsec maintainance. From Bagas Sanjaya. * tag 'ipsec-next-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: MAINTAINERS: Add entry for XFRM documentation net: Move XFRM documentation into its own subdirectory Documentation: xfrm_sync: Number the fifth section Documentation: xfrm_sysctl: Trim trailing colon in section heading Documentation: xfrm_sync: Trim excess section heading characters Documentation: xfrm_sync: Properly reindent list text Documentation: xfrm_device: Separate hardware offload sublists Documentation: xfrm_device: Use numbered list for offloading steps Documentation: xfrm_device: Wrap iproute2 snippets in literal code block pfkey: Deprecate pfkey xfrm: Skip redundant replay recheck for the hardware offload path xfrm: Refactor xfrm_input lock to reduce contention with RSS ==================== Link: https://patch.msgid.link/20251118092610.2223552-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-17tcp: reduce tcp_comp_sack_slack_ns default value to 10 usecEric Dumazet-1/+2
net.ipv4.tcp_comp_sack_slack_ns current default value is too high. When a flow has many drops (1 % or more), and small RTT, adding 100 usec before sending SACK stalls the sender relying on getting SACK fast enough to keep the pipe busy. Decrease the default to 10 usec. This is orthogonal to Congestion Control heuristics to determine if drops are caused by congestion or not. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251114135141.3810964-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-12net: Move XFRM documentation into its own subdirectoryBagas Sanjaya-7/+17
XFRM docs are currently reside in Documentation/networking directory, yet these are distinctive as a group of their own. Move them into xfrm subdirectory. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-11-12Documentation: xfrm_sync: Number the fifth sectionBagas Sanjaya-2/+2
Number the fifth section ("Exception to threshold settings") to be consistent with the rest of sections. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-11-12Documentation: xfrm_sysctl: Trim trailing colon in section headingBagas Sanjaya-2/+2
The sole section heading ("/proc/sys/net/core/xfrm_* Variables") has trailing colon. Trim it. Suggested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-11-12Documentation: xfrm_sync: Trim excess section heading charactersBagas Sanjaya-5/+5
The first section "Message Structure" has excess underline, while the second and third one ("TLVS reflect the different parameters" and "Default configurations for the parameters") have trailing colon. Trim them. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>