| Age | Commit message (Collapse) | Author | Lines |
|
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
1-3) IPVS updates from Julian Anastasov to enhance visibility into
IPVS internal state by exposing hash size, load factor etc and
allows userspace to tune the load factor used for resizing hash
tables.
4) reject empty/not nul terminated device names from xt_physdev.
This isn't a bug fix; existing code doesn't require a c-string.
But clean this up anyway because conceptually the interface name
definitely should be a c-string.
5) Switch nfnetlink to skb_mac_header helpers that didn't exist back
when this code was written. This gives us additional debug checks
but is not intended to change functionality.
6) Let the xt ttl/hoplimit match reject unknown operator modes.
This is a cleanup, the evaluation function simply returns false when
the mode is out of range. From Marino Dzalto.
7) xt_socket match should enable defrag after all other checks. This
bug is harmless, historically defrag could not be disabled either
except by rmmod.
8) remove UDP-Lite conntrack support, from Fernando Fernandez Mancera.
9) Avoid a couple -Wflex-array-member-not-at-end warnings in the old
xtables 32bit compat code, from Gustavo A. R. Silva.
10) nftables fwd expression should drop packets when their ttl/hl has
expired. This is a bug fix deferred, its not deemed important
enough for -rc8.
11) Add additional checks before assuming the mac header is an ethernet
header, from Zhengchuan Liang.
* tag 'nf-next-26-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: require Ethernet MAC header before using eth_hdr()
netfilter: nft_fwd_netdev: check ttl/hl before forwarding
netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings
netfilter: conntrack: remove UDP-Lite conntrack support
netfilter: xt_socket: enable defrag after all other checks
netfilter: xt_HL: add pr_fmt and checkentry validation
netfilter: nfnetlink: prefer skb_mac_header helpers
netfilter: x_physdev: reject empty or not-nul terminated device names
ipvs: add conn_lfactor and svc_lfactor sysctl vars
ipvs: add ip_vs_status info
ipvs: show the current conn_tab size to users
====================
Link: https://patch.msgid.link/20260410112352.23599-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add documentation for the IFLA_BR_STP_MODE bridge attribute in the
"User space STP helper" section of the bridge documentation. Reference
the BR_STP_MODE_* values via kernel-doc and describe the use case for
network namespace environments.
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Andy Roulin <aroulin@nvidia.com>
Link: https://patch.msgid.link/20260405205224.3163000-3-aroulin@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Allow the default load factor for the connection and service tables
to be configured.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Implement netdev_nl_queue_create_doit which creates a new rx queue in a
virtual netdev and then leases it to a rx queue in a physical netdev.
Example with ynl client:
# ynl --family netdev --output-json --do queue-create \
--json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}'
{'id': 1}
Note that the netdevice locking order is always from the virtual to
the physical device.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Document the scope parameter for devlink resource show, which allows
filtering the dump to device-level or port-level resources only.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Document the port-level resource support and the option to dump all
resources, including both device-level and port-level entries.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is not used anywhere in the kernel.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260406212158.721806-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix two typos:
- 'Subfunctons' -> 'Subfunctions'
- 'subfuction' -> 'subfunction'
Reviewed-by: Joe Damato <joe@dama.to>
Signed-off-by: Ryohei Kinugawa <ryohei.kinugawa@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260324053416.70166-1-ryohei.kinugawa@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add two parameters for drivers supporting Rx CQE coalescing /
descriptor writeback.
ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE or
writeback.
ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Max time in nanoseconds after the first packet arrival in a
coalesced CQE or writeback to be sent.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260317191826.1346111-2-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Document shared devlink instances for multiple PFs on the same chip.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260312100407.551173-13-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
By default, the Linux TCP implementation does not shrink the
advertised window (RFC 7323 calls this "window retraction") with the
following exceptions:
- When an incoming segment cannot be added due to the receive buffer
running out of memory. Since commit 8c670bdfa58e ("tcp: correct
handling of extreme memory squeeze") a zero window will be
advertised in this case. It turns out that reaching the required
memory pressure is easy when window scaling is in use. In the
simplest case, sending a sufficient number of segments smaller than
the scale factor to a receiver that does not read data is enough.
- Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
allowing the tcp window to shrink") addressed the "eating memory"
problem by introducing a sysctl knob that allows shrinking the
window before running out of memory.
However, RFC 7323 does not only state that shrinking the window is
necessary in some cases, it also formulates requirements for TCP
implementations when doing so (Section 2.4).
This commit addresses the receiver-side requirements: After retracting
the window, the peer may have a snd_nxt that lies within a previously
advertised window but is now beyond the retracted window. This means
that all incoming segments (including pure ACKs) will be rejected
until the application happens to read enough data to let the peer's
snd_nxt be in window again (which may be never).
To comply with RFC 7323, the receiver MUST honor any segment that
would have been in window for any ACK sent by the receiver and, when
window scaling is in effect, SHOULD track the maximum window sequence
number it has advertised. This patch tracks that maximum window
sequence number rcv_mwnd_seq throughout the connection and uses it in
tcp_sequence() when deciding whether a segment is acceptable.
rcv_mwnd_seq is updated together with rcv_wup and rcv_wnd in
tcp_select_window(). If we count tcp_sequence() as fast path, it is
read in the fast path. Therefore, rcv_mwnd_seq is put into rcv_wnd's
cacheline group.
The logic for handling received data in tcp_data_queue() is already
sufficient and does not need to be updated.
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260309-tcp_rfc7323_retract_wnd_rfc-v3-1-4c7f96b1ec69@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix spelling mistake "Crate" to "Create" in the documentation.
Signed-off-by: ShravyaPanchagiri <shravy112@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260311030450.8461-1-shravy112@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add missing entry for sysctl_tcp_shrink_window.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260310073855.564927-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the current port selection algorithm, ports after a reserved port
range or long time used port are used more often than others [1]. This
causes an uneven port usage distribution. This combines with cloud
environments blocking connections between the application server and the
database server if there was a previous connection with the same source
port, leading to connectivity problems between applications on cloud
environments.
The real issue here is that these firewalls cannot cope with
standards-compliant port reuse. This is a workaround for such situations
and an improvement on the distribution of ports selected.
The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
The step size is selected randomly on every connect() call ensuring it
is a coprime with respect to the size of the range of ports we want to
scan. This way, we can ensure that all ports within the range are
scanned before returning an error. To enable this algorithm, the user
must configure the new sysctl option "net.ipv4.ip_local_port_step_width".
In addition, on graphs generated we can observe that the distribution of
source ports is more even with the proposed approach. [2]
[1] https://0xffsoftware.com/port_graph_current_alg.html
[2] https://0xffsoftware.com/port_graph_random_step_alg.html
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260309023946.5473-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Document missing SMC sysctl parameters limit_smc_hs and hs_ctrl
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: D. Wythe<alibuda@linux.alibaba.com>
Link: https://patch.msgid.link/20260309124541.22723-3-sawara04.o@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
smcr_buf_type section used inconsistent indentation compared
with the rest of this document.
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: D. Wythe<alibuda@linux.alibaba.com>
Link: https://patch.msgid.link/20260309124541.22723-2-sawara04.o@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Instead of storing the @mask at the beginning of rps_sock_flow_table,
use 5 low order bits of the rps_tag_ptr to store the log of the size.
This removes a potential cache line miss to fetch @mask.
More importantly, we can switch to vmalloc_huge() without wasting memory.
Tested with:
numactl --interleave=all bash -c "echo 4194304 >/proc/sys/net/core/rps_sock_flow_entries"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The word "a" was used instead of "an" which is grammatically incorrect.
Fixed by changing from "a" to "an". This improves readability of the
documentation.
Signed-off-by: Leon Kral <leon.j.kral@protonmail.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Link: https://patch.msgid.link/20260227001151.41610-1-leon.j.kral@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Clarify the bit-by-bit bitset format's behavior around mandatory
attributes and bit identification. More specifically, the following
changes are made:
* Rephrase a misleading sentence which implies name and index are
mutually exclusive
* Describe that ETHTOOL_A_BITSET_BITS nest is mandatory
* Describe that a request fails if inconsistent identifiers are given
Signed-off-by: Yohei Kojima <yk@y-koj.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ef90a56965ca66e57aa177929ce3e10c5ca815fa.1772031974.git.yk@y-koj.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add entry for the neigh/default/gc_interval sysctl. This sysctl is
unused since kernel v2.6.8.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Link: https://patch.msgid.link/20260225095822.44050-1-g.goller@proxmox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
These sysctls were added in 4cdf507d5452 ("icmp: add a global rate
limitation") and their default values might be too small.
Some network tools send probes to closed UDP ports from many hosts
to estimate proportion of packet drops on a particular target.
This patch sets both sysctls to 10000.
Note the per-peer rate-limit (as described in RFC 4443 2.4 (f))
intent is still enforced.
This also increases security, see b38e7819cae9
("icmp: randomize the global rate limiter") for reference.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223161742.929830-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add missing documentation for a neighbor table garbage collector sysctl
parameter in ip-sysctl.rst:
neigh/default/gc_stale_time: controls how long an unused neighbor entry
is kept before becoming eligible for garbage collection (default: 60
seconds)
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Link: https://patch.msgid.link/20260223101257.47563-1-g.goller@proxmox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Following part was needed before the blamed commit, because
inet_getpeer_v6() second argument was the prefix.
/* Give more bandwidth to wider prefixes. */
if (rt->rt6i_dst.plen < 128)
tmo >>= ((128 - rt->rt6i_dst.plen)>>5);
Now inet_getpeer_v6() retrieves hosts, we need to remove
@tmo adjustement or wider prefixes likes /24 allow 8x
more ICMP to be sent for a given ratelimit.
As we had this issue for a while, this patch changes net.ipv6.icmp.ratelimit
default value from 1000ms to 100ms to avoid potential regressions.
Also add a READ_ONCE() when reading net->ipv6.sysctl.icmpv6_time.
Fixes: fd0273d7939f ("ipv6: Remove external dependency on rt6i_dst and rt6i_src")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260216142832.3834174-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core & protocols:
- A significant effort all around the stack to guide the compiler to
make the right choice when inlining code, to avoid unneeded calls
for small helper and stack canary overhead in the fast-path.
This generates better and faster code with very small or no text
size increases, as in many cases the call generated more code than
the actual inlined helper.
- Extend AccECN implementation so that is now functionally complete,
also allow the user-space enabling it on a per network namespace
basis.
- Add support for memory providers with large (above 4K) rx buffer.
Paired with hw-gro, larger rx buffer sizes reduce the number of
buffers traversing the stack, dincreasing single stream CPU usage
by up to ~30%.
- Do not add HBH header to Big TCP GSO packets. This simplifies the
RX path, the TX path and the NIC drivers, and is possible because
user-space taps can now interpret correctly such packets without
the HBH hint.
- Allow IPv6 routes to be configured with a gateway address that is
resolved out of a different interface than the one specified,
aligning IPv6 to IPv4 behavior.
- Multi-queue aware sch_cake. This makes it possible to scale the
rate shaper of sch_cake across multiple CPUs, while still enforcing
a single global rate on the interface.
- Add support for the nbcon (new buffer console) infrastructure to
netconsole, enabling lock-free, priority-based console operations
that are safer in crash scenarios.
- Improve the TCP ipv6 output path to cache the flow information,
saving cpu cycles, reducing cache line misses and stack use.
- Improve netfilter packet tracker to resolve clashes for most
protocols, avoiding unneeded drops on rare occasions.
- Add IP6IP6 tunneling acceleration to the flowtable infrastructure.
- Reduce tcp socket size by one cache line.
- Notify neighbour changes atomically, avoiding inconsistencies
between the notification sequence and the actual states sequence.
- Add vsock namespace support, allowing complete isolation of vsocks
across different network namespaces.
- Improve xsk generic performances with cache-alignment-oriented
optimizations.
- Support netconsole automatic target recovery, allowing netconsole
to reestablish targets when underlying low-level interface comes
back online.
Driver API:
- Support for switching the working mode (automatic vs manual) of a
DPLL device via netlink.
- Introduce PHY ports representation to expose multiple front-facing
media ports over a single MAC.
- Introduce "rx-polarity" and "tx-polarity" device tree properties,
to generalize polarity inversion requirements for differential
signaling.
- Add helper to create, prepare and enable managed clocks.
Device drivers:
- Add Huawei hinic3 PF etherner driver.
- Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet
controller.
- Add ethernet driver for MaxLinear MxL862xx switches
- Remove parallel-port Ethernet driver.
- Convert existing driver timestamp configuration reporting to
hwtstamp_get and remove legacy ioctl().
- Convert existing drivers to .get_rx_ring_count(), simplifing the RX
ring count retrieval. Also remove the legacy fallback path.
- Ethernet high-speed NICs:
- Broadcom (bnxt, bng):
- bnxt: add FW interface update to support FEC stats histogram
and NVRAM defragmentation
- bng: add TSO and H/W GRO support
- nVidia/Mellanox (mlx5):
- improve latency of channel restart operations, reducing the
used H/W resources
- add TSO support for UDP over GRE over VLAN
- add flow counters support for hardware steering (HWS) rules
- use a static memory area to store headers for H/W GRO,
leading to 12% RX tput improvement
- Intel (100G, ice, idpf):
- ice: reorganizes layout of Tx and Rx rings for cacheline
locality and utilizes __cacheline_group* macros on the new
layouts
- ice: introduces Synchronous Ethernet (SyncE) support
- Meta (fbnic):
- adds debugfs for firmware mailbox and tx/rx rings vectors
- Ethernet virtual:
- geneve: introduce GRO/GSO support for double UDP encapsulation
- Ethernet NICs consumer, and embedded:
- Synopsys (stmmac):
- some code refactoring and cleanups
- RealTek (r8169):
- add support for RTL8127ATF (10G Fiber SFP)
- add dash and LTR support
- Airoha:
- AN8811HB 2.5 Gbps phy support
- Freescale (fec):
- add XDP zero-copy support
- Thunderbolt:
- add get link setting support to allow bonding
- Renesas:
- add support for RZ/G3L GBETH SoC
- Ethernet switches:
- Maxlinear:
- support R(G)MII slow rate configuration
- add support for Intel GSW150
- Motorcomm (yt921x):
- add DCB/QoS support
- TI:
- icssm-prueth: support bridging (STP/RSTP) via the switchdev
framework
- Ethernet PHYs:
- Realtek:
- enable SGMII and 2500Base-X in-band auto-negotiation
- simplify and reunify C22/C45 drivers
- Micrel: convert bindings to DT schema
- CAN:
- move skb headroom content into skb extensions, making CAN
metadata access more robust
- CAN drivers:
- rcar_canfd:
- add support for FD-only mode
- add support for the RZ/T2H SoC
- sja1000: cleanup the CAN state handling
- WiFi:
- implement EPPKE/802.1X over auth frames support
- split up drop reasons better, removing generic RX_DROP
- additional FTM capabilities: 6 GHz support, supported number of
spatial streams and supported number of LTF repetitions
- better mac80211 iterators to enumerate resources
- initial UHR (Wi-Fi 8) support for cfg80211/mac80211
- WiFi drivers:
- Qualcomm/Atheros:
- ath11k: support for Channel Frequency Response measurement
- ath12k: a significant driver refactor to support multi-wiphy
devices and and pave the way for future device support in the
same driver (rather than splitting to ath13k)
- ath12k: support for the QCC2072 chipset
- Intel:
- iwlwifi: partial Neighbor Awareness Networking (NAN) support
- iwlwifi: initial support for U-NII-9 and IEEE 802.11bn
- RealTek (rtw89):
- preparations for RTL8922DE support
- Bluetooth:
- implement setsockopt(BT_PHY) to set the connection packet type/PHY
- set link_policy on incoming ACL connections
- Bluetooth drivers:
- btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE
- btqca: add WCN6855 firmware priority selection feature"
* tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits)
bnge/bng_re: Add a new HSI
net: macb: Fix tx/rx malfunction after phy link down and up
af_unix: Fix memleak of newsk in unix_stream_connect().
net: ti: icssg-prueth: Add optional dependency on HSR
net: dsa: add basic initial driver for MxL862xx switches
net: mdio: add unlocked mdiodev C45 bus accessors
net: dsa: add tag format for MxL862xx switches
dt-bindings: net: dsa: add MaxLinear MxL862xx
selftests: drivers: net: hw: Modify toeplitz.c to poll for packets
octeontx2-pf: Unregister devlink on probe failure
net: renesas: rswitch: fix forwarding offload statemachine
ionic: Rate limit unknown xcvr type messages
tcp: inet6_csk_xmit() optimization
tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()
tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect()
ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6
ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update()
ipv6: use np->final in inet6_sk_rebuild_header()
ipv6: add daddr/final storage in struct ipv6_pinfo
net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()
...
|
|
Detect spurious retransmission of a previously sent ACK carrying the
AccECN option after the second retransmission. Since this might be caused
by the middlebox dropping ACK with options it does not recognize, disable
the sending of the AccECN option in all subsequent ACKs. This patch
follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field
(accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was
sent with duplicate SACK info.
Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl:
(TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and
persistently sends AccECN option once it fits into TCP option space.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-13-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add newly acked pkts EWMA. When ACK thinning occurs, select
between safer and unsafe cep delta in AccECN processing based
on it. If the packets ACKed per ACK tends to be large, don't
conservatively assume ACE field overflow.
This patch uses the existing 2-byte holes in the rx group for new
u16 variables withtout creating more holes. Below are the pahole
outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ecn_bytes[3]; /* 2744 12 */
/* XXX 4 bytes hole, try to pack */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2816 0 */
[...]
/* size: 3264, cachelines: 51, members: 177 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ecn_bytes[3]; /* 2744 12 */
u16 pkts_acked_ewma; /* 2756 2 */
/* XXX 2 bytes hole, try to pack */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2816 0 */
[...]
/* size: 3264, cachelines: 51, members: 178 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-2-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Spell out the recommendation that the RSS table should be
4x the queue count to avoid traffic imbalance. Include minor
rephrasing and removal of the explicit 128 entry example
since a 128 entry table is inadequate on modern machines.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131225454.1225151-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The s2io driver supports Exar (formerly Neterion and S2io) PCI-X 10
Gigabit Ethernet cards. Hardware supporting PCI-X has not been
manufactured in years. On x86, it was quickly replaced by PCIe. While
it stuck around longer on POWER hardware, the last POWER hardware to
support it was POWER7, which is not supported by ppc64le Linux
distributions. The last supported mainstream ppc64 Linux distribution
was RHEL 7; while it is still supported under ELS, ELS is only
available for x86 and IBM Z. It is possible to use many PCI-X cards in
standard PCI slots (which are still available on new motherboards), but
it does not make sense to do so for 10 Gigabit Ethernet because the
maximum bandwidth of standard PCI is only 1067 Mbps. It is therefore
highly unlikely that this driver is still being used. Remove the
driver, and move the former maintainer to the CREDITS file (restoring
credit for the vxge driver, which was removed in commit f05643a0f60b
("eth: remove neterion/vxge").
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260126031352.22997-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fixes two minor typos. Specifically, on -> or and Devices -> Device.
Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260122225723.2368698-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The root document usually has a special :ref:`genindex` link to the
generated index. This is also the case for Documentation/index.rst. The
other index.rst files deeper in the directory hierarchy usually don't.
For SPHINXDIRS builds, the root document isn't Documentation/index.rst,
but some other index.rst in the hierarchy. Currently they have a
".. only::" block to add the index link when doing SPHINXDIRS html
builds.
This is obviously very tedious and repetitive. The link is also added to
all index.rst files in the hierarchy for SPHINXDIRS builds, not just the
root document.
Put the boilerplate in a sphinx-includes/subproject-index.rst file, and
include it at the end of the root document for subproject builds in an
ad-hoc source-read extension defined in conf.py.
For now, keep having the boilerplate in translations, because this
approach currently doesn't cover translated index link headers.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Tested-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
[jc: did s/doctree/kern_doc_dir/ ]
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260123143149.2024303-1-jani.nikula@intel.com>
|
|
With all drivers converted to use ndo_hwstamp callbacks the legacy way
can be removed, marking ioctl interface as deprecated.
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20260116062121.1230184-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pavel Begunkov says:
====================
Add support for providers with large rx buffer
Many modern NICs support configurable receive buffer lengths, and zcrx and
memory providers can use buffers larger than 4K to improve performance.
When paired with hw-gro larger rx buffer sizes can drastically reduce
the number of buffers traversing the stack and save a lot of processing
time. It also allows to give to users larger contiguous chunks of data.
Single stream benchmarks showed up to ~30% CPU util improvement.
E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC:
packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU %usr %nice %sys %iowait %irq %soft %idle
0 1.53 0.00 27.78 2.72 1.31 66.45 0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU %usr %nice %sys %iowait %irq %soft %idle
0 0.69 0.00 8.26 31.65 1.83 57.00 0.57
This series adds net infrastructure for memory providers configuring
the size and implements it for bnxt. It's an opt-in feature for drivers,
they should advertise support for the parameter in the qops and must check
if the hardware supports the given size. It's limited to memory providers
as it drastically simplifies implementation. It doesn't affect the fast
path zcrx uAPI, and the user exposed parameter is defined in zcrx terms,
which allows it to be flexible and adjusted in the future.
A liburing example can be found at [2]
full branch:
[1] https://github.com/isilence/linux.git zcrx/large-buffers-v8
Liburing example:
[2] https://github.com/isilence/liburing.git zcrx/rx-buf-len
* tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux:
io_uring/zcrx: document area chunking parameter
selftests: iou-zcrx: test large chunk sizes
eth: bnxt: support qcfg provided rx page size
eth: bnxt: adjust the fill level of agg queues with larger buffers
eth: bnxt: store rx buffer size per queue
net: pass queue rx page size from memory provider
net: add bare bone queue configs
net: reduce indent of struct netdev_queue_mgmt_ops members
net: memzero mp params when closing a queue
====================
Link: https://patch.msgid.link/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Expand the tls-offload.rst documentation to provide a more detailed
explanation of the asynchronous resync process, including the role
of struct tls_offload_resync_async in managing resync requests on
the kernel side.
Also, add documentation for helper functions
tls_offload_rx_resync_async_request_start/ _end/ _cancel.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1768298883-1602599-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
No user of PHY fixups unregisters these. IOW: The fixup unregistering
functions are unused and can be removed. Remove also documentation
for these functions. Whilst at it, remove also mentioning of
phy_register_fixup() from the Documentation, as this function has been
static since ea47e70e476f ("net: phy: remove fixup-related definitions
from phy.h which are not used outside phylib").
Fixup unregistering functions were added with f38e7a32ee4f
("phy: add phy fixup unregister functions") in 2016, and last user
was removed with 6782d06a47ad ("net: usb: lan78xx: Remove KSZ9031 PHY
fixup") in 2024.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/ff8ac321-435c-48d0-b376-fbca80c0c22e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This documentation aims at describing the main goal of the phy_port
infrastructure.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260108080041.553250-15-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
struct io_uring_zcrx_ifq_reg::rx_buf_len is used as a hint specifying
the kernel what buffer size it should use. Document the API and
limitations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
|
|
Keep the documentation up to date.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20251130131657.65080-16-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since commit 5055cccfc2d1 ("net: hsr: Provide RedBox support
(HSR-SAN)"), RedBox is available (including for offload in DSA).
Update the DSA documentation that states it isn't.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20251130131657.65080-15-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now sk->sk_timer is no longer used by TCP keepalive, we can use
its storage for TCP and MPTCP retransmit timers for better
cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sk->sk_timer has been used for TCP keepalives.
Keepalive timers are not in fast path, we want to use sk->sk_timer
storage for retransmit timers, for better cache locality.
Create icsk->icsk_keepalive_timer and change keepalive
code to no longer use sk->sk_timer.
Added space is reclaimed in the following patch.
This includes changes to MPTCP, which was also using sk_timer.
Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer
for inet_sk_diag_fill() sake.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
swp_l4_csum_mode controls how L4 transmit checksums are computed when
using Software Parser (SWP) hints for header locations.
Supported values:
1. default: device will choose between full_csum or l4_only. Driver
will discover the device's choice during initialization.
2. full_csum: calculate L4 checksum with the pseudo-header.
3. l4_only: calculate L4 checksum without the pseudo-header. Only
available when swp_l4_csum_mode_l4_only is set in
mlx5_ifc_nv_sw_offload_cap_bits.
Note that 'default' might be returned from the device and passed to
userspace, and it might also be set during a
devlink_param::reset_default() call, but attempts to set a value of
default directly with param-set will be rejected.
The l4_only setting is a dependency for PSP initialization in
mlx5e_psp_init().
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251119025038.651131-5-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Support querying and resetting to default param values.
Introduce two new devlink netlink attrs:
DEVLINK_ATTR_PARAM_VALUE_DEFAULT and
DEVLINK_ATTR_PARAM_RESET_DEFAULT. The former is used to contain an
optional parameter value inside of the param_value nested
attribute. The latter is used in param-set requests from userspace to
indicate that the driver should reset the param to its default value.
To implement this, two new functions are added to the devlink driver
api: devlink_param::get_default() and
devlink_param::reset_default(). These callbacks allow drivers to
implement default param actions for runtime and permanent cmodes. For
driverinit params, the core latches the last value set by a driver via
devl_param_driverinit_value_set(), and uses that as the default value
for a param.
Because default parameter values are optional, it would be impossible
to discern whether or not a param of type bool has default value of
false or not provided if the default value is encoded using a netlink
flag type. For this reason, when a DEVLINK_PARAM_TYPE_BOOL has an
associated default value, the default value is encoded using a u8
type.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251119025038.651131-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a follow up of commit aa251c84636c ("tcp: fix too slow
tcp_rcvbuf_grow() action") which brought again the issue that I tried
to fix in commit 65c5287892e9 ("tcp: fix sk_rcvbuf overshoot")
We also recently increased tcp_rmem[2] to 32 MB in commit 572be9bf9d0d
("tcp: increase tcp_rmem[2] to 32 MB")
Idea of this patch is to not let tcp_rcvbuf_grow() grow sk->sk_rcvbuf
too fast for small RTT flows. If sk->sk_rcvbuf is too big, this can
force NIC driver to not recycle pages from their page pool, and also
can cause cache evictions for DDIO enabled cpus/NIC, as receivers
are usually slower than senders.
Add net.ipv4.tcp_rcvbuf_low_rtt sysctl, set by default to 1000 usec (1 ms)
If RTT if smaller than the sysctl value, use the RTT/tcp_rcvbuf_low_rtt
ratio to control sk_rcvbuf inflation.
Tested:
Pair of hosts with a 200Gbit IDPF NIC. Using netperf/netserver
Client initiates 8 TCP bulk flows, asking netserver to use CPU #10 only.
super_netperf 8 -H server -T,10 -l 30
On server, use perf -e tcp:tcp_rcvbuf_grow while test is running.
Before:
sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1
perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script|tail -20|cut -c30-230
1153.051201: tcp:tcp_rcvbuf_grow: time=398 rtt_us=382 copied=6905856 inq=180224 space=6115328 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
1153.138752: tcp:tcp_rcvbuf_grow: time=446 rtt_us=413 copied=5529600 inq=180224 space=4505600 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil
1153.361484: tcp:tcp_rcvbuf_grow: time=415 rtt_us=380 copied=7061504 inq=204800 space=6725632 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
1153.457642: tcp:tcp_rcvbuf_grow: time=483 rtt_us=421 copied=5885952 inq=720896 space=4407296 ooo=0 scaling_ratio=240 rcvbuf=23763511 rcv_ssthresh=22223271 window_clamp=22278291 rcv_wnd=21430272 famil
1153.466002: tcp:tcp_rcvbuf_grow: time=308 rtt_us=281 copied=3244032 inq=180224 space=2883584 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41713664 famil
1153.747792: tcp:tcp_rcvbuf_grow: time=394 rtt_us=332 copied=4460544 inq=585728 space=3063808 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41373696 famil
1154.260747: tcp:tcp_rcvbuf_grow: time=652 rtt_us=226 copied=10977280 inq=737280 space=9486336 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29197743 window_clamp=29217691 rcv_wnd=28368896 fami
1154.375019: tcp:tcp_rcvbuf_grow: time=461 rtt_us=443 copied=7573504 inq=507904 space=6856704 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25288704 famil
1154.463072: tcp:tcp_rcvbuf_grow: time=494 rtt_us=408 copied=7983104 inq=200704 space=7065600 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25579520 famil
1154.474658: tcp:tcp_rcvbuf_grow: time=507 rtt_us=459 copied=5586944 inq=540672 space=4718592 ooo=0 scaling_ratio=240 rcvbuf=17852266 rcv_ssthresh=16692999 window_clamp=16736499 rcv_wnd=16056320 famil
1154.584657: tcp:tcp_rcvbuf_grow: time=494 rtt_us=427 copied=8126464 inq=204800 space=7782400 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
1154.702117: tcp:tcp_rcvbuf_grow: time=480 rtt_us=406 copied=5734400 inq=180224 space=5349376 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil
1155.941595: tcp:tcp_rcvbuf_grow: time=717 rtt_us=670 copied=11042816 inq=3784704 space=7159808 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=14614528 fam
1156.384735: tcp:tcp_rcvbuf_grow: time=529 rtt_us=473 copied=9011200 inq=180224 space=7258112 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=18018304 famil
1157.821676: tcp:tcp_rcvbuf_grow: time=529 rtt_us=272 copied=8224768 inq=602112 space=6545408 ooo=0 scaling_ratio=240 rcvbuf=67000000 rcv_ssthresh=62793576 window_clamp=62812500 rcv_wnd=62115840 famil
1158.906379: tcp:tcp_rcvbuf_grow: time=710 rtt_us=445 copied=11845632 inq=540672 space=10240000 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29205935 window_clamp=29217691 rcv_wnd=28536832 fam
1164.600160: tcp:tcp_rcvbuf_grow: time=841 rtt_us=430 copied=12976128 inq=1290240 space=11304960 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29212591 window_clamp=29217691 rcv_wnd=27856896 fa
1165.163572: tcp:tcp_rcvbuf_grow: time=845 rtt_us=800 copied=12632064 inq=540672 space=7921664 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25912795 window_clamp=25937095 rcv_wnd=25260032 fami
1165.653464: tcp:tcp_rcvbuf_grow: time=388 rtt_us=309 copied=4493312 inq=180224 space=3874816 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41995899 window_clamp=42050919 rcv_wnd=41713664 famil
1166.651211: tcp:tcp_rcvbuf_grow: time=556 rtt_us=553 copied=6328320 inq=540672 space=5554176 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=20946944 famil
After:
sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1000
perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script|tail -20|cut -c30-230
1457.053149: tcp:tcp_rcvbuf_grow: time=128 rtt_us=24 copied=1441792 inq=40960 space=1269760 ooo=0 scaling_ratio=240 rcvbuf=2960741 rcv_ssthresh=2605474 window_clamp=2775694 rcv_wnd=2568192 family=AF_I
1458.000778: tcp:tcp_rcvbuf_grow: time=128 rtt_us=31 copied=1441792 inq=24576 space=1400832 ooo=0 scaling_ratio=240 rcvbuf=3060163 rcv_ssthresh=2810042 window_clamp=2868902 rcv_wnd=2674688 family=AF_I
1458.088059: tcp:tcp_rcvbuf_grow: time=190 rtt_us=110 copied=3227648 inq=385024 space=2781184 ooo=0 scaling_ratio=240 rcvbuf=6728240 rcv_ssthresh=6252705 window_clamp=6307725 rcv_wnd=5799936 family=AF
1458.148549: tcp:tcp_rcvbuf_grow: time=232 rtt_us=129 copied=3956736 inq=237568 space=2842624 ooo=0 scaling_ratio=240 rcvbuf=6731333 rcv_ssthresh=6252705 window_clamp=6310624 rcv_wnd=5918720 family=AF
1458.466861: tcp:tcp_rcvbuf_grow: time=193 rtt_us=83 copied=2949120 inq=180224 space=2457600 ooo=0 scaling_ratio=240 rcvbuf=5751438 rcv_ssthresh=5357689 window_clamp=5391973 rcv_wnd=5054464 family=AF_
1458.775476: tcp:tcp_rcvbuf_grow: time=257 rtt_us=127 copied=4304896 inq=352256 space=3346432 ooo=0 scaling_ratio=240 rcvbuf=8067131 rcv_ssthresh=7523275 window_clamp=7562935 rcv_wnd=7061504 family=AF
1458.776631: tcp:tcp_rcvbuf_grow: time=200 rtt_us=96 copied=3260416 inq=143360 space=2768896 ooo=0 scaling_ratio=240 rcvbuf=6397256 rcv_ssthresh=5938567 window_clamp=5997427 rcv_wnd=5828608 family=AF_
1459.707973: tcp:tcp_rcvbuf_grow: time=215 rtt_us=96 copied=2506752 inq=163840 space=1388544 ooo=0 scaling_ratio=240 rcvbuf=3068867 rcv_ssthresh=2768282 window_clamp=2877062 rcv_wnd=2555904 family=AF_
1460.246494: tcp:tcp_rcvbuf_grow: time=231 rtt_us=80 copied=3756032 inq=204800 space=3117056 ooo=0 scaling_ratio=240 rcvbuf=7288091 rcv_ssthresh=6773725 window_clamp=6832585 rcv_wnd=6471680 family=AF_
1460.714596: tcp:tcp_rcvbuf_grow: time=270 rtt_us=110 copied=4714496 inq=311296 space=3719168 ooo=0 scaling_ratio=240 rcvbuf=8957739 rcv_ssthresh=8339020 window_clamp=8397880 rcv_wnd=7933952 family=AF
1462.029977: tcp:tcp_rcvbuf_grow: time=101 rtt_us=19 copied=1105920 inq=40960 space=1036288 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=1986560 family=AF_I
1462.802385: tcp:tcp_rcvbuf_grow: time=89 rtt_us=45 copied=1069056 inq=0 space=1064960 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=2035712 family=AF_INET6
1462.918648: tcp:tcp_rcvbuf_grow: time=105 rtt_us=33 copied=1441792 inq=180224 space=1069056 ooo=0 scaling_ratio=240 rcvbuf=2383282 rcv_ssthresh=2091684 window_clamp=2234326 rcv_wnd=1896448 family=AF_
1463.222533: tcp:tcp_rcvbuf_grow: time=273 rtt_us=144 copied=4603904 inq=385024 space=3469312 ooo=0 scaling_ratio=240 rcvbuf=8422564 rcv_ssthresh=7891053 window_clamp=7896153 rcv_wnd=7409664 family=AF
1466.519312: tcp:tcp_rcvbuf_grow: time=130 rtt_us=23 copied=1343488 inq=0 space=1261568 ooo=0 scaling_ratio=240 rcvbuf=2780158 rcv_ssthresh=2493778 window_clamp=2606398 rcv_wnd=2494464 family=AF_INET6
1466.681003: tcp:tcp_rcvbuf_grow: time=128 rtt_us=21 copied=1441792 inq=12288 space=1343488 ooo=0 scaling_ratio=240 rcvbuf=2932027 rcv_ssthresh=2578555 window_clamp=2748775 rcv_wnd=2568192 family=AF_I
1470.689959: tcp:tcp_rcvbuf_grow: time=255 rtt_us=122 copied=3932160 inq=204800 space=3551232 ooo=0 scaling_ratio=240 rcvbuf=8182038 rcv_ssthresh=7647384 window_clamp=7670660 rcv_wnd=7442432 family=AF
1471.754154: tcp:tcp_rcvbuf_grow: time=188 rtt_us=95 copied=2138112 inq=577536 space=1429504 ooo=0 scaling_ratio=240 rcvbuf=3113650 rcv_ssthresh=2806426 window_clamp=2919046 rcv_wnd=2248704 family=AF_
1476.813542: tcp:tcp_rcvbuf_grow: time=269 rtt_us=99 copied=3088384 inq=180224 space=2564096 ooo=0 scaling_ratio=240 rcvbuf=6219470 rcv_ssthresh=5771893 window_clamp=5830753 rcv_wnd=5509120 family=AF_
1477.738309: tcp:tcp_rcvbuf_grow: time=166 rtt_us=54 copied=1777664 inq=180224 space=1417216 ooo=0 scaling_ratio=240 rcvbuf=3117118 rcv_ssthresh=2874958 window_clamp=2922298 rcv_wnd=2613248 family=AF_
We can see sk_rcvbuf values are much smaller, and that rtt_us (estimation of rtt
from a receiver point of view) is kept small, instead of being bloated.
No difference in throughput.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20251119084813.3684576-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sysctl_tcp_moderate_rcvbuf is only used from tcp_rcvbuf_grow().
Move it to netns_ipv4_read_rx group.
Remove various CACHELINE_ASSERT_GROUP_SIZE() from netns_ipv4_struct_check(),
as they have no real benefit but cause pain for all changes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251119084813.3684576-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2025-11-18
1) Relax a lock contention bottleneck to improve IPsec crypto
offload performance. From Jianbo Liu.
2) Deprecate pfkey, the interface will be removed in 2027.
3) Update xfrm documentation and move it to ipsec maintainance.
From Bagas Sanjaya.
* tag 'ipsec-next-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
MAINTAINERS: Add entry for XFRM documentation
net: Move XFRM documentation into its own subdirectory
Documentation: xfrm_sync: Number the fifth section
Documentation: xfrm_sysctl: Trim trailing colon in section heading
Documentation: xfrm_sync: Trim excess section heading characters
Documentation: xfrm_sync: Properly reindent list text
Documentation: xfrm_device: Separate hardware offload sublists
Documentation: xfrm_device: Use numbered list for offloading steps
Documentation: xfrm_device: Wrap iproute2 snippets in literal code block
pfkey: Deprecate pfkey
xfrm: Skip redundant replay recheck for the hardware offload path
xfrm: Refactor xfrm_input lock to reduce contention with RSS
====================
Link: https://patch.msgid.link/20251118092610.2223552-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net.ipv4.tcp_comp_sack_slack_ns current default value is too high.
When a flow has many drops (1 % or more), and small RTT, adding 100 usec
before sending SACK stalls the sender relying on getting SACK
fast enough to keep the pipe busy.
Decrease the default to 10 usec.
This is orthogonal to Congestion Control heuristics to determine
if drops are caused by congestion or not.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251114135141.3810964-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
XFRM docs are currently reside in Documentation/networking directory,
yet these are distinctive as a group of their own. Move them into xfrm
subdirectory.
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Number the fifth section ("Exception to threshold settings") to be
consistent with the rest of sections.
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The sole section heading ("/proc/sys/net/core/xfrm_* Variables") has
trailing colon. Trim it.
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The first section "Message Structure" has excess underline, while the
second and third one ("TLVS reflect the different parameters" and
"Default configurations for the parameters") have trailing colon. Trim
them.
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|