summaryrefslogtreecommitdiffstats
path: root/net/sched
AgeCommit message (Collapse)AuthorLines
11 daysnet/sched: teql: fix NULL pointer dereference in iptunnel_xmit on TEQL slave ↵Weiming Shi-0/+1
xmit teql_master_xmit() calls netdev_start_xmit(skb, slave) to transmit through slave devices, but does not update skb->dev to the slave device beforehand. When a gretap tunnel is a TEQL slave, the transmit path reaches iptunnel_xmit() which saves dev = skb->dev (still pointing to teql0 master) and later calls iptunnel_xmit_stats(dev, pkt_len). This function does: get_cpu_ptr(dev->tstats) Since teql_master_setup() does not set dev->pcpu_stat_type to NETDEV_PCPU_STAT_TSTATS, the core network stack never allocates tstats for teql0, so dev->tstats is NULL. get_cpu_ptr(NULL) computes NULL + __per_cpu_offset[cpu], resulting in a page fault. BUG: unable to handle page fault for address: ffff8880e6659018 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 68bc067 P4D 68bc067 PUD 0 Oops: Oops: 0002 [#1] SMP KASAN PTI RIP: 0010:iptunnel_xmit (./include/net/ip_tunnels.h:664 net/ipv4/ip_tunnel_core.c:89) Call Trace: <TASK> ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847) __gre_xmit (net/ipv4/ip_gre.c:478) gre_tap_xmit (net/ipv4/ip_gre.c:779) teql_master_xmit (net/sched/sch_teql.c:319) dev_hard_start_xmit (net/core/dev.c:3887) sch_direct_xmit (net/sched/sch_generic.c:347) __dev_queue_xmit (net/core/dev.c:4802) neigh_direct_output (net/core/neighbour.c:1660) ip_finish_output2 (net/ipv4/ip_output.c:237) __ip_finish_output.part.0 (net/ipv4/ip_output.c:315) ip_mc_output (net/ipv4/ip_output.c:369) ip_send_skb (net/ipv4/ip_output.c:1508) udp_send_skb (net/ipv4/udp.c:1195) udp_sendmsg (net/ipv4/udp.c:1485) inet_sendmsg (net/ipv4/af_inet.c:859) __sys_sendto (net/socket.c:2206) Fix this by setting skb->dev = slave before calling netdev_start_xmit(), so that tunnel xmit functions see the correct slave device with properly allocated tstats. Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260304044216.3517851-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet/sched: act_ife: Fix metalist update behaviorJamal Hadi Salim-49/+44
Whenever an ife action replace changes the metalist, instead of replacing the old data on the metalist, the current ife code is appending the new metadata. Aside from being innapropriate behavior, this may lead to an unbounded addition of metadata to the metalist which might cause an out of bounds error when running the encode op: [ 138.423369][ C1] ================================================================== [ 138.424317][ C1] BUG: KASAN: slab-out-of-bounds in ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.424906][ C1] Write of size 4 at addr ffff8880077f4ffe by task ife_out_out_bou/255 [ 138.425778][ C1] CPU: 1 UID: 0 PID: 255 Comm: ife_out_out_bou Not tainted 7.0.0-rc1-00169-gfbdfa8da05b6 #624 PREEMPT(full) [ 138.425795][ C1] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 138.425800][ C1] Call Trace: [ 138.425804][ C1] <IRQ> [ 138.425808][ C1] dump_stack_lvl (lib/dump_stack.c:122) [ 138.425828][ C1] print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) [ 138.425839][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425844][ C1] ? __virt_addr_valid (./arch/x86/include/asm/preempt.h:95 (discriminator 1) ./include/linux/rcupdate.h:975 (discriminator 1) ./include/linux/mmzone.h:2207 (discriminator 1) arch/x86/mm/physaddr.c:54 (discriminator 1)) [ 138.425853][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425859][ C1] kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:597) [ 138.425868][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425878][ C1] kasan_check_range (mm/kasan/generic.c:186 (discriminator 1) mm/kasan/generic.c:200 (discriminator 1)) [ 138.425884][ C1] __asan_memset (mm/kasan/shadow.c:84 (discriminator 2)) [ 138.425889][ C1] ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425893][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:171) [ 138.425898][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425903][ C1] ife_encode_meta_u16 (net/sched/act_ife.c:57) [ 138.425910][ C1] ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114) [ 138.425916][ C1] ? __asan_memcpy (mm/kasan/shadow.c:105 (discriminator 3)) [ 138.425921][ C1] ? __pfx_ife_encode_meta_u16 (net/sched/act_ife.c:45) [ 138.425927][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425931][ C1] tcf_ife_act (net/sched/act_ife.c:847 net/sched/act_ife.c:879) To solve this issue, fix the replace behavior by adding the metalist to the ife rcu data structure. Fixes: aa9fd9a325d51 ("sched: act: ife: update parameters via rcu handling") Reported-by: Ruitong Liu <cnitlrt@gmail.com> Tested-by: Ruitong Liu <cnitlrt@gmail.com> Co-developed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260304140603.76500-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet_sched: sch_fq: clear q->band_pkt_count[] in fq_reset()Eric Dumazet-0/+1
When/if a NIC resets, queues are deactivated by dev_deactivate_many(), then reactivated when the reset operation completes. fq_reset() removes all the skbs from various queues. If we do not clear q->band_pkt_count[], these counters keep growing and can eventually reach sch->limit, preventing new packets to be queued. Many thanks to Praveen for discovering the root cause. Fixes: 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling") Diagnosed-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260304015640.961780-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-27net/sched: Only allow act_ct to bind to clsact/ingress qdiscs and shared blocksVictor Nogueira-0/+13
As Paolo said earlier [1]: "Since the blamed commit below, classify can return TC_ACT_CONSUMED while the current skb being held by the defragmentation engine. As reported by GangMin Kim, if such packet is that may cause a UaF when the defrag engine later on tries to tuch again such packet." act_ct was never meant to be used in the egress path, however some users are attaching it to egress today [2]. Attempting to reach a middle ground, we noticed that, while most qdiscs are not handling TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we address the issue by only allowing act_ct to bind to clsact/ingress qdiscs and shared blocks. That way it's still possible to attach act_ct to egress (albeit only with clsact). [1] https://lore.kernel.org/netdev/674b8cbfc385c6f37fb29a1de08d8fe5c2b0fbee.1771321118.git.pabeni@redhat.com/ [2] https://lore.kernel.org/netdev/cc6bfb4a-4a2b-42d8-b9ce-7ef6644fb22b@ovn.org/ Reported-by: GangMin Kim <km.kim1503@gmail.com> Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo frags") CC: stable@vger.kernel.org Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260225134349.1287037-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-27net/sched: sch_cake: fixup cake_mq rate adjustment for diffserv configJonas Köppeler-27/+23
cake_mq's rate adjustment during the sync periods did not adjust the rates for every tin in a diffserv config. This lead to inconsistencies of rates between the tins. Fix this by setting the rates for all tins during synchronization. Fixes: 1bddd758bac2 ("net/sched: sch_cake: share shaper state across sub-instances of cake_mq") Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-2-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-27net/sched: sch_cake: avoid sync overhead when unlimitedJonas Köppeler-1/+2
Skip inter-instance sync when no rate limit is configured, as it serves no purpose and only adds overhead. Fixes: 1bddd758bac2 ("net/sched: sch_cake: share shaper state across sub-instances of cake_mq") Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-1-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-27net/sched: act_gate: snapshot parameters with RCU on replacePaul Moses-79/+186
The gate action can be replaced while the hrtimer callback or dump path is walking the schedule list. Convert the parameters to an RCU-protected snapshot and swap updates under tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits the entry list, preserve the existing schedule so the effective state is unchanged. Fixes: a51c328df310 ("net: qos: introduce a gate control flow action") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-26net/sched: ets: fix divide by zero in the offload pathDavide Caratti-4/+8
Offloading ETS requires computing each class' WRR weight: this is done by averaging over the sums of quanta as 'q_sum' and 'q_psum'. Using unsigned int, the same integer size as the individual DRR quanta, can overflow and even cause division by zero, like it happened in the following splat: Oops: divide error: 0000 [#1] SMP PTI CPU: 13 UID: 0 PID: 487 Comm: tc Tainted: G E 6.19.0-virtme #45 PREEMPT(full) Tainted: [E]=UNSIGNED_MODULE Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:ets_offload_change+0x11f/0x290 [sch_ets] Code: e4 45 31 ff eb 03 41 89 c7 41 89 cb 89 ce 83 f9 0f 0f 87 b7 00 00 00 45 8b 08 31 c0 45 01 cc 45 85 c9 74 09 41 6b c4 64 31 d2 <41> f7 f2 89 c2 44 29 fa 45 89 df 41 83 fb 0f 0f 87 c7 00 00 00 44 RSP: 0018:ffffd0a180d77588 EFLAGS: 00010246 RAX: 00000000ffffff38 RBX: ffff8d3d482ca000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd0a180d77660 RBP: ffffd0a180d77690 R08: ffff8d3d482ca2d8 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffe R13: ffff8d3d472f2000 R14: 0000000000000003 R15: 0000000000000000 FS: 00007f440b6c2740(0000) GS:ffff8d3dc9803000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000003cdd2000 CR3: 0000000007b58002 CR4: 0000000000172ef0 Call Trace: <TASK> ets_qdisc_change+0x870/0xf40 [sch_ets] qdisc_create+0x12b/0x540 tc_modify_qdisc+0x6d7/0xbd0 rtnetlink_rcv_msg+0x168/0x6b0 netlink_rcv_skb+0x5c/0x110 netlink_unicast+0x1d6/0x2b0 netlink_sendmsg+0x22e/0x470 ____sys_sendmsg+0x38a/0x3c0 ___sys_sendmsg+0x99/0xe0 __sys_sendmsg+0x8a/0xf0 do_syscall_64+0x111/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f440b81c77e Code: 4d 89 d8 e8 d4 bc 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff951e4c10 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000481820 RCX: 00007f440b81c77e RDX: 0000000000000000 RSI: 00007fff951e4cd0 RDI: 0000000000000003 RBP: 00007fff951e4c20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff951f4fa8 R13: 00000000699ddede R14: 00007f440bb01000 R15: 0000000000486980 </TASK> Modules linked in: sch_ets(E) netdevsim(E) ---[ end trace 0000000000000000 ]--- RIP: 0010:ets_offload_change+0x11f/0x290 [sch_ets] Code: e4 45 31 ff eb 03 41 89 c7 41 89 cb 89 ce 83 f9 0f 0f 87 b7 00 00 00 45 8b 08 31 c0 45 01 cc 45 85 c9 74 09 41 6b c4 64 31 d2 <41> f7 f2 89 c2 44 29 fa 45 89 df 41 83 fb 0f 0f 87 c7 00 00 00 44 RSP: 0018:ffffd0a180d77588 EFLAGS: 00010246 RAX: 00000000ffffff38 RBX: ffff8d3d482ca000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd0a180d77660 RBP: ffffd0a180d77690 R08: ffff8d3d482ca2d8 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffe R13: ffff8d3d472f2000 R14: 0000000000000003 R15: 0000000000000000 FS: 00007f440b6c2740(0000) GS:ffff8d3dc9803000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000003cdd2000 CR3: 0000000007b58002 CR4: 0000000000172ef0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x30000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- Fix this using 64-bit integers for 'q_sum' and 'q_psum'. Cc: stable@vger.kernel.org Fixes: d35eb52bd2ac ("net: sch_ets: Make the ETS qdisc offloadable") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/28504887df314588c7255e9911769c36f751edee.1771964872.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook-2/+1
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds-16/+8
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds-7/+7
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds-63/+63
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook-97/+93
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-17net/sched: act_skbedit: fix divide-by-zero in tcf_skbedit_hash()Ruitong Liu-1/+5
Commit 38a6f0865796 ("net: sched: support hash selecting tx queue") added SKBEDIT_F_TXQ_SKBHASH support. The inclusive range size is computed as: mapping_mod = queue_mapping_max - queue_mapping + 1; The range size can be 65536 when the requested range covers all possible u16 queue IDs (e.g. queue_mapping=0 and queue_mapping_max=U16_MAX). That value cannot be represented in a u16 and previously wrapped to 0, so tcf_skbedit_hash() could trigger a divide-by-zero: queue_mapping += skb_get_hash(skb) % params->mapping_mod; Compute mapping_mod in a wider type and reject ranges larger than U16_MAX to prevent params->mapping_mod from becoming 0 and avoid the crash. Fixes: 38a6f0865796 ("net: sched: support hash selecting tx queue") Cc: stable@vger.kernel.org # 6.12+ Signed-off-by: Ruitong Liu <cnitlrt@gmail.com> Link: https://patch.msgid.link/20260213175948.1505257-1-cnitlrt@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-11Merge tag 'net-next-7.0' of ↵Linus Torvalds-178/+429
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - A significant effort all around the stack to guide the compiler to make the right choice when inlining code, to avoid unneeded calls for small helper and stack canary overhead in the fast-path. This generates better and faster code with very small or no text size increases, as in many cases the call generated more code than the actual inlined helper. - Extend AccECN implementation so that is now functionally complete, also allow the user-space enabling it on a per network namespace basis. - Add support for memory providers with large (above 4K) rx buffer. Paired with hw-gro, larger rx buffer sizes reduce the number of buffers traversing the stack, dincreasing single stream CPU usage by up to ~30%. - Do not add HBH header to Big TCP GSO packets. This simplifies the RX path, the TX path and the NIC drivers, and is possible because user-space taps can now interpret correctly such packets without the HBH hint. - Allow IPv6 routes to be configured with a gateway address that is resolved out of a different interface than the one specified, aligning IPv6 to IPv4 behavior. - Multi-queue aware sch_cake. This makes it possible to scale the rate shaper of sch_cake across multiple CPUs, while still enforcing a single global rate on the interface. - Add support for the nbcon (new buffer console) infrastructure to netconsole, enabling lock-free, priority-based console operations that are safer in crash scenarios. - Improve the TCP ipv6 output path to cache the flow information, saving cpu cycles, reducing cache line misses and stack use. - Improve netfilter packet tracker to resolve clashes for most protocols, avoiding unneeded drops on rare occasions. - Add IP6IP6 tunneling acceleration to the flowtable infrastructure. - Reduce tcp socket size by one cache line. - Notify neighbour changes atomically, avoiding inconsistencies between the notification sequence and the actual states sequence. - Add vsock namespace support, allowing complete isolation of vsocks across different network namespaces. - Improve xsk generic performances with cache-alignment-oriented optimizations. - Support netconsole automatic target recovery, allowing netconsole to reestablish targets when underlying low-level interface comes back online. Driver API: - Support for switching the working mode (automatic vs manual) of a DPLL device via netlink. - Introduce PHY ports representation to expose multiple front-facing media ports over a single MAC. - Introduce "rx-polarity" and "tx-polarity" device tree properties, to generalize polarity inversion requirements for differential signaling. - Add helper to create, prepare and enable managed clocks. Device drivers: - Add Huawei hinic3 PF etherner driver. - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet controller. - Add ethernet driver for MaxLinear MxL862xx switches - Remove parallel-port Ethernet driver. - Convert existing driver timestamp configuration reporting to hwtstamp_get and remove legacy ioctl(). - Convert existing drivers to .get_rx_ring_count(), simplifing the RX ring count retrieval. Also remove the legacy fallback path. - Ethernet high-speed NICs: - Broadcom (bnxt, bng): - bnxt: add FW interface update to support FEC stats histogram and NVRAM defragmentation - bng: add TSO and H/W GRO support - nVidia/Mellanox (mlx5): - improve latency of channel restart operations, reducing the used H/W resources - add TSO support for UDP over GRE over VLAN - add flow counters support for hardware steering (HWS) rules - use a static memory area to store headers for H/W GRO, leading to 12% RX tput improvement - Intel (100G, ice, idpf): - ice: reorganizes layout of Tx and Rx rings for cacheline locality and utilizes __cacheline_group* macros on the new layouts - ice: introduces Synchronous Ethernet (SyncE) support - Meta (fbnic): - adds debugfs for firmware mailbox and tx/rx rings vectors - Ethernet virtual: - geneve: introduce GRO/GSO support for double UDP encapsulation - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - some code refactoring and cleanups - RealTek (r8169): - add support for RTL8127ATF (10G Fiber SFP) - add dash and LTR support - Airoha: - AN8811HB 2.5 Gbps phy support - Freescale (fec): - add XDP zero-copy support - Thunderbolt: - add get link setting support to allow bonding - Renesas: - add support for RZ/G3L GBETH SoC - Ethernet switches: - Maxlinear: - support R(G)MII slow rate configuration - add support for Intel GSW150 - Motorcomm (yt921x): - add DCB/QoS support - TI: - icssm-prueth: support bridging (STP/RSTP) via the switchdev framework - Ethernet PHYs: - Realtek: - enable SGMII and 2500Base-X in-band auto-negotiation - simplify and reunify C22/C45 drivers - Micrel: convert bindings to DT schema - CAN: - move skb headroom content into skb extensions, making CAN metadata access more robust - CAN drivers: - rcar_canfd: - add support for FD-only mode - add support for the RZ/T2H SoC - sja1000: cleanup the CAN state handling - WiFi: - implement EPPKE/802.1X over auth frames support - split up drop reasons better, removing generic RX_DROP - additional FTM capabilities: 6 GHz support, supported number of spatial streams and supported number of LTF repetitions - better mac80211 iterators to enumerate resources - initial UHR (Wi-Fi 8) support for cfg80211/mac80211 - WiFi drivers: - Qualcomm/Atheros: - ath11k: support for Channel Frequency Response measurement - ath12k: a significant driver refactor to support multi-wiphy devices and and pave the way for future device support in the same driver (rather than splitting to ath13k) - ath12k: support for the QCC2072 chipset - Intel: - iwlwifi: partial Neighbor Awareness Networking (NAN) support - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn - RealTek (rtw89): - preparations for RTL8922DE support - Bluetooth: - implement setsockopt(BT_PHY) to set the connection packet type/PHY - set link_policy on incoming ACL connections - Bluetooth drivers: - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE - btqca: add WCN6855 firmware priority selection feature" * tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits) bnge/bng_re: Add a new HSI net: macb: Fix tx/rx malfunction after phy link down and up af_unix: Fix memleak of newsk in unix_stream_connect(). net: ti: icssg-prueth: Add optional dependency on HSR net: dsa: add basic initial driver for MxL862xx switches net: mdio: add unlocked mdiodev C45 bus accessors net: dsa: add tag format for MxL862xx switches dt-bindings: net: dsa: add MaxLinear MxL862xx selftests: drivers: net: hw: Modify toeplitz.c to poll for packets octeontx2-pf: Unregister devlink on probe failure net: renesas: rswitch: fix forwarding offload statemachine ionic: Rate limit unknown xcvr type messages tcp: inet6_csk_xmit() optimization tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock() tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect() ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6 ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update() ipv6: use np->final in inet6_sk_rebuild_header() ipv6: add daddr/final storage in struct ipv6_pinfo net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup() ...
2026-02-10Merge tag 'bpf-next-7.0' of ↵Linus Torvalds-7/+13
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support associating BPF program with struct_ops (Amery Hung) - Switch BPF local storage to rqspinlock and remove recursion detection counters which were causing false positives (Amery Hung) - Fix live registers marking for indirect jumps (Anton Protopopov) - Introduce execution context detection BPF helpers (Changwoo Min) - Improve verifier precision for 32bit sign extension pattern (Cupertino Miranda) - Optimize BTF type lookup by sorting vmlinux BTF and doing binary search (Donglin Peng) - Allow states pruning for misc/invalid slots in iterator loops (Eduard Zingerman) - In preparation for ASAN support in BPF arenas teach libbpf to move global BPF variables to the end of the region and enable arena kfuncs while holding locks (Emil Tsalapatis) - Introduce support for implicit arguments in kfuncs and migrate a number of them to new API. This is a prerequisite for cgroup sub-schedulers in sched-ext (Ihor Solodrai) - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen) - Fix ORC stack unwind from kprobe_multi (Jiri Olsa) - Speed up fentry attach by using single ftrace direct ops in BPF trampolines (Jiri Olsa) - Require frozen map for calculating map hash (KP Singh) - Fix lock entry creation in TAS fallback in rqspinlock (Kumar Kartikeya Dwivedi) - Allow user space to select cpu in lookup/update operations on per-cpu array and hash maps (Leon Hwang) - Make kfuncs return trusted pointers by default (Matt Bobrowski) - Introduce "fsession" support where single BPF program is executed upon entry and exit from traced kernel function (Menglong Dong) - Allow bpf_timer and bpf_wq use in all programs types (Mykyta Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei Starovoitov) - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their definition across the tree (Puranjay Mohan) - Allow BPF arena calls from non-sleepable context (Puranjay Mohan) - Improve register id comparison logic in the verifier and extend linked registers with negative offsets (Puranjay Mohan) - In preparation for BPF-OOM introduce kfuncs to access memcg events (Roman Gushchin) - Use CFI compatible destructor kfunc type (Sami Tolvanen) - Add bitwise tracking for BPF_END in the verifier (Tianci Cao) - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou Tang) - Make BPF selftests work with 64k page size (Yonghong Song) * tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits) selftests/bpf: Fix outdated test on storage->smap selftests/bpf: Choose another percpu variable in bpf for btf_dump test selftests/bpf: Remove test_task_storage_map_stress_lookup selftests/bpf: Update task_local_storage/task_storage_nodeadlock test selftests/bpf: Update task_local_storage/recursion test selftests/bpf: Update sk_storage_omem_uncharge test bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} bpf: Support lockless unlink when freeing map or local storage bpf: Prepare for bpf_selem_unlink_nofail() bpf: Remove unused percpu counter from bpf_local_storage_map_free bpf: Remove cgroup local storage percpu counter bpf: Remove task local storage percpu counter bpf: Change local_storage->lock and b->lock to rqspinlock bpf: Convert bpf_selem_unlink to failable bpf: Convert bpf_selem_link_map to failable bpf: Convert bpf_selem_unlink_map to failable bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage selftests/xsk: fix number of Tx frags in invalid packet selftests/xsk: properly handle batch ending in the middle of a packet bpf: Prevent reentrance into call_rcu_tasks_trace() ...
2026-02-06net/ipv6: Introduce payload_len helpersAlice Mikityanska-1/+1
The next commits will transition away from using the hop-by-hop extension header to encode packet length for BIG TCP. Add wrappers around ip6->payload_len that return the actual value if it's non-zero, and calculate it from skb->len if payload_len is set to zero (and a symmetrical setter). The new helpers are used wherever the surrounding code supports the hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h). No behavioral change in this commit. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-2-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06net_sched: sch_fq: rework fq_gc() to avoid stack canaryEric Dumazet-13/+11
Using kmem_cache_free_bulk() in fq_gc() was not optimal. 1) It needs an array. 2) It is only saving cpu cycles for large batches. The automatic array forces a stack canary, which is expensive. In practice fq_gc was finding zero, one or two flows at most per round. Remove the array, use kmem_cache_free(). This makes fq_enqueue() smaller and faster. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-79 (-79) Function old new delta fq_enqueue 1629 1550 -79 Total: Before=24886583, After=24886504, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260204190034.76277-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski-7/+6
Cross-merge networking fixes after downstream PR (net-6.19-rc9). No adjacent changes, conflicts: drivers/net/ethernet/spacemit/k1_emac.c 3125fc1701694 ("net: spacemit: k1-emac: fix jumbo frame support") f66086798f91f ("net: spacemit: Remove broken flow control support") https://lore.kernel.org/aYIysFIE9ooavWia@sirena.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05net/sched: don't use dynamic lockdep keys with clsact/ingress/noqueueDavide Caratti-6/+4
Currently we are registering one dynamic lockdep key for each allocated qdisc, to avoid false deadlock reports when mirred (or TC eBPF) redirects packets to another device while the root lock is acquired [1]. Since dynamic keys are a limited resource, we can save them at least for qdiscs that are not meant to acquire the root lock in the traffic path, or to carry traffic at all, like: - clsact - ingress - noqueue Don't register dynamic keys for the above schedulers, so that we hit MAX_LOCKDEP_KEYS later in our tests. [1] https://github.com/multipath-tcp/mptcp_net-next/issues/451 Changes in v2: - change ordering of spin_lock_init() vs. lockdep_register_key() (Jakub Kicinski) Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/94448f7fa7c4f52d2ce416a4895ec87d456d7417.1770220576.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-04net_sched: sch_fq: tweak unlikely() hints in fq_dequeue()Eric Dumazet-2/+2
After 076433bd78d7 ("net_sched: sch_fq: add fast path for mostly idle qdisc") we need to remove one unlikely() because q->internal holds all the fast path packets. skb = fq_peek(&q->internal); if (unlikely(skb)) { q->internal.qlen--; Calling INET_ECN_set_ce() is very unlikely. These changes allow fq_dequeue_skb() to be (auto)inlined, thus making fq_dequeue() faster. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux add/remove: 2/2 grow/shrink: 0/1 up/down: 283/-269 (14) Function old new delta INET_ECN_set_ce - 267 +267 __pfx_INET_ECN_set_ce - 16 +16 __pfx_fq_dequeue_skb 16 - -16 fq_dequeue_skb 103 - -103 fq_dequeue 1685 1535 -150 Total: Before=24886569, After=24886583, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260203214716.880853-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-29net/sched: cls_u32: use skb_header_pointer_careful()Eric Dumazet-7/+6
skb_header_pointer() does not fully validate negative @offset values. Use skb_header_pointer_careful() instead. GangMin Kim provided a report and a repro fooling u32_classify(): BUG: KASAN: slab-out-of-bounds in u32_classify+0x1180/0x11b0 net/sched/cls_u32.c:221 Fixes: fbc2e7d9cf49 ("cls_u32: use skb_header_pointer() to dereference data safely") Reported-by: GangMin Kim <km.kim1503@gmail.com> Closes: https://lore.kernel.org/netdev/CANn89iJkyUZ=mAzLzC4GdcAgLuPnUoivdLaOs6B9rq5_erj76w@mail.gmail.com/T/ Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260128141539.3404400-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski-3/+10
Cross-merge networking fixes after downstream PR (net-6.19-rc7). Conflicts: drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b35a6fd37a00 ("hinic3: Add adaptive IRQ coalescing with DIM") fb2bb2a1ebf7 ("hinic3: Fix netif_queue_set_napi queue_index input parameter error") https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com drivers/net/wireless/ath/ath12k/mac.c drivers/net/wireless/ath/ath12k/wifi7/hw.c 31707572108d ("wifi: ath12k: Fix wrong P2P device link id issue") c26f294fef2a ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module") https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au Adjacent changes: drivers/net/wireless/ath/ath12k/mac.c 8b8d6ee53dfd ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel") 914c890d3b90 ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22net/sched: act_ife: avoid possible NULL derefEric Dumazet-2/+4
tcf_ife_encode() must make sure ife_encode() does not return NULL. syzbot reported: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:ife_tlv_meta_encode+0x41/0xa0 net/ife/ife.c:166 CPU: 3 UID: 0 PID: 8990 Comm: syz.0.696 Not tainted syzkaller #0 PREEMPT(full) Call Trace: <TASK> ife_encode_meta_u32+0x153/0x180 net/sched/act_ife.c:101 tcf_ife_encode net/sched/act_ife.c:841 [inline] tcf_ife_act+0x1022/0x1de0 net/sched/act_ife.c:877 tc_act include/net/tc_wrapper.h:130 [inline] tcf_action_exec+0x1c0/0xa20 net/sched/act_api.c:1152 tcf_exts_exec include/net/pkt_cls.h:349 [inline] mall_classify+0x1a0/0x2a0 net/sched/cls_matchall.c:42 tc_classify include/net/tc_wrapper.h:197 [inline] __tcf_classify net/sched/cls_api.c:1764 [inline] tcf_classify+0x7f2/0x1380 net/sched/cls_api.c:1860 multiq_classify net/sched/sch_multiq.c:39 [inline] multiq_enqueue+0xe0/0x510 net/sched/sch_multiq.c:66 dev_qdisc_enqueue+0x45/0x250 net/core/dev.c:4147 __dev_xmit_skb net/core/dev.c:4262 [inline] __dev_queue_xmit+0x2998/0x46c0 net/core/dev.c:4798 Fixes: 295a6e06d21e ("net/sched: act_ife: Change to use ife module") Reported-by: syzbot+5cf914f193dffde3bd3c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6970d61d.050a0220.706b.0010.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yotam Gigi <yotam.gi@gmail.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260121133724.3400020-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20netfilter: nf_conntrack: don't rely on implicit includesFlorian Westphal-0/+3
several netfilter compilation units rely on implicit includes coming from nf_conntrack_proto_gre.h. Clean this up and add the required dependencies where needed. nf_conntrack.h requires net_generic() helper. Place various gre/ppp/vlan includes to where they are needed. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-19net/sched: qfq: Use cl_is_active to determine whether class is active in ↵Jamal Hadi Salim-1/+1
qfq_rm_from_ag This is more of a preventive patch to make the code more consistent and to prevent possible exploits that employ child qlen manipulations on qfq. use cl_is_active instead of relying on the child qdisc's qlen to determine class activation. Fixes: 462dbc9101acd ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260114160243.913069-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net/sched: Enforce that teql can only be used as root qdiscJamal Hadi Salim-0/+5
Design intent of teql is that it is only supposed to be used as root qdisc. We need to check for that constraint. Although not important, I will describe the scenario that unearthed this issue for the curious. GangMin Kim <km.kim1503@gmail.com> managed to concot a scenario as follows: ROOT qdisc 1:0 (QFQ) ├── class 1:1 (weight=15, lmax=16384) netem with delay 6.4s └── class 1:2 (weight=1, lmax=1514) teql GangMin sends a packet which is enqueued to 1:1 (netem). Any invocation of dequeue by QFQ from this class will not return a packet until after 6.4s. In the meantime, a second packet is sent and it lands on 1:2. teql's enqueue will return success and this will activate class 1:2. Main issue is that teql only updates the parent visible qlen (sch->q.qlen) at dequeue. Since QFQ will only call dequeue if peek succeeds (and teql's peek always returns NULL), dequeue will never be called and thus the qlen will remain as 0. With that in mind, when GangMin updates 1:2's lmax value, the qfq_change_class calls qfq_deact_rm_from_agg. Since the child qdisc's qlen was not incremented, qfq fails to deactivate the class, but still frees its pointers from the aggregate. So when the first packet is rescheduled after 6.4 seconds (netem's delay), a dangling pointer is accessed causing GangMin's causing a UAF. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: GangMin Kim <km.kim1503@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260114160243.913069-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17net/sched: cake: avoid separate allocation of struct cake_sched_configToke Høiland-Jørgensen-23/+6
Paolo pointed out that we can avoid separately allocating struct cake_sched_config even in the non-mq case, by embedding it into struct cake_sched_data. This reduces the complexity of the logic that swaps the pointers and frees the old value, at the cost of adding 56 bytes to the latter. Since cake_sched_data is already almost 17k bytes, this seems like a reasonable tradeoff. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Fixes: bc0ce2bad36c ("net/sched: sch_cake: Factor out config variables into separate struct") Link: https://patch.msgid.link/20260113143157.2581680-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski-2/+4
Cross-merge networking fixes after downstream PR (net-6.19-rc6). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5Alexei Starovoitov-5/+20
Cross-merge BPF and other fixes after downstream PR. No conflicts. Adjacent: Auto-merging MAINTAINERS Auto-merging Makefile Auto-merging kernel/bpf/verifier.c Auto-merging kernel/sched/ext.c Auto-merging mm/memcontrol.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13net/sched: sch_qfq: do not free existing class in qfq_change_class()Eric Dumazet-2/+4
Fixes qfq_change_class() error case. cl->qdisc and cl should only be freed if a new class and qdisc were allocated, or we risk various UAF. Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Reported-by: syzbot+07f3f38f723c335f106d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6965351d.050a0220.eaf7.00c5.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260112175656.17605-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13net/sched: sch_cake: share shaper state across sub-instances of cake_mqJonas Köppeler-0/+51
This commit adds shared shaper state across the cake instances beneath a cake_mq qdisc. It works by periodically tracking the number of active instances, and scaling the configured rate by the number of active queues. The scan is lockless and simply reads the qlen and the last_active state variable of each of the instances configured beneath the parent cake_mq instance. Locking is not required since the values are only updated by the owning instance, and eventual consistency is sufficient for the purpose of estimating the number of active queues. The interval for scanning the number of active queues is set to 200 us. We found this to be a good tradeoff between overhead and response time. For a detailed analysis of this aspect see the Netdevconf talk: https://netdevconf.info/0x19/docs/netdev-0x19-paper16-talk-paper.pdf Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20260109-mq-cake-sub-qdisc-v8-5-8d613fece5d8@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-13net/sched: sch_cake: Share config across cake_mq sub-qdiscsToke Høiland-Jørgensen-40/+133
This adds support for configuring the cake_mq instance directly, sharing the config across the cake sub-qdiscs. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20260109-mq-cake-sub-qdisc-v8-4-8d613fece5d8@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-13net/sched: sch_cake: Add cake_mq qdisc for using cake on mq devicesToke Høiland-Jørgensen-1/+78
Add a cake_mq qdisc which installs cake instances on each hardware queue on a multi-queue device. This is just a copy of sch_mq that installs cake instead of the default qdisc on each queue. Subsequent commits will add sharing of the config between cake instances, as well as a multi-queue aware shaper algorithm. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20260109-mq-cake-sub-qdisc-v8-3-8d613fece5d8@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-13net/sched: sch_cake: Factor out config variables into separate structToke Høiland-Jørgensen-112/+133
Factor out all the user-configurable variables into a separate struct and embed it into struct cake_sched_data. This is done in preparation for sharing the configuration across multiple instances of cake in an mq setup. No functional change is intended with this patch. Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20260109-mq-cake-sub-qdisc-v8-2-8d613fece5d8@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-13net/sched: Export mq functions for reuseToke Høiland-Jørgensen-22/+49
To enable the cake_mq qdisc to reuse code from the mq qdisc, export a bunch of functions from sch_mq. Split common functionality out from some functions so it can be composed with other code, and export other functions wholesale. To discourage wanton reuse, put the symbols into a new NET_SCHED_INTERNAL namespace, and a sch_priv.h header file. No functional change intended. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20260109-mq-cake-sub-qdisc-v8-1-8d613fece5d8@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-12bpf: net_sched: Use the correct destructor kfunc typeSami Tolvanen-1/+7
With CONFIG_CFI enabled, the kernel strictly enforces that indirect function calls use a function pointer type that matches the target function. As bpf_kfree_skb() signature differs from the btf_dtor_kfunc_t pointer type used for the destructor calls in bpf_obj_free_fields(), add a stub function with the correct type to fix the type mismatch. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260110082548.113748-8-samitolvanen@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-08net/sched: sch_qfq: Fix NULL deref when deactivating inactive aggregate in ↵Xiang Mei-1/+1
qfq_reset `qfq_class->leaf_qdisc->q.qlen > 0` does not imply that the class itself is active. Two qfq_class objects may point to the same leaf_qdisc. This happens when: 1. one QFQ qdisc is attached to the dev as the root qdisc, and 2. another QFQ qdisc is temporarily referenced (e.g., via qdisc_get() / qdisc_put()) and is pending to be destroyed, as in function tc_new_tfilter. When packets are enqueued through the root QFQ qdisc, the shared leaf_qdisc->q.qlen increases. At the same time, the second QFQ qdisc triggers qdisc_put and qdisc_destroy: the qdisc enters qfq_reset() with its own q->q.qlen == 0, but its class's leaf qdisc->q.qlen > 0. Therefore, the qfq_reset would wrongly deactivate an inactive aggregate and trigger a null-deref in qfq_deactivate_agg: [ 0.903172] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 0.903571] #PF: supervisor write access in kernel mode [ 0.903860] #PF: error_code(0x0002) - not-present page [ 0.904177] PGD 10299b067 P4D 10299b067 PUD 10299c067 PMD 0 [ 0.904502] Oops: Oops: 0002 [#1] SMP NOPTI [ 0.904737] CPU: 0 UID: 0 PID: 135 Comm: exploit Not tainted 6.19.0-rc3+ #2 NONE [ 0.905157] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 0.905754] RIP: 0010:qfq_deactivate_agg (include/linux/list.h:992 (discriminator 2) include/linux/list.h:1006 (discriminator 2) net/sched/sch_qfq.c:1367 (discriminator 2) net/sched/sch_qfq.c:1393 (discriminator 2)) [ 0.906046] Code: 0f 84 4d 01 00 00 48 89 70 18 8b 4b 10 48 c7 c2 ff ff ff ff 48 8b 78 08 48 d3 e2 48 21 f2 48 2b 13 48 8b 30 48 d3 ea 8b 4b 18 0 Code starting with the faulting instruction =========================================== 0: 0f 84 4d 01 00 00 je 0x153 6: 48 89 70 18 mov %rsi,0x18(%rax) a: 8b 4b 10 mov 0x10(%rbx),%ecx d: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx 14: 48 8b 78 08 mov 0x8(%rax),%rdi 18: 48 d3 e2 shl %cl,%rdx 1b: 48 21 f2 and %rsi,%rdx 1e: 48 2b 13 sub (%rbx),%rdx 21: 48 8b 30 mov (%rax),%rsi 24: 48 d3 ea shr %cl,%rdx 27: 8b 4b 18 mov 0x18(%rbx),%ecx ... [ 0.907095] RSP: 0018:ffffc900004a39a0 EFLAGS: 00010246 [ 0.907368] RAX: ffff8881043a0880 RBX: ffff888102953340 RCX: 0000000000000000 [ 0.907723] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 0.908100] RBP: ffff888102952180 R08: 0000000000000000 R09: 0000000000000000 [ 0.908451] R10: ffff8881043a0000 R11: 0000000000000000 R12: ffff888102952000 [ 0.908804] R13: ffff888102952180 R14: ffff8881043a0ad8 R15: ffff8881043a0880 [ 0.909179] FS: 000000002a1a0380(0000) GS:ffff888196d8d000(0000) knlGS:0000000000000000 [ 0.909572] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.909857] CR2: 0000000000000000 CR3: 0000000102993002 CR4: 0000000000772ef0 [ 0.910247] PKRU: 55555554 [ 0.910391] Call Trace: [ 0.910527] <TASK> [ 0.910638] qfq_reset_qdisc (net/sched/sch_qfq.c:357 net/sched/sch_qfq.c:1485) [ 0.910826] qdisc_reset (include/linux/skbuff.h:2195 include/linux/skbuff.h:2501 include/linux/skbuff.h:3424 include/linux/skbuff.h:3430 net/sched/sch_generic.c:1036) [ 0.911040] __qdisc_destroy (net/sched/sch_generic.c:1076) [ 0.911236] tc_new_tfilter (net/sched/cls_api.c:2447) [ 0.911447] rtnetlink_rcv_msg (net/core/rtnetlink.c:6958) [ 0.911663] ? __pfx_rtnetlink_rcv_msg (net/core/rtnetlink.c:6861) [ 0.911894] netlink_rcv_skb (net/netlink/af_netlink.c:2550) [ 0.912100] netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344) [ 0.912296] ? __alloc_skb (net/core/skbuff.c:706) [ 0.912484] netlink_sendmsg (net/netlink/af_netlink.c:1894) [ 0.912682] sock_write_iter (net/socket.c:727 (discriminator 1) net/socket.c:742 (discriminator 1) net/socket.c:1195 (discriminator 1)) [ 0.912880] vfs_write (fs/read_write.c:593 fs/read_write.c:686) [ 0.913077] ksys_write (fs/read_write.c:738) [ 0.913252] do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) [ 0.913438] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131) [ 0.913687] RIP: 0033:0x424c34 [ 0.913844] Code: 89 02 48 c7 c0 ff ff ff ff eb bd 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d 2d 44 09 00 00 74 13 b8 01 00 00 00 0f 05 9 Code starting with the faulting instruction =========================================== 0: 89 02 mov %eax,(%rdx) 2: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax 9: eb bd jmp 0xffffffffffffffc8 b: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) 12: 00 00 00 15: 90 nop 16: f3 0f 1e fa endbr64 1a: 80 3d 2d 44 09 00 00 cmpb $0x0,0x9442d(%rip) # 0x9444e 21: 74 13 je 0x36 23: b8 01 00 00 00 mov $0x1,%eax 28: 0f 05 syscall 2a: 09 .byte 0x9 [ 0.914807] RSP: 002b:00007ffea1938b78 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 0.915197] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000000424c34 [ 0.915556] RDX: 000000000000003c RSI: 000000002af378c0 RDI: 0000000000000003 [ 0.915912] RBP: 00007ffea1938bc0 R08: 00000000004b8820 R09: 0000000000000000 [ 0.916297] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffea1938d28 [ 0.916652] R13: 00007ffea1938d38 R14: 00000000004b3828 R15: 0000000000000001 [ 0.917039] </TASK> [ 0.917158] Modules linked in: [ 0.917316] CR2: 0000000000000000 [ 0.917484] ---[ end trace 0000000000000000 ]--- [ 0.917717] RIP: 0010:qfq_deactivate_agg (include/linux/list.h:992 (discriminator 2) include/linux/list.h:1006 (discriminator 2) net/sched/sch_qfq.c:1367 (discriminator 2) net/sched/sch_qfq.c:1393 (discriminator 2)) [ 0.917978] Code: 0f 84 4d 01 00 00 48 89 70 18 8b 4b 10 48 c7 c2 ff ff ff ff 48 8b 78 08 48 d3 e2 48 21 f2 48 2b 13 48 8b 30 48 d3 ea 8b 4b 18 0 Code starting with the faulting instruction =========================================== 0: 0f 84 4d 01 00 00 je 0x153 6: 48 89 70 18 mov %rsi,0x18(%rax) a: 8b 4b 10 mov 0x10(%rbx),%ecx d: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx 14: 48 8b 78 08 mov 0x8(%rax),%rdi 18: 48 d3 e2 shl %cl,%rdx 1b: 48 21 f2 and %rsi,%rdx 1e: 48 2b 13 sub (%rbx),%rdx 21: 48 8b 30 mov (%rax),%rsi 24: 48 d3 ea shr %cl,%rdx 27: 8b 4b 18 mov 0x18(%rbx),%ecx ... [ 0.918902] RSP: 0018:ffffc900004a39a0 EFLAGS: 00010246 [ 0.919198] RAX: ffff8881043a0880 RBX: ffff888102953340 RCX: 0000000000000000 [ 0.919559] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 0.919908] RBP: ffff888102952180 R08: 0000000000000000 R09: 0000000000000000 [ 0.920289] R10: ffff8881043a0000 R11: 0000000000000000 R12: ffff888102952000 [ 0.920648] R13: ffff888102952180 R14: ffff8881043a0ad8 R15: ffff8881043a0880 [ 0.921014] FS: 000000002a1a0380(0000) GS:ffff888196d8d000(0000) knlGS:0000000000000000 [ 0.921424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.921710] CR2: 0000000000000000 CR3: 0000000102993002 CR4: 0000000000772ef0 [ 0.922097] PKRU: 55555554 [ 0.922240] Kernel panic - not syncing: Fatal exception [ 0.922590] Kernel Offset: disabled Fixes: 0545a3037773 ("pkt_sched: QFQ - quick fair queue scheduler") Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260106034100.1780779-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-06net/sched: act_api: avoid dereferencing ERR_PTR in tcf_idrinfo_destroyShivani Gupta-0/+2
syzbot reported a crash in tc_act_in_hw() during netns teardown where tcf_idrinfo_destroy() passed an ERR_PTR(-EBUSY) value as a tc_action pointer, leading to an invalid dereference. Guard against ERR_PTR entries when iterating the action IDR so teardown does not call tc_act_in_hw() on an error pointer. Fixes: 84a7d6797e6a ("net/sched: acp_api: no longer acquire RTNL in tc_action_net_exit()") Link: https://syzkaller.appspot.com/bug?extid=8f1c492ffa4644ff3826 Reported-by: syzbot+8f1c492ffa4644ff3826@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8f1c492ffa4644ff3826 Signed-off-by: Shivani Gupta <shivani07g@gmail.com> Link: https://patch.msgid.link/20260105005905.243423-1-shivani07g@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-05net/sched: act_mirred: Fix leak when redirecting to self on egressJamal Hadi Salim-12/+12
Whenever a mirred redirect to self on egress happens, mirred allocates a new skb (skb_to_send). The loop to self check was done after that allocation, but was not freeing the newly allocated skb, causing a leak. Fix this by moving the if-statement to before the allocation of the new skb. The issue was found by running the accompanying tdc test in 2/2 with config kmemleak enabled. After a few minutes the kmemleak thread ran and reported the leak coming from mirred. Fixes: 1d856251a009 ("net/sched: act_mirred: fix loop detection") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260101135608.253079-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-02bpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncsPuranjay Mohan-6/+6
Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the flag itself. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-18net/sched: act_mirred: fix loop detectionJamal Hadi Salim-0/+9
Fix a loop scenario of ethx:egress->ethx:egress Example setup to reproduce: tc qdisc add dev ethx root handle 1: drr tc filter add dev ethx parent 1: protocol ip prio 1 matchall \ action mirred egress redirect dev ethx Now ping out of ethx and you get a deadlock: [ 116.892898][ T307] ============================================ [ 116.893182][ T307] WARNING: possible recursive locking detected [ 116.893418][ T307] 6.18.0-rc6-01205-ge05021a829b8-dirty #204 Not tainted [ 116.893682][ T307] -------------------------------------------- [ 116.893926][ T307] ping/307 is trying to acquire lock: [ 116.894133][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50 [ 116.894517][ T307] [ 116.894517][ T307] but task is already holding lock: [ 116.894836][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50 [ 116.895252][ T307] [ 116.895252][ T307] other info that might help us debug this: [ 116.895608][ T307] Possible unsafe locking scenario: [ 116.895608][ T307] [ 116.895901][ T307] CPU0 [ 116.896057][ T307] ---- [ 116.896200][ T307] lock(&sch->root_lock_key); [ 116.896392][ T307] lock(&sch->root_lock_key); [ 116.896605][ T307] [ 116.896605][ T307] *** DEADLOCK *** [ 116.896605][ T307] [ 116.896864][ T307] May be due to missing lock nesting notation [ 116.896864][ T307] [ 116.897123][ T307] 6 locks held by ping/307: [ 116.897302][ T307] #0: ffff88800b4b0250 (sk_lock-AF_INET){+.+.}-{0:0}, at: raw_sendmsg+0xb20/0x2cf0 [ 116.897808][ T307] #1: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_output+0xa9/0x600 [ 116.898138][ T307] #2: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0x2c6/0x1ee0 [ 116.898459][ T307] #3: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50 [ 116.898782][ T307] #4: ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50 [ 116.899132][ T307] #5: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50 [ 116.899442][ T307] [ 116.899442][ T307] stack backtrace: [ 116.899667][ T307] CPU: 2 UID: 0 PID: 307 Comm: ping Not tainted 6.18.0-rc6-01205-ge05021a829b8-dirty #204 PREEMPT(voluntary) [ 116.899672][ T307] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 116.899675][ T307] Call Trace: [ 116.899678][ T307] <TASK> [ 116.899680][ T307] dump_stack_lvl+0x6f/0xb0 [ 116.899688][ T307] print_deadlock_bug.cold+0xc0/0xdc [ 116.899695][ T307] __lock_acquire+0x11f7/0x1be0 [ 116.899704][ T307] lock_acquire+0x162/0x300 [ 116.899707][ T307] ? __dev_queue_xmit+0x2210/0x3b50 [ 116.899713][ T307] ? srso_alias_return_thunk+0x5/0xfbef5 [ 116.899717][ T307] ? stack_trace_save+0x93/0xd0 [ 116.899723][ T307] _raw_spin_lock+0x30/0x40 [ 116.899728][ T307] ? __dev_queue_xmit+0x2210/0x3b50 [ 116.899731][ T307] __dev_queue_xmit+0x2210/0x3b50 Fixes: 178ca30889a1 ("Revert "net/sched: Fix mirred deadlock on device recursion"") Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251210162255.1057663-1-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-11net/sched: ets: Remove drr class from the active list if it changes to strictVictor Nogueira-0/+4
Whenever a user issues an ets qdisc change command, transforming a drr class into a strict one, the ets code isn't checking whether that class was in the active list and removing it. This means that, if a user changes a strict class (which was in the active list) back to a drr one, that class will be added twice to the active list [1]. Doing so with the following commands: tc qdisc add dev lo root handle 1: ets bands 2 strict 1 tc qdisc add dev lo parent 1:2 handle 20: \ tbf rate 8bit burst 100b latency 1s tc filter add dev lo parent 1: basic classid 1:2 ping -c1 -W0.01 -s 56 127.0.0.1 tc qdisc change dev lo root handle 1: ets bands 2 strict 2 tc qdisc change dev lo root handle 1: ets bands 2 strict 1 ping -c1 -W0.01 -s 56 127.0.0.1 Will trigger the following splat with list debug turned on: [ 59.279014][ T365] ------------[ cut here ]------------ [ 59.279452][ T365] list_add double add: new=ffff88801d60e350, prev=ffff88801d60e350, next=ffff88801d60e2c0. [ 59.280153][ T365] WARNING: CPU: 3 PID: 365 at lib/list_debug.c:35 __list_add_valid_or_report+0x17f/0x220 [ 59.280860][ T365] Modules linked in: [ 59.281165][ T365] CPU: 3 UID: 0 PID: 365 Comm: tc Not tainted 6.18.0-rc7-00105-g7e9f13163c13-dirty #239 PREEMPT(voluntary) [ 59.281977][ T365] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 59.282391][ T365] RIP: 0010:__list_add_valid_or_report+0x17f/0x220 [ 59.282842][ T365] Code: 89 c6 e8 d4 b7 0d ff 90 0f 0b 90 90 31 c0 e9 31 ff ff ff 90 48 c7 c7 e0 a0 22 9f 48 89 f2 48 89 c1 4c 89 c6 e8 b2 b7 0d ff 90 <0f> 0b 90 90 31 c0 e9 0f ff ff ff 48 89 f7 48 89 44 24 10 4c 89 44 ... [ 59.288812][ T365] Call Trace: [ 59.289056][ T365] <TASK> [ 59.289224][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.289546][ T365] ets_qdisc_change+0xd2b/0x1e80 [ 59.289891][ T365] ? __lock_acquire+0x7e7/0x1be0 [ 59.290223][ T365] ? __pfx_ets_qdisc_change+0x10/0x10 [ 59.290546][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.290898][ T365] ? __mutex_trylock_common+0xda/0x240 [ 59.291228][ T365] ? __pfx___mutex_trylock_common+0x10/0x10 [ 59.291655][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.291993][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.292313][ T365] ? trace_contention_end+0xc8/0x110 [ 59.292656][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.293022][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.293351][ T365] tc_modify_qdisc+0x63a/0x1cf0 Fix this by always checking and removing an ets class from the active list when changing it to strict. [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/sched/sch_ets.c?id=ce052b9402e461a9aded599f5b47e76bc727f7de#n663 Fixes: cd9b50adc6bb9 ("net/sched: ets: fix crash when flipping from 'strict' to 'quantum'") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20251208190125.1868423-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-04net/sched: ets: Always remove class from active list before deleting in ↵Jamal Hadi Salim-1/+1
ets_qdisc_change zdi-disclosures@trendmicro.com says: The vulnerability is a race condition between `ets_qdisc_dequeue` and `ets_qdisc_change`. It leads to UAF on `struct Qdisc` object. Attacker requires the capability to create new user and network namespace in order to trigger the bug. See my additional commentary at the end of the analysis. Analysis: static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { ... // (1) this lock is preventing .change handler (`ets_qdisc_change`) //to race with .dequeue handler (`ets_qdisc_dequeue`) sch_tree_lock(sch); for (i = nbands; i < oldbands; i++) { if (i >= q->nstrict && q->classes[i].qdisc->q.qlen) list_del_init(&q->classes[i].alist); qdisc_purge_queue(q->classes[i].qdisc); } WRITE_ONCE(q->nbands, nbands); for (i = nstrict; i < q->nstrict; i++) { if (q->classes[i].qdisc->q.qlen) { // (2) the class is added to the q->active list_add_tail(&q->classes[i].alist, &q->active); q->classes[i].deficit = quanta[i]; } } WRITE_ONCE(q->nstrict, nstrict); memcpy(q->prio2band, priomap, sizeof(priomap)); for (i = 0; i < q->nbands; i++) WRITE_ONCE(q->classes[i].quantum, quanta[i]); for (i = oldbands; i < q->nbands; i++) { q->classes[i].qdisc = queues[i]; if (q->classes[i].qdisc != &noop_qdisc) qdisc_hash_add(q->classes[i].qdisc, true); } // (3) the qdisc is unlocked, now dequeue can be called in parallel // to the rest of .change handler sch_tree_unlock(sch); ets_offload_change(sch); for (i = q->nbands; i < oldbands; i++) { // (4) we're reducing the refcount for our class's qdisc and // freeing it qdisc_put(q->classes[i].qdisc); // (5) If we call .dequeue between (4) and (5), we will have // a strong UAF and we can control RIP q->classes[i].qdisc = NULL; WRITE_ONCE(q->classes[i].quantum, 0); q->classes[i].deficit = 0; gnet_stats_basic_sync_init(&q->classes[i].bstats); memset(&q->classes[i].qstats, 0, sizeof(q->classes[i].qstats)); } return 0; } Comment: This happens because some of the classes have their qdiscs assigned to NULL, but remain in the active list. This commit fixes this issue by always removing the class from the active list before deleting and freeing its associated qdisc Reproducer Steps (trimmed version of what was sent by zdi-disclosures@trendmicro.com) ``` DEV="${DEV:-lo}" ROOT_HANDLE="${ROOT_HANDLE:-1:}" BAND2_HANDLE="${BAND2_HANDLE:-20:}" # child under 1:2 PING_BYTES="${PING_BYTES:-48}" PING_COUNT="${PING_COUNT:-200000}" PING_DST="${PING_DST:-127.0.0.1}" SLOW_TBF_RATE="${SLOW_TBF_RATE:-8bit}" SLOW_TBF_BURST="${SLOW_TBF_BURST:-100b}" SLOW_TBF_LAT="${SLOW_TBF_LAT:-1s}" cleanup() { tc qdisc del dev "$DEV" root 2>/dev/null } trap cleanup EXIT ip link set "$DEV" up tc qdisc del dev "$DEV" root 2>/dev/null || true tc qdisc add dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 2 tc qdisc add dev "$DEV" parent 1:2 handle "$BAND2_HANDLE" \ tbf rate "$SLOW_TBF_RATE" burst "$SLOW_TBF_BURST" latency "$SLOW_TBF_LAT" tc filter add dev "$DEV" parent 1: protocol all prio 1 u32 match u32 0 0 flowid 1:2 tc -s qdisc ls dev $DEV ping -I "$DEV" -f -c "$PING_COUNT" -s "$PING_BYTES" -W 0.001 "$PING_DST" \ >/dev/null 2>&1 & tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 0 tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 2 tc -s qdisc ls dev $DEV tc qdisc del dev "$DEV" parent 1:2 || true tc -s qdisc ls dev $DEV tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 1 strict 1 ``` KASAN report ``` ================================================================== BUG: KASAN: slab-use-after-free in ets_qdisc_dequeue+0x1071/0x11b0 kernel/net/sched/sch_ets.c:481 Read of size 8 at addr ffff8880502fc018 by task ping/12308 > CPU: 0 UID: 0 PID: 12308 Comm: ping Not tainted 6.18.0-rc4-dirty #1 PREEMPT(full) Hardware name: QEMU Ubuntu 25.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <IRQ> __dump_stack kernel/lib/dump_stack.c:94 dump_stack_lvl+0x100/0x190 kernel/lib/dump_stack.c:120 print_address_description kernel/mm/kasan/report.c:378 print_report+0x156/0x4c9 kernel/mm/kasan/report.c:482 kasan_report+0xdf/0x110 kernel/mm/kasan/report.c:595 ets_qdisc_dequeue+0x1071/0x11b0 kernel/net/sched/sch_ets.c:481 dequeue_skb kernel/net/sched/sch_generic.c:294 qdisc_restart kernel/net/sched/sch_generic.c:399 __qdisc_run+0x1c9/0x1b00 kernel/net/sched/sch_generic.c:417 __dev_xmit_skb kernel/net/core/dev.c:4221 __dev_queue_xmit+0x2848/0x4410 kernel/net/core/dev.c:4729 dev_queue_xmit kernel/./include/linux/netdevice.h:3365 [...] Allocated by task 17115: kasan_save_stack+0x30/0x50 kernel/mm/kasan/common.c:56 kasan_save_track+0x14/0x30 kernel/mm/kasan/common.c:77 poison_kmalloc_redzone kernel/mm/kasan/common.c:400 __kasan_kmalloc+0xaa/0xb0 kernel/mm/kasan/common.c:417 kasan_kmalloc kernel/./include/linux/kasan.h:262 __do_kmalloc_node kernel/mm/slub.c:5642 __kmalloc_node_noprof+0x34e/0x990 kernel/mm/slub.c:5648 kmalloc_node_noprof kernel/./include/linux/slab.h:987 qdisc_alloc+0xb8/0xc30 kernel/net/sched/sch_generic.c:950 qdisc_create_dflt+0x93/0x490 kernel/net/sched/sch_generic.c:1012 ets_class_graft+0x4fd/0x800 kernel/net/sched/sch_ets.c:261 qdisc_graft+0x3e4/0x1780 kernel/net/sched/sch_api.c:1196 [...] Freed by task 9905: kasan_save_stack+0x30/0x50 kernel/mm/kasan/common.c:56 kasan_save_track+0x14/0x30 kernel/mm/kasan/common.c:77 __kasan_save_free_info+0x3b/0x70 kernel/mm/kasan/generic.c:587 kasan_save_free_info kernel/mm/kasan/kasan.h:406 poison_slab_object kernel/mm/kasan/common.c:252 __kasan_slab_free+0x5f/0x80 kernel/mm/kasan/common.c:284 kasan_slab_free kernel/./include/linux/kasan.h:234 slab_free_hook kernel/mm/slub.c:2539 slab_free kernel/mm/slub.c:6630 kfree+0x144/0x700 kernel/mm/slub.c:6837 rcu_do_batch kernel/kernel/rcu/tree.c:2605 rcu_core+0x7c0/0x1500 kernel/kernel/rcu/tree.c:2861 handle_softirqs+0x1ea/0x8a0 kernel/kernel/softirq.c:622 __do_softirq kernel/kernel/softirq.c:656 [...] Commentary: 1. Maher Azzouzi working with Trend Micro Zero Day Initiative was reported as the person who found the issue. I requested to get a proper email to add to the reported-by tag but got no response. For this reason i will credit the person i exchanged emails with i.e zdi-disclosures@trendmicro.com 2. Neither i nor Victor who did a much more thorough testing was able to reproduce a UAF with the PoC or other approaches we tried. We were both able to reproduce a null ptr deref. After exchange with zdi-disclosures@trendmicro.com they sent a small change to be made to the code to add an extra delay which was able to simulate the UAF. i.e, this: qdisc_put(q->classes[i].qdisc); mdelay(90); q->classes[i].qdisc = NULL; I was informed by Thomas Gleixner(tglx@linutronix.de) that adding delays was acceptable approach for demonstrating the bug, quote: "Adding such delays is common exploit validation practice" The equivalent delay could happen "by virt scheduling the vCPU out, SMIs, NMIs, PREEMPT_RT enabled kernel" 3. I asked the OP to test and report back but got no response and after a few days gave up and proceeded to submit this fix. Fixes: de6d25924c2a ("net/sched: sch_ets: don't peek at classes beyond 'nbands'") Reported-by: zdi-disclosures@trendmicro.com Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/20251128151919.576920-1-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski-26/+32
Merge in late fixes in preparation for the net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-02net/sched: sch_cake: Fix incorrect qlen reduction in cake_dropXiang Mei-26/+32
In cake_drop(), qdisc_tree_reduce_backlog() is used to update the qlen and backlog of the qdisc hierarchy. Its caller, cake_enqueue(), assumes that the parent qdisc will enqueue the current packet. However, this assumption breaks when cake_enqueue() returns NET_XMIT_CN: the parent qdisc stops enqueuing current packet, leaving the tree qlen/backlog accounting inconsistent. This mismatch can lead to a NULL dereference (e.g., when the parent Qdisc is qfq_qdisc). This patch computes the qlen/backlog delta in a more robust way by observing the difference before and after the series of cake_drop() calls, and then compensates the qdisc tree accounting if cake_enqueue() returns NET_XMIT_CN. To ensure correct compensation when ACK thinning is enabled, a new variable is introduced to keep qlen unchanged. Fixes: 15de71d06a40 ("net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit") Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20251128001415.377823-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski-3/+18
Conflicts: net/xdp/xsk.c 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number") 8da7bea7db69 ("xsk: add indirect call for xsk_destruct_skb") 30ed05adca4a ("xsk: use a smaller new lock for shared pool case") https://lore.kernel.org/20251127105450.4a1665ec@canb.auug.org.au https://lore.kernel.org/eb4eee14-7e24-4d1b-b312-e9ea738fefee@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-26Merge tag 'linux-can-fixes-for-6.18-20251126' of ↵Jakub Kicinski-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2025-11-26 this is a pull request of 8 patches for net/main. Seungjin Bae provides a patch for the kvaser_usb driver to fix a potential infinite loop in the USB data stream command parser. Thomas Mühlbacher's patch for the sja1000 driver IRQ handler's max loop handling, that might lead to unhandled interrupts. 3 patches by me for the gs_usb driver fix handling of failed transmit URBs and add checking of the actual length of received URBs before accessing the data. The next patch is by me and is a port of Thomas Mühlbacher's patch (fix IRQ handler's max loop handling, that might lead to unhandled interrupts.) to the sun4i_can driver. Biju Das provides a patch for the rcar_canfd driver to fix the CAN-FD mode setting. The last patch is by Shaurya Rane for the em_canid filter to ensure that the complete CAN frame is present in the linear data buffer before accessing it. * tag 'linux-can-fixes-for-6.18-20251126' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: net/sched: em_canid: fix uninit-value in em_canid_match can: rcar_canfd: Fix CAN-FD mode as default can: sun4i_can: sun4i_can_interrupt(): fix max irq loop handling can: gs_usb: gs_usb_receive_bulk_callback(): check actual_length before accessing data can: gs_usb: gs_usb_receive_bulk_callback(): check actual_length before accessing header can: gs_usb: gs_usb_xmit_callback(): fix handling of failed transmitted URBs can: sja1000: fix max irq loop handling can: kvaser_usb: leaf: Fix potential infinite loop in command parsers ==================== Link: https://patch.msgid.link/20251126155713.217105-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-26net/sched: em_canid: fix uninit-value in em_canid_matchShaurya Rane-0/+3
Use pskb_may_pull() to ensure a complete CAN frame is present in the linear data buffer before reading the CAN ID. A simple skb->len check is insufficient because it only verifies the total data length but does not guarantee the data is present in skb->data (it could be in fragments). pskb_may_pull() both validates the length and pulls fragmented data into the linear buffer if necessary, making it safe to directly access skb->data. Reported-by: syzbot+5d8269a1e099279152bc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5d8269a1e099279152bc Fixes: f057bbb6f9ed ("net: em_canid: Ematch rule to match CAN frames according to their identifiers") Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Link: https://patch.msgid.link/20251126085718.50808-1-ssranevjti@gmail.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-25net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codelEric Dumazet-3/+10
cake, codel and fq_codel can drop many packets from dequeue(). Use qdisc_dequeue_drop() so that the freeing can happen outside of the qdisc spinlock scope. Add TCQ_F_DEQUEUE_DROPS to sch->flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-15-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>