summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2025-11-17sched: Increase sched_tick_remote timeoutPhil Auld-1/+1
Increase the sched_tick_remote WARN_ON timeout to remove false positives due to temporarily busy HK cpus. The suggestion was 30 seconds to catch really stuck remote tick processing but not trigger it too easily. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20250911161300.437944-1-pauld@redhat.com
2025-11-17sched/fair: Have SD_SERIALIZE affect newidle balancingPeter Zijlstra-1/+1
Also serialize the possiblty much more frequent newidle balancing for the 'expensive' domains that have SD_BALANCE set. Initial benchmarking by K Prateek and Tim showed no negative effect. Split out from the larger patch moving sched_balance_running around for ease of bisect and such. Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Seconded-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/df068896-82f9-458d-8fff-5a2f654e8ffd@amd.com Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com # Conflicts: # kernel/sched/fair.c
2025-11-17sched/fair: Skip sched_balance_running cmpxchg when balance is not dueTim Chen-26/+28
The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing only one NUMA load balancing operation to run system-wide at a time. Currently, each sched group leader directly under NUMA domain attempts to acquire the global sched_balance_running flag via cmpxchg() before checking whether load balancing is due or whether it is the designated load balancer for that NUMA domain. On systems with a large number of cores, this causes significant cache contention on the shared sched_balance_running flag. This patch reduces unnecessary cmpxchg() operations by first checking that the balancer is the designated leader for a NUMA domain from should_we_balance(), and the balance interval has expired before trying to acquire sched_balance_running to load balance a NUMA domain. On a 2-socket Granite Rapids system with sub-NUMA clustering enabled, running an OLTP workload, 7.8% of total CPU cycles were previously spent in sched_balance_domain() contending on sched_balance_running before this change. : 104 static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new) : 105 { : 106 return arch_cmpxchg(&v->counter, old, new); 0.00 : ffffffff81326e6c: xor %eax,%eax 0.00 : ffffffff81326e6e: mov $0x1,%ecx 0.00 : ffffffff81326e73: lock cmpxchg %ecx,0x2394195(%rip) # ffffffff836bb010 <sched_balance_running> : 110 sched_balance_domains(): : 12234 if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) 99.39 : ffffffff81326e7b: test %eax,%eax 0.00 : ffffffff81326e7d: jne ffffffff81326e99 <sched_balance_domains+0x209> : 12238 if (time_after_eq(jiffies, sd->last_balance + interval)) { 0.00 : ffffffff81326e7f: mov 0x14e2b3a(%rip),%rax # ffffffff828099c0 <jiffies_64> 0.00 : ffffffff81326e86: sub 0x48(%r14),%rax 0.00 : ffffffff81326e8a: cmp %rdx,%rax After applying this fix, sched_balance_domain() is gone from the profile and there is a 5% throughput improvement. [peterz: made it so that redo retains the 'lock' and split out the CPU_NEWLY_IDLE change to a separate patch] Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com
2025-11-17Merge back earlier material related to system sleep for 6.19Rafael J. Wysocki-30/+122
2025-11-17sched_ext: Use kvfree_rcu() to release per-cpu ksyncs objectZqiang-8/+1
The free_kick_syncs_rcu() rcu-callback only invoke kvfree() to release per-cpu ksyncs object, this can use kvfree_rcu() replace call_rcu() to release per-cpu ksyncs object in the free_kick_syncs(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-17sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_workZqiang-1/+1
For PREEMPT_RT kernels, the kick_cpus_irq_workfn() be invoked in the per-cpu irq_work/* task context and there is no rcu-read critical section to protect. this commit therefore use IRQ_WORK_INIT_HARD() to initialize the per-cpu rq->scx.kick_cpus_irq_work in the init_sched_ext_class(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-16mm: make INVALID_PHYS_ADDR a generic macroAnshuman Khandual-2/+0
INVALID_PHYS_ADDR has very similar definitions across the code base. Hence just move that inside header <liux/mm.h> for more generic usage. Also drop the now redundant ones which are no longer required. Link: https://lkml.kernel.org/r/20251021025638.2420216-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16treewide: include linux/pgalloc.h instead of asm/pgalloc.hHarry Yoo-2/+2
For now, including <asm/pgalloc.h> instead of <linux/pgalloc.h> is technically fine unless the .c file calls p*d_populate_kernel() helper functions. But it is a better practice to always include <linux/pgalloc.h>. Include <linux/pgalloc.h> instead of <asm/pgalloc.h> outside arch/. Link: https://lkml.kernel.org/r/20251024113047.119058-3-harry.yoo@oracle.com Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16relay: update relay to use mmap_prepareLorenzo Stoakes-16/+17
It is relatively trivial to update this code to use the f_op->mmap_prepare hook in favour of the deprecated f_op->mmap hook, so do so. Link: https://lkml.kernel.org/r/7c9e82cdddf8b573ea3edb8cdb697363e3ccb5d7.1760959442.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Dave Young <dyoung@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Sumanth Korikkar <sumanthk@linux.ibm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16memcg: net: track network throttling due to memcg memory pressureShakeel Butt-0/+1
The kernel can throttle network sockets if the memory cgroup associated with the corresponding socket is under memory pressure. The throttling actions include clamping the transmit window, failing to expand receive or send buffers, aggressively prune out-of-order receive queue, FIN deferred to a retransmitted packet and more. Let's add memcg metric to track such throttling actions. At the moment memcg memory pressure is defined through vmpressure and in future it may be defined using PSI or we may add more flexible way for the users to define memory pressure, maybe through ebpf. However the potential throttling actions will remain the same, so this newly introduced metric will continue to track throttling actions irrespective of how memcg memory pressure is defined. Link: https://lkml.kernel.org/r/20251016161035.86161-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Daniel Sedlak <daniel.sedlak@cdn77.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16mm: consistently use current->mm in mm_get_unmapped_area()Ryan Roberts-2/+2
mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() / arch_get_unmapped_area_topdown(), both of which search current->mm for some free space. Neither take an mm_struct - they implicitly operate on current->mm. But the wrapper takes an mm_struct and uses it to decide whether to search bottom up or top down. All callers pass in current->mm for this, so everything is working consistently. But it feels like an accident waiting to happen; eventually someone will call that function with a different mm, expecting to find free space in it, but what gets returned is free space in the current mm. So let's simplify by removing the parameter and have the wrapper use current->mm to decide which end to start at. Now everything is consistent and self-documenting. Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16Merge tag 'mm-hotfixes-stable-2025-11-16-10-40' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. 5 are cc:stable, 4 are against mm/ All are singletons - please see the respective changelogs for details" * tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm, swap: fix potential UAF issue for VMA readahead selftests/user_events: fix type cast for write_index packed member in perf_test lib/test_kho: check if KHO is enabled mm/huge_memory: fix folio split check for anon folios in swapcache MAINTAINERS: update David Hildenbrand's email address crash: fix crashkernel resource shrink mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
2025-11-16convert bpfAl Viro-10/+5
object creation goes through the normal VFS paths or approximation thereof (user_path_create()/done_path_create() in case of bpf_obj_do_pin(), open-coded simple_{start,done}_creating() in bpf_iter_link_pin_kernel() at mount time), removals go entirely through the normal VFS paths (and ->unlink() is simple_unlink() there). Enough to have bpf_dentry_finalize() use d_make_persistent() instead of dget() and we are done. Convert bpf_iter_link_pin_kernel() to simple_{start,done}_creating(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-15crash: fix crashkernel resource shrinkSourabh Jain-1/+1
When crashkernel is configured with a high reservation, shrinking its value below the low crashkernel reservation causes two issues: 1. Invalid crashkernel resource objects 2. Kernel crash if crashkernel shrinking is done twice For example, with crashkernel=200M,high, the kernel reserves 200MB of high memory and some default low memory (say 256MB). The reservation appears as: cat /proc/iomem | grep -i crash af000000-beffffff : Crash kernel 433000000-43f7fffff : Crash kernel If crashkernel is then shrunk to 50MB (echo 52428800 > /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: af000000-beffffff : Crash kernel Instead, it should show 50MB: af000000-b21fffff : Crash kernel Further shrinking crashkernel to 40MB causes a kernel crash with the following trace (x86): BUG: kernel NULL pointer dereference, address: 0000000000000038 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI <snip...> Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15a/0x2f0 ? search_module_extables+0x19/0x60 ? search_bpf_extables+0x5f/0x80 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? __release_resource+0xd/0xb0 release_resource+0x26/0x40 __crash_shrink_memory+0xe5/0x110 crash_shrink_memory+0x12a/0x190 kexec_crash_size_store+0x41/0x80 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x294/0x460 ksys_write+0x6d/0xf0 <snip...> This happens because __crash_shrink_memory()/kernel/crash_core.c incorrectly updates the crashk_res resource object even when crashk_low_res should be updated. Fix this by ensuring the correct crashkernel resource object is updated when shrinking crashkernel memory. Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-15Merge tag 'timers-urgent-2025-11-15' of ↵Linus Torvalds-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a memory leak in the posix timer creation logic" * tag 'timers-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-timers: Plug potential memory leak in do_timer_create()
2025-11-14bpf: don't skip other information if xlated_prog_insns is skippedAltgelt, Max (Nextron)-11/+11
If xlated_prog_insns should not be exposed, other information (such as func_info) still can and should be filled in. Therefore, instead of directly terminating in this case, continue with the normal flow. Signed-off-by: Max Altgelt <max.altgelt@nextron-systems.com> Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14bpf: verifier: Move desc->imm setup to sort_kfunc_descs_by_imm_off()Puranjay Mohan-19/+35
Metadata about a kfunc call is added to the kfunc_tab in add_kfunc_call() but the call instruction itself could get removed by opt_remove_dead_code() later if it is not reachable. If the call instruction is removed, specialize_kfunc() is never called for it and the desc->imm in the kfunc_tab is never initialized for this kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in do_misc_fixups() doesn't sort the table correctly. This is a problem for s390 as its JIT uses this table to find the addresses for kfuncs, and if this table is not sorted properly, JIT may fail to find addresses for valid kfunc calls. This was exposed by: commit d869d56ca848 ("bpf: verifier: refactor kfunc specialization") as before this commit, desc->imm was initialised in add_kfunc_call() which happens before dead code elimination. Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes the problem and also saves us from having the same logic in add_kfunc_call() and specialize_kfunc(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114154023.12801-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+Alexei Starovoitov-111/+243
Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/helpers.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds-40/+72
Pull bpf fixes from Alexei Starovoitov: - Fix interaction between livepatch and BPF fexit programs (Song Liu) With Steven and Masami acks. - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa) With Steven and Masami acks. - Fix out of bounds access in widen_imprecise_scalars() in the verifier (Eduard Zingerman) - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen) - Fix net_sched storage collision with BPF data_meta/data_end (Eric Dumazet) - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test widen_imprecise_scalars() with different stack depth bpf: account for current allocated stack depth in widen_imprecise_scalars() bpf: Add bpf_prog_run_data_pointers() selftests/bpf: Add mptcp test with sockmap mptcp: Fix proto fallback detection with BPF mptcp: Disallow MPTCP subflows from sockmap selftests/bpf: Add stacktrace ips test for raw_tp selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()" bpf: add _impl suffix for bpf_stream_vprintk() kfunc bpf:add _impl suffix for bpf_task_work_schedule* kfuncs selftests/bpf: Add tests for livepatch + bpf trampoline ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct() ftrace: Fix BPF fexit with livepatch
2025-11-14bpf: Handle return value of ftrace_set_filter_ip in register_fentryMenglong Dong-1/+3
The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14bpf: Add missing checks to avoid verbose verifier logEduard Zingerman-4/+6
There are a few places where log level is not checked before calling "verbose()". This forces programs working only at BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large log buffers. Add missing checks. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114200542.912386-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docsTejun Heo-4/+23
With the buddy lockup detector, smp_processor_id() returns the detecting CPU, not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the locked CPU number from watchdog_hardlockup_check() as a parameter instead. Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and scx_rcu_cpu_stall() documenting their return value semantics. Suggested-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-14bpf: Prevent nesting overflow in bpf_try_get_buffersSahil Chandna-0/+3
bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a per-CPU nesting counter. This mechanism expects that buffers are not endlessly acquired before being returned. migrate_disable() ensures that a task remains on the same CPU, but it does not prevent the task from being preempted by another task on that CPU. Without disabled preemption, a task may be preempted while holding a buffer, allowing another task to run on same CPU and acquire an additional buffer. Several such preemptions can cause the per-CPU nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around buffer acquisition and release prevents this task preemption and preserves the intended bounded nesting behavior. Reported-by: syzbot+b0cff308140f79a9c4cb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68f6a4c8.050a0220.1be48.0011.GAE@google.com/ Fixes: 4223bf833c849 ("bpf: Remove preempt_disable in bpf_try_get_buffers") Suggested-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com> Link: https://lore.kernel.org/r/20251114064922.11650-1-chandna.sahil@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14time: Fix a few typos in time[r] related code commentsJianyun Gao-2/+2
Signed-off-by: Jianyun Gao <jianyungao89@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250927093411.1509275-1-jianyungao89@gmail.com
2025-11-14tracing: Convert function graph set_flags() to use a switch() statementSteven Rostedt-5/+7
Currently the set_flags() of the function graph tracer has a bunch of: if (bit == FLAG1) { [..] } if (bit == FLAG2) { [..] } To clean it up a bit, convert it over to a switch statement. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192319.117123664@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14tracing: Have function graph tracer option sleep-time be per instanceSteven Rostedt-23/+60
Currently the option to have function graph tracer to ignore time spent when a task is sleeping is global when the interface is per-instance. Changing the value in one instance will affect the results of another instance that is also running the function graph tracer. This can lead to confusing results. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.950255167@kernel.org Fixes: c132be2c4fcc1 ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14tracing: Move graph-time out of function graph optionsSteven Rostedt-14/+23
The option "graph-time" affects the function profiler when it is using the function graph infrastructure. It has nothing to do with the function graph tracer itself. The option only affects the global function profiler and does nothing to the function graph tracer. Move it out of the function graph tracer options and make it a global option that is only available at the top level instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.781711154@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14tracing: Have function graph tracer option funcgraph-irqs be per instanceSteven Rostedt-10/+31
Currently the option to trace interrupts in the function graph tracer is global when the interface is per-instance. Changing the value in one instance will affect the results of another instance that is also running the function graph tracer. This can lead to confusing results. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.613867934@kernel.org Fixes: c132be2c4fcc1 ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14time: tick-oneshot: Add missing Return and parameter descriptions to kernel-docSunday Adelodun-1/+19
Several functions in kernel/time/tick-oneshot.c are missing parameter and return value descriptions in their kernel-doc comments. This causes warnings during doc generation. Update the kernel-doc blocks to include detailed @param and Return: descriptions for better clarity and to fix kernel-doc warnings. No functional code changes are made. Signed-off-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251106113938.34693-3-adelodunolaoluwa@yahoo.com
2025-11-14bpf: account for current allocated stack depth in widen_imprecise_scalars()Eduard Zingerman-2/+4
The usage pattern for widen_imprecise_scalars() looks as follows: prev_st = find_prev_entry(env, ...); queued_st = push_stack(...); widen_imprecise_scalars(env, prev_st, queued_st); Where prev_st is an ancestor of the queued_st in the explored states tree. This ancestor is not guaranteed to have same allocated stack depth as queued_st. E.g. in the following case: def main(): for i in 1..2: foo(i) // same callsite, differnt param def foo(i): if i == 1: use 128 bytes of stack iterator based loop Here, for a second 'foo' call prev_st->allocated_stack is 128, while queued_st->allocated_stack is much smaller. widen_imprecise_scalars() needs to take this into account and avoid accessing bpf_verifier_state->frame[*]->stack out of bounds. Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks") Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14PM: suspend: Make pm_test delay interruptible by wakeup eventsRiwen Lu-1/+5
Modify the suspend_test() function to allow the test delay to be interrupted by wakeup events. This improves the responsiveness of the system during suspend testing when wakeup events occur, allowing the suspend process to proceed without waiting for the full test delay to complete when wakeup events are detected. Additionally, using msleep() instead of mdelay() avoids potential soft lockup "CPU stuck" issues when long test delays are configured. Co-developed-by: xiongxin <xiongxin@kylinos.cn> Signed-off-by: xiongxin <xiongxin@kylinos.cn> Signed-off-by: Riwen Lu <luriwen@kylinos.cn> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20251113012638.1362013-1-luriwen@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-14posix-timers: Plug potential memory leak in do_timer_create()Eslam Khafagy-6/+6
When posix timer creation is set to allocate a given timer ID and the access to the user space value faults, the function terminates without freeing the already allocated posix timer structure. Move the allocation after the user space access to cure that. [ tglx: Massaged change log ] Fixes: ec2d0c04624b3 ("posix-timers: Provide a mechanism to allocate a given timer ID") Reported-by: syzbot+9c47ad18f978d4394986@syzkaller.appspotmail.com Suggested-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Eslam Khafagy <eslam.medhat1993@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251114122739.994326-1-eslam.medhat1993@gmail.com Closes: https://lore.kernel.org/all/69155df4.a70a0220.3124cb.0017.GAE@google.com/T/
2025-11-14hrtimer: Store time as ktime_t in restart blockThomas Weißschuh-4/+4
The hrtimer core uses ktime_t to represent times, use that also for the restart block. CPU timers internally use nanoseconds instead of ktime_t but use the same restart block, so use the correct accessors for those. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251110-restart-block-expiration-v1-3-5d39cc93df4f@linutronix.de
2025-11-14futex: Store time as ktime_t in restart blockThomas Weißschuh-5/+4
The futex core uses ktime_t to represent times, use that also for the restart block. This allows the simplification of the accessors. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20251110-restart-block-expiration-v1-2-5d39cc93df4f@linutronix.de
2025-11-14nstree: fix kernel-doc comments for internal functionsKriish Sharma-2/+3
Documentation build reported: Warning: kernel/nstree.c:325 function parameter 'ns_tree' not described in '__ns_tree_adjoined_rcu' Warning: kernel/nstree.c:325 expecting prototype for ns_tree_adjoined_rcu(). Prototype was for __ns_tree_adjoined_rcu() instead Warning: kernel/nstree.c:353 expecting prototype for ns_tree_gen_id(). Prototype was for __ns_tree_gen_id() instead The kernel-doc comments for `__ns_tree_adjoined_rcu()` and `__ns_tree_gen_id()` had mismatched function names and a missing parameter description. This patch updates the function names in the kernel-doc headers and adds the missing `@ns_tree` parameter description for `__ns_tree_adjoined_rcu()`. Fixes: 885fc8ac0a4d ("nstree: make iterator generic") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511061542.0LO7xKs8-lkp@intel.com Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com> Link: https://patch.msgid.link/20251111112533.2254432-1-kriish.sharma2006@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14nsproxy: fix free_nsproxy() and simplify create_new_namespaces()Christian Brauner-16/+20
Make it possible to handle NULL being passed to the reference count helpers instead of forcing the caller to handle this. Afterwards we can nicely allow a cleanup guard to handle nsproxy freeing. Active reference count handling is not done in nsproxy_free() but rather in free_nsproxy() as nsproxy_free() is also called from setns() failure paths where a new nsproxy has been prepared but has not been marked as active via switch_task_namespaces(). Link: https://lore.kernel.org/690bfb9e.050a0220.2e3c35.0013.GAE@google.com Link: https://patch.msgid.link/20251111-sakralbau-guthaben-7dcc277d337f@brauner Fixes: 3c9820d5c64a ("ns: add active reference count") Reported-by: syzbot+0b2e79f91ff6579bfa5b@syzkaller.appspotmail.com Reported-by: syzbot+0a8655a80e189278487e@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14padata: remove __padata_list_init()Tetsuo Handa-8/+4
syzbot is reporting possibility of deadlock due to sharing lock_class_key between padata_init_squeues() and padata_init_reorder_list(). This is a false positive, for these callers initialize different object. Unshare lock_class_key by embedding __padata_list_init() into these callers. Reported-by: syzbot+bd936ccd4339cea66e6b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bd936ccd4339cea66e6b Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-11-14syscore: Pass context data to callbacksThierry Reding-23/+69
Several drivers can benefit from registering per-instance data along with the syscore operations. To achieve this, move the modifiable fields out of the syscore_ops structure and into a separate struct syscore that can be registered with the framework. Add a void * driver data field for drivers to store contextual data that will be passed to the syscore ops. Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Signed-off-by: Thierry Reding <treding@nvidia.com>
2025-11-13Merge tag 'pm-6.18-rc6' of ↵Linus Torvalds-9/+13
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix issues related to the handling of compressed hibernation images and a recent intel_pstate driver regression: - Fix issues related to using inadequate data types and incorrect use of atomic variables in the compressed hibernation images handling code that were introduced during the 6.9 development cycle (Mario Limonciello) - Move a X86_FEATURE_IDA check from turbo_is_disabled() to the places where a new value for MSR_IA32_PERF_CTL is computed in intel_pstate to address a regression preventing users from enabling turbo frequencies post-boot (Srinivas Pandruvada)" * tag 'pm-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Check IDA only before MSR_IA32_PERF_CTL writes PM: hibernate: Fix style issues in save_compressed_image() PM: hibernate: Use atomic64_t for compressed_size variable PM: hibernate: Emit an error when image writing fails
2025-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski-62/+158
Cross-merge networking fixes after downstream PR (net-6.18-rc6). No conflicts, adjacent changes in: drivers/net/phy/micrel.c 96a9178a29a6 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface") 61b7ade9ba8c ("net: phy: micrel: Add support for non PTP SKUs for lan8814") and a trivial one in tools/testing/selftests/drivers/net/Makefile. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-13trace/pid_list: optimize pid_list->lock contentionYongliang Gao-9/+22
When the system has many cores and task switching is frequent, setting set_ftrace_pid can cause frequent pid_list->lock contention and high system sys usage. For example, in a 288-core VM environment, we observed 267 CPUs experiencing contention on pid_list->lock, with stack traces showing: #4 [ffffa6226fb4bc70] native_queued_spin_lock_slowpath at ffffffff99cd4b7e #5 [ffffa6226fb4bc90] _raw_spin_lock_irqsave at ffffffff99cd3e36 #6 [ffffa6226fb4bca0] trace_pid_list_is_set at ffffffff99267554 #7 [ffffa6226fb4bcc0] trace_ignore_this_task at ffffffff9925c288 #8 [ffffa6226fb4bcd8] ftrace_filter_pid_sched_switch_probe at ffffffff99246efe #9 [ffffa6226fb4bcf0] __schedule at ffffffff99ccd161 Replaces the existing spinlock with a seqlock to allow concurrent readers, while maintaining write exclusivity. Link: https://patch.msgid.link/20251113000252.1058144-1-leonylgao@gmail.com Reviewed-by: Huang Cun <cunhuang@tencent.com> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-13tracing: Have function graph tracer define options per instanceSteven Rostedt-7/+11
Currently the function graph tracer's options are saved via a global mask when it should be per instance. Use the new infrastructure to define a "default_flags" field in the tracer structure that is used for the top level instance as well as new ones. Currently the global mask causes confusion: # cd /sys/kernel/tracing # mkdir instances/foo # echo function_graph > instances/foo/current_tracer # echo 1 > options/funcgraph-args # echo function_graph > current_tracer # cat trace [..] 2) | _raw_spin_lock_irq(lock=0xffff96b97dea16c0) { 2) 0.422 us | do_raw_spin_lock(lock=0xffff96b97dea16c0); 7) | rcu_sched_clock_irq(user=0) { 2) 1.478 us | } 7) 0.758 us | rcu_is_cpu_rrupt_from_idle(); 2) 0.647 us | enqueue_hrtimer(timer=0xffff96b97dea2058, base=0xffff96b97dea1740, mode=0); # cat instances/foo/options/funcgraph-args 1 # cat instances/foo/trace [..] 4) | __x64_sys_read() { 4) | ksys_read() { 4) 0.755 us | fdget_pos(); 4) | vfs_read() { 4) | rw_verify_area() { 4) | security_file_permission() { 4) | apparmor_file_permission() { 4) | common_file_perm() { 4) | aa_file_perm() { 4) | rcu_read_lock_held() { [..] The above shows that updating the "funcgraph-args" option at the top level instance also updates the "funcgraph-args" option in the instance but because the update is only done by the instance that gets changed (as it should), it's confusing to see that the option is already set in the other instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.641030027@kernel.org Fixes: c132be2c4fcc1 ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-13Merge branch 'pm-sleep'Rafael J. Wysocki-9/+13
Merge fixes for issues related to the handling of compressed hibernation images that were introduced during the 6.9 development cycle. * pm-sleep: PM: hibernate: Fix style issues in save_compressed_image() PM: hibernate: Use atomic64_t for compressed_size variable PM: hibernate: Emit an error when image writing fails
2025-11-13sched_ext: Fix possible deadlock in the deferred_irq_workfn()Zqiang-1/+1
For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in the per-cpu irq_work/* task context and not disable-irq, if the rq returned by container_of() is current CPU's rq, the following scenarios may occur: lock(&rq->__lock); <Interrupt> lock(&rq->__lock); This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn() is always invoked in hard-irq context. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-13bpf: Free special fields when update [lru_,]percpu_hash mapsLeon Hwang-2/+8
As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the map gets freed. Fix this by calling 'bpf_obj_free_fields()' after 'copy_map_value[,_long]()' in 'pcpu_copy_value()'. Fixes: 65334e64a493 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap") Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251105151407.12723-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-13Merge tag 'v6.18-rc5' into objtool/core, to pick up fixesIngo Molnar-95/+261
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-11-12bpf: Adjust return value for queue destruction in rqspinlockKumar Kartikeya Dwivedi-1/+1
Return -ETIMEDOUT whenever non-head waiters are signalled by head, and fix oversight in commit 7bd6e5ce5be6 ("rqspinlock: Disable queue destruction for deadlocks"). We no longer signal on deadlocks. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20251111013827.1853484-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-12sched_ext: Update comments replacing breather with aborting mechanismAndrea Righi-4/+4
Commit 5ebec443fb96a ("sched_ext: Exit dispatch and move operations immediately when aborting") replaced the breather mechanism with the scx_aborting flag. Update comments removing references to the breather mechanism to avoid confusion. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched/ext: convert scx_tasks_lock to raw spinlockEmil Tsalapatis-8/+8
Update scx_task_locks so that it's safe to lock/unlock in a non-sleepable context in PREEMPT_RT kernels. scx_task_locks is (non-raw) spinlock used to protect the list of tasks under SCX. This list is updated during from finish_task_switch(), which cannot sleep. Regular spinlocks can be locked in such a context in non-RT kernels, but are sleepable under when CONFIG_PREEMPT_RT=y. Convert scx_task_locks into a raw spinlock, which is not sleepable even on RT kernels. Sample backtrace: <TASK> dump_stack_lvl+0x83/0xa0 __might_resched+0x14a/0x200 rt_spin_lock+0x61/0x1c0 ? sched_ext_dead+0x2d/0xf0 ? lock_release+0xc6/0x280 sched_ext_dead+0x2d/0xf0 ? srso_alias_return_thunk+0x5/0xfbef5 finish_task_switch.isra.0+0x254/0x360 __schedule+0x584/0x11d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? tick_nohz_idle_exit+0x7e/0x120 schedule_idle+0x23/0x40 cpu_startup_entry+0x29/0x30 start_secondary+0xf8/0x100 common_startup_64+0x13e/0x148 </TASK> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12release_task: kill unnecessary rcu_read_lock() around dec_rlimit_ucounts()Oleg Nesterov-3/+1
rcu_read_lock() was added to shut RCU-lockdep up when this code used __task_cred()->rcu_dereference(), but after the commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") it is no longer needed: task_ucounts()->task_cred_xxx() takes rcu_read_lock() itself. NOTE: task_ucounts() returns the pointer to another rcu-protected data, struct ucounts. So it should either be used when task->real_cred and thus task->real_cred->ucounts is stable (release_task, copy_process, copy_creds), or it should be called under rcu_read_lock(). In both cases it is pointless to take rcu_read_lock() to read the cred->ucounts pointer. Link: https://lkml.kernel.org/r/20251026143140.GA22463@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alexey Gladkov <legion@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>