aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-09-05cgroup: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari2-2/+2
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-05cgroup: replace use of system_wq with system_percpu_wqMarco Crivellari1-1/+1
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04cgroup: Remove unused local variables from cgroup_procs_write_finish()Tejun Heo1-3/+0
d8b269e009bb ("cgroup: Remove unused cgroup_subsys::post_attach") made $ss and $ssid unused but didn't drop them leading to compilation warnings. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
2025-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-10/+16
Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h c51613fa276f ("net: add sk->sk_drop_counters") 5d6b58c932ec ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warningAndrea Righi1-2/+5
When printing the deprecation warning for scx_bpf_cpu_rq(), we may hit a NULL pointer dereference if the kfunc is called before a BPF scheduler is fully attached, for example, when invoked from a BPF timer or during ops.init(): [ 50.752775] BUG: kernel NULL pointer dereference, address: 0000000000000331 ... [ 50.764205] RIP: 0010:scx_bpf_cpu_rq+0x30/0xa0 ... [ 50.787661] Call Trace: [ 50.788398] <TASK> [ 50.789061] bpf_prog_08f7fd2dcb187aaf_wakeup_timerfn+0x75/0x1a8 [ 50.792477] bpf_timer_cb+0x7e/0x140 [ 50.796003] hrtimer_run_softirq+0x91/0xe0 [ 50.796952] handle_softirqs+0xce/0x3c0 [ 50.799087] run_ksoftirqd+0x3e/0x70 [ 50.800197] smpboot_thread_fn+0x133/0x290 [ 50.802320] kthread+0x115/0x220 [ 50.804984] ret_from_fork+0x17a/0x1d0 [ 50.806920] ret_from_fork_asm+0x1a/0x30 [ 50.807799] </TASK> Fix this by only printing the warning once the scheduler is fully registered. Fixes: 5c48d88fe0049 ("sched_ext: deprecation warn for scx_bpf_cpu_rq()") Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04change the calling conventions for vfs_parse_fs_string()Al Viro1-2/+1
Absolute majority of callers are passing the 4th argument equal to strlen() of the 3rd one. Drop the v_size argument, add vfs_parse_fs_qstr() for the cases that want independent length. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-04PM: sleep: Make pm_wakeup_clear() call more clearSamuel Wu2-1/+1
Move pm_wakeup_clear() to the same location as other functions that do bookkeeping prior to suspend_prepare(). Since calling pm_wakeup_clear() is a prerequisite to setting up for suspend and enabling functionalities of suspend (like aborting during suspend), moving pm_wakeup_clear() higher up the call stack makes its intent more clear and obvious that it is called prior to suspend_prepare(). After this change, there is a slightly larger window when abort events can be registered, but otherwise suspend functionality is the same. Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Samuel Wu <wusamuel@google.com> Link: https://patch.msgid.link/20250821004237.2712312-2-wusamuel@google.com Reviewed-by: Saravana Kannan <saravanak@google.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-04workqueue: Provide a handshake for canceling BH workersSebastian Andrzej Siewior1-9/+41
While a BH work item is canceled, the core code spins until it determines that the item completed. On PREEMPT_RT the spinning relies on a lock in local_bh_disable() to avoid a live lock if the canceling thread has higher priority than the BH-worker and preempts it. This lock ensures that the BH-worker makes progress by PI-boosting it. This lock in local_bh_disable() is a central per-CPU BKL and about to be removed. To provide the required synchronisation add a per pool lock. The lock is acquired by the bh_worker at the begin while the individual callbacks are invoked. To enforce progress in case of interruption, __flush_work() needs to acquire the lock. This will flush all BH-work items assigned to that pool. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04cgroup: Remove unused cgroup_subsys::post_attachChuyi Zhou1-4/+0
cgroup_subsys::post_attach callback was introduced in commit 5cf1cacb49ae ("cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback") and only cpuset would use this callback to wait for the mm migration to complete at the end of __cgroup_procs_write(). Since the previous patch defer the flush operation until returning to userspace, no one use this callback now. Remove this callback from cgroup_subsys. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_workChuyi Zhou1-5/+24
Now in cpuset_attach(), we need to synchronously wait for flush_workqueue to complete. The execution time of flushing cpuset_migrate_mm_wq depends on the amount of mm migration initiated by cpusets at that time. When the cpuset.mems of a cgroup occupying a large amount of memory is modified, it may trigger extensive mm migration, causing cpuset_attach() to block on flush_workqueue for an extended period. This could be dangerous because cpuset_attach() is within the critical section of cgroup_mutex, which may ultimately cause all cgroup-related operations in the system to be blocked. This patch attempts to defer the flush_workqueue() operation until returning to userspace using the task_work which is originally proposed by tejun[1], so that flush happens after cgroup_mutex is dropped. That way we maintain the operation synchronicity while avoiding bothering anyone else. [1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883 Originally-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmaskChuyi Zhou1-1/+2
It is unnecessary to always wait for the flush operation of cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The flush_workqueue can be executed only when cpuset.mems is modified. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04workqueue: Remove rcu_read_lock/unlock() in wq_watchdog_timer_fn()Zqiang1-4/+0
The wq_watchdog_timer_fn() is executed in the softirq context, this is already in the RCU read critical section, this commit therefore remove rcu_read_lock/unlock() in wq_watchdog_timer_fn(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04workqueue: Remove redundant rcu_read_lock/unlock() in workqueue_congested()Zqiang1-2/+0
The preempt_disable/enable() has already formed RCU read crtical section, this commit therefore remove rcu_read_lock/unlock() in workqueue_congested(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04bpf: add bpf_strcasecmp kfuncRong Tao1-20/+48
bpf_strcasecmp() function performs same like bpf_strcmp() except ignoring the case of the characters. Signed-off-by: Rong Tao <rongtao@cestc.cn> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/tencent_292BD3682A628581AA904996D8E59F4ACD06@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-04audit: init ab->skb_list earlier in audit_buffer_alloc()Eric Dumazet1-1/+2
syzbot found a bug in audit_buffer_alloc() if nlmsg_new() returns NULL. We need to initialize ab->skb_list before calling audit_buffer_free() which will use both the skb_list spinlock and list pointers. Fixes: eb59d494eebd ("audit: add record for multiple task security contexts") Reported-by: syzbot+bb185b018a51f8d91fd2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/68b93e3c.a00a0220.eb3d.0000.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Paris <eparis@redhat.com> Cc: audit@vger.kernel.org Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-09-04time: Build generic update_vsyscall() only with generic time vDSOThomas Weißschuh1-1/+1
The generic vDSO can be used without the time-related functionality. In that case the generic update_vsyscall() from kernel/time/vsyscall.c should not be built. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250826-vdso-cleanups-v1-5-d9b65750e49f@linutronix.de
2025-09-03time: export timespec64_add_safe() symbolGatien Chevallier1-0/+1
Export the timespec64_add_safe() symbol so that this function can be used in modules where computation of time related is done. Signed-off-by: Gatien Chevallier <gatien.chevallier@foss.st.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250901-relative_flex_pps-v4-1-b874971dfe85@foss.st.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03sched_ext: deprecation warn for scx_bpf_cpu_rq()Christian Loehle2-0/+10
scx_bpf_cpu_rq() works on an unlocked rq which generally isn't safe. For the common use-cases scx_bpf_locked_rq() and scx_bpf_cpu_curr() work, so add a deprecation warning to scx_bpf_cpu_rq() so it can eventually be removed. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_cpu_curr()Christian Loehle1-0/+14
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr task of a remote rq without assuming its lock is held. Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr (e.g. to see if it should be preempted). This is problematic because scx_bpf_cpu_rq() provides access to all fields of struct rq, most of which aren't safe to use without holding the associated rq lock. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_locked_rq()Christian Loehle1-0/+23
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held. Furthermore they become meaningless without rq lock, too. Make a safer version of scx_bpf_cpu_rq() that only returns a rq if we hold rq lock of that rq. Also mark the new scx_bpf_locked_rq() as returning NULL as scx_bpf_cpu_rq() should've been too. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operationsTejun Heo3-56/+14
SCX hooks into CPU cgroup controller operations and read-locks scx_cgroup_rwsem to exclude them while enabling and disable schedulers. While this works, it's unnecessarily complicated given that cgroup_[un]lock() are available and thus the cgroup operations can be locked out that way. Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop scx_cgroup_finish_attach() which is no longer necessary. Drop the now unnecessary rcu locking and css ref bumping in scx_cgroup_init() and scx_cgroup_exit(). As scx_cgroup_set_weight/bandwidth() paths aren't protected by cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain the locking there. This is overall simpler and will also allow enable/disable paths to synchronize against cgroup changes independent of the CPU controller. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Put event_stats_cpu in struct scx_sched_pcpuTejun Heo2-16/+19
scx_sched.event_stats_cpu is the percpu counters that are used to track stats. Introduce struct scx_sched_pcpu and move the counters inside. This will ease adding more per-cpu fields. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Move internal type and accessor definitions to ext_internal.hTejun Heo4-1057/+1062
There currently isn't a place to place SCX-internal types and accessors to be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h and move internal type and accessor definitions there. This trims ext.c a bit and makes future additions easier. Pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Keep bypass on between enable failure and scx_disable_workfn()Tejun Heo1-1/+1
scx_enable() turns on the bypass mode while enable is in progress. If enabling fails, it turns off the bypass mode and then triggers scx_error(). scx_error() will trigger scx_disable_workfn() which will turn on the bypass mode again and unload the failed scheduler. This moves the system out of bypass mode between the enable error path and the disable path, which is unnecessary and can be brittle - e.g. the thread running scx_enable() may already be on the failed scheduler and can be switched out before it triggers scx_error() leading to a stall. The watchdog would eventually kick in, so the situation isn't critical but is still suboptimal. There is nothing to be gained by turning off the bypass mode between scx_enable() failure and scx_disable_workfn(). Keep bypass on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Make explicit scx_task_iter_relock() calls unnecessaryTejun Heo1-20/+23
During tasks iteration, the locks can be dropped using scx_task_iter_unlock() to perform e.g. sleepable allocations. Afterwards, scx_task_iter_relock() has to be called prior to other iteration operations, which is error-prone. This can be easily automated by tracking whether scx_tasks_lock is held in scx_task_iter and re-acquiring when necessary. It already tracks whether the task's rq is locked after all. - Add scx_task_iter->list_locked which remembers whether scx_tasks_lock is held. - Rename scx_task_iter->locked to scx_task_iter->locked_task to better distinguish it from ->list_locked. - Replace scx_task_iter_relock() with __scx_task_iter_maybe_relock() which is automatically called by scx_task_iter_next() and scx_task_iter_stop(). - Drop explicit scx_task_iter_relock() calls. The resulting behavior should be equivalent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03audit: fix out-of-bounds read in audit_compare_dname_path()Stanislav Fort1-1/+1
When a watch on dir=/ is combined with an fsnotify event for a single-character name directly under / (e.g., creating /a), an out-of-bounds read can occur in audit_compare_dname_path(). The helper parent_len() returns 1 for "/". In audit_compare_dname_path(), when parentlen equals the full path length (1), the code sets p = path + 1 and pathlen = 1 - 1 = 0. The subsequent loop then dereferences p[pathlen - 1] (i.e., p[-1]), causing an out-of-bounds read. Fix this by adding a pathlen > 0 check to the while loop condition to prevent the out-of-bounds access. Cc: stable@vger.kernel.org Fixes: e92eebb0d611 ("audit: fix suffixed '/' filename matching") Reported-by: Stanislav Fort <disclosure@aisle.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Stanislav Fort <stanislav.fort@aisle.com> [PM: subject tweak, sign-off email fixes] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-09-03cgroup/cpuset: Prevent NULL pointer access in free_tmpmasks()Waiman Long1-0/+3
Commit 5806b3d05165 ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") separates out the freeing of tmpmasks into a new free_tmpmask() helper but removes the NULL pointer check in the process. Unfortunately a NULL pointer can be passed to free_tmpmasks() in cpuset_handle_hotplug() if cpuset v1 is active. This can cause segmentation fault and crash the kernel. Fix that by adding the NULL pointer check to free_tmpmasks(). Fixes: 5806b3d05165 ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") Reported-by: Ashay Jaiswal <quic_ashayj@quicinc.com> Closes: https://lore.kernel.org/lkml/20250902-cpuset-free-on-condition-v1-1-f46ffab53eac@quicinc.com/ Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched: Fix sched_numa_find_nth_cpu() if mask offlineChristian Loehle1-0/+2
sched_numa_find_nth_cpu() uses a bsearch to look for the 'closest' CPU in sched_domains_numa_masks and given cpus mask. However they might not intersect if all CPUs in the cpus mask are offline. bsearch will return NULL in that case, bail out instead of dereferencing a bogus pointer. The previous behaviour lead to this bug when using maxcpus=4 on an rk3399 (LLLLbb) (i.e. booting with all big CPUs offline): [ 1.422922] Unable to handle kernel paging request at virtual address ffffff8000000000 [ 1.423635] Mem abort info: [ 1.423889] ESR = 0x0000000096000006 [ 1.424227] EC = 0x25: DABT (current EL), IL = 32 bits [ 1.424715] SET = 0, FnV = 0 [ 1.424995] EA = 0, S1PTW = 0 [ 1.425279] FSC = 0x06: level 2 translation fault [ 1.425735] Data abort info: [ 1.425998] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 [ 1.426499] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 1.426952] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 1.427428] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000004a9f000 [ 1.428038] [ffffff8000000000] pgd=18000000f7fff403, p4d=18000000f7fff403, pud=18000000f7fff403, pmd=0000000000000000 [ 1.429014] Internal error: Oops: 0000000096000006 [#1] SMP [ 1.429525] Modules linked in: [ 1.429813] CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc4-dirty #343 PREEMPT [ 1.430559] Hardware name: Pine64 RockPro64 v2.1 (DT) [ 1.431012] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1.431634] pc : sched_numa_find_nth_cpu+0x2a0/0x488 [ 1.432094] lr : sched_numa_find_nth_cpu+0x284/0x488 [ 1.432543] sp : ffffffc084e1b960 [ 1.432843] x29: ffffffc084e1b960 x28: ffffff80078a8800 x27: ffffffc0846eb1d0 [ 1.433495] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 1.434144] x23: 0000000000000000 x22: fffffffffff7f093 x21: ffffffc081de6378 [ 1.434792] x20: 0000000000000000 x19: 0000000ffff7f093 x18: 00000000ffffffff [ 1.435441] x17: 3030303866666666 x16: 66663d736b73616d x15: ffffffc104e1b5b7 [ 1.436091] x14: 0000000000000000 x13: ffffffc084712860 x12: 0000000000000372 [ 1.436739] x11: 0000000000000126 x10: ffffffc08476a860 x9 : ffffffc084712860 [ 1.437389] x8 : 00000000ffffefff x7 : ffffffc08476a860 x6 : 0000000000000000 [ 1.438036] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000 [ 1.438683] x2 : 0000000000000000 x1 : ffffffc0846eb000 x0 : ffffff8000407b68 [ 1.439332] Call trace: [ 1.439559] sched_numa_find_nth_cpu+0x2a0/0x488 (P) [ 1.440016] smp_call_function_any+0xc8/0xd0 [ 1.440416] armv8_pmu_init+0x58/0x27c [ 1.440770] armv8_cortex_a72_pmu_init+0x20/0x2c [ 1.441199] arm_pmu_device_probe+0x1e4/0x5e8 [ 1.441603] armv8_pmu_device_probe+0x1c/0x28 [ 1.442007] platform_probe+0x5c/0xac [ 1.442347] really_probe+0xbc/0x298 [ 1.442683] __driver_probe_device+0x78/0x12c [ 1.443087] driver_probe_device+0xdc/0x160 [ 1.443475] __driver_attach+0x94/0x19c [ 1.443833] bus_for_each_dev+0x74/0xd4 [ 1.444190] driver_attach+0x24/0x30 [ 1.444525] bus_add_driver+0xe4/0x208 [ 1.444874] driver_register+0x60/0x128 [ 1.445233] __platform_driver_register+0x24/0x30 [ 1.445662] armv8_pmu_driver_init+0x28/0x4c [ 1.446059] do_one_initcall+0x44/0x25c [ 1.446416] kernel_init_freeable+0x1dc/0x3bc [ 1.446820] kernel_init+0x20/0x1d8 [ 1.447151] ret_from_fork+0x10/0x20 [ 1.447493] Code: 90022e21 f000e5f5 910de2b5 2a1703e2 (f8767803) [ 1.448040] ---[ end trace 0000000000000000 ]--- [ 1.448483] note: swapper/0[1] exited with preempt_count 1 [ 1.449047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 1.449741] SMP: stopping secondary CPUs [ 1.450105] Kernel Offset: disabled [ 1.450419] CPU features: 0x000000,00080000,20002001,0400421b [ 1.450935] Memory Limit: none [ 1.451217] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Yury: with the fix, the function returns cpu == nr_cpu_ids, and later in smp_call_function_any -> smp_call_function_single -> generic_exec_single we test the cpu for '>= nr_cpu_ids' and return -ENXIO. So everything is handled correctly. Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()") Cc: stable@vger.kernel.org Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
2025-09-03genirq/test: Ensure CPU 1 is online for hotplug testBrian Norris1-0/+2
It's possible to run these tests on platforms that think they have a hotpluggable CPU1, but for whatever reason, CPU1 is not online and can't be brought online: # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:210 Expected remove_cpu(1) == 0, but remove_cpu(1) == 1 (0x1) CPU1: failed to boot: -38 # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:214 Expected add_cpu(1) == 0, but add_cpu(1) == -38 (0xffffffffffffffda) Check that CPU1 is actually online before trying to run the test. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-7-briannorris@chromium.org
2025-09-03genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptionsBrian Norris1-4/+0
Not all platforms use the generic IRQ migration code, even if they select GENERIC_IRQ_MIGRATION. (See, for example, powerpc / pseries_cpu_disable().) If such platforms don't perform managed shutdown the same way, the interrupt may not actually shut down, and these tests fail: [ 4.357022][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:211 [ 4.357022][ T101] Expected irqd_is_activated(data) to be false, but is true [ 4.358128][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:212 [ 4.358128][ T101] Expected irqd_is_started(data) to be false, but is true [ 4.375558][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:216 [ 4.375558][ T101] Expected irqd_is_activated(data) to be false, but is true [ 4.376088][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:217 [ 4.376088][ T101] Expected irqd_is_started(data) to be false, but is true [ 4.377851][ T1] # irq_cpuhotplug_test: pass:0 fail:1 skip:0 total:1 [ 4.377901][ T1] not ok 4 irq_cpuhotplug_test [ 4.378073][ T1] # irq_test_cases: pass:3 fail:1 skip:0 total:4 Rather than test that PowerPC performs migration the same way as the unterrupt core, just drop the state checks. The point of the test was to ensure that the code kept |depth| balanced, which still can be tested for. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-6-briannorris@chromium.org
2025-09-03genirq/test: Depend on SPARSE_IRQBrian Norris1-0/+1
Some architectures have a static interrupt layout, with a limited number of interrupts. Without SPARSE_IRQ, the test may not be able to allocate any fake interrupts, and the test will fail. (This occurs on ARCH=m68k, for example.) Additionally, managed-affinity is only supported with CONFIG_SPARSE_IRQ=y, so irq_shutdown_depth_test() and irq_cpuhotplug_test() would fail without it. Add a 'SPARSE_IRQ' dependency to avoid these problems. Many architectures 'select SPARSE_IRQ', so this is easy to miss. Notably, this also excludes ARCH=um from running any of these tests, even though some of them might work. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-5-briannorris@chromium.org
2025-09-03genirq/test: Fail early if interrupt request failsBrian Norris1-5/+5
Requesting an interrupt is part of the basic test setup. If it fails, most of the subsequent tests are likely to fail, and the output gets noisy. Use "assert" to fail early. Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-4-briannorris@chromium.org
2025-09-03genirq/test: Factor out fake-virq setupBrian Norris1-25/+20
A few things need to be repeated in tests. Factor out the creation of fake interrupts. Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-3-briannorris@chromium.org
2025-09-03genirq/test: Select IRQ_DOMAINBrian Norris1-0/+1
These tests use irq_domain_alloc_descs() and so require CONFIG_IRQ_DOMAIN. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-2-briannorris@chromium.org Closes: https://lore.kernel.org/lkml/ded44edf-eeb7-420c-b8a8-d6543b955e6e@roeck-us.net/
2025-09-03genirq/test: Fix depth tests on architectures with NOREQUEST by default.David Gow1-0/+12
The new irq KUnit tests fail on some architectures (notably PowerPC and 32-bit ARM), as the request_irq() call fails due to the ARCH_IRQ_INIT_FLAGS containing IRQ_NOREQUEST, yielding the following errors: [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:88 [10:17:45] Expected ret == 0, but [10:17:45] ret == -22 (0xffffffffffffffea) [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:90 [10:17:45] Expected desc->depth == 0, but [10:17:45] desc->depth == 1 (0x1) [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:93 [10:17:45] Expected desc->depth == 1, but [10:17:45] desc->depth == 2 (0x2) By clearing IRQ_NOREQUEST from the interrupt descriptor, these tests now pass on ARM and PowerPC. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Brian Norris <briannorris@chromium.org> Link: https://lore.kernel.org/all/20250816094528.3560222-2-davidgow@google.com
2025-09-03genirq: Add support for warning on long-running interrupt handlersWladislav Wiebe1-1/+48
Introduce a mechanism to detect and warn about prolonged interrupt handlers. With a new command-line parameter (irqhandler.duration_warn_us=), users can configure the duration threshold in microseconds when a warning in such format should be emitted: "[CPU14] long duration of IRQ[159:bad_irq_handler [long_irq]], took: 1330 us" The implementation uses local_clock() to measure the execution duration of the generic IRQ per-CPU event handler. Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Link: https://lore.kernel.org/all/20250804093525.851-1-wladislav.wiebe@nokia.com
2025-09-03vdso/vsyscall: Avoid slow division loop in auxiliary clock updateThomas Weißschuh2-4/+10
The call to __iter_div_u64_rem() in vdso_time_update_aux() is a wrapper around subtraction. It cannot be used to divide large numbers, as that introduces long, computationally expensive delays. A regular u64 division is also not possible in the timekeeper update path as it can be too slow. Instead of splitting the ktime_t offset into into second and subsecond components during the timekeeper update fast-path, do it together with the adjustment of tk->offs_aux in the slow-path. Equivalent to the handling of offs_boot and monotonic_to_boot. Reuse the storage of monotonic_to_boot for the new field, as it is not used by auxiliary timekeepers. Fixes: 380b84e168e5 ("vdso/vsyscall: Update auxiliary clock data in the datapage") Reported-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250825-vdso-auxclock-division-v1-1-a1d32a16a313@linutronix.de Closes: https://lore.kernel.org/lkml/aKwsNNWsHJg8IKzj@localhost/
2025-09-03perf: Fix the POLL_HUP delivery breakageKan Liang1-0/+1
The event_limit can be set by the PERF_EVENT_IOC_REFRESH to limit the number of events. When the event_limit reaches 0, the POLL_HUP signal should be sent. But it's missed. The corresponding counter should be stopped when the event_limit reaches 0. It was implemented in the ARCH-specific code. However, since the commit 9734e25fbf5a ("perf: Fix the throttle logic for a group"), all the ARCH-specific code has been moved to the generic code. The code to handle the event_limit was lost. Add the event->pmu->stop(event, 0); back. Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group") Closes: https://lore.kernel.org/lkml/aICYAqM5EQUlTqtX@li-2b55cdcc-350b-11b2-a85c-a78bff51fc11.ibm.com/ Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Link: https://lkml.kernel.org/r/20250811182644.1305952-1-kan.liang@linux.intel.com
2025-09-03sched/fair: Get rid of throttled_lb_pair()Aaron Lu1-31/+4
Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks list, there is no need to take special care of these throttled tasks anymore in load balance. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-6-ziqianlu@bytedance.com
2025-09-03sched/fair: Task based throttle time accountingAaron Lu2-25/+32
With task based throttle model, the previous way to check cfs_rq's nr_queued to decide if throttled time should be accounted doesn't work as expected, e.g. when a cfs_rq which has a single task is throttled, that task could later block in kernel mode instead of being dequeued on limbo list and accounting this as throttled time is not accurate. Rework throttle time accounting for a cfs_rq as follows: - start accounting when the first task gets throttled in its hierarchy; - stop accounting on unthrottle. Note that there will be a time gap between when a cfs_rq is throttled and when a task in its hierarchy is actually throttled. This accounting mechanism only starts accounting in the latter case. Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-5-ziqianlu@bytedance.com
2025-09-03sched/fair: Switch to task based throttle modelValentin Schneider3-167/+181
In current throttle model, when a cfs_rq is throttled, its entity will be dequeued from cpu's rq, making tasks attached to it not able to run, thus achiveing the throttle target. This has a drawback though: assume a task is a reader of percpu_rwsem and is waiting. When it gets woken, it can not run till its task group's next period comes, which can be a relatively long time. Waiting writer will have to wait longer due to this and it also makes further reader build up and eventually trigger task hung. To improve this situation, change the throttle model to task based, i.e. when a cfs_rq is throttled, record its throttled status but do not remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued there. In this way, tasks throttled will not hold any kernel resources. And on unthrottle, enqueue back those tasks so they can continue to run. Throttled cfs_rq's PELT clock is handled differently now: previously the cfs_rq's PELT clock is stopped once it entered throttled state but since now tasks(in kernel mode) can continue to run, change the behaviour to stop PELT clock when the throttled cfs_rq has no tasks left. Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-4-ziqianlu@bytedance.com
2025-09-03sched/fair: Implement throttle task work and related helpersValentin Schneider1-0/+65
Implement throttle_cfs_rq_work() task work which gets executed on task's ret2user path where the task is dequeued and marked as throttled. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-3-ziqianlu@bytedance.com
2025-09-03sched/fair: Add related data structure for task based throttleValentin Schneider3-0/+19
Add related data structures for this new throttle functionality. Tesed-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://lore.kernel.org/r/20250829081120.806-2-ziqianlu@bytedance.com
2025-09-03sched: Move STDL_INIT() functions out-of-linePeter Zijlstra1-0/+45
Since all these functions are address-taken in SDTL_INIT() and called indirectly, it doesn't really make sense for them to be inline. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-03sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()Peter Zijlstra1-18/+10
Leon [1] and Vinicius [2] noted a topology_span_sane() warning during their testing starting from v6.16-rc1. Debug that followed pointed to the tl->mask() for the NODE domain being incorrectly resolved to that of the highest NUMA domain. tl->mask() for NODE is set to the sd_numa_mask() which depends on the global "sched_domains_curr_level" hack. "sched_domains_curr_level" is set to the "tl->numa_level" during tl traversal in build_sched_domains() calling sd_init() but was not reset before topology_span_sane(). Since "tl->numa_level" still reflected the old value from build_sched_domains(), topology_span_sane() for the NODE domain trips when the span of the last NUMA domain overlaps. Instead of replicating the "sched_domains_curr_level" hack, get rid of it entirely and instead, pass the entire "sched_domain_topology_level" object to tl->cpumask() function to prevent such mishap in the future. sd_numa_mask() now directly references "tl->numa_level" instead of relying on the global "sched_domains_curr_level" hack to index into sched_domains_numa_masks[]. The original warning was reproducible on the following NUMA topology reported by Leon: $ sudo numactl -H available: 5 nodes (0-4) node 0 cpus: 0 1 node 0 size: 2927 MB node 0 free: 1603 MB node 1 cpus: 2 3 node 1 size: 3023 MB node 1 free: 3008 MB node 2 cpus: 4 5 node 2 size: 3023 MB node 2 free: 3007 MB node 3 cpus: 6 7 node 3 size: 3023 MB node 3 free: 3002 MB node 4 cpus: 8 9 node 4 size: 3022 MB node 4 free: 2718 MB node distances: node 0 1 2 3 4 0: 10 39 38 37 36 1: 39 10 38 37 36 2: 38 38 10 37 36 3: 37 37 37 10 36 4: 36 36 36 36 10 The above topology can be mimicked using the following QEMU cmd that was used to reproduce the warning and test the fix: sudo qemu-system-x86_64 -enable-kvm -cpu host \ -m 20G -smp cpus=10,sockets=10 -machine q35 \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -object memory-backend-ram,size=4G,id=m2 \ -object memory-backend-ram,size=4G,id=m3 \ -object memory-backend-ram,size=4G,id=m4 \ -numa node,cpus=0-1,memdev=m0,nodeid=0 \ -numa node,cpus=2-3,memdev=m1,nodeid=1 \ -numa node,cpus=4-5,memdev=m2,nodeid=2 \ -numa node,cpus=6-7,memdev=m3,nodeid=3 \ -numa node,cpus=8-9,memdev=m4,nodeid=4 \ -numa dist,src=0,dst=1,val=39 \ -numa dist,src=0,dst=2,val=38 \ -numa dist,src=0,dst=3,val=37 \ -numa dist,src=0,dst=4,val=36 \ -numa dist,src=1,dst=0,val=39 \ -numa dist,src=1,dst=2,val=38 \ -numa dist,src=1,dst=3,val=37 \ -numa dist,src=1,dst=4,val=36 \ -numa dist,src=2,dst=0,val=38 \ -numa dist,src=2,dst=1,val=38 \ -numa dist,src=2,dst=3,val=37 \ -numa dist,src=2,dst=4,val=36 \ -numa dist,src=3,dst=0,val=37 \ -numa dist,src=3,dst=1,val=37 \ -numa dist,src=3,dst=2,val=37 \ -numa dist,src=3,dst=4,val=36 \ -numa dist,src=4,dst=0,val=36 \ -numa dist,src=4,dst=1,val=36 \ -numa dist,src=4,dst=2,val=36 \ -numa dist,src=4,dst=3,val=36 \ ... [ prateek: Moved common functions to include/linux/sched/topology.h, reuse the common bits for s390 and ppc, commit message ] Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1] Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Valentin Schneider <vschneid@redhat.com> # x86 Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]
2025-09-03sched/deadline: Fix race in push_dl_task()Harshit Agarwal1-24/+49
When a CPU chooses to call push_dl_task and picks a task to push to another CPU's runqueue then it will call find_lock_later_rq method which would take a double lock on both CPUs' runqueues. If one of the locks aren't readily available, it may lead to dropping the current runqueue lock and reacquiring both the locks at once. During this window it is possible that the task is already migrated and is running on some other CPU. These cases are already handled. However, if the task is migrated and has already been executed and another CPU is now trying to wake it up (ttwu) such that it is queued again on the runqeue (on_rq is 1) and also if the task was run by the same CPU, then the current checks will pass even though the task was migrated out and is no longer in the pushable tasks list. Please go through the original rt change for more details on the issue. To fix this, after the lock is obtained inside the find_lock_later_rq, it ensures that the task is still at the head of pushable tasks list. Also removed some checks that are no longer needed with the addition of this new check. However, the new check of pushable tasks list only applies when find_lock_later_rq is called by push_dl_task. For the other caller i.e. dl_task_offline_migration, existing checks are used. Signed-off-by: Harshit Agarwal <harshit@nutanix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250408045021.3283624-1-harshit@nutanix.com
2025-09-02tracing: Fix tracing_marker may trigger page fault during preempt_disableLuo Gengkun1-2/+2
Both tracing_mark_write and tracing_mark_raw_write call __copy_from_user_inatomic during preempt_disable. But in some case, __copy_from_user_inatomic may trigger page fault, and will call schedule() subtly. And if a task is migrated to other cpu, the following warning will be trigger: if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing))) An example can illustrate this issue: process flow CPU --------------------------------------------------------------------- tracing_mark_raw_write(): cpu:0 ... ring_buffer_lock_reserve(): cpu:0 ... cpu = raw_smp_processor_id() cpu:0 cpu_buffer = buffer->buffers[cpu] cpu:0 ... ... __copy_from_user_inatomic(): cpu:0 ... # page fault do_mem_abort(): cpu:0 ... # Call schedule schedule() cpu:0 ... # the task schedule to cpu1 __buffer_unlock_commit(): cpu:1 ... ring_buffer_unlock_commit(): cpu:1 ... cpu = raw_smp_processor_id() cpu:1 cpu_buffer = buffer->buffers[cpu] cpu:1 As shown above, the process will acquire cpuid twice and the return values are not the same. To fix this problem using copy_from_user_nofault instead of __copy_from_user_inatomic, as the former performs 'access_ok' before copying. Link: https://lore.kernel.org/20250819105152.2766363-1-luogengkun@huaweicloud.com Fixes: 656c7f0d2d2b ("tracing: Replace kmap with copy_from_user() in trace_marker writing") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-02trace: Remove redundant __GFP_NOWARNQianfeng Rong1-1/+1
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. No functional changes. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250805023630.335719-1-rongqianfeng@vivo.com Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-02bpf: Replace kvfree with kfree for kzalloc memoryFeng Yang1-2/+2
These pointers are allocated by kzalloc. Therefore, replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr() check in kvfree(). This is the remaining unmodified part from [1]. Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com [1] Link: https://lore.kernel.org/bpf/20250827032812.498216-1-yangfeng59949@163.com
2025-09-02pidns: move is-ancestor logic to helperAleksa Sarai1-8/+14
This check will be needed in later patches, and there's no point open-coding it each time. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-1-705f984940e7@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>