aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2024-10-16bpf: Add bpf_task_from_vpid() kfuncJuntong Deng1-0/+20
bpf_task_from_pid() that currently exists looks up the struct task_struct corresponding to the pid in the root pid namespace (init_pid_ns). This patch adds bpf_task_from_vpid() which looks up the struct task_struct corresponding to vpid in the pid namespace of the current process. This is useful for getting information about other processes in the same pid namespace. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Link: https://lore.kernel.org/r/AM6PR03MB5848E50DA58F79CDE65433C399442@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-16mm/bpf: Add bpf_get_kmem_cache() kfuncNamhyung Kim2-0/+6
The bpf_get_kmem_cache() is to get a slab cache information from a virtual address like virt_to_cache(). If the address is a pointer to a slab object, it'd return a valid kmem_cache pointer, otherwise NULL is returned. It doesn't grab a reference count of the kmem_cache so the caller is responsible to manage the access. The returned point is marked as PTR_UNTRUSTED. The intended use case for now is to symbolize locks in slab objects from the lock contention tracepoints. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241010232505.1339892-3-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-16printk: Improve memory usage logging during bootIsaac J. Manjarres1-4/+22
When the initial printk ring buffer size is updated, setup_log_buf() allocates a new ring buffer, as well as a set of meta-data structures for the new ring buffer. The function also emits the new size of the ring buffer, but not the size of the meta-data structures. This makes it difficult to assess how changing the log buffer size impacts memory usage during boot. For instance, increasing the ring buffer size from 512 KB to 1 MB through the command line yields an increase of 2304 KB in reserved memory at boot, while the only obvious change is the 512 KB difference in the ring buffer sizes: log_buf_len=512K: printk: log_buf_len: 524288 bytes Memory: ... (... 733252K reserved ...) log_buf_len=1M: printk: log_buf_len: 1048576 bytes Memory: ... (... 735556K reserved ...) This is because of how the size of the meta-data structures scale with the size of the ring buffer. Even when there aren't changes to the printk ring buffer size (i.e. the initial size == 1 << CONFIG_LOG_BUF_SHIFT), it is impossible to tell how much memory is consumed by the printk ring buffer during boot. Therefore, unconditionally log the sizes of the printk ring buffer and its meta-data structures, so that it's easier to understand how changing the log buffer size (either through the command line or by changing CONFIG_LOG_BUF_SHIFT) affects boot time memory usage. With the new logs, it is much easier to see exactly why the memory increased by 2304 KB: log_buf_len=512K: printk: log buffer data + meta data: 524288 + 1835008 = 2359296 bytes Memory: ... (... 733252K reserved ...) log_buf_len=1M: printk: log buffer data + meta data: 1048576 + 3670016 = 4718592 bytes Memory: ... (... 735556K reserved ...) Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240930184826.3595221-1-isaacmanjarres@google.com [pmladek@suse.com: Updated the examples in the commit message, simplified comment for default buffer.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-10-15Merge tag 'sched_ext-for-6.12-rc3-fixes' of ↵Linus Torvalds1-85/+101
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - More issues reported in the enable/disable paths on large machines with many tasks due to scx_tasks_lock being held too long. Break up the task iterations - Remove ops.select_cpu() dependency in bypass mode so that a misbehaving implementation can't live-lock the machine by pushing all tasks to few CPUs in bypass mode - Other misc fixes * tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Remove unnecessary cpu_relax() sched_ext: Don't hold scx_tasks_lock for too long sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers sched_ext: bypass mode shouldn't depend on ops.select_cpu() sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl() sched_ext: Start schedulers with consistent p->scx.slice values Revert "sched_ext: Use shorter slice while bypassing" sched_ext: use correct function name in pick_task_scx() warning message selftests: sched_ext: Add sched_ext as proper selftest target
2024-10-16timers: Add a warning to usleep_range_state() for wrong order of argumentsAnna-Maria Behnsen1-0/+3
There is a warning in checkpatch script that triggers, when min and max arguments of usleep_range_state() are in reverse order. This check does only cover callsites which uses constants. Add this check into the code as a WARN_ON_ONCE() to also cover callsites not using constants and fix the mis-usage by resetting the delta to 0. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-9-dc8b907cb62f@linutronix.de
2024-10-16timers: Update function descriptions of sleep/delay related functionsAnna-Maria Behnsen1-6/+47
A lot of commonly used functions for inserting a sleep or delay lack a proper function description. Add function descriptions to all of them to have important information in a central place close to the code. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-5-dc8b907cb62f@linutronix.de
2024-10-16timers: Update schedule_[hr]timeout*() related function descriptionsAnna-Maria Behnsen1-25/+41
schedule_timeout*() functions do not have proper kernel-doc formatted function descriptions. schedule_hrtimeout() and schedule_hrtimeout_range() have a almost identical description. Add missing function descriptions. Remove copy of function description and add a pointer to the existing description instead. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-3-dc8b907cb62f@linutronix.de
2024-10-16timers: Move *sleep*() and timeout functions into a separate fileAnna-Maria Behnsen4-313/+318
All schedule_timeout() and *sleep*() related functions are interfaces on top of timer list timers and hrtimers to add a sleep to the code. As they are built on top of the timer list timers and hrtimers, the [hr]timer interfaces are already used except when queuing the timer in schedule_timeout(). But there exists the appropriate interface add_timer() which does the same job with an extra check for an already pending timer. Split all those functions as they are into a separate file and use add_timer() instead of __mod_timer() in schedule_timeout(). While at it fix minor formatting issues and a multi line printk function call in schedule_timeout(). Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-2-dc8b907cb62f@linutronix.de
2024-10-16time: Remove '%' from numeric constant in kernel-doc commentWang Jinchao1-8/+8
Change %0 to 0 in kernel-doc comments. %0 is not valid. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241009022135.92400-2-wangjinchao@xfusion.com
2024-10-15Merge tag 'trace-ringbuffer-v6.12-rc3' of ↵Linus Torvalds2-21/+29
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fixes from Steven Rostedt: - Fix ref counter of buffers assigned at boot up A tracing instance can be created from the kernel command line. If it maps to memory, it is considered permanent and should not be deleted, or bad things can happen. If it is not mapped to memory, then the user is fine to delete it via rmdir from the instances directory. But the ref counts assumed 0 was free to remove and greater than zero was not. But this was not the case. When an instance is created, it should have the reference of 1, and if it should not be removed, it must be greater than 1. The boot up code set normal instances with a ref count of 0, which could get removed if something accessed it and then released it. And memory mapped instances had a ref count of 1 which meant it could be deleted, and bad things happen. Keep normal instances ref count as 1, and set memory mapped instances ref count to 2. - Protect sub buffer size (order) updates from other modifications When a ring buffer is changing the size of its sub-buffers, no other operations should be performed on the ring buffer. That includes reading it. But the locking only grabbed the buffer->mutex that keeps some operations from touching the ring buffer. It also must hold the cpu_buffer->reader_lock as well when updates happen as other paths use that to do some operations on the ring buffer. * tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Fix reader locking when changing the sub buffer order ring-buffer: Fix refcount setting of boot mapped buffers
2024-10-15bpf: Fix truncation bug in coerce_reg_to_size_sx()Dimitar Kanaliev1-4/+4
coerce_reg_to_size_sx() updates the register state after a sign-extension operation. However, there's a bug in the assignment order of the unsigned min/max values, leading to incorrect truncation: 0: (85) call bpf_get_prandom_u32#7 ; R0_w=scalar() 1: (57) r0 &= 1 ; R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) 2: (07) r0 += 254 ; R0_w=scalar(smin=umin=smin32=umin32=254,smax=umax=smax32=umax32=255,var_off=(0xfe; 0x1)) 3: (bf) r0 = (s8)r0 ; R0_w=scalar(smin=smin32=-2,smax=smax32=-1,umin=umin32=0xfffffffe,umax=0xffffffff,var_off=(0xfffffffffffffffe; 0x1)) In the current implementation, the unsigned 32-bit min/max values (u32_min_value and u32_max_value) are assigned directly from the 64-bit signed min/max values (s64_min and s64_max): reg->umin_value = reg->u32_min_value = s64_min; reg->umax_value = reg->u32_max_value = s64_max; Due to the chain assigmnent, this is equivalent to: reg->u32_min_value = s64_min; // Unintended truncation reg->umin_value = reg->u32_min_value; reg->u32_max_value = s64_max; // Unintended truncation reg->umax_value = reg->u32_max_value; Fixes: 1f9a1ea821ff ("bpf: Support new sign-extension load insns") Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Reported-by: Zac Ecob <zacecob@protonmail.com> Signed-off-by: Dimitar Kanaliev <dimitar.kanaliev@siteground.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20241014121155.92887-2-dimitar.kanaliev@siteground.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-15vdso: Remove timekeeper argument of __arch_update_vsyscall()Thomas Weißschuh1-1/+1
No implementation of this hook uses the passed in timekeeper anymore. This avoids including a non-VDSO header while building the VDSO, which can lead to compilation errors. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/all/20241010-vdso-generic-arch_update_vsyscall-v1-1-7fe5a3ea4382@linutronix.de
2024-10-15ring-buffer: Fix reader locking when changing the sub buffer orderPetr Pavlu1-18/+26
The function ring_buffer_subbuf_order_set() updates each ring_buffer_per_cpu and installs new sub buffers that match the requested page order. This operation may be invoked concurrently with readers that rely on some of the modified data, such as the head bit (RB_PAGE_HEAD), or the ring_buffer_per_cpu.pages and reader_page pointers. However, no exclusive access is acquired by ring_buffer_subbuf_order_set(). Modifying the mentioned data while a reader also operates on them can then result in incorrect memory access and various crashes. Fix the problem by taking the reader_lock when updating a specific ring_buffer_per_cpu in ring_buffer_subbuf_order_set(). Link: https://lore.kernel.org/linux-trace-kernel/20240715145141.5528-1-petr.pavlu@suse.com/ Link: https://lore.kernel.org/linux-trace-kernel/20241010195849.2f77cc3f@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20241011112850.17212b25@gandalf.local.home/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241015112440.26987-1-petr.pavlu@suse.com Fixes: 8e7b58c27b3c ("ring-buffer: Just update the subbuffers when changing their allocation order") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-14bpf: Add kmem_cache iteratorNamhyung Kim2-0/+176
The new "kmem_cache" iterator will traverse the list of slab caches and call attached BPF programs for each entry. It should check the argument (ctx.s) if it's NULL before using it. Now the iteration grabs the slab_mutex only if it traverse the list and releases the mutex when it runs the BPF program. The kmem_cache entry is protected by a refcount during the execution. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> #slab Link: https://lore.kernel.org/r/20241010232505.1339892-2-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-14posix-clock: Fix missing timespec64 check in pc_clock_settime()Jinjie Ruan1-0/+3
As Andrew pointed out, it will make sense that the PTP core checked timespec64 struct's tv_sec and tv_nsec range before calling ptp->info->settime64(). As the man manual of clock_settime() said, if tp.tv_sec is negative or tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL, which include dynamic clocks which handles PTP clock, and the condition is consistent with timespec64_valid(). As Thomas suggested, timespec64_valid() only check the timespec is valid, but not ensure that the time is in a valid range, so check it ahead using timespec64_valid_strict() in pc_clock_settime() and return -EINVAL if not valid. There are some drivers that use tp->tv_sec and tp->tv_nsec directly to write registers without validity checks and assume that the higher layer has checked it, which is dangerous and will benefit from this, such as hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(), and some drivers can remove the checks of itself. Cc: stable@vger.kernel.org Fixes: 0606f422b453 ("posix clocks: Introduce dynamic clocks") Acked-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://patch.msgid.link/20241009072302.1754567-2-ruanjinjie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14cgroup: Fix potential overflow issue when checking max_depthXiu Jianfeng1-2/+2
cgroup.max.depth is the maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. However due to the cgroup->max_depth is of int type and having the default value INT_MAX, the condition 'level > cgroup->max_depth' will never be satisfied, and it will cause an overflow of the level after it reaches to INT_MAX. Fix it by starting the level from 0 and using '>=' instead. It's worth mentioning that this issue is unlikely to occur in reality, as it's impossible to have a depth of INT_MAX hierarchy, but should be be avoided logically. Fixes: 1a926e0bbab8 ("cgroup: implement hierarchy limits") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-14sched_ext: Remove unnecessary cpu_relax()David Vernet1-1/+0
As described in commit b07996c7abac ("sched_ext: Don't hold scx_tasks_lock for too long"), we're doing a cond_resched() every 32 calls to scx_task_iter_next() to avoid RCU and other stalls. That commit also added a cpu_relax() to the codepath where we drop and reacquire the lock, but as Waiman described in [0], cpu_relax() should only be necessary in busy loops to avoid pounding on a cacheline (or to allow a hypertwin to more fully utilize a core). Let's remove the unnecessary cpu_relax(). [0]: https://lore.kernel.org/all/35b3889b-904a-4d26-981f-c8aa1557a7c7@redhat.com/ Cc: Waiman Long <llong@redhat.com> Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-14ring-buffer: Fix refcount setting of boot mapped buffersSteven Rostedt1-3/+3
A ring buffer which has its buffered mapped at boot up to fixed memory should not be freed. Other buffers can be. The ref counting setup was wrong for both. It made the not mapped buffers ref count have zero, and the boot mapped buffer a ref count of 1. But an normally allocated buffer should be 1, where it can be removed. Keep the ref count of a normal boot buffer with its setup ref count (do not decrement it), and increment the fixed memory boot mapped buffer's ref count. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241011165224.33dd2624@gandalf.local.home Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-14kcsan: Remove redundant call of kallsyms_lookup_name()Ran Xiaokai1-2/+1
There is no need to repeatedly call kallsyms_lookup_name, we can reuse the return value of this function. Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Marco Elver <elver@google.com>
2024-10-14kcsan: Turn report_filterlist_lock into a raw_spinlockMarco Elver1-38/+36
Ran Xiaokai reports that with a KCSAN-enabled PREEMPT_RT kernel, we can see splats like: | BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 | preempt_count: 10002, expected: 0 | RCU nest depth: 0, expected: 0 | no locks held by swapper/1/0. | irq event stamp: 156674 | hardirqs last enabled at (156673): [<ffffffff81130bd9>] do_idle+0x1f9/0x240 | hardirqs last disabled at (156674): [<ffffffff82254f84>] sysvec_apic_timer_interrupt+0x14/0xc0 | softirqs last enabled at (0): [<ffffffff81099f47>] copy_process+0xfc7/0x4b60 | softirqs last disabled at (0): [<0000000000000000>] 0x0 | Preemption disabled at: | [<ffffffff814a3e2a>] paint_ptr+0x2a/0x90 | CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.11.0+ #3 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 | Call Trace: | <IRQ> | dump_stack_lvl+0x7e/0xc0 | dump_stack+0x1d/0x30 | __might_resched+0x1a2/0x270 | rt_spin_lock+0x68/0x170 | kcsan_skip_report_debugfs+0x43/0xe0 | print_report+0xb5/0x590 | kcsan_report_known_origin+0x1b1/0x1d0 | kcsan_setup_watchpoint+0x348/0x650 | __tsan_unaligned_write1+0x16d/0x1d0 | hrtimer_interrupt+0x3d6/0x430 | __sysvec_apic_timer_interrupt+0xe8/0x3a0 | sysvec_apic_timer_interrupt+0x97/0xc0 | </IRQ> On a detected data race, KCSAN's reporting logic checks if it should filter the report. That list is protected by the report_filterlist_lock *non-raw* spinlock which may sleep on RT kernels. Since KCSAN may report data races in any context, convert it to a raw_spinlock. This requires being careful about when to allocate memory for the filter list itself which can be done via KCSAN's debugfs interface. Concurrent modification of the filter list via debugfs should be rare: the chosen strategy is to optimistically pre-allocate memory before the critical section and discard if unused. Link: https://lore.kernel.org/all/20240925143154.2322926-1-ranxiaokai627@163.com/ Reported-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Tested-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Marco Elver <elver@google.com>
2024-10-14sched: Split scheduler and execution contextsPeter Zijlstra7-80/+114
Let's define the "scheduling context" as all the scheduler state in task_struct for the task chosen to run, which we'll call the donor task, and the "execution context" as all state required to actually run the task. Currently both are intertwined in task_struct. We want to logically split these such that we can use the scheduling context of the donor task selected to be scheduled, but use the execution context of a different task to actually be run. To this purpose, introduce rq->donor field to point to the task_struct chosen from the runqueue by the scheduler, and will be used for scheduler state, and preserve rq->curr to indicate the execution context of the task that will actually be run. This patch introduces the donor field as a union with curr, so it doesn't cause the contexts to be split yet, but adds the logic to handle everything separately. [add additional comments and update more sched_class code to use rq::proxy] [jstultz: Rebased and resolved minor collisions, reworked to use accessors, tweaked update_curr_common to use rq_proxy fixing rt scheduling issues] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-8-jstultz@google.com
2024-10-14sched: Split out __schedule() deactivate task logic into a helperJohn Stultz1-27/+40
As we're going to re-use the deactivation logic, split it into a helper. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-7-jstultz@google.com
2024-10-14sched: Consolidate pick_*_task to task_is_pushable helperConnor O'Brien3-19/+12
This patch consolidates rt and deadline pick_*_task functions to a task_is_pushable() helper This patch was broken out from a larger chain migration patch originally by Connor O'Brien. [jstultz: split out from larger chain migration patch, renamed helper function] Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-6-jstultz@google.com
2024-10-14sched: Add move_queued_task_locked helperConnor O'Brien4-22/+19
Switch logic that deactivates, sets the task cpu, and reactivates a task on a different rq to use a helper that will be later extended to push entire blocked task chains. This patch was broken out from a larger chain migration patch originally by Connor O'Brien. [jstultz: split out from larger chain migration patch] Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com
2024-10-14locking/mutex: Expose __mutex_owner()Juri Lelli2-25/+27
Implementing proxy execution requires that scheduler code be able to identify the current owner of a mutex. Expose __mutex_owner() for this purpose (alone!). Includes a null mutex check, so that users of the function can be simplified. [Removed the EXPORT_SYMBOL] [jstultz: Reworked per Peter's suggestions] Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-4-jstultz@google.com
2024-10-14locking/mutex: Make mutex::wait_lock irq safeJuri Lelli2-18/+21
With the proxy-execution series, we traverse the task->mutex->task blocked_on/owner chain in the scheduler core. We do this while holding the rq::lock to keep the structures in place while taking and releasing the alternating lock types. Since the mutex::wait_lock is one of the locks we will take in this way under the rq::lock in the scheduler core, we need to make sure that its usage elsewhere is irq safe. [rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h] Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-3-jstultz@google.com
2024-10-14locking/mutex: Remove wakeups from under mutex::wait_lockPeter Zijlstra9-39/+96
In preparation to nest mutex::wait_lock under rq::lock we need to remove wakeups from under it. Do this by utilizing wake_qs to defer the wakeup until after the lock is dropped. [Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and 08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait mutexes")] [jstultz: rebased to mainline, added extra wake_up_q & init to avoid hangs, similar to Connor's rework of this patch] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-2-jstultz@google.com
2024-10-14sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloadsMathieu Desnoyers3-24/+48
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay. These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive. However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data. Introduce the following changes to improve per-mm-cid cache locality: - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID, - Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus _not_ a viable approach for this), - Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access), - Add a per-mm max_nr_cid to keep track of the highest number of concurrency IDs allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by: min(mm->nr_cpus_allowed, mm->mm_users) When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead. Spreading allocation to use all the cid values within the range [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ] improves cache locality while preserving mm_cid compactness within the expected user limits, - In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask, - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask. * Benchmarks Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core. Testing configurations: 8-core/1-L3: Use 8 cores within a single L3 24-core/24-L3: Use 24 cores, 1 core per L3 192-core/24-L3: Use 192 cores (all cores in the system) 384-thread/24-L3: Use 384 HW threads (all HW threads in the system) Intermittent workload delays between threads: 200ms, 10ms. Hardware: CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Caches (sum of all): L1d: 6 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 192 MiB (192 instances) L3: 768 MiB (24 instances) Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid). Intermittent workload delay: 200ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 1374 19289 1336 14.4x 24-core/24-L3 2423 26721 1594 16.7x 192-core/24-L3 2291 15826 2153 7.3x 384-thread/24-L3 1874 13234 1907 6.9x Intermittent workload delay: 10ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 662 756 686 1.1x 24-core/24-L3 1378 3648 1035 3.5x 192-core/24-L3 1439 10833 1482 7.3x 384-thread/24-L3 1503 10570 1556 6.8x [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs" patch series with a simpler and more general approach. ] [ This patch applies on top of v6.12-rc1. ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
2024-10-14sched: idle: Optimize the generic idle loop by removing needless memory barrierZhongqiu Han1-1/+0
The memory barrier rmb() in generic idle loop do_idle() function is not needed, it doesn't order any load instruction, just remove it as needless rmb() can cause performance impact. The rmb() was introduced by the tglx/history.git commit f2f1b44c75c4 ("[PATCH] Remove RCU abuse in cpu_idle()") to order the loads between cpu_idle_map and pm_idle. It pairs with wmb() in function cpu_idle_wait(). And then with the removal of cpu_idle_state in function cpu_idle() and wmb() in function cpu_idle_wait() in commit 783e391b7b5b ("x86: Simplify cpu_idle_wait"), rmb() no longer has a reason to exist. After that, commit d16699123434 ("idle: Implement generic idle function") implemented a generic idle function cpu_idle_loop() which resembles the functionality found in arch/. And it retained the rmb() in generic idle loop in file kernel/cpu/idle.c. And at last, commit cf37b6b48428 ("sched/idle: Move cpu/idle.c to sched/idle.c") moved cpu/idle.c to sched/idle.c. And commit c1de45ca831a ("sched/idle: Add support for tasks that inject idle") renamed function cpu_idle_loop() to do_idle(). History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241009093745.9504-1-quic_zhonhan@quicinc.com
2024-10-14Merge branch 'tip/sched/urgent'Peter Zijlstra13-70/+128
Sync with sched/urgent to avoid conflicts. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-10-14sched/fair: Fix external p->on_rq usersPeter Zijlstra6-6/+32
Sean noted that ever since commit 152e11f6df29 ("sched/fair: Implement delayed dequeue") KVM's preemption notifiers have started mis-classifying preemption vs blocking. Notably p->on_rq is no longer sufficient to determine if a task is runnable or blocked -- the aforementioned commit introduces tasks that remain on the runqueue even through they will not run again, and should be considered blocked for many cases. Add the task_is_runnable() helper to classify things and audit all external users of the p->on_rq state. Also add a few comments. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
2024-10-14sched/psi: Fix mistaken CPU pressure indication after corrupted task state bugJohannes Weiner2-21/+39
Since sched_delayed tasks remain queued even after blocking, the load balancer can migrate them between runqueues while PSI considers them to be asleep. As a result, it misreads the migration requeue followed by a wakeup as a double queue: psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=. set=4 First, call psi_enqueue() after p->sched_class->enqueue_task(). A wakeup will clear p->se.sched_delayed while a migration will not, so psi can use that flag to tell them apart. Then teach psi to migrate any "sleep" state when delayed-dequeue tasks are being migrated. Delayed-dequeue tasks can be revived by ttwu_runnable(), which will call down with a new ENQUEUE_DELAYED. Instead of further complicating the wakeup conditional in enqueue_task(), identify migration contexts instead and default to wakeup handling for all other cases. It's not just the warning in dmesg, the task state corruption causes a permanent CPU pressure indication, which messes with workload/machine health monitoring. Debugged-by-and-original-fix-by: K Prateek Nayak <kprateek.nayak@amd.com> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/ Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/ Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20241010193712.GC181795@cmpxchg.org
2024-10-11Merge tag 'rcu.fixes.6.12-rc3' of ↵Linus Torvalds1-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU fix from Neeraj Upadhyay: "Fix rcuog kthread wakeup invocation from softirq context on a CPU which has been marked offline. This can happen when new callbacks are enqueued from a softirq on an offline CPU before it calls rcutree_report_cpu_dead(). When this happens on NOCB configuration, the rcuog wake-up is deferred through an IPI to an online CPU. This is done to avoid call into the scheduler which can risk arming the RT-bandwidth after hrtimers have been migrated out and disabled. However, doing IPI call from softirq is not allowed: Fix this by forcing deferred rcuog wakeup through the NOCB timer when the CPU is offline" * tag 'rcu.fixes.6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: rcu/nocb: Fix rcuog wake-up from offline softirq
2024-10-11audit: change context data from secid to lsm_propCasey Schaufler3-21/+13
Change the LSM data stored in the audit transactions from a secid to an LSM prop. This is done in struct audit_context and struct audit_aux_data_pids. Several cases of scaffolding can be removed. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11audit: use an lsm_prop in audit_namesCasey Schaufler2-16/+6
Replace the osid field in the audit_names structure with a lsm_prop structure. This accommodates the use of an lsm_prop in security_audit_rule_match() and security_inode_getsecid(). Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11lsm: use lsm_prop in security_inode_getsecidCasey Schaufler1-1/+5
Change the security_inode_getsecid() interface to fill in a lsm_prop structure instead of a u32 secid. This allows for its callers to gather data from all registered LSMs. Data is provided for IMA and audit. Change the name to security_inode_getlsmprop(). Cc: linux-integrity@vger.kernel.org Cc: selinux@vger.kernel.org Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11lsm: use lsm_prop in security_current_getsecidCasey Schaufler3-16/+20
Change the security_current_getsecid_subj() and security_task_getsecid_obj() interfaces to fill in a lsm_prop structure instead of a u32 secid. Audit interfaces will need to collect all possible security data for possible reporting. Cc: linux-integrity@vger.kernel.org Cc: audit@vger.kernel.org Cc: selinux@vger.kernel.org Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11audit: update shutdown LSM dataCasey Schaufler1-6/+8
The audit process LSM information is changed from a secid audit_sig_sid to an lsm_prop in audit_sig_lsm. Update the users of this data appropriately. Calls to security_secid_to_secctx() are changed to use security_lsmprop_to_secctx() instead. security_current_getsecid_subj() is scaffolded. It will be updated in a subsequent patch. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11lsm: use lsm_prop in security_ipc_getsecidCasey Schaufler1-2/+1
There may be more than one LSM that provides IPC data for auditing. Change security_ipc_getsecid() to fill in a lsm_prop structure instead of the u32 secid. Change the name to security_ipc_getlsmprop() to reflect the change. Cc: audit@vger.kernel.org Cc: linux-security-module@vger.kernel.org Cc: selinux@vger.kernel.org Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11audit: maintain an lsm_prop in audit_contextCasey Schaufler2-12/+10
Replace the secid value stored in struct audit_context with a struct lsm_prop. Change the code that uses this value to accommodate the change. security_audit_rule_match() expects a lsm_prop, so existing scaffolding can be removed. A call to security_secid_to_secctx() is changed to security_lsmprop_to_secctx(). The call to security_ipc_getsecid() is scaffolded. A new function lsmprop_is_set() is introduced to identify whether an lsm_prop contains a non-zero value. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject line tweak, fix lsmprop_is_set() typo] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11lsm: use lsm_prop in security_audit_rule_matchCasey Schaufler2-8/+21
Change the secid parameter of security_audit_rule_match to a lsm_prop structure pointer. Pass the entry from the lsm_prop structure for the approprite slot to the LSM hook. Change the users of security_audit_rule_match to use the lsm_prop instead of a u32. The scaffolding function lsmprop_init() fills the structure with the value of the old secid, ensuring that it is available to the appropriate module hook. The sources of the secid, security_task_getsecid() and security_inode_getsecid(), will be converted to use the lsm_prop structure later in the series. At that point the use of lsmprop_init() is dropped. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()Paul E. McKenney1-2/+2
Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock() to check for excessive CSD-lock wait times. This works, but does not guarantee monotonic timestamps on x86 due to the sched_clock() function's use of the rdtsc instruction, which does not guarantee ordering. This means that, given successive calls to sched_clock(), the second might return an earlier time than the second, that is, time might seem to go backwards. This can (and does!) result in false-positive CSD-lock wait complaints claiming almost 2^64 nanoseconds of delay. Therefore, switch from sched_clock() to ktime_get_mono_fast_ns(), which does guarantee monotonic timestamps via the rdtsc_ordered() function, which as the name implies, does guarantee ordered timestamps, at least in the absence of calls from NMI handlers, which are not involved in this code path. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
2024-10-11sched/core: Dequeue PSI signals for blocked tasks that are delayedPeter Zijlstra1-1/+3
psi_dequeue() in for blocked task expects psi_sched_switch() to clear the TSK_.*RUNNING PSI flags and set the TSK_IOWAIT flags however psi_sched_switch() uses "!task_on_rq_queued(prev)" to detect if the task is blocked or still runnable which is no longer true with DELAY_DEQUEUE since a blocking task can be left queued on the runqueue. This can lead to PSI splats similar to: psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4 when the task is requeued since the TSK_RUNNING flag was not cleared when the task was blocked. Explicitly communicate that the task was blocked to psi_sched_switch() even if it was delayed and is still on the runqueue. [ prateek: Broke off the relevant part from [1], commit message ] Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/ Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/ Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/ [1]
2024-10-11sched: Fix delayed_dequeue vs switched_from_fair()Peter Zijlstra5-34/+30
Commit 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") and its follow up fixes try to deal with a rather unfortunate situation where is task is enqueued in a new class, even though it shouldn't have been. Mostly because the existing ->switched_to/from() hooks are in the wrong place for this case. This all led to Paul being able to trigger failures at something like once per 10k CPU hours of RCU torture. For now, do the ugly thing and move the code to the right place by ignoring the switch hooks. Note: Clean up the whole sched_class::switch*_{to,from}() thing. Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
2024-10-11sched/core: Disable page allocation in task_tick_mm_cid()Waiman Long2-3/+16
With KASAN and PREEMPT_RT enabled, calling task_work_add() in task_tick_mm_cid() may cause the following splat. [ 63.696416] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [ 63.696416] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 610, name: modprobe [ 63.696416] preempt_count: 10001, expected: 0 [ 63.696416] RCU nest depth: 1, expected: 1 This problem is caused by the following call trace. sched_tick() [ acquire rq->__lock ] -> task_tick_mm_cid() -> task_work_add() -> __kasan_record_aux_stack() -> kasan_save_stack() -> stack_depot_save_flags() -> alloc_pages_mpol_noprof() -> __alloc_pages_noprof() -> get_page_from_freelist() -> rmqueue() -> rmqueue_pcplist() -> __rmqueue_pcplist() -> rmqueue_bulk() -> rt_spin_lock() The rq lock is a raw_spinlock_t. We can't sleep while holding it. IOW, we can't call alloc_pages() in stack_depot_save_flags(). The task_tick_mm_cid() function with its task_work_add() call was introduced by commit 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") in v6.4 kernel. Fortunately, there is a kasan_record_aux_stack_noalloc() variant that calls stack_depot_save_flags() while not allowing it to allocate new pages. To allow task_tick_mm_cid() to use task_work without page allocation, a new TWAF_NO_ALLOC flag is added to enable calling kasan_record_aux_stack_noalloc() instead of kasan_record_aux_stack() if set. The task_tick_mm_cid() function is modified to add this new flag. The possible downside is the missing stack trace in a KASAN report due to new page allocation required when task_work_add_noallloc() is called which should be rare. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241010014432.194742-1-longman@redhat.com
2024-10-11sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()Phil Auld1-1/+1
The deadline server code moved one of the start_hrtick_dl() calls but dropped the dl specific hrtick_enabled check. This causes hrticks to get armed even when sched_feat(HRTICK_DL) is false. Fix it. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20241004123729.460668-1-pauld@redhat.com
2024-10-10bpf: Fix unpopulated path_size when uprobe_multi fields unsetTyrone Wu1-19/+17
Previously when retrieving `bpf_link_info.uprobe_multi` with `path` and `path_size` fields unset, the `path_size` field is not populated (remains 0). This behavior was inconsistent with how other input/output string buffer fields work, as the field should be populated in cases when: - both buffer and length are set (currently works as expected) - both buffer and length are unset (not working as expected) This patch now fills the `path_size` field when `path` and `path_size` are unset. Fixes: e56fdbfb06e2 ("bpf: Add link_info support for uprobe multi link") Signed-off-by: Tyrone Wu <wudevelops@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241011000803.681190-1-wudevelops@gmail.com
2024-10-10ftrace: Make ftrace_regs abstract from direct useSteven Rostedt1-1/+1
ftrace_regs was created to hold registers that store information to save function parameters, return value and stack. Since it is a subset of pt_regs, it should only be used by its accessor functions. But because pt_regs can easily be taken from ftrace_regs (on most archs), it is tempting to use it directly. But when running on other architectures, it may fail to build or worse, build but crash the kernel! Instead, make struct ftrace_regs an empty structure and have the architectures define __arch_ftrace_regs and all the accessor functions will typecast to it to get to the actual fields. This will help avoid usage of ftrace_regs directly. Link: https://lore.kernel.org/all/20241007171027.629bdafd@gandalf.local.home/ Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org> Cc: "x86@kernel.org" <x86@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/20241008230628.958778821@goodmis.org Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-10fgragh: No need to invoke the function call_filter_check_discard()Steven Rostedt1-3/+1
The function call_filter_check_discard() has been removed in the commit 49e4154f4b16 ("tracing: Remove TRACE_EVENT_FL_FILTERED logic"), from another topic branch. But when merged together with commit 21e92806d39c6 ("function_graph: Support recording and printing the function return address") which added another call to call_filter_check_discard(), it caused the build to fail. Since the function call_filter_check_discard() is useless, it can simply be removed regardless of being merged with commit 49e4154f4b16 or not. Link: https://lore.kernel.org/all/20241010134649.43ed357c@canb.auug.org.au/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Donglin Peng <dolinux.peng@gmail.com> Link: https://lore.kernel.org/20241010194020.46192b21@gandalf.local.home Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 21e92806d39c6 ("function_graph: Support recording and printing the function return address") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-10sched_ext: Don't hold scx_tasks_lock for too longTejun Heo1-2/+17
While enabling and disabling a BPF scheduler, every task is iterated a couple times by walking scx_tasks. Except for one, all iterations keep holding scx_tasks_lock. On multi-socket systems under heavy rq lock contention and high number of threads, this can can lead to RCU and other stalls. The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs) running `stress-ng --workload 150 --workload-threads 10` with >400k idle threads and RCU stall period reduced to 5s: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17 rcu: 186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17 rcu: (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192) Sending NMI from CPU 80 to CPUs 91: NMI backtrace for cpu 91 CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471 Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023 Sched_ext: simple (disabling+all) RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0 Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37 RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002 RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0 RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080 RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460 R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000 FS: 0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0 Call Trace: <NMI> </NMI> <TASK> do_raw_spin_lock+0x9c/0xb0 task_rq_lock+0x50/0x190 scx_task_iter_next_locked+0x157/0x170 scx_ops_disable_workfn+0x2c2/0xbf0 kthread_worker_fn+0x108/0x2a0 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 </TASK> Sending NMI from CPU 80 to CPUs 186: NMI backtrace for cpu 186 CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471 scx_task_iter can safely drop locks while iterating. Make scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid stalls. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>