summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2025-07-10printk: ringbuffer: Explain why the KUnit test ignores failed writesPetr Mladek-0/+13
The KUnit test ignores prb_reserve() failures on purpose. It tries to push the ringbuffer beyond limits. Note that it is a know problem that writes might fail in this situation. printk() tries to prevent this problem by: + allocating big enough data buffer, see log_buf_add_cpu(). + allocating enough descriptors by using small enough average record, see PRB_AVGBITS. + storing the record with disabled interrupts, see vprintk_store(). Also the amount of printk() messages is always somehow bound in practice. And they are serialized when they are printed from many CPUs on purpose, for example, when printing backtraces. Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://patch.msgid.link/20250702095157.110916-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-07-10PM: sleep: add kernel parameter to disable asynchronous suspend/resumeTudor Ambarus-0/+9
On some platforms, device dependencies are not properly represented by device links, which can cause issues when asynchronous power management is enabled. While it is possible to disable this via sysfs, doing so at runtime can race with the first system suspend event. This patch introduces a kernel command-line parameter, "pm_async", which can be set to "off" to globally disable asynchronous suspend and resume operations from early boot. It effectively provides a way to set the initial value of the existing pm_async sysfs knob at boot time. This offers a robust method to fall back to synchronous (sequential) operation, which can stabilize platforms with problematic dependencies and also serve as a useful debugging tool. The default behavior remains unchanged (asynchronous enabled). To disable it, boot the kernel with the "pm_async=off" parameter. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20250709-pm-async-off-v3-1-cb69a6fc8d04@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-09kthread: update comment for __to_kthreadJiazi Li-6/+5
With commit 343f4c49f243 ("kthread: Don't allocate kthread_struct for init and umh") and commit 753550eb0ce1 ("fork: Explicitly set PF_KTHREAD"), umh task no longer have struct kthread and PF_KTHREAD flag. Update the comment to describe what the current rules are to detect is something is a kthread. Link: https://lkml.kernel.org/r/20250620100801.23185-1-jqqlijiazi@gmail.com Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: mingzhu.wang <mingzhu.wang@transsion.com> Suggested-by Eric W . Biederman <ebiederm@xmission.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: clean up ifdef logic around stack allocationPasha Tatashin-11/+11
There is an unneeded OR in the ifdef functions that are used to allocate and free kernel stacks based on direct map or vmap. Adding dynamic stack support would complicate this logic even further. Therefore, clean up by changing the order so OR is no longer needed. Link: https://lkml.kernel.org/r/20250618-fork-fixes-v4-1-2e05a2e1f5fc@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09uprobes: revert ref_ctr_offset in uprobe_unregister error pathJiri Olsa-2/+2
There's error path that could lead to inactive uprobe: 1) uprobe_register succeeds - updates instruction to int3 and changes ref_ctr from 0 to 1 2) uprobe_unregister fails - int3 stays in place, but ref_ctr is changed to 0 (it's not restored to 1 in the fail path) uprobe is leaked 3) another uprobe_register comes and re-uses the leaked uprobe and succeds - but int3 is already in place, so ref_ctr update is skipped and it stays 0 - uprobe CAN NOT be triggered now 4) uprobe_unregister fails because ref_ctr value is unexpected Fix this by reverting the updated ref_ctr value back to 1 in step 2), which is the case when uprobe_unregister fails (int3 stays in place), but we have already updated refctr. The new scenario will go as follows: 1) uprobe_register succeeds - updates instruction to int3 and changes ref_ctr from 0 to 1 2) uprobe_unregister fails - int3 stays in place and ref_ctr is reverted to 1.. uprobe is leaked 3) another uprobe_register comes and re-uses the leaked uprobe and succeds - but int3 is already in place, so ref_ctr update is skipped and it stays 1 - uprobe CAN be triggered now 4) uprobe_unregister succeeds Link: https://lkml.kernel.org/r/20250514101809.2010193-1-jolsa@kernel.org Fixes: 1cc33161a83d ("uprobes: Support SDT markers having reference count (semaphore)") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09exit: fix misleading comment in forget_original_parent()Fushuai Wang-6/+1
The commit 482a3767e508 ("exit: reparent: call forget_original_parent() under tasklist_lock") moved the comment from exit_notify() to forget_original_parent(). However, the forget_original_parent() only handles (A), while (B) is handled in kill_orphaned_pgrp(). So remove the unrelated part. Link: https://lkml.kernel.org/r/20250615030930.58051-1-wangfushuai@baidu.com Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: wangfushuai <wangfushuai@baidu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kcov: fix typo in comment of kcov_fault_in_areaWei Nanxin-1/+1
change '__santizer_cov_trace_pc()' to '__sanitizer_cov_trace_pc()' Link: https://lkml.kernel.org/r/20250615123237.110144-1-n9winx@163.com Signed-off-by: Wei Nanxin <n9winx@163.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Macro Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: support a counter tracking if data is too big to writeJason Xing-8/+10
It really doesn't matter if the user/admin knows what the last too big value is. Record how many times this case is triggered would be helpful. Solve the existing issue where relay_reset() doesn't restore the value. Store the counter in the per-cpu buffer structure instead of the global buffer structure. It also solves the racy condition which is likely to happen when a few of per-cpu buffers encounter the too big data case and then access the global field last_toobig without lock protection. Remove the printk in relay_close() since kernel module can directly call relay_stats() as they want. Link: https://lkml.kernel.org/r/20250612061201.34272-6-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09blktrace: use rbuf->stats.full as a drop indicator in relayfsJason Xing-20/+2
Replace internal subbuf_start in blktrace with the default policy in relayfs. Remove dropped field from struct blktrace. Correspondingly, call the common helper in relay. By incrementing full_count to keep track of how many times we encountered a full buffer issue, user space will know how many events were lost. Link: https://lkml.kernel.org/r/20250612061201.34272-5-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: introduce getting relayfs statistics functionJason Xing-0/+30
In this version, only support getting the counter for buffer full and implement the framework of how it works. Users can pass certain flag to fetch what field/statistics they expect to know. Each time it only returns one result. So do not pass multiple flags. Link: https://lkml.kernel.org/r/20250612061201.34272-4-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: support a counter tracking if per-cpu buffers is fullJason Xing-1/+7
When using relay mechanism, we often encounter the case where new data are lost or old unconsumed data are overwritten because of slow reader. Add 'full' field in per-cpu buffer structure to detect if the above case is happening. Relay has two modes: 1) non-overwrite mode, 2) overwrite mode. So buffer being full here respectively means: 1) relayfs doesn't intend to accept new data and then simply drop them, or 2) relayfs is going to start over again and overwrite old unread data with new data. Note: this counter doesn't need any explicit lock to protect from being modified by different threads for the better performance consideration. Writers calling __relay_write/relay_write should consider how to use the lock and ensure it performs under the lock protection, thus it's not necessary to add a new small lock here. Link: https://lkml.kernel.org/r/20250612061201.34272-3-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: abolish prev_paddingJason Xing-7/+9
Patch series "relayfs: misc changes", v5. The series mostly focuses on the error counters which helps every user debug their own kernel module. This patch (of 5): prev_padding represents the unused space of certain subbuffer. If the content of a call of relay_write() exceeds the limit of the remainder of this subbuffer, it will skip storing in the rest space and record the start point as buf->prev_padding in relay_switch_subbuf(). Since the buf is a per-cpu big buffer, the point of prev_padding as a global value for the whole buffer instead of a single subbuffer (whose padding info is stored in buf->padding[]) seems meaningless from the real use cases, so we don't bother to record it any more. Link: https://lkml.kernel.org/r/20250612061201.34272-1-kerneljasonxing@gmail.com Link: https://lkml.kernel.org/r/20250612061201.34272-2-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kernel: relay: use __GFP_ZERO in relay_alloc_bufElijah Wright-2/+1
Passing the __GFP_ZERO flag to alloc_page should result in less overhead th= an using memset() Link: https://lkml.kernel.org/r/20250610225639.314970-3-git@elijahs.space Signed-off-by: Elijah Wright <git@elijahs.space> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: define a local GFP_VMAP_STACKLinus Walleij-6/+7
The current allocation of VMAP stack memory is using (THREADINFO_GFP & ~__GFP_ACCOUNT) which is a complicated way of saying (GFP_KERNEL | __GFP_ZERO): <linux/thread_info.h>: define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO) <linux/gfp_types.h>: define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT) This is an unfortunate side-effect of independent changes blurring the picture: commit 19809c2da28aee5860ad9a2eff760730a0710df0 changed (THREADINFO_GFP | __GFP_HIGHMEM) to just THREADINFO_GFP since highmem became implicit. commit 9b6f7e163cd0f468d1b9696b785659d3c27c8667 then added stack caching and rewrote the allocation to (THREADINFO_GFP & ~__GFP_ACCOUNT) as cached stacks need to be accounted separately. However that code, when it eventually accounts the memory does this: ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0) so the memory is charged as a GFP_KERNEL allocation. Define a unique GFP_VMAP_STACK to use GFP_KERNEL | __GFP_ZERO and move the comment there. Link: https://lkml.kernel.org/r/20250509-gfp-stack-v1-1-82f6f7efc210@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reported-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks codePasha Tatashin-31/+29
There are two data types: "struct vm_struct" and "struct vm_stack" that have the same local variable names: vm_stack, or vm, or s, which makes the code confusing to read. Change the code so the naming is consistent: struct vm_struct is always called vm_area struct vm_stack is always called vm_stack One change altering vfree(vm_stack) to vfree(vm_area->addr) may look like a semantic change but it is not: vm_area->addr points to the vm_stack. This was done to improve readability. [linus.walleij@linaro.org: rebased and added new users of the variable names, address review comments] Link: https://lore.kernel.org/20240311164638.2015063-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-2-e6c69dd356f2@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: filter zone device pages returned from folio_walk_start()Alistair Popple-1/+1
Previously dax pages were skipped by the pagewalk code as pud_special() or vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are refcounted normally that is no longer the case, so the pagewalk code will start returning them. Most callers already explicitly filter for DAX or zone device pages so don't need updating. However some don't, so add checks to those callers. Link: https://lkml.kernel.org/r/4ecb7b357fc5b435588024770b88bbb695c30090.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: use folio_expected_ref_count() helper for reference countingShivank Garg-2/+1
Replace open-coded folio reference count calculations with the folio_expected_ref_count(). No functional changes intended. Link: https://lkml.kernel.org/r/20250611052706.515408-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "sched/numa: add statistics of numa balance task"Chen Yu-11/+2
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382. This commit introduces per-memcg/task NUMA balance statistics, but unfortunately it introduced a NULL pointer exception due to the following race condition: After a swap task candidate was chosen, its mm_struct pointer was set to NULL due to task exit. Later, when performing the actual task swapping, the p->mm caused the problem. CPU0 CPU1 : ... task_numa_migrate task_numa_find_cpu task_numa_compare # a normal task p is chosen env->best_task = p # p exit: exit_signals(p); p->flags |= PF_EXITING exit_mm p->mm = NULL; migrate_swap_stop __migrate_swap_task((arg->src_task, arg->dst_cpu) count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL task_lock() should be held and the PF_EXITING flag needs to be checked to prevent this from happening. After discussion, the conclusion was that adding a lock is not worthwhile for some statistics calculations. Revert the change and rely on the tracepoint for this purpose. Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/ Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com> Reported-by: Suneeth D <Suneeth.D@amd.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Libo Chen <libo.chen@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fgraph: Make pid_str size match the commentArtem Sadovnikov-1/+1
The comment above buffer mentions sign, 10 bytes width for number and null terminator, but buffer itself isn't large enough to hold that much data. This is a cosmetic change, since PID cannot be negative, other than -1. Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lore.kernel.org/20250617152110.2530-1-a.sadovnikov@ispras.ru Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09tracing: ring_buffer: Rewind persistent ring buffer on rebootMasami Hiramatsu (Google)-3/+100
Rewind persistent ring buffer pages which have been read in the previous boot. Those pages are highly possible to be lost before writing it to the disk if the previous kernel crashed. In this case, the trace data is kept on the persistent ring buffer, but it can not be read because its commit size has been reset after read. This skips clearing the commit size of each sub-buffer and recover it after reboot. Note: If you read the previous boot data via trace_pipe, that is not accessible in that time. But reboot without clearing (or reusing) the read data, the read data is recovered again in the next boot. Thus, when you read the previous boot data, clear it by `echo > trace`. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/174899582116.955054.773265393511190051.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Allow to configure the number of per-task monitorNam Cao-4/+14
Now that there are 2 monitors for real-time applications, users may want to enable both of them simultaneously. Make the number of per-task monitor configurable. Default it to 2 for now. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/93e83313fc4ba7f6e66f4abe80ca5f5494d658d0.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add rtapp_sleep monitorNam Cao-1/+534
Add a monitor for checking that real-time tasks do not go to sleep in a manner that may cause undesirable latency. Also change RV depends on TRACING to RV select TRACING to avoid the following recursive dependency: error: recursive dependency detected! symbol TRACING is selected by PREEMPTIRQ_TRACEPOINTS symbol PREEMPTIRQ_TRACEPOINTS depends on TRACE_IRQFLAGS symbol TRACE_IRQFLAGS is selected by RV_MON_SLEEP symbol RV_MON_SLEEP depends on RV symbol RV depends on TRACING Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/75bc5bcc741d153aa279c95faf778dff35c5c8ad.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add rtapp_pagefault monitorNam Cao-0/+189
Userspace real-time applications may have design flaws that they raise page faults in real-time threads, and thus have unexpected latencies. Add an linear temporal logic monitor to detect this scenario. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/78fea8a2de6d058241d3c6502c1a92910772b0ed.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add rtapp container monitorNam Cao-0/+48
Add the container "rtapp" which is the monitor collection for detecting problems with real-time applications. The monitors will be added in the follow-up commits. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/fb18b87631d386271de00959d8d4826f23fcd1cd.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add support for LTL monitorsNam Cao-4/+55
While attempting to implement DA monitors for some complex specifications, deterministic automaton is found to be inappropriate as the specification language. The automaton is complicated, hard to understand, and error-prone. For these cases, linear temporal logic is more suitable as the specification language. Add support for linear temporal logic runtime verification monitor. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/d366c1fed60ed4e8f6451f3c15a99755f2740b5f.1752088709.git.namcao@linutronix.de Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: rename CONFIG_DA_MON_EVENTS to CONFIG_RV_MON_EVENTSNam Cao-4/+4
CONFIG_DA_MON_EVENTS is not specific to deterministic automaton. It could be used for other monitor types. Therefore rename it to CONFIG_RV_MON_EVENTS. This prepares for the introduction of linear temporal logic monitor. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/507210517123d887c1d208aa2fd45ec69765d3f0.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Let the reactors take care of buffersNam Cao-5/+13
Each RV monitor has one static buffer to send to the reactors. If multiple errors are detected simultaneously, the one buffer could be overwritten. Instead, leave it to the reactors to handle buffering. Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09panic: Add vpanic()Nam Cao-4/+12
vpanic() is useful for implementing runtime verification reactors. Add it. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09printk: Make vprintk_deferred() publicNam Cao-1/+0
vprintk_deferred() is useful for implementing runtime verification reactors. Make it public. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add #undef TRACE_INCLUDE_FILENam Cao-1/+2
Without "#undef TRACE_INCLUDE_FILE", there could be a build error due to TRACE_INCLUDE_FILE being redefined. Therefore add it. Also fix a typo while at it. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/f805e074581e927bb176c742c981fa7675b6ebe5.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09sched/fair: Always trigger resched at the end of a protected periodVincent Guittot-9/+1
Always trigger a resched after a protected period even if the entity is still eligible. It can happen that an entity remains eligible at the end of the protected period but must let an entity with a shorter slice to run in order to keep its lag shorter than slice. This is particulalry true with run to parity which tries to maximize the lag. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-7-vincent.guittot@linaro.org
2025-07-09sched/fair: Fix entity's lag with run to parityVincent Guittot-3/+13
When an entity is enqueued without preempting current, we must ensure that the slice protection is updated to take into account the slice duration of the newly enqueued task so that its lag will not exceed its slice (+ tick). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-6-vincent.guittot@linaro.org
2025-07-09sched/fair: Limit run to parity to the min slice of enqueued entitiesVincent Guittot-4/+6
Run to parity ensures that current will get a chance to run its full slice in one go but this can create large latency and/or lag for entities with shorter slice that have exhausted their previous slice and wait to run their next slice. Clamp the run to parity to the shortest slice of all enqueued entities. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-5-vincent.guittot@linaro.org
2025-07-09sched/fair: Remove spurious shorter slice preemptionVincent Guittot-30/+14
Even if the waking task can preempt current, it might not be the one selected by pick_task_fair. Check that the waking task will be selected if we cancel the slice protection before doing so. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-4-vincent.guittot@linaro.org
2025-07-09sched/fair: Fix NO_RUN_TO_PARITY caseVincent Guittot-11/+20
EEVDF expects the scheduler to allocate a time quantum to the selected entity and then pick a new entity for next quantum. Although this notion of time quantum is not strictly doable in our case, we can ensure a minimum runtime for each task most of the time and pick a new entity after a minimum time has elapsed. Reuse the slice protection of run to parity to ensure such runtime quantum. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-3-vincent.guittot@linaro.org
2025-07-09sched/fair: Use protect_slice() instead of direct comparisonVincent Guittot-1/+1
Replace the test by the relevant protect_slice() function. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dhaval Giani (AMD) <dhaval@gianis.ca> Link: https://lkml.kernel.org/r/20250708165630.1948751-2-vincent.guittot@linaro.org
2025-07-09sched/deadline: Less agressive dl_server handlingPeter Zijlstra-12/+22
Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default bandwidth control") caused a significant dip in his favourite benchmark of the day. Simply disabling dl_server cured things. His workload hammers the 0->1, 1->0 transitions, and the dl_server_{start,stop}() overhead kills it -- fairly obviously a bad idea in hind sight and all that. Change things around to only disable the dl_server when there has not been a fair task around for a whole period. Since the default period is 1 second, this ensures the benchmark never trips this, overhead gone. Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20250702121158.465086194@infradead.org
2025-07-09sched/psi: Optimize psi_group_change() cpu_clock() usagePeter Zijlstra-55/+66
Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") caused a regression for him on a high context switch rate benchmark (schbench) due to the now repeating cpu_clock() calls. In particular the problem is that get_recent_times() will extrapolate the current state to 'now'. But if an update uses a timestamp from before the start of the update, it is possible to get two reads with inconsistent results. It is effectively back-dating an update. (note that this all hard-relies on the clock being synchronized across CPUs -- if this is not the case, all bets are off). Combine this problem with the fact that there are per-group-per-cpu seqcounts, the commit in question pushed the clock read into the group iteration, causing tree-depth cpu_clock() calls. On architectures where cpu_clock() has appreciable overhead, this hurts. Instead move to a per-cpu seqcount, which allows us to have a single clock read for all group updates, increasing internal consistency and lowering update overhead. This comes at the cost of a longer update side (proportional to the tree depth) which can cause the read side to retry more often. Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>, Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net
2025-07-09sched/fair: Bump sd->max_newidle_lb_cost when newidle balance failsChris Mason-3/+16
schbench (https://github.com/masoncl/schbench.git) is showing a regression from previous production kernels that bisected down to: sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc) The schbench command line was: schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0 This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker threads spread across the rest of the CPUs. Neither the worker threads or the message threads do any work, they just wake each other up and go back to sleep as soon as possible. The end result is the first 4 CPUs are pegged waking up those 1024 workers, and the rest of the CPUs are constantly banging in and out of idle. If I take a v6.9 Linus kernel and revert that one commit, performance goes from 3.4M RPS to 5.4M RPS. schedstat shows there are ~100x more new idle balance operations, and profiling shows the worker threads are spending ~20% of their CPU time on new idle balance. schedstats also shows that almost all of these new idle balance attemps are failing to find busy groups. The fix used here is to crank up the cost of the newidle balance whenever it fails. Since we don't want sd->max_newidle_lb_cost to grow out of control, this also changes update_newidle_cost() to use sysctl_sched_migration_cost as the upper limit on max_newidle_lb_cost. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com
2025-07-09perf/core: Fix WARN in perf_sigtrap()Tetsuo Handa-7/+7
Since exit_task_work() runs after perf_event_exit_task_context() updated ctx->task to TASK_TOMBSTONE, perf_sigtrap() from perf_pending_task() might observe event->ctx->task == TASK_TOMBSTONE. Swap the early exit tests in order not to hit WARN_ON_ONCE(). Closes: https://syzkaller.appspot.com/bug?extid=2fe61cb2a86066be6985 Reported-by: syzbot <syzbot+2fe61cb2a86066be6985@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/b1c224bd-97f9-462c-a3e3-125d5e19c983@I-love.SAKURA.ne.jp
2025-07-09vdso/vsyscall: Split up __arch_update_vsyscall() into __arch_update_vdso_clock()Thomas Weißschuh-1/+2
The upcoming auxiliary clocks need this hook, too. To separate the architecture hooks from the timekeeper internals, refactor the hook to only operate on a single vDSO clock. While at it, use a more robust #define for the hook override. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-3-df7d9f87b9b8@linutronix.de
2025-07-09vdso/vsyscall: Introduce a helper to fill clock configurationsThomas Weißschuh-14/+13
The logic to configure a 'struct vdso_clock' from a 'struct tk_read_base' is copied two times. Split it into a shared function to reduce the duplication, especially as another user will be added for auxiliary clocks. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-2-df7d9f87b9b8@linutronix.de
2025-07-09Merge v6.16-rc2 into timers/ptpThomas Gleixner-3/+10
to pick up the __GENMASK() fix, otherwise the AUX clock VDSO patches fail to compile for compat. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-07-08kernel: trace: preemptirq_delay_test: use offstack cpu maskArnd Bergmann-4/+9
A CPU mask on the stack is broken for large values of CONFIG_NR_CPUS: kernel/trace/preemptirq_delay_test.c: In function ‘preemptirq_delay_run’: kernel/trace/preemptirq_delay_test.c:143:1: error: the frame size of 8512 bytes is larger than 1536 bytes [-Werror=frame-larger-than=] Fall back to dynamic allocation here. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Song Chen <chensong_2000@189.cn> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250620111215.3365305-1-arnd@kernel.org Fixes: 4b9091e1c194 ("kernel: trace: preemptirq_delay_test: add cpu affinity") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-08tracing: Use queue_rcu_work() to free filtersSteven Rostedt-8/+20
Freeing of filters requires to wait for both an RCU grace period as well as a RCU task trace wait period after they have been detached from their lists. The trace task period can be quite large so the freeing of the filters was moved to use the call_rcu*() routines. The problem with that is that the callback functions of call_rcu*() is done from a soft irq and can cause latencies if the callback takes a bit of time. The filters are freed per event in a system and the syscalls system contains an event per system call, which can be over 700 events. Freeing 700 filters in a bottom half is undesirable. Instead, move the freeing to use queue_rcu_work() which is done in task context. Link: https://lore.kernel.org/all/9a2f0cd0-1561-4206-8966-f93ccd25927f@paulmck-laptop/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250609131732.04fd303b@gandalf.local.home Fixes: a9d0aab5eb33 ("tracing: Fix regression of filter waiting a long time on RCU synchronization") Suggested-by: "Paul E. McKenney" <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-08tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()Yury Norov-4/+1
The dedicated cpumask_next_wrap() is more verbose and effective than cpumask_next() followed by cpumask_first(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250605000651.45281-1-yury.norov@gmail.com Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-08module: Make sure relocations are applied to the per-CPU sectionSebastian Andrzej Siewior-2/+8
The per-CPU data section is handled differently than the other sections. The memory allocations requires a special __percpu pointer and then the section is copied into the view of each CPU. Therefore the SHF_ALLOC flag is removed to ensure move_module() skips it. Later, relocations are applied and apply_relocations() skips sections without SHF_ALLOC because they have not been copied. This also skips the per-CPU data section. The missing relocations result in a NULL pointer on x86-64 and very small values on x86-32. This results in a crash because it is not skipped like NULL pointer would and can't be dereferenced. Such an assignment happens during static per-CPU lock initialisation with lockdep enabled. Allow relocation processing for the per-CPU section even if SHF_ALLOC is missing. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202506041623.e45e4f7d-lkp@intel.com Fixes: 1a6100caae425 ("Don't relocate non-allocated regions in modules.") #v2.6.1-rc3 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Link: https://lore.kernel.org/r/20250610163328.URcsSUC1@linutronix.de Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Message-ID: <20250610163328.URcsSUC1@linutronix.de>
2025-07-08module: Avoid unnecessary return value initialization in move_module()Petr Pavlu-2/+1
All error conditions in move_module() set the return value by updating the ret variable. Therefore, it is not necessary to the initialize the variable when declaring it. Remove the unnecessary initialization. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Link: https://lore.kernel.org/r/20250618122730.51324-3-petr.pavlu@suse.com Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Message-ID: <20250618122730.51324-3-petr.pavlu@suse.com>
2025-07-08module: Fix memory deallocation on error path in move_module()Petr Pavlu-2/+2
The function move_module() uses the variable t to track how many memory types it has allocated and consequently how many should be freed if an error occurs. The variable is initially set to 0 and is updated when a call to module_memory_alloc() fails. However, move_module() can fail for other reasons as well, in which case t remains set to 0 and no memory is freed. Fix the problem by initializing t to MOD_MEM_NUM_TYPES. Additionally, make the deallocation loop more robust by not relying on the mod_mem_type_t enum having a signed integer as its underlying type. Fixes: c7ee8aebf6c0 ("module: add stop-grap sanity check on module memcpy()") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Link: https://lore.kernel.org/r/20250618122730.51324-2-petr.pavlu@suse.com Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Message-ID: <20250618122730.51324-2-petr.pavlu@suse.com>
2025-07-08rcu/nocb: Fix possible invalid rdp's->nocb_cb_kthread pointer accessZqiang-3/+2
In the preparation stage of CPU online, if the corresponding the rdp's->nocb_cb_kthread does not exist, will be created, there is a situation where the rdp's rcuop kthreads creation fails, and then de-offload this CPU's rdp, does not assign this CPU's rdp->nocb_cb_kthread pointer, but this rdp's->nocb_gp_rdp and rdp's->rdp_gp->nocb_gp_kthread is still valid. This will cause the subsequent re-offload operation of this offline CPU, which will pass the conditional check and the kthread_unpark() will access invalid rdp's->nocb_cb_kthread pointer. This commit therefore use rdp's->nocb_gp_kthread instead of rdp_gp's->nocb_gp_kthread for safety check. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>