| Age | Commit message (Collapse) | Author | Files | Lines |
|
When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The
owner can be NULL. With this change, BPF helpers can safely access
mm->owner to retrieve the associated task from the mm. We can then make
policy decision based on the task attribute.
The typical use case is as follows,
bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field
@owner = @mm->owner; // mm_struct::owner is rcu trusted or null
if (!@owner)
goto out;
/* Do something based on the task attribute */
out:
bpf_rcu_read_unlock();
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20251016063929.13830-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Cross-merge BPF and other fixes after downstream PR.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Refactor pick_task_scx() adding a new argument to forcibly pick a
SCHED_EXT task, ignoring any higher-priority sched class activity.
This refactoring prepares the code for future scenarios, e.g., allowing
the ext dl_server to force a SCHED_EXT task selection.
No functional changes.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The CMA code, in addition to the reserved-memory regions in the device
tree, will also register a default CMA region if the device tree doesn't
provide any, with its size and position coming from either the kernel
command-line or configuration.
Let's register that one for use to create a heap for it.
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Maxime Ripard <mripard@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Link: https://lore.kernel.org/r/20251013-dma-buf-ecc-heap-v8-4-04ce150ea3d9@kernel.org
|
|
In order to create a CMA dma-buf heap instance for each CMA heap region
in the system, we need to collect all of them during boot.
They are created from two main sources: the reserved-memory regions in
the device tree, and the default CMA region created from the
configuration or command line parameters, if no default region is
provided in the device tree.
Let's collect all the device-tree defined CMA regions flagged as
reusable.
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Maxime Ripard <mripard@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Link: https://lore.kernel.org/r/20251013-dma-buf-ecc-heap-v8-3-04ce150ea3d9@kernel.org
|
|
The pm_vt_switch_required() function fails silently when memory
allocation fails, offering no indication to callers that the operation
was unsuccessful. This behavior prevents drivers from handling allocation
errors correctly or implementing retry mechanisms. By ensuring that
failures are reported back to the caller, drivers can make informed
decisions, improve robustness, and avoid unexpected behavior during
critical power management operations.
Change the function signature to return an integer error code and modify
the implementation to return -ENOMEM when kmalloc() fails. Update both
the function declaration and the inline stub in include/linux/pm.h to
maintain consistency across CONFIG_VT_CONSOLE_SLEEP configurations.
The function now returns:
- 0 on success (including when updating existing entries)
- -ENOMEM when memory allocation fails
This change improves error reporting without breaking existing callers,
as the current callers in drivers/video/fbdev/core/fbmem.c already
ignore the return value, making this a backward-compatible improvement.
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Malaya Kumar Rout <mrout@redhat.com>
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Link: https://patch.msgid.link/20251013193028.89570-1-mrout@redhat.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.19
Pull in tip/sched/core to receive:
50653216e4ff ("sched: Add support to pick functions to take rf")
4c95380701f5 ("sched/ext: Fold balance_scx() into pick_task_scx()")
which will enable clean integration of DL server support among other things.
This conflicts with the following from sched_ext/for-6.18-fixes:
a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")
which adds maybe_queue_balance_callback() to balance_scx() which is removed
by 50653216e4ff. Resolve by moving the invocation to pick_task_scx() in the
equivalent location.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pull sched_ext/for-6.18-fixes to sync trees to receive:
05e63305c85c ("sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads")
to avoid conflicts with planned cgroup sub-sched support.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING
instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests
whether there is already a pending request for queue_balance_callback() to
be invoked at the end of .balance().
Fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When __lookup_instance() allocates a func_instance structure but fails
to allocate the must_write_set array, it returns an error without freeing
the previously allocated func_instance. This causes a memory leak of 192
bytes (sizeof(struct func_instance)) each time this error path is triggered.
Fix by freeing 'result' on must_write_set allocation failure.
Fixes: b3698c356ad9 ("bpf: callchain sensitive stack liveness tracking using CFG")
Reported-by: BPF Runtime Fuzzer (BRF)
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com
|
|
With pick_task() having an rf argument, it is possible to do the
lock-break there, get rid of the weird balance/pick_task hack.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
|
|
Some pick functions like the internal pick_next_task_fair() already take
rf but some others dont. We need this for scx's server pick function.
Prepare for this by having pick functions accept it.
[peterz: - added RETRY_TASK handling
- removed pick_next_task_fair indirection]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
|
|
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
enables easy tracking of which runqueues are modified over a
lock-break.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
|
|
Shrikanth noted that sched_change pattern relies on using shared
flags.
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Teach the sched_change pattern how to do update_rq_clock(); this
allows for some simplifications / cleanups.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
In preparation to adding more rules to __task_rq_lock(), such that
__task_rq_unlock() will no longer be equivalent to rq_unlock(),
make sure every __task_rq_lock() is matched by a __task_rq_unlock()
and vice-versa.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
'Document' the locking context the various sched_class methods are
called under.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Now that do_set_cpus_allowed() holds all the regular locks, convert it
to use the sched_change pattern helper.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Hopefully saner naming.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
All callers of do_set_cpus_allowed() only take p->pi_lock, which is
not sufficient to actually change the cpumask. Again, this is mostly
ok in these cases, but it results in unnecessarily complicated
reasoning.
Furthermore, there is no reason what so ever to not just take all the
required locks, so do just that.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
For some reason migrate_disable_switch() was more complicated than it
needs to be, resulting in mind bending locking of dubious quality.
Recognise that migrate_disable_switch() must be called before a
context switch, but any place before that switch is equally good.
Since the current place results in troubled locking, simply move the
thing before taking rq->lock.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Move sched_class::prio_changed() into the change pattern.
And while there, extend it with sched_class::get_prio() in order to
fix the deadline sitation.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Use the new sched_class::switching_from() method to dequeue delayed
tasks before switching to another class.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
|
|
Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into
the change pattern. This completes and makes the pattern more
symmetric.
This changes the order of callbacks slightly:
OLD NEW
|
| switching_from()
dequeue_task(); | dequeue_task()
put_prev_task(); | put_prev_task()
| switched_from()
|
... change task ... | ... change task ...
|
switching_to(); | switching_to()
enqueue_task(); | enqueue_task()
set_next_task(); | set_next_task()
prev_class->switched_from() |
switched_to() | switched_to()
|
Notably, it moves the switched_from() callback right after the
dequeue/put. Existing implementations don't appear to be affected by
this change in location -- specifically the task isn't enqueued on the
class in question in either location.
Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore
when changing scheduling classes.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Prepare for the sched_class::switch*() methods getting folded into the
change pattern. As a result of that, the location of switched_from
will change slightly. SCHED_DEADLINE is affected by this change in
location:
OLD NEW
|
| switching_from()
dequeue_task(); | dequeue_task()
put_prev_task(); | put_prev_task()
| switched_from()
|
... change task ... | ... change task ...
|
switching_to(); | switching_to()
enqueue_task(); | enqueue_task()
set_next_task(); | set_next_task()
prev_class->switched_from() |
switched_to() | switched_to()
|
Notably, where switched_from() was called *after* the change to the
task, it will get called before it. Specifically, switched_from_dl()
uses dl_task(p) which uses p->prio; which is changed when switching
class (it might be the reason to switch class in case of PI).
When switched_from_dl() gets called, the task will have left the
deadline class and dl_task() must be false, while when doing
dequeue_dl_entity() the task must be a dl_task(), otherwise we'd have
called a different dequeue method.
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Ensure the matched flags are in the low word while the unmatched flags
go into the second word.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
As proposed a long while ago -- and half done by scx -- wrap the
scheduler's 'change' pattern in a guard helper.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Load imbalance is observed when the workload frequently forks new threads.
Due to CPU affinity, the workload can run on CPU 0-7 in the first
group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle.
{ 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15}
* * * * * * * * * * * *
When looking for dst group for newly forked threads, in many times
update_sg_wakeup_stats() reports the second group has more idle CPUs
than the first group. The scheduler thinks the second group is less
busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11
can be crowded with newly forked threads, at the same time CPU 0-7
can be idle.
A task may not use all the CPUs in a schedule group due to CPU affinity.
Only update schedule group statistics for allowed CPUs.
Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Allow architecture specific sched domain NUMA distances that are
modified from actual NUMA node distances for the purpose of building
NUMA sched domains.
Keep actual NUMA distances separately if modified distances
are used for building sched domains. Such distances
are still needed as NUMA balancing benefits from finding the
NUMA nodes that are actually closer to a task numa_group.
Consolidate the recording of unique NUMA distances in an array to
sched_record_numa_dist() so the function can be reused to record NUMA
distances when the NUMA distance metric is changed.
No functional change and additional distance array
allocated if there're no arch specific NUMA distances
being defined.
Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
|
|
Commit 16b269436b72 ("sched/deadline: Modify cpudl::free_cpus
to reflect rd->online") introduced the cpudl_set/clear_freecpu
functions to allow the cpu_dl::free_cpus mask to be manipulated
by the deadline scheduler class rq_on/offline callbacks so the
mask would also reflect this state.
Commit 9659e1eeee28 ("sched/deadline: Remove cpu_active_mask
from cpudl_find()") removed the check of the cpu_active_mask to
save some processing on the premise that the cpudl::free_cpus
mask already reflected the runqueue online state.
Unfortunately, there are cases where it is possible for the
cpudl_clear function to set the free_cpus bit for a CPU when the
deadline runqueue is offline. When this occurs while a CPU is
connected to the default root domain the flag may retain the bad
state after the CPU has been unplugged. Later, a different CPU
that is transitioning through the default root domain may push a
deadline task to the powered down CPU when cpudl_find sees its
free_cpus bit is set. If this happens the task will not have the
opportunity to run.
One example is outlined here:
https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com
Another occurs when the last deadline task is migrated from a
CPU that has an offlined runqueue. The dequeue_task member of
the deadline scheduler class will eventually call cpudl_clear
and set the free_cpus bit for the CPU.
This commit modifies the cpudl_clear function to be aware of the
online state of the deadline runqueue so that the free_cpus mask
can be updated appropriately.
It is no longer necessary to manage the mask outside of the
cpudl_set/clear functions so the cpudl_set/clear_freecpu
functions are removed. In addition, since the free_cpus mask is
now only updated under the cpudl lock the code was changed to
use the non-atomic __cpumask functions.
Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
If a task yields, the scheduler may decide to pick it again. The task in
turn may decide to yield immediately or shortly after, leading to a tight
loop of yields.
If there's another runnable task as this point, the deadline will be
increased by the slice at each loop. This can cause the deadline to runaway
pretty quickly, and subsequent elevated run delays later on as the task
doesn't get picked again. The reason the scheduler can pick the same task
again and again despite its deadline increasing is because it may be the
only eligible task at that point.
Fix this by making the task forfeiting its remaining vruntime and pushing
the deadline one slice ahead. This implements yield behavior more
authentically.
We limit the forfeiting to eligible tasks. This is because core scheduling
prefers running ineligible tasks rather than force idling. As such, without
the condition, we can end up on a yield loop which makes the vruntime
increase rapidly, leading to anomalous run delays later down the line.
Fixes: 147f3efaa24182 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250401123622.584018-1-sieberf@amazon.com
Link: https://lore.kernel.org/r/20250911095113.203439-1-sieberf@amazon.com
Link: https://lore.kernel.org/r/20250916140228.452231-1-sieberf@amazon.com
|
|
Commit 370645f41e6e ("dma-mapping: force bouncing if the kmalloc() size is
not cache-line-aligned") introduced DMA_BOUNCE_UNALIGNED_KMALLOC feature
and permitted architecture specific code configure kmalloc slabs with
sizes smaller than the value of dma_get_cache_alignment().
When that feature is enabled, the physical address of some small
kmalloc()-ed buffers might be not aligned to the CPU cachelines, thus not
really suitable for typical DMA. To properly handle that case a SWIOTLB
buffer bouncing is used, so no CPU cache corruption occurs. When that
happens, there is no point reporting a false-positive DMA-API warning that
the buffer is not properly aligned, as this is not a client driver fault.
[m.szyprowski@samsung.com: replace is_swiotlb_allocated() with is_swiotlb_active(), per Catalin]
Link: https://lkml.kernel.org/r/20251010173009.3916215-1-m.szyprowski@samsung.com
Link: https://lkml.kernel.org/r/20251009141508.2342138-1-m.szyprowski@samsung.com
Fixes: 370645f41e6e ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The builtin DSQ queue data structures are meant to be used by a wide
range of different sched_ext schedulers with different demands on these
data structures. They might be per-cpu with low-contention, or
high-contention shared queues. Unfortunately, DSQs have a coarse-grained
lock around the whole data structure. Without going all the way to a
lock-free, more scalable implementation, a small step we can take to
reduce lock contention is to allow a lockless, small-fixed-cost peek at
the head of the queue.
This change allows certain custom SCX schedulers to cheaply peek at
queues, e.g. during load balancing, before locking them. But it
represents a few extra memory operations to update the pointer each
time the DSQ is modified, including a memory barrier on ARM so the write
appears correctly ordered.
This commit adds a first_task pointer field which is updated
atomically when the DSQ is modified, and allows any thread to peek at
the head of the queue without holding the lock.
Signed-off-by: Ryan Newton <newton@meta.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When there is only one function of the same name, old_sympos of 0 and 1
are logically identical. Match them in klp_find_func().
This is to avoid a corner case with different toolchain behavior.
In this specific issue, two versions of kpatch-build were used to
build livepatch for the same kernel. One assigns old_sympos == 0 for
unique local functions, the other assigns old_sympos == 1 for unique
local functions. Both versions work fine by themselves. (PS: This
behavior change was introduced in a downstream version of kpatch-build.
This change does not exist in upstream kpatch-build.)
However, during livepatch upgrade (with the replace flag set) from a
patch built with one version of kpatch-build to the same fix built with
the other version of kpatch-build, livepatching fails with errors like:
[ 14.218706] sysfs: cannot create duplicate filename 'xxx/somefunc,1'
...
[ 14.219466] Call Trace:
[ 14.219468] <TASK>
[ 14.219469] dump_stack_lvl+0x47/0x60
[ 14.219474] sysfs_warn_dup.cold+0x17/0x27
[ 14.219476] sysfs_create_dir_ns+0x95/0xb0
[ 14.219479] kobject_add_internal+0x9e/0x260
[ 14.219483] kobject_add+0x68/0x80
[ 14.219485] ? kstrdup+0x3c/0xa0
[ 14.219486] klp_enable_patch+0x320/0x830
[ 14.219488] patch_init+0x443/0x1000 [ccc_0_6]
[ 14.219491] ? 0xffffffffa05eb000
[ 14.219492] do_one_initcall+0x2e/0x190
[ 14.219494] do_init_module+0x67/0x270
[ 14.219496] init_module_from_file+0x75/0xa0
[ 14.219499] idempotent_init_module+0x15a/0x240
[ 14.219501] __x64_sys_finit_module+0x61/0xc0
[ 14.219503] do_syscall_64+0x5b/0x160
[ 14.219505] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 14.219507] RIP: 0033:0x7f545a4bd96d
...
[ 14.219516] kobject: kobject_add_internal failed for somefunc,1 with
-EEXIST, don't try to register things with the same name ...
This happens because klp_find_func() thinks somefunc with old_sympos==0
is not the same as somefunc with old_sympos==1, and klp_add_object_nops
adds another xxx/func,1 to the list of functions to patch.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
[pmladek@suse.com: Fixed some typos.]
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
We have many places which open-code what's now is bpf_rcu_lock_held()
macro, so replace all those places with a clean and short macro invocation.
For that, move bpf_rcu_lock_held() macro into include/linux/bpf.h.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20251014201403.4104511-1-andrii@kernel.org
|
|
bpf_async_cb structures.
The following kmemleak splat:
[ 8.105530] kmemleak: Trying to color unknown object at 0xff11000100e918c0 as Black
[ 8.106521] Call Trace:
[ 8.106521] <TASK>
[ 8.106521] dump_stack_lvl+0x4b/0x70
[ 8.106521] kvfree_call_rcu+0xcb/0x3b0
[ 8.106521] ? hrtimer_cancel+0x21/0x40
[ 8.106521] bpf_obj_free_fields+0x193/0x200
[ 8.106521] htab_map_update_elem+0x29c/0x410
[ 8.106521] bpf_prog_cfc8cd0f42c04044_overwrite_cb+0x47/0x4b
[ 8.106521] bpf_prog_8c30cd7c4db2e963_overwrite_timer+0x65/0x86
[ 8.106521] bpf_prog_test_run_syscall+0xe1/0x2a0
happens due to the combination of features and fixes, but mainly due to
commit 6d78b4473cdb ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()")
It's using __GFP_HIGH, which instructs slub/kmemleak internals to skip
kmemleak_alloc_recursive() on allocation, so subsequent kfree_rcu()->
kvfree_call_rcu()->kmemleak_ignore() complains with the above splat.
To fix this imbalance, replace bpf_map_kmalloc_node() with
kmalloc_nolock() and kfree_rcu() with call_rcu() + kfree_nolock() to
make sure that the objects allocated with kmalloc_nolock() are freed
with kfree_nolock() rather than the implicit kfree() that kfree_rcu()
uses internally.
Note, the kmalloc_nolock() happens under bpf_spin_lock_irqsave(), so
it will always fail in PREEMPT_RT. This is not an issue at the moment,
since bpf_timers are disabled in PREEMPT_RT. In the future
bpf_spin_lock will be replaced with state machine similar to
bpf_task_work.
Fixes: 6d78b4473cdb ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/bpf/20251015000700.28988-1-alexei.starovoitov@gmail.com
|
|
In preparation for introducing klp-build, add a new CONFIG_KLP_BUILD
option. The initial version will only be supported on x86-64.
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
|
Add a new klp diff subcommand which performs a binary diff between two
object files and extracts changed functions into a new object which can
then be linked into a livepatch module.
This builds on concepts from the longstanding out-of-tree kpatch [1]
project which began in 2012 and has been used for many years to generate
livepatch modules for production kernels. However, this is a complete
rewrite which incorporates hard-earned lessons from 12+ years of
maintaining kpatch.
Key improvements compared to kpatch-build:
- Integrated with objtool: Leverages objtool's existing control-flow
graph analysis to help detect changed functions.
- Works on vmlinux.o: Supports late-linked objects, making it
compatible with LTO, IBT, and similar.
- Simplified code base: ~3k fewer lines of code.
- Upstream: No more out-of-tree #ifdef hacks, far less cruft.
- Cleaner internals: Vastly simplified logic for symbol/section/reloc
inclusion and special section extraction.
- Robust __LINE__ macro handling: Avoids false positive binary diffs
caused by the __LINE__ macro by introducing a fix-patch-lines script
(coming in a later patch) which injects #line directives into the
source .patch to preserve the original line numbers at compile time.
Note the end result of this subcommand is not yet functionally complete.
Livepatch needs some ELF magic which linkers don't like:
- Two relocation sections (.rela*, .klp.rela*) for the same text
section.
- Use of SHN_LIVEPATCH to mark livepatch symbols.
Unfortunately linkers tend to mangle such things. To work around that,
klp diff generates a linker-compliant intermediate binary which encodes
the relevant KLP section/reloc/symbol metadata.
After module linking, a klp post-link step (coming soon) will clean up
the mess and convert the linked .ko into a fully compliant livepatch
module.
Note this subcommand requires the diffed binaries to have been compiled
with -ffunction-sections and -fdata-sections, and processed with
'objtool --checksum'. Those constraints will be handled by a klp-build
script introduced in a later patch.
Without '-ffunction-sections -fdata-sections', reliable object diffing
would be infeasible due to toolchain limitations:
- For intra-file+intra-section references, the compiler might
occasionally generated hard-coded instruction offsets instead of
relocations.
- Section-symbol-based references can be ambiguous:
- Overlapping or zero-length symbols create ambiguity as to which
symbol is being referenced.
- A reference to the end of a symbol (e.g., checking array bounds)
can be misinterpreted as a reference to the next symbol, or vice
versa.
A potential future alternative to '-ffunction-sections -fdata-sections'
would be to introduce a toolchain option that forces symbol-based
(non-section) relocations.
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
|
Add the section number and reloc index to relocation error messages to
help find the faulty relocation.
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
|
If we load a BPF scheduler while another scheduler is already running,
alloc_kick_pseqs() would be called again, overwriting the previously
allocated arrays.
Fix by moving the alloc_kick_pseqs() call after the scx_enable_state()
check, ensuring that the arrays are only allocated when a scheduler can
actually be loaded.
Fixes: 14c1da3895a11 ("sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Implement the missing cgroup_set_idle() callback that was marked as a
TODO. This allows BPF schedulers to be notified when a cgroup's idle
state changes, enabling them to adjust their scheduling behavior
accordingly.
The implementation follows the same pattern as other cgroup callbacks
like cgroup_set_weight() and cgroup_set_bandwidth(). It checks if the
BPF scheduler has implemented the callback and invokes it with the
appropriate parameters.
Fixes a spelling error in the cgroup_set_bandwidth() documentation.
tj: s/scx_cgroup_rwsem/scx_cgroup_ops_rwsem/ to fix build breakage.
Signed-off-by: zhidao su <soolaugust@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The check for some lost idle pelt time should be always done when
pick_next_task_fair() fails to pick a task and not only when we call it
from the fair fast-path.
The case happens when the last running task on rq is a RT or DL task. When
the latter goes to sleep and the /Sum of util_sum of the rq is at the max
value, we don't account the lost of idle time whereas we should.
Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
IBM CI tool reported kernel warning[1] when running a CPU removal
operation through drmgr[2]. i.e "drmgr -c cpu -r -q 1"
WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170
NIP [c0000000002b6ed8] cpudl_set+0x58/0x170
LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0
Call Trace:
[c000000002c2f8c0] init_stack+0x78c0/0x8000 (unreliable)
[c0000000002b7cb8] dl_server_timer+0x168/0x2a0
[c00000000034df84] __hrtimer_run_queues+0x1a4/0x390
[c00000000034f624] hrtimer_interrupt+0x124/0x300
[c00000000002a230] timer_interrupt+0x140/0x320
Git bisects to: commit 4ae8d9aa9f9d ("sched/deadline: Fix dl_server getting stuck")
This happens since:
- dl_server hrtimer gets enqueued close to cpu offline, when
kthread_park enqueues a fair task.
- CPU goes offline and drmgr removes it from cpu_present_mask.
- hrtimer fires and warning is hit.
Fix it by stopping the dl_server before CPU is marked dead.
[1]: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com/
[2]: https://github.com/ibm-power-utilities/powerpc-utils/tree/next/src/drmgr
[sshegde: wrote the changelog and tested it]
Fixes: 4ae8d9aa9f9d ("sched/deadline: Fix dl_server getting stuck")
Closes: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
|
|
Some file systems like FUSE-based ones or overlayfs may record the backing
file in struct vm_area_struct vm_file, instead of the user file that the
user mmapped.
That causes perf to misreport the device major/minor numbers of the file
system of the file, and the generation of the file, and potentially other
inode details. There is an existing helper file_user_inode() for that
situation.
Use file_user_inode() instead of file_inode() to get the inode for MMAP2
events.
Example:
Setup:
# cd /root
# mkdir test ; cd test ; mkdir lower upper work merged
# cp `which cat` lower
# mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
# perf record -e cycles:u -- /root/test/merged/cat /proc/self/maps
...
55b2c91d0000-55b2c926b000 r-xp 00018000 00:1a 3419 /root/test/merged/cat
...
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.004 MB perf.data (5 samples) ]
#
# stat /root/test/merged/cat
File: /root/test/merged/cat
Size: 1127792 Blocks: 2208 IO Block: 4096 regular file
Device: 0,26 Inode: 3419 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-09-08 12:23:59.453309624 +0000
Modify: 2025-09-08 12:23:59.454309624 +0000
Change: 2025-09-08 12:23:59.454309624 +0000
Birth: 2025-09-08 12:23:59.453309624 +0000
Before:
Device reported 00:02 differs from stat output and /proc/self/maps
# perf script --show-mmap-events | grep /root/test/merged/cat
cat 377 [-01] 243.078558: PERF_RECORD_MMAP2 377/377: [0x55b2c91d0000(0x9b000) @ 0x18000 00:02 3419 2068525940]: r-xp /root/test/merged/cat
After:
Device reported 00:1a is the same as stat output and /proc/self/maps
# perf script --show-mmap-events | grep /root/test/merged/cat
cat 362 [-01] 127.755167: PERF_RECORD_MMAP2 362/362: [0x55ba6e781000(0x9b000) @ 0x18000 00:1a 3419 0]: r-xp /root/test/merged/cat
With respect to stable kernels, overlayfs mmap function ovl_mmap() was
added in v4.19 but file_user_inode() was not added until v6.8 and never
back-ported to stable kernels. FMODE_BACKING that it depends on was added
in v6.5. This issue has gone largely unnoticed, so back-porting before
v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite
version, although in practice the next long term kernel is 6.12.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org # 6.8
|
|
Some file systems like FUSE-based ones or overlayfs may record the backing
file in struct vm_area_struct vm_file, instead of the user file that the
user mmapped.
Since commit def3ae83da02f ("fs: store real path instead of fake path in
backing file f_path"), file_path() no longer returns the user file path
when applied to a backing file. There is an existing helper
file_user_path() for that situation.
Use file_user_path() instead of file_path() to get the path for MMAP
and MMAP2 events.
Example:
Setup:
# cd /root
# mkdir test ; cd test ; mkdir lower upper work merged
# cp `which cat` lower
# mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
# perf record -e intel_pt//u -- /root/test/merged/cat /proc/self/maps
...
55b0ba399000-55b0ba434000 r-xp 00018000 00:1a 3419 /root/test/merged/cat
...
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.060 MB perf.data ]
#
Before:
File name is wrong (/cat), so decoding fails:
# perf script --no-itrace --show-mmap-events
cat 367 [016] 100.491492: PERF_RECORD_MMAP2 367/367: [0x55b0ba399000(0x9b000) @ 0x18000 00:02 3419 489959280]: r-xp /cat
...
# perf script --itrace=e | wc -l
Warning:
19 instruction trace errors
19
#
After:
File name is correct (/root/test/merged/cat), so decoding is ok:
# perf script --no-itrace --show-mmap-events
cat 364 [016] 72.153006: PERF_RECORD_MMAP2 364/364: [0x55ce4003d000(0x9b000) @ 0x18000 00:02 3419 3132534314]: r-xp /root/test/merged/cat
# perf script --itrace=e
# perf script --itrace=e | wc -l
0
#
Fixes: def3ae83da02f ("fs: store real path instead of fake path in backing file f_path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org
|
|
It was reported that Intel PT address filters do not work in Docker
containers. That relates to the use of overlayfs.
overlayfs records the backing file in struct vm_area_struct vm_file,
instead of the user file that the user mmapped. In order for an address
filter to match, it must compare to the user file inode. There is an
existing helper file_user_inode() for that situation.
Use file_user_inode() instead of file_inode() to get the inode for address
filter matching.
Example:
Setup:
# cd /root
# mkdir test ; cd test ; mkdir lower upper work merged
# cp `which cat` lower
# mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
# perf record --buildid-mmap -e intel_pt//u --filter 'filter * @ /root/test/merged/cat' -- /root/test/merged/cat /proc/self/maps
...
55d61d246000-55d61d2e1000 r-xp 00018000 00:1a 3418 /root/test/merged/cat
...
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.015 MB perf.data ]
# perf buildid-cache --add /root/test/merged/cat
Before:
Address filter does not match so there are no control flow packets
# perf script --itrace=e
# perf script --itrace=b | wc -l
0
# perf script -D | grep 'TIP.PGE' | wc -l
0
#
After:
Address filter does match so there are control flow packets
# perf script --itrace=e
# perf script --itrace=b | wc -l
235
# perf script -D | grep 'TIP.PGE' | wc -l
57
#
With respect to stable kernels, overlayfs mmap function ovl_mmap() was
added in v4.19 but file_user_inode() was not added until v6.8 and never
back-ported to stable kernels. FMODE_BACKING that it depends on was added
in v6.5. This issue has gone largely unnoticed, so back-porting before
v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite
version, although in practice the next long term kernel is 6.12.
Closes: https://lore.kernel.org/linux-perf-users/aBCwoq7w8ohBRQCh@fremen.lan
Reported-by: Edd Barrett <edd@theunixzoo.co.uk>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org # 6.8
|
|
It's less confusing to optimize uprobe right after handlers execution
and before we do the check for changed ip register to avoid situations
where changed ip register would skip uprobe optimization.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change add the WQ_UNBOUND flag to pm_wq, to make explicit this
workqueue can be unbound and that it does not benefit from per-cpu work.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
In preparation for hierarchical schedulers, change scx_bpf_dsq_insert() and
scx_bpf_dsq_insert_vtime() to return bool instead of void. With
sub-schedulers, there will be no reliable way to guarantee a task is still
owned by the sub-scheduler at insertion time (e.g., the task may have been
migrated to another scheduler). The bool return value will enable
sub-schedulers to detect and gracefully handle insertion failures.
For the root scheduler, insertion failures will continue to trigger scheduler
abort via scx_error(), so existing code doesn't need to check the return
value. Backward compatibility is maintained through compat wrappers.
Also update scx_bpf_dsq_move() documentation to clarify that it can return
false for sub-schedulers when @dsq_id points to a disallowed local DSQ.
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and() currently have 5
parameters. An upcoming change will add aux__prog parameter which will exceed
BPF's 5 argument limit.
Prepare by adding new kfuncs __scx_bpf_dsq_insert_vtime() and
__scx_bpf_select_cpu_and() that take args structs. The existing kfuncs are
kept as compatibility wrappers. BPF programs use inline wrappers that detect
kernel API version via bpf_core_type_exists() and use the new struct-based
kfuncs when available, falling back to compat kfuncs otherwise. This allows
BPF programs to work with both old and new kernels.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|