| Age | Commit message (Collapse) | Author | Lines |
|
Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem with the current
emulated vPMU. Because perf owns all the PMU counters. It can mask the
counter which is assigned to an exclude_guest event when a guest is
running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
(AMD way). The counter doesn't count when a guest is running.
However, either way doesn't work with the introduced mediated vPMU.
A guest owns all the PMU counters when it's running. The host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwritten.
Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.
It's possible that an exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.
The ctx time is shared among different PMUs. The time cannot be stopped
when a guest is running. It is required to calculate the time for events
from other PMUs, e.g., uncore events. Add timeguest to track the guest
run time. For an exclude_guest event, the elapsed time equals
the ctx time - guest time.
Cgroup has dedicated times. Use the same method to deduct the guest time
from the cgroup time as well.
[sean: massage comments]
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-7-seanjc@google.com
|
|
The current perf tracks two timestamps for the normal ctx and cgroup.
The same type of variables and similar codes are used to track the
timestamps. In the following patch, the third timestamp to track the
guest time will be introduced.
To avoid the code duplication, add a new struct perf_time_ctx and factor
out a generic function update_perf_time_ctx().
No functional change.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-6-seanjc@google.com
|
|
Currently, exposing PMU capabilities to a KVM guest is done by emulating
guest PMCs via host perf events, i.e. by having KVM be "just" another user
of perf. As a result, the guest and host are effectively competing for
resources, and emulating guest accesses to vPMU resources requires
expensive actions (expensive relative to the native instruction). The
overhead and resource competition results in degraded guest performance
and ultimately very poor vPMU accuracy.
To address the issues with the perf-emulated vPMU, introduce a "mediated
vPMU", where the data plane (PMCs and enable/disable knobs) is exposed
directly to the guest, but the control plane (event selectors and access
to fixed counters) is managed by KVM (via MSR interceptions). To allow
host perf usage of the PMU to (partially) co-exist with KVM/guest usage
of the PMU, KVM and perf will coordinate to a world switch between host
perf context and guest vPMU context near VM-Enter/VM-Exit.
Add two exported APIs, perf_{create,release}_mediated_pmu(), to allow KVM
to create and release a mediated PMU instance (per VM). Because host perf
context will be deactivated while the guest is running, mediated PMU usage
will be mutually exclusive with perf analysis of the guest, i.e. perf
events that do NOT exclude the guest will not behave as expected.
To avoid silent failure of !exclude_guest perf events, disallow creating a
mediated PMU if there are active !exclude_guest events, and on the perf
side, disallowing creating new !exclude_guest perf events while there is
at least one active mediated PMU.
Exempt PMU resources that do not support mediated PMU usage, i.e. that are
outside the scope/view of KVM's vPMU and will not be swapped out while the
guest is running.
Guard mediated PMU with a new kconfig to help readers identify code paths
that are unique to mediated PMU support, and to allow for adding arch-
specific hooks without stubs. KVM x86 is expected to be the only KVM
architecture to support a mediated PMU in the near future (e.g. arm64 is
trending toward a partitioned PMU implementation), and KVM x86 will select
PERF_GUEST_MEDIATED_PMU unconditionally, i.e. won't need stubs.
Immediately select PERF_GUEST_MEDIATED_PMU when KVM x86 is enabled so that
all paths are compile tested. Full KVM support is on its way...
[sean: add kconfig and WARNing, rewrite changelog, swizzle patch ordering]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-5-seanjc@google.com
|
|
Move the freeing of any security state associated with a perf event from
_free_event() to __free_event(), i.e. invoke security_perf_event_free() in
the error paths for perf_event_alloc(). This will allow adding potential
error paths in perf_event_alloc() that can occur after allocating security
state.
Note, kfree() and thus security_perf_event_free() is a nop if
event->security is NULL, i.e. calling security_perf_event_free() even if
security_perf_event_alloc() fails or is never reached is functionality ok.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-4-seanjc@google.com
|
|
Only KVM knows the exact time when a guest is entering/exiting. Expose
two interfaces to KVM to switch the ownership of the PMU resources.
All the pinned events must be scheduled in first. Extend the
perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-3-seanjc@google.com
|
|
To optimize the cgroup context switch, the perf_event_pmu_context
iteration skips the PMUs without cgroup events. A bool cgroup was
introduced to indicate the case. It can work, but this way is hard to
extend for other cases, e.g. skipping non-mediated PMUs. It doesn't
make sense to keep adding bool variables.
Pass the event_type instead of the specific bool variable. Check both
the event_type and related pmu_ctx variables to decide whether skipping
a PMU.
Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active.
Add EVENT_FLAGS to indicate such event flags.
No functional change.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-2-seanjc@google.com
|
|
Commit 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") added an
assert to sched_change_end() verifying that a class demotion would result in a
reschedule.
As it turns out; rt_mutex_setprio() does not force a resched on class
demontion. Furthermore, this is only relevant to running tasks.
Change the warning into a reschedule and make sure to only do so for running
tasks.
Fixes: 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net
|
|
Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task
is of a higher class than the previous highest class.
In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have schedule() re-set the value on
pick.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
|
|
Cross-merge BPF and other fixes after downstream PR.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Pull bpf fixes from Alexei Starovoitov:
- Fix BPF builds due to -fms-extensions. selftests (Alexei
Starovoitov), bpftool (Quentin Monnet).
- Fix build of net/smc when CONFIG_BPF_SYSCALL=y, but CONFIG_BPF_JIT=n
(Geert Uytterhoeven)
- Fix livepatch/BPF interaction and support reliable unwinding through
BPF stack frames (Josh Poimboeuf)
- Do not audit capability check in arm64 JIT (Ondrej Mosnacek)
- Fix truncated dmabuf BPF iterator reads (T.J. Mercier)
- Fix verifier assumptions of bpf_d_path's output buffer (Shuran Liu)
- Fix warnings in libbpf when built with -Wdiscarded-qualifiers under
C23 (Mikhail Gavrilov)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: add regression test for bpf_d_path()
bpf: Fix verifier assumptions of bpf_d_path's output buffer
selftests/bpf: Add test for truncated dmabuf_iter reads
bpf: Fix truncated dmabuf iterator reads
x86/unwind/orc: Support reliable unwinding through BPF stack frames
bpf: Add bpf_has_frame_pointer()
bpf, arm64: Do not audit capability check in do_jit()
libbpf: Fix -Wdiscarded-qualifiers under C23
bpftool: Fix build warnings due to MS extensions
net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT
selftests/bpf: Add -fms-extensions to bpf build flags
|
|
Smatch reported:
kernel/sched/ext.c:5332 scx_alloc_and_add_sched() warn: passing zero to 'ERR_PTR'
In scx_alloc_and_add_sched(), the alloc_percpu() failure path jumps to
err_free_gdsqs without initializing @ret. That can lead to returning
ERR_PTR(0), which violates the ERR_PTR() convention and confuses
callers.
Set @ret to -ENOMEM before jumping to the error path when
alloc_percpu() fails.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202512141601.yAXDAeA9-lkp@intel.com/
Reported-by: Dan Carpenter <error27@gmail.com>
Fixes: c201ea1578d3 ("sched_ext: Move event_stats_cpu into scx_sched")
Signed-off-by: Liang Jie <liangjie@lixiang.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The verifier currently limits direct offsets into a map to 512MiB
to avoid overflow during pointer arithmetic. However, this prevents
arena maps from using direct addressing instructions to access data
at the end of > 512MiB arena maps. This is necessary when moving
arena globals to the end of the arena instead of the front.
Refactor the verifier code to remove the offset calculation during
direct value access calculations. This is possible because the only
two map types that implement .map_direct_value_addr() are arrays and
arenas, and they both do their own internal checks to ensure the
offset is within bounds.
Adjust selftests that expect the old error. These tests still fail
because the verifier identifies the access as out of bounds for the
map, so change them to expect an "invalid access to map value pointer"
error instead.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251216173325.98465-3-emil@etsalapatis.com
|
|
NETFILTER_PKT records show both source and destination
addresses, in addition to the associated networking protocol.
However, it lacks the ports information, which is often
valuable for troubleshooting.
This patch adds both source and destination port numbers,
'sport' and 'dport' respectively, to TCP, UDP, UDP-Lite and
SCTP-related NETFILTER_PKT records.
$ TESTS="netfilter_pkt" make -e test &> /dev/null
$ ausearch -i -ts recent |grep NETFILTER_PKT
type=NETFILTER_PKT ... proto=icmp
type=NETFILTER_PKT ... proto=ipv6-icmp
type=NETFILTER_PKT ... proto=udp sport=46333 dport=42424
type=NETFILTER_PKT ... proto=udp sport=35953 dport=42424
type=NETFILTER_PKT ... proto=tcp sport=50314 dport=42424
type=NETFILTER_PKT ... proto=tcp sport=57346 dport=42424
Link: https://github.com/linux-audit/audit-kernel/issues/162
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Netfilter code (net/netfilter/nft_log.c and net/netfilter/xt_AUDIT.c)
have to be kept in sync. Both source files had duplicated versions of
audit_ip4() and audit_ip6() functions, which can result in lack of
consistency and/or duplicated work.
This patch adds a helper function in audit.c that can be called by
netfilter code commonly, aiming to improve maintainability and
consistency.
Suggested-by: Florian Westphal <fw@strlen.de>
Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix memory leak when destroying helper kthread workers during
scheduler disable
- Fix bypass depth accounting on scx_enable() failure which could leave
the system permanently in bypass mode
- Fix missing preemption handling when moving tasks to local DSQs via
scx_bpf_dsq_move()
- Misc fixes including NULL check for put_prev_task(), flushing stdout
in selftests, and removing unused code
* tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove unused code in the do_pick_task_scx()
selftests/sched_ext: flush stdout before test to avoid log spam
sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()
sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()
sched_ext: Fix bypass depth leak on scx_enable() failure
sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next
sched_ext: Fix the memleak for sch->helper objects
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
- Fix a race condition in css_rstat_updated() where CMPXCHG without
LOCK prefix could cause lnode corruption when the flusher runs
concurrently on another CPU. The issue was introduced in 6.17 and
causes memcg stats to become corrupted in production.
* tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
|
|
Add infrastructure to redirect interrupt handler execution to a
different CPU when the current CPU is not part of the interrupt's CPU
affinity mask.
This is primarily aimed at (de)multiplexed interrupts, where the child
interrupt handler runs in the context of the parent interrupt handler,
and therefore CPU affinity control for the child interrupt is typically
not available.
With the new infrastructure, the child interrupt is allowed to freely
change its affinity setting, independently of the parent. If the
interrupt handler happens to be triggered on an "incompatible" CPU (a
CPU that's not part of the child interrupt's affinity mask), the handler
is redirected and runs in IRQ work context on a "compatible" CPU.
No functional change is being made to any existing irqchip driver, and
irqchip drivers must be explicitly modified to use the newly added
infrastructure to support interrupt redirection.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/linux-pci/878qpg4o4t.ffs@tglx/
Link: https://patch.msgid.link/20251128212055.1409093-2-rrendec@redhat.com
|
|
setup_percpu_irq() was always a bad kludge, and should have never
been there the first place. Now that the last users are gone,
remove it for good.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251210082242.360936-7-maz@kernel.org
|
|
With the IRQ timing stuff being gone, there is no need to specify a flag
when requesting a percpu interrupt. Not only IRQF_TIMER was the only flag
(set of flags actually) allowed, but nobody ever passed it.
Get rid of __request_percpu_irq(), which was only getting 0 as flags, and
promote request_percpu_irq_affinity() as its replacement.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20251210082242.360936-3-maz@kernel.org
|
|
The IRQ timing tracking infrastructure was merged in 2019, but was never
plumbed in, is not selectable, and is therefore never used.
As Daniel agrees that there is little hope for this infrastructure to be
completed in the near term, drop it altogether.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/87zf7vex6h.wl-maz@kernel.org
Link: https://patch.msgid.link/20251210082242.360936-2-maz@kernel.org
|
|
New network transport protocols want NIC drivers to get hardware timestamps
of all incoming packets, and possibly all outgoing packets.
One example is the upcoming 'Swift congestion control' which is used by TCP
transport and is the primary need for timecounter_cyc2time(). This means
timecounter_cyc2time() can be called more than 100 million times per second
on a busy server.
Inlining timecounter_cyc2time() brings a 12% improvement on a UDP receive
stress test on a 100Gbit NIC.
Note that FDO, LTO, PGO are unable to magically help for this case,
presumably because NIC drivers are almost exclusively shipped as modules.
Add an unlikely() around the cc_cyc2ns_backwards() case, even if FDO (when
used) is able to take care of this optimization.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/
Link: https://patch.msgid.link/20251129095740.3338476-1-edumazet@google.com
|
|
The kick_idle variable is no longer used, this commit therefore remove
it and also remove associated code in the do_pick_task_scx().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The commit d5d399efff6577 ("printk/nbcon: Release nbcon consoles ownership
in atomic flush after each emitted record") prevented stall of a CPU
which lost nbcon console ownership because another CPU entered
an emergency flush.
But there is still the problem that the CPU doing the emergency flush
might cause a stall on its own.
Let's go even further and restore IRQ in the atomic flush after
each emitted record.
It is not a complete solution. The interrupts and/or scheduling might
still be blocked when the emergency atomic flush was called with
IRQs and/or scheduling disabled. But it should remove the following
lockup:
mlx5_core 0000:03:00.0: Shutdown was called
kvm: exiting hardware virtualization
arm-smmu-v3 arm-smmu-v3.10.auto: CMD_SYNC timeout at 0x00000103 [hwprod 0x00000104, hwcons 0x00000102]
smp: csd: Detected non-responsive CSD lock (#1) on CPU#4, waiting 5000000032 ns for CPU#00 do_nothing (kernel/smp.c:1057)
smp: csd: CSD lock (#1) unresponsive.
[...]
Call trace:
pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540) (P)
nbcon_emit_next_record (kernel/printk/nbcon.c:1049)
__nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1517)
__nbcon_atomic_flush_pending.llvm.15488114865160659019 (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:192 kernel/printk/nbcon.c:1562 kernel/printk/nbcon.c:1612)
nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1629)
printk_kthreads_shutdown (kernel/printk/printk.c:?)
syscore_shutdown (drivers/base/syscore.c:120)
kernel_kexec (kernel/kexec_core.c:1045)
__arm64_sys_reboot (kernel/reboot.c:794 kernel/reboot.c:722 kernel/reboot.c:722)
invoke_syscall (arch/arm64/kernel/syscall.c:50)
el0_svc_common.llvm.14158405452757855239 (arch/arm64/kernel/syscall.c:?)
do_el0_svc (arch/arm64/kernel/syscall.c:152)
el0_svc (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:73 arch/arm64/kernel/entry-common.c:169 arch/arm64/kernel/entry-common.c:182 arch/arm64/kernel/entry-common.c:749)
el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:820)
el0t_64_sync (arch/arm64/kernel/entry.S:600)
In this case, nbcon_atomic_flush_pending() is called from
printk_kthreads_shutdown() with IRQs and scheduling enabled.
Note that __nbcon_atomic_flush_pending_con() is directly called also from
nbcon_device_release() where the disabled IRQs might break PREEMPT_RT
guarantees. But the atomic flush is called only in emergency or panic
situations where the latencies are irrelevant anyway.
An ultimate solution would be a touching of watchdogs. But it would hide
all problems. Let's do it later when anyone reports a stall which does
not have a better solution.
Closes: https://lore.kernel.org/r/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu
Tested-by: Breno Leitao <leitao@debian.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251212124520.244483-1-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
These callbacks are necessary to synchronize ->write_thread callback
against other operations using the same device.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251208-nbcon-device-cb-fix-v2-1-36be8d195123@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case. This aspect is fixed with
the patch.
Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.
The change also facilitated some cosmetic fixes.
It has an unintentional side effect of no longer issuing spurious
idr_preload() around idr_replace().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251203092851.287617-3-mjguzik@gmail.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Till now, the runtime PM workqueue has been flagged as freezable, so it
does not process work items during system-wide PM transitions like
system suspend and resume. The original reason to do that was to
reduce the likelihood of runtime PM getting in the way of system-wide
PM processing, but now it is mostly an optimization because (1) runtime
suspend of devices is prevented by bumping up their runtime PM usage
counters in device_prepare() and (2) device drivers are expected to
disable runtime PM for the devices handled by them before they embark
on system-wide PM activities that may change the state of the hardware
or otherwise interfere with runtime PM. However, it prevents
asynchronous runtime resume of devices from working during system-wide
PM transitions, which is confusing because synchronous runtime resume
is not prevented at the same time, and it also sometimes turns out to
be problematic.
For example, it has been reported that blk_queue_enter() may deadlock
during a system suspend transition because of the pm_request_resume()
usage in it [1]. It may also deadlock during a system resume transition
in a similar way. That happens because the asynchronous runtime resume
of the given device is not processed due to the freezing of the runtime
PM workqueue. While it may be better to address this particular issue
in the block layer, the very presence of it means that similar problems
may be expected to occur elsewhere.
For this reason, remove the WQ_FREEZABLE flag from the runtime PM
workqueue and make device_suspend_late() use the generic variant of
pm_runtime_disable() that will carry out runtime PM of the device
synchronously if there is pending resume work for it.
Also update the comment before the pm_runtime_disable() call in
device_suspend_late(), to document the fact that the runtime PM
should not be expected to work for the device until the end of
device_resume_early(), and update the related documentation.
This change may, even though it is not expected to, uncover some
latent issues related to queuing up asynchronous runtime resume
work items during system suspend or hibernation. However, they
should be limited to the interference between runtime resume and
system-wide PM callbacks in the cases when device drivers start
to handle system-wide PM before disabling runtime PM as described
above.
Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/12794222.O9o76ZdvQC@rafael.j.wysocki
|
|
There's three layers of logic in the scheduler that
deal with 'has_blocked' (load) handling of the NOHZ code:
(1) nohz.has_blocked,
(2) rq->has_blocked_load, deal with NOHZ idle balancing,
(3) and cfs_rq_has_blocked(), which is part of the layer
that is passing the SMP load-balancing signal to the
NOHZ layers.
The 'has_blocked' and 'has_blocked_load' names are used
in a mixed fashion, sometimes within the same function.
Standardize on 'has_blocked_load' to make it all easy
to read and easy to grep.
No change in functionality.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/aS6yvxyc3JfMxxQW@gmail.com
|
|
for wrapped-signed aritmetics
We have to be careful with vruntime comparisons and subtraction,
due to the possibility of wrapping, so we have macros like:
#define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
Which is used like this:
if (vruntime_gt(min_vruntime, se, rse))
se->min_vruntime = rse->min_vruntime;
Replace this with an easier to read pattern that uses the regular
arithmetics operators:
if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime))
se->min_vruntime = rse->min_vruntime;
Also replace vruntime subtractions with vruntime_op():
- delta = (s64)(sea->vruntime - seb->vruntime) +
- (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi);
+ delta = vruntime_op(sea->vruntime, "-", seb->vruntime) +
+ vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi);
In the vruntime_cmp() and vruntime_op() macros use Use __builtin_strcmp(),
because of __HAVE_ARCH_STRCMP might turn off the compiler optimizations
we rely on here to catch usage bugs.
No change in functionality.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
functions
The ::avg_vruntime field is a misnomer: it says it's an
'average vruntime', but in reality it's the momentary sum
of the weighted vruntimes of all queued tasks, which is
at least a division away from being an average.
This is clear from comments about the math of fair scheduling:
* \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
This confusion is increased by the cfs_avg_vruntime() function,
which does perform the division and returns a true average.
The sum of all weighted vruntimes should be named thusly,
so rename the field to ::sum_w_vruntime. (As arguably
::sum_weighted_vruntime would be a bit of a mouthful.)
Understanding the scheduler is hard enough already, without
extra layers of obfuscated naming. ;-)
Also rename related helper functions:
sum_vruntime_add() => sum_w_vruntime_add()
sum_vruntime_sub() => sum_w_vruntime_sub()
sum_vruntime_update() => sum_w_vruntime_update()
With the notable exception of cfs_avg_vruntime(), which
was named accurately.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-7-mingo@kernel.org
|
|
The ::avg_load field is a long-standing misnomer: it says it's an
'average load', but in reality it's the momentary sum of the load
of all currently runnable tasks. We'd have to also perform a
division by nr_running (or use time-decay) to arrive at any sort
of average value.
This is clear from comments about the math of fair scheduling:
* \Sum w_i := cfs_rq->avg_load
The sum of all weights is ... the sum of all weights, not
the average of all weights.
To make it doubly confusing, there's also an ::avg_load
in the load-balancing struct sg_lb_stats, which *is* a
true average.
The second part of the field's name is a minor misnomer
as well: it says 'load', and it is indeed a load_weight
structure as it shares code with the load-balancer - but
it's only in an SMP load-balancing context where
load = weight, in the fair scheduling context the primary
purpose is the weighting of different nice levels.
So rename the field to ::sum_weight instead, which makes
the terminology of the EEVDF math match up with our
implementation of it:
* \Sum w_i := cfs_rq->sum_weight
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-6-mingo@kernel.org
|
|
- Fix vertical alignment
- Fix typos
- Fix capitalization
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-3-mingo@kernel.org
|
|
Join two identical #ifdef blocks:
#ifdef CONFIG_FAIR_GROUP_SCHED
...
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
...
#endif
Also mark nested #ifdef blocks in the usual fashion, to make
it more apparent where in a nested hierarchy of #ifdefs we
are at a glance.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251201064647.1851919-2-mingo@kernel.org
|
|
Add some checks to the sched_change pattern to validate assumptions
around changing classes.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.771691954@infradead.org
|
|
The task_tick_fair() function does:
- update the hierarchical runtimes
- drive NUMA-balancing
- update load-balance statistics
- drive force-idle preemption
All but the very first can be limited to the periodic tick. Let hrtick
only update accounting and drive preemption, not load-balancing and
other bits.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20250918080205.563385766@infradead.org
|
|
With fair switched to rcu_dereference_all() validation, having IRQ or
preemption disabled is sufficient, remove the rcu_read_lock()
clutter.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.647502625@infradead.org
|
|
With the {rcu,sched,bh} RCU flavours being unified, it doesn't really
make sense to check for just the rcu one. Switch to the _all family of
verification which includes all 3 of the listed flavours.
Notably, this will enable us to remove some superfluous
rcu_read_lock() regions when we know they are inside preempt/IRQ
disabled regions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
rcu_dereference_sched_domain()
Remove check from the name for being surplus to requirements.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
While poking at this code recently I noted we do a pointless
unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so
we can balance) but then instantly grab the same rq->lock again in
sched_balance_update_blocked_averages().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.532469061@infradead.org
|
|
Nine (and a half) instances of the same pattern is just silly, fold the lot.
Notably, the half instance in enqueue_load_avg() is right after setting
cfs_rq->avg.load_sum to cfs_rq->avg.load_avg * get_pelt_divider(&cfs_rq->avg).
Since get_pelt_divisor() >= PELT_MIN_DIVIDER, this ends up being a no-op
change.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20251127154725.413564507@infradead.org
|
|
Commit af65320948b8 ("bpf: Bump iter seq size to support BTF
representation of large data structures") increased the fixed buffer
size from PAGE_SIZE to PAGE_SIZE << 3, but the docs for the function
didn't get updated at the same time. Update them.
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251207091005.2829703-1-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
- Fix CPU hotplug callbacks to disable interrupts on UP kernels
* tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Make atomic hotplug callbacks run with interrupts disabled on UP
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fixes from Ingo Molnar:
- Fix NULL pointer dereference crash in the Intel PMU driver
- Fix missing read event generation on task exit
- Fix AMD uncore driver init error handling
- Fix whitespace noise
* tag 'perf-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix NULL event dereference crash in handle_pmi_common()
perf/core: Fix missing read event generation on task exit
perf/x86/amd/uncore: Fix the return value of amd_uncore_df_event_init() on error
perf/uprobes: Remove <space><Tab> whitespace noise
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
- Fix error code in the irqchip/mchp-eic driver
- Fix setup_percpu_irq() affinity assumptions
- Remove the unused irq_domain_add_tree() function
* tag 'irq-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mchp-eic: Fix error code in mchp_eic_domain_alloc()
irqdomain: Delete irq_domain_add_tree()
genirq: Allow NULL affinity for setup_percpu_irq()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"There are no significant series in this small merge. Please see the
individual changelogs for details"
[ Editor's note: it's mainly ocfs2 and a couple of random fixes ]
* tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: memfd_luo: add CONFIG_SHMEM dependency
mm: shmem: avoid build warning for CONFIG_SHMEM=n
ocfs2: fix memory leak in ocfs2_merge_rec_left()
ocfs2: invalidate inode if i_mode is zero after block read
ocfs2: avoid -Wflex-array-member-not-at-end warning
ocfs2: convert remaining read-only checks to ocfs2_emergency_state
ocfs2: add ocfs2_emergency_state helper and apply to setattr
checkpatch: add uninitialized pointer with __free attribute check
args: fix documentation to reflect the correct numbers
ocfs2: fix kernel BUG in ocfs2_find_victim_chain
liveupdate: luo_core: fix redundant bound check in luo_ioctl()
ocfs2: validate inline xattr size and entry count in ocfs2_xattr_ibody_list
fs/fat: remove unnecessary wrapper fat_max_cache()
ocfs2: replace deprecated strcpy with strscpy
ocfs2: check tl_used after reading it from trancate log inode
liveupdate: luo_file: don't use invalid list iterator
|
|
Chris reported that the recent affinity management changes result in
overwriting the already initialized thread flags.
Use set_bit() to set the affinity bit instead of assigning the bit value to
the flags.
Fixes: 801afdfbfcd9 ("genirq: Fix interrupt threads affinity vs. cpuset isolated partitions")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/87ecp0e4cf.ffs@tglx
Closes: https://lore.kernel.org/all/20251212014848.3509622-1-clm@meta.com
|
|
move_local_task_to_local_dsq() is used when moving a task from a non-local
DSQ to a local DSQ on the same CPU. It directly manipulates the local DSQ
without going through dispatch_enqueue() and was missing the post-enqueue
handling that triggers preemption when SCX_ENQ_PREEMPT is set or the idle
task is running.
The function is used by move_task_between_dsqs() which backs
scx_bpf_dsq_move() and may be called while the CPU is busy.
Add local_dsq_post_enq() call to move_local_task_to_local_dsq(). As the
dispatch path doesn't need post-enqueue handling, add SCX_RQ_IN_BALANCE
early exit to keep consume_dispatch_q() behavior unchanged and avoid
triggering unnecessary resched when scx_bpf_dsq_move() is used from the
dispatch path.
Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Factor out local_dsq_post_enq() which performs post-enqueue handling for
local DSQs - triggering resched_curr() if SCX_ENQ_PREEMPT is specified or if
the current CPU is idle. No functional change.
This will be used by the next patch to fix move_local_task_to_local_dsq().
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_enable() calls scx_bypass(true) to initialize in bypass mode and then
scx_bypass(false) on success to exit. If scx_enable() fails during task
initialization - e.g. scx_cgroup_init() or scx_init_task() returns an error -
it jumps to err_disable while bypass is still active. scx_disable_workfn()
then calls scx_bypass(true/false) for its own bypass, leaving the bypass depth
at 1 instead of 0. This causes the system to remain permanently in bypass mode
after a failed scx_enable().
Failures after task initialization is complete - e.g. scx_tryset_enable_state()
at the end - already call scx_bypass(false) before reaching the error path and
are not affected. This only affects a subset of failure modes.
Fix it by tracking whether scx_enable() called scx_bypass(true) in a bool and
having scx_disable_workfn() call an extra scx_bypass(false) to clear it. This
is a temporary measure as the bypass depth will be moved into the sched
instance, which will make this tracking unnecessary.
Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/stable/286e6f7787a81239e1ce2989b52391ce%40kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The new memfd code fails to link without SHMEM:
aarch64-linux-ld: mm/memfd_luo.o: in function `memfd_luo_retrieve_folios':
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0xdc): undefined reference to `shmem_add_to_page_cache'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x11c): undefined reference to `shmem_inode_acct_blocks'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x134): undefined reference to `shmem_recalc_inode'
Add a Kconfig dependency to disallow that configuration.
Link: https://lkml.kernel.org/r/20251204100203.1034394-1-arnd@kernel.org
Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel test robot reported a Smatch warning:
kernel/liveupdate/luo_core.c:402 luo_ioctl() warn: unsigned 'nr' is
never less than zero.
This occurs because 'nr' is unsigned and LIVEUPDATE_CMD_BASE is currently
defined as 0, making the check (nr < LIVEUPDATE_CMD_BASE) always false.
Remove the explicit lower bound check. The logic remains correct because
'nr' is unsigned; if nr is less than LIVEUPDATE_CMD_BASE, the expression
(nr - LIVEUPDATE_CMD_BASE) will wrap around to a large positive value.
This will inevitably be larger than ARRAY_SIZE(luo_ioctl_ops) and be
caught by the upper bound check.
Link: https://lkml.kernel.org/r/20251130010919.1488230-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511280300.6pvBmXUS-lkp@intel.com/
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|