summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2025-12-17perf: Add a EVENT_GUEST flagKan Liang-51/+179
Current perf doesn't explicitly schedule out all exclude_guest events while the guest is running. There is no problem with the current emulated vPMU. Because perf owns all the PMU counters. It can mask the counter which is assigned to an exclude_guest event when a guest is running (Intel way), or set the corresponding HOSTONLY bit in evsentsel (AMD way). The counter doesn't count when a guest is running. However, either way doesn't work with the introduced mediated vPMU. A guest owns all the PMU counters when it's running. The host should not mask any counters. The counter may be used by the guest. The evsentsel may be overwritten. Perf should explicitly schedule out all exclude_guest events to release the PMU resources when entering a guest, and resume the counting when exiting the guest. It's possible that an exclude_guest event is created when a guest is running. The new event should not be scheduled in as well. The ctx time is shared among different PMUs. The time cannot be stopped when a guest is running. It is required to calculate the time for events from other PMUs, e.g., uncore events. Add timeguest to track the guest run time. For an exclude_guest event, the elapsed time equals the ctx time - guest time. Cgroup has dedicated times. Use the same method to deduct the guest time from the cgroup time as well. [sean: massage comments] Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-7-seanjc@google.com
2025-12-17perf: Clean up perf ctx timeKan Liang-38/+32
The current perf tracks two timestamps for the normal ctx and cgroup. The same type of variables and similar codes are used to track the timestamps. In the following patch, the third timestamp to track the guest time will be introduced. To avoid the code duplication, add a new struct perf_time_ctx and factor out a generic function update_perf_time_ctx(). No functional change. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-6-seanjc@google.com
2025-12-17perf: Add APIs to create/release mediated guest vPMUsKan Liang-0/+82
Currently, exposing PMU capabilities to a KVM guest is done by emulating guest PMCs via host perf events, i.e. by having KVM be "just" another user of perf. As a result, the guest and host are effectively competing for resources, and emulating guest accesses to vPMU resources requires expensive actions (expensive relative to the native instruction). The overhead and resource competition results in degraded guest performance and ultimately very poor vPMU accuracy. To address the issues with the perf-emulated vPMU, introduce a "mediated vPMU", where the data plane (PMCs and enable/disable knobs) is exposed directly to the guest, but the control plane (event selectors and access to fixed counters) is managed by KVM (via MSR interceptions). To allow host perf usage of the PMU to (partially) co-exist with KVM/guest usage of the PMU, KVM and perf will coordinate to a world switch between host perf context and guest vPMU context near VM-Enter/VM-Exit. Add two exported APIs, perf_{create,release}_mediated_pmu(), to allow KVM to create and release a mediated PMU instance (per VM). Because host perf context will be deactivated while the guest is running, mediated PMU usage will be mutually exclusive with perf analysis of the guest, i.e. perf events that do NOT exclude the guest will not behave as expected. To avoid silent failure of !exclude_guest perf events, disallow creating a mediated PMU if there are active !exclude_guest events, and on the perf side, disallowing creating new !exclude_guest perf events while there is at least one active mediated PMU. Exempt PMU resources that do not support mediated PMU usage, i.e. that are outside the scope/view of KVM's vPMU and will not be swapped out while the guest is running. Guard mediated PMU with a new kconfig to help readers identify code paths that are unique to mediated PMU support, and to allow for adding arch- specific hooks without stubs. KVM x86 is expected to be the only KVM architecture to support a mediated PMU in the near future (e.g. arm64 is trending toward a partitioned PMU implementation), and KVM x86 will select PERF_GUEST_MEDIATED_PMU unconditionally, i.e. won't need stubs. Immediately select PERF_GUEST_MEDIATED_PMU when KVM x86 is enabled so that all paths are compile tested. Full KVM support is on its way... [sean: add kconfig and WARNing, rewrite changelog, swizzle patch ordering] Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-5-seanjc@google.com
2025-12-17perf: Move security_perf_event_free() call to __free_event()Sean Christopherson-2/+2
Move the freeing of any security state associated with a perf event from _free_event() to __free_event(), i.e. invoke security_perf_event_free() in the error paths for perf_event_alloc(). This will allow adding potential error paths in perf_event_alloc() that can occur after allocating security state. Note, kfree() and thus security_perf_event_free() is a nop if event->security is NULL, i.e. calling security_perf_event_free() even if security_perf_event_alloc() fails or is never reached is functionality ok. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-4-seanjc@google.com
2025-12-17perf: Add generic exclude_guest supportKan Liang-7/+8
Only KVM knows the exact time when a guest is entering/exiting. Expose two interfaces to KVM to switch the ownership of the PMU resources. All the pinned events must be scheduled in first. Extend the perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-3-seanjc@google.com
2025-12-17perf: Skip pmu_ctx based on event_typeKan Liang-34/+40
To optimize the cgroup context switch, the perf_event_pmu_context iteration skips the PMUs without cgroup events. A bool cgroup was introduced to indicate the case. It can work, but this way is hard to extend for other cases, e.g. skipping non-mediated PMUs. It doesn't make sense to keep adding bool variables. Pass the event_type instead of the specific bool variable. Check both the event_type and related pmu_ctx variables to decide whether skipping a PMU. Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active. Add EVENT_FLAGS to indicate such event flags. No functional change. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-2-seanjc@google.com
2025-12-17sched: Fix faulty assertion in sched_change_end()Peter Zijlstra-16/+17
Commit 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") added an assert to sched_change_end() verifying that a class demotion would result in a reschedule. As it turns out; rt_mutex_setprio() does not force a resched on class demontion. Furthermore, this is only relevant to running tasks. Change the warning into a reschedule and make sure to only do so for running tasks. Fixes: 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net
2025-12-17sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()Peter Zijlstra-60/+54
Change sched_class::wakeup_preempt() to also get called for cross-class wakeups, specifically those where the woken task is of a higher class than the previous highest class. In order to do this, track the current highest class of the runqueue in rq::next_class and have wakeup_preempt() track this upwards for each new wakeup. Additionally have schedule() re-set the value on pick. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
2025-12-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.19-rc1Alexei Starovoitov-2655/+6890
Cross-merge BPF and other fixes after downstream PR. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-17Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds-8/+66
Pull bpf fixes from Alexei Starovoitov: - Fix BPF builds due to -fms-extensions. selftests (Alexei Starovoitov), bpftool (Quentin Monnet). - Fix build of net/smc when CONFIG_BPF_SYSCALL=y, but CONFIG_BPF_JIT=n (Geert Uytterhoeven) - Fix livepatch/BPF interaction and support reliable unwinding through BPF stack frames (Josh Poimboeuf) - Do not audit capability check in arm64 JIT (Ondrej Mosnacek) - Fix truncated dmabuf BPF iterator reads (T.J. Mercier) - Fix verifier assumptions of bpf_d_path's output buffer (Shuran Liu) - Fix warnings in libbpf when built with -Wdiscarded-qualifiers under C23 (Mikhail Gavrilov) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: add regression test for bpf_d_path() bpf: Fix verifier assumptions of bpf_d_path's output buffer selftests/bpf: Add test for truncated dmabuf_iter reads bpf: Fix truncated dmabuf iterator reads x86/unwind/orc: Support reliable unwinding through BPF stack frames bpf: Add bpf_has_frame_pointer() bpf, arm64: Do not audit capability check in do_jit() libbpf: Fix -Wdiscarded-qualifiers under C23 bpftool: Fix build warnings due to MS extensions net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT selftests/bpf: Add -fms-extensions to bpf build flags
2025-12-16sched_ext: fix uninitialized ret on alloc_percpu() failureLiang Jie-1/+3
Smatch reported: kernel/sched/ext.c:5332 scx_alloc_and_add_sched() warn: passing zero to 'ERR_PTR' In scx_alloc_and_add_sched(), the alloc_percpu() failure path jumps to err_free_gdsqs without initializing @ret. That can lead to returning ERR_PTR(0), which violates the ERR_PTR() convention and confuses callers. Set @ret to -ENOMEM before jumping to the error path when alloc_percpu() fails. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202512141601.yAXDAeA9-lkp@intel.com/ Reported-by: Dan Carpenter <error27@gmail.com> Fixes: c201ea1578d3 ("sched_ext: Move event_stats_cpu into scx_sched") Signed-off-by: Liang Jie <liangjie@lixiang.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-16bpf/verifier: Do not limit maximum direct offset into arena mapEmil Tsalapatis-5/+0
The verifier currently limits direct offsets into a map to 512MiB to avoid overflow during pointer arithmetic. However, this prevents arena maps from using direct addressing instructions to access data at the end of > 512MiB arena maps. This is necessary when moving arena globals to the end of the arena instead of the front. Refactor the verifier code to remove the offset calculation during direct value access calculations. This is possible because the only two map types that implement .map_direct_value_addr() are arrays and arenas, and they both do their own internal checks to ensure the offset is within bounds. Adjust selftests that expect the old error. These tests still fail because the verifier identifies the access as out of bounds for the map, so change them to expect an "invalid access to map value pointer" error instead. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251216173325.98465-3-emil@etsalapatis.com
2025-12-16audit: include source and destination ports to NETFILTER_PKTRicardo Robaina-4/+99
NETFILTER_PKT records show both source and destination addresses, in addition to the associated networking protocol. However, it lacks the ports information, which is often valuable for troubleshooting. This patch adds both source and destination port numbers, 'sport' and 'dport' respectively, to TCP, UDP, UDP-Lite and SCTP-related NETFILTER_PKT records. $ TESTS="netfilter_pkt" make -e test &> /dev/null $ ausearch -i -ts recent |grep NETFILTER_PKT type=NETFILTER_PKT ... proto=icmp type=NETFILTER_PKT ... proto=ipv6-icmp type=NETFILTER_PKT ... proto=udp sport=46333 dport=42424 type=NETFILTER_PKT ... proto=udp sport=35953 dport=42424 type=NETFILTER_PKT ... proto=tcp sport=50314 dport=42424 type=NETFILTER_PKT ... proto=tcp sport=57346 dport=42424 Link: https://github.com/linux-audit/audit-kernel/issues/162 Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-12-16audit: add audit_log_nf_skb helper functionRicardo Robaina-0/+64
Netfilter code (net/netfilter/nft_log.c and net/netfilter/xt_AUDIT.c) have to be kept in sync. Both source files had duplicated versions of audit_ip4() and audit_ip6() functions, which can result in lack of consistency and/or duplicated work. This patch adds a helper function in audit.c that can be called by netfilter code commonly, aiming to improve maintainability and consistency. Suggested-by: Florian Westphal <fw@strlen.de> Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-12-16Merge tag 'sched_ext-for-6.19-rc1-fixes' of ↵Linus Torvalds-24/+48
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix memory leak when destroying helper kthread workers during scheduler disable - Fix bypass depth accounting on scx_enable() failure which could leave the system permanently in bypass mode - Fix missing preemption handling when moving tasks to local DSQs via scx_bpf_dsq_move() - Misc fixes including NULL check for put_prev_task(), flushing stdout in selftests, and removing unused code * tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Remove unused code in the do_pick_task_scx() selftests/sched_ext: flush stdout before test to avoid log spam sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq() sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue() sched_ext: Fix bypass depth leak on scx_enable() failure sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next sched_ext: Fix the memleak for sch->helper objects
2025-12-16Merge tag 'cgroup-for-6.19-rc1-fixes' of ↵Linus Torvalds-5/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - Fix a race condition in css_rstat_updated() where CMPXCHG without LOCK prefix could cause lnode corruption when the flusher runs concurrently on another CPU. The issue was introduced in 6.17 and causes memcg stats to become corrupted in production. * tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
2025-12-15genirq: Add interrupt redirection infrastructureRadu Rendec-5/+118
Add infrastructure to redirect interrupt handler execution to a different CPU when the current CPU is not part of the interrupt's CPU affinity mask. This is primarily aimed at (de)multiplexed interrupts, where the child interrupt handler runs in the context of the parent interrupt handler, and therefore CPU affinity control for the child interrupt is typically not available. With the new infrastructure, the child interrupt is allowed to freely change its affinity setting, independently of the parent. If the interrupt handler happens to be triggered on an "incompatible" CPU (a CPU that's not part of the child interrupt's affinity mask), the handler is redirected and runs in IRQ work context on a "compatible" CPU. No functional change is being made to any existing irqchip driver, and irqchip drivers must be explicitly modified to use the newly added infrastructure to support interrupt redirection. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Radu Rendec <rrendec@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/linux-pci/878qpg4o4t.ffs@tglx/ Link: https://patch.msgid.link/20251128212055.1409093-2-rrendec@redhat.com
2025-12-15genirq: Remove setup_percpu_irq()Marc Zyngier-30/+0
setup_percpu_irq() was always a bad kludge, and should have never been there the first place. Now that the last users are gone, remove it for good. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251210082242.360936-7-maz@kernel.org
2025-12-15genirq: Remove __request_percpu_irq() helperMarc Zyngier-10/+5
With the IRQ timing stuff being gone, there is no need to specify a flag when requesting a percpu interrupt. Not only IRQF_TIMER was the only flag (set of flags actually) allowed, but nobody ever passed it. Get rid of __request_percpu_irq(), which was only getting 0 as flags, and promote request_percpu_irq_affinity() as its replacement. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://patch.msgid.link/20251210082242.360936-3-maz@kernel.org
2025-12-15genirq: Remove IRQ timing tracking infrastructureMarc Zyngier-1081/+0
The IRQ timing tracking infrastructure was merged in 2019, but was never plumbed in, is not selectable, and is therefore never used. As Daniel agrees that there is little hope for this infrastructure to be completed in the near term, drop it altogether. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://lore.kernel.org/r/87zf7vex6h.wl-maz@kernel.org Link: https://patch.msgid.link/20251210082242.360936-2-maz@kernel.org
2025-12-15time/timecounter: Inline timecounter_cyc2time()Eric Dumazet-35/+0
New network transport protocols want NIC drivers to get hardware timestamps of all incoming packets, and possibly all outgoing packets. One example is the upcoming 'Swift congestion control' which is used by TCP transport and is the primary need for timecounter_cyc2time(). This means timecounter_cyc2time() can be called more than 100 million times per second on a busy server. Inlining timecounter_cyc2time() brings a 12% improvement on a UDP receive stress test on a 100Gbit NIC. Note that FDO, LTO, PGO are unable to magically help for this case, presumably because NIC drivers are almost exclusively shipped as modules. Add an unlikely() around the cc_cyc2ns_backwards() case, even if FDO (when used) is able to take care of this optimization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/ Link: https://patch.msgid.link/20251129095740.3338476-1-edumazet@google.com
2025-12-15sched_ext: Remove unused code in the do_pick_task_scx()Zqiang-6/+2
The kick_idle variable is no longer used, this commit therefore remove it and also remove associated code in the do_pick_task_scx(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-15printk/nbcon: Restore IRQ in atomic flush after each emitted recordPetr Mladek-20/+18
The commit d5d399efff6577 ("printk/nbcon: Release nbcon consoles ownership in atomic flush after each emitted record") prevented stall of a CPU which lost nbcon console ownership because another CPU entered an emergency flush. But there is still the problem that the CPU doing the emergency flush might cause a stall on its own. Let's go even further and restore IRQ in the atomic flush after each emitted record. It is not a complete solution. The interrupts and/or scheduling might still be blocked when the emergency atomic flush was called with IRQs and/or scheduling disabled. But it should remove the following lockup: mlx5_core 0000:03:00.0: Shutdown was called kvm: exiting hardware virtualization arm-smmu-v3 arm-smmu-v3.10.auto: CMD_SYNC timeout at 0x00000103 [hwprod 0x00000104, hwcons 0x00000102] smp: csd: Detected non-responsive CSD lock (#1) on CPU#4, waiting 5000000032 ns for CPU#00 do_nothing (kernel/smp.c:1057) smp: csd: CSD lock (#1) unresponsive. [...] Call trace: pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540) (P) nbcon_emit_next_record (kernel/printk/nbcon.c:1049) __nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1517) __nbcon_atomic_flush_pending.llvm.15488114865160659019 (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:192 kernel/printk/nbcon.c:1562 kernel/printk/nbcon.c:1612) nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1629) printk_kthreads_shutdown (kernel/printk/printk.c:?) syscore_shutdown (drivers/base/syscore.c:120) kernel_kexec (kernel/kexec_core.c:1045) __arm64_sys_reboot (kernel/reboot.c:794 kernel/reboot.c:722 kernel/reboot.c:722) invoke_syscall (arch/arm64/kernel/syscall.c:50) el0_svc_common.llvm.14158405452757855239 (arch/arm64/kernel/syscall.c:?) do_el0_svc (arch/arm64/kernel/syscall.c:152) el0_svc (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:73 arch/arm64/kernel/entry-common.c:169 arch/arm64/kernel/entry-common.c:182 arch/arm64/kernel/entry-common.c:749) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:820) el0t_64_sync (arch/arm64/kernel/entry.S:600) In this case, nbcon_atomic_flush_pending() is called from printk_kthreads_shutdown() with IRQs and scheduling enabled. Note that __nbcon_atomic_flush_pending_con() is directly called also from nbcon_device_release() where the disabled IRQs might break PREEMPT_RT guarantees. But the atomic flush is called only in emergency or panic situations where the latencies are irrelevant anyway. An ultimate solution would be a touching of watchdogs. But it would hide all problems. Let's do it later when anyone reports a stall which does not have a better solution. Closes: https://lore.kernel.org/r/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu Tested-by: Breno Leitao <leitao@debian.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251212124520.244483-1-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-12-15printk: nbcon: Check for device_{lock,unlock} callbacksMarcos Paulo de Souza-2/+5
These callbacks are necessary to synchronize ->write_thread callback against other operations using the same device. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251208-nbcon-device-cb-fix-v2-1-36be8d195123@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-12-15pid: only take pidmap_lock once on allocMateusz Guzik-46/+85
When spawning and killing threads in separate processes in parallel the primary bottleneck on the stock kernel is pidmap_lock, largely because of a back-to-back acquire in the common case. This aspect is fixed with the patch. Performance improvement varies between reboots. When benchmarking with 20 processes creating and killing threads in a loop, the unpatched baseline hovers around 465k ops/s, while patched is anything between ~510k ops/s and ~560k depending on false-sharing (which I only minimally sanitized). So this is at least 10% if you are unlucky. The change also facilitated some cosmetic fixes. It has an unintentional side effect of no longer issuing spurious idr_preload() around idr_replace(). Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251203092851.287617-3-mjguzik@gmail.com Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-15PM: sleep: Do not flag runtime PM workqueue as freezableRafael J. Wysocki-1/+1
Till now, the runtime PM workqueue has been flagged as freezable, so it does not process work items during system-wide PM transitions like system suspend and resume. The original reason to do that was to reduce the likelihood of runtime PM getting in the way of system-wide PM processing, but now it is mostly an optimization because (1) runtime suspend of devices is prevented by bumping up their runtime PM usage counters in device_prepare() and (2) device drivers are expected to disable runtime PM for the devices handled by them before they embark on system-wide PM activities that may change the state of the hardware or otherwise interfere with runtime PM. However, it prevents asynchronous runtime resume of devices from working during system-wide PM transitions, which is confusing because synchronous runtime resume is not prevented at the same time, and it also sometimes turns out to be problematic. For example, it has been reported that blk_queue_enter() may deadlock during a system suspend transition because of the pm_request_resume() usage in it [1]. It may also deadlock during a system resume transition in a similar way. That happens because the asynchronous runtime resume of the given device is not processed due to the freezing of the runtime PM workqueue. While it may be better to address this particular issue in the block layer, the very presence of it means that similar problems may be expected to occur elsewhere. For this reason, remove the WQ_FREEZABLE flag from the runtime PM workqueue and make device_suspend_late() use the generic variant of pm_runtime_disable() that will carry out runtime PM of the device synchronously if there is pending resume work for it. Also update the comment before the pm_runtime_disable() call in device_suspend_late(), to document the fact that the runtime PM should not be expected to work for the device until the end of device_resume_early(), and update the related documentation. This change may, even though it is not expected to, uncover some latent issues related to queuing up asynchronous runtime resume work items during system suspend or hibernation. However, they should be limited to the interference between runtime resume and system-wide PM callbacks in the cases when device drivers start to handle system-wide PM before disabling runtime PM as described above. Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/12794222.O9o76ZdvQC@rafael.j.wysocki
2025-12-15sched/fair: Sort out 'blocked_load*' namespace noiseIngo Molnar-20/+20
There's three layers of logic in the scheduler that deal with 'has_blocked' (load) handling of the NOHZ code: (1) nohz.has_blocked, (2) rq->has_blocked_load, deal with NOHZ idle balancing, (3) and cfs_rq_has_blocked(), which is part of the layer that is passing the SMP load-balancing signal to the NOHZ layers. The 'has_blocked' and 'has_blocked_load' names are used in a mixed fashion, sometimes within the same function. Standardize on 'has_blocked_load' to make it all easy to read and easy to grep. No change in functionality. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/aS6yvxyc3JfMxxQW@gmail.com
2025-12-15sched/fair: Introduce and use the vruntime_cmp() and vruntime_op() wrappers ↵Ingo Molnar-15/+51
for wrapped-signed aritmetics We have to be careful with vruntime comparisons and subtraction, due to the possibility of wrapping, so we have macros like: #define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; }) Which is used like this: if (vruntime_gt(min_vruntime, se, rse)) se->min_vruntime = rse->min_vruntime; Replace this with an easier to read pattern that uses the regular arithmetics operators: if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime)) se->min_vruntime = rse->min_vruntime; Also replace vruntime subtractions with vruntime_op(): - delta = (s64)(sea->vruntime - seb->vruntime) + - (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi); + delta = vruntime_op(sea->vruntime, "-", seb->vruntime) + + vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi); In the vruntime_cmp() and vruntime_op() macros use Use __builtin_strcmp(), because of __HAVE_ARCH_STRCMP might turn off the compiler optimizations we rely on here to catch usage bugs. No change in functionality. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-15sched/fair: Rename cfs_rq::avg_vruntime to ::sum_w_vruntime, and helper ↵Ingo Molnar-14/+14
functions The ::avg_vruntime field is a misnomer: it says it's an 'average vruntime', but in reality it's the momentary sum of the weighted vruntimes of all queued tasks, which is at least a division away from being an average. This is clear from comments about the math of fair scheduling: * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime This confusion is increased by the cfs_avg_vruntime() function, which does perform the division and returns a true average. The sum of all weighted vruntimes should be named thusly, so rename the field to ::sum_w_vruntime. (As arguably ::sum_weighted_vruntime would be a bit of a mouthful.) Understanding the scheduler is hard enough already, without extra layers of obfuscated naming. ;-) Also rename related helper functions: sum_vruntime_add() => sum_w_vruntime_add() sum_vruntime_sub() => sum_w_vruntime_sub() sum_vruntime_update() => sum_w_vruntime_update() With the notable exception of cfs_avg_vruntime(), which was named accurately. Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-7-mingo@kernel.org
2025-12-15sched/fair: Rename cfs_rq::avg_load to cfs_rq::sum_weightIngo Molnar-9/+9
The ::avg_load field is a long-standing misnomer: it says it's an 'average load', but in reality it's the momentary sum of the load of all currently runnable tasks. We'd have to also perform a division by nr_running (or use time-decay) to arrive at any sort of average value. This is clear from comments about the math of fair scheduling: * \Sum w_i := cfs_rq->avg_load The sum of all weights is ... the sum of all weights, not the average of all weights. To make it doubly confusing, there's also an ::avg_load in the load-balancing struct sg_lb_stats, which *is* a true average. The second part of the field's name is a minor misnomer as well: it says 'load', and it is indeed a load_weight structure as it shares code with the load-balancer - but it's only in an SMP load-balancing context where load = weight, in the fair scheduling context the primary purpose is the weighting of different nice levels. So rename the field to ::sum_weight instead, which makes the terminology of the EEVDF math match up with our implementation of it: * \Sum w_i := cfs_rq->sum_weight Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-6-mingo@kernel.org
2025-12-15sched/fair: Clean up comments in 'struct cfs_rq'Ingo Molnar-6/+6
- Fix vertical alignment - Fix typos - Fix capitalization Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-3-mingo@kernel.org
2025-12-15sched/fair: Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocksIngo Molnar-6/+4
Join two identical #ifdef blocks: #ifdef CONFIG_FAIR_GROUP_SCHED ... #endif #ifdef CONFIG_FAIR_GROUP_SCHED ... #endif Also mark nested #ifdef blocks in the usual fashion, to make it more apparent where in a nested hierarchy of #ifdefs we are at a glance. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20251201064647.1851919-2-mingo@kernel.org
2025-12-14sched/core: Add assertions to QUEUE_CLASSPeter Zijlstra-0/+14
Add some checks to the sched_change pattern to validate assumptions around changing classes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.771691954@infradead.org
2025-12-14sched/fair: Limit hrtick workPeter Zijlstra-0/+6
The task_tick_fair() function does: - update the hierarchical runtimes - drive NUMA-balancing - update load-balance statistics - drive force-idle preemption All but the very first can be limited to the periodic tick. Let hrtick only update accounting and drive preemption, not load-balancing and other bits. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20250918080205.563385766@infradead.org
2025-12-14sched/fair: Remove superfluous rcu_read_lock()Peter Zijlstra-8/+1
With fair switched to rcu_dereference_all() validation, having IRQ or preemption disabled is sufficient, remove the rcu_read_lock() clutter. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.647502625@infradead.org
2025-12-14sched/fair: Switch to rcu_dereference_all()Peter Zijlstra-25/+25
With the {rcu,sched,bh} RCU flavours being unified, it doesn't really make sense to check for just the rcu one. Switch to the _all family of verification which includes all 3 of the listed flavours. Notably, this will enable us to remove some superfluous rcu_read_lock() regions when we know they are inside preempt/IRQ disabled regions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-14sched/headers: Rename rcu_dereference_check_sched_domain() => ↵Peter Zijlstra-3/+3
rcu_dereference_sched_domain() Remove check from the name for being surplus to requirements. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-14sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()Peter Zijlstra-9/+18
While poking at this code recently I noted we do a pointless unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so we can balance) but then instantly grab the same rq->lock again in sched_balance_update_blocked_averages(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.532469061@infradead.org
2025-12-14sched/fair: Fold the sched_avg updatePeter Zijlstra-76/+32
Nine (and a half) instances of the same pattern is just silly, fold the lot. Notably, the half instance in enqueue_load_avg() is right after setting cfs_rq->avg.load_sum to cfs_rq->avg.load_avg * get_pelt_divider(&cfs_rq->avg). Since get_pelt_divisor() >= PELT_MIN_DIVIDER, this ends up being a no-op change. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20251127154725.413564507@infradead.org
2025-12-13bpf: Fix bpf_seq_read docs for increased buffer sizeT.J. Mercier-1/+1
Commit af65320948b8 ("bpf: Bump iter seq size to support BTF representation of large data structures") increased the fixed buffer size from PAGE_SIZE to PAGE_SIZE << 3, but the docs for the function didn't get updated at the same time. Update them. Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251207091005.2829703-1-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-14Merge tag 'smp-urgent-2025-12-12' of ↵Linus Torvalds-9/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug fix from Ingo Molnar: - Fix CPU hotplug callbacks to disable interrupts on UP kernels * tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu: Make atomic hotplug callbacks run with interrupts disabled on UP
2025-12-14Merge tag 'perf-urgent-2025-12-12' of ↵Linus Torvalds-14/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fixes from Ingo Molnar: - Fix NULL pointer dereference crash in the Intel PMU driver - Fix missing read event generation on task exit - Fix AMD uncore driver init error handling - Fix whitespace noise * tag 'perf-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix NULL event dereference crash in handle_pmi_common() perf/core: Fix missing read event generation on task exit perf/x86/amd/uncore: Fix the return value of amd_uncore_df_event_init() on error perf/uprobes: Remove <space><Tab> whitespace noise
2025-12-14Merge tag 'irq-urgent-2025-12-12' of ↵Linus Torvalds-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: - Fix error code in the irqchip/mchp-eic driver - Fix setup_percpu_irq() affinity assumptions - Remove the unused irq_domain_add_tree() function * tag 'irq-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/mchp-eic: Fix error code in mchp_eic_domain_alloc() irqdomain: Delete irq_domain_add_tree() genirq: Allow NULL affinity for setup_percpu_irq()
2025-12-13Merge tag 'mm-nonmm-stable-2025-12-11-11-47' of ↵Linus Torvalds-5/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc updates from Andrew Morton: "There are no significant series in this small merge. Please see the individual changelogs for details" [ Editor's note: it's mainly ocfs2 and a couple of random fixes ] * tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: memfd_luo: add CONFIG_SHMEM dependency mm: shmem: avoid build warning for CONFIG_SHMEM=n ocfs2: fix memory leak in ocfs2_merge_rec_left() ocfs2: invalidate inode if i_mode is zero after block read ocfs2: avoid -Wflex-array-member-not-at-end warning ocfs2: convert remaining read-only checks to ocfs2_emergency_state ocfs2: add ocfs2_emergency_state helper and apply to setattr checkpatch: add uninitialized pointer with __free attribute check args: fix documentation to reflect the correct numbers ocfs2: fix kernel BUG in ocfs2_find_victim_chain liveupdate: luo_core: fix redundant bound check in luo_ioctl() ocfs2: validate inline xattr size and entry count in ocfs2_xattr_ibody_list fs/fat: remove unnecessary wrapper fat_max_cache() ocfs2: replace deprecated strcpy with strscpy ocfs2: check tl_used after reading it from trancate log inode liveupdate: luo_file: don't use invalid list iterator
2025-12-13genirq: Don't overwrite interrupt thread flags on setupThomas Gleixner-1/+1
Chris reported that the recent affinity management changes result in overwriting the already initialized thread flags. Use set_bit() to set the affinity bit instead of assigning the bit value to the flags. Fixes: 801afdfbfcd9 ("genirq: Fix interrupt threads affinity vs. cpuset isolated partitions") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/87ecp0e4cf.ffs@tglx Closes: https://lore.kernel.org/all/20251212014848.3509622-1-clm@meta.com
2025-12-12sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()Tejun Heo-0/+10
move_local_task_to_local_dsq() is used when moving a task from a non-local DSQ to a local DSQ on the same CPU. It directly manipulates the local DSQ without going through dispatch_enqueue() and was missing the post-enqueue handling that triggers preemption when SCX_ENQ_PREEMPT is set or the idle task is running. The function is used by move_task_between_dsqs() which backs scx_bpf_dsq_move() and may be called while the CPU is busy. Add local_dsq_post_enq() call to move_local_task_to_local_dsq(). As the dispatch path doesn't need post-enqueue handling, add SCX_RQ_IN_BALANCE early exit to keep consume_dispatch_q() behavior unchanged and avoid triggering unnecessary resched when scx_bpf_dsq_move() is used from the dispatch path. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()Tejun Heo-15/+19
Factor out local_dsq_post_enq() which performs post-enqueue handling for local DSQs - triggering resched_curr() if SCX_ENQ_PREEMPT is specified or if the current CPU is idle. No functional change. This will be used by the next patch to fix move_local_task_to_local_dsq(). Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-11sched_ext: Fix bypass depth leak on scx_enable() failureTejun Heo-0/+14
scx_enable() calls scx_bypass(true) to initialize in bypass mode and then scx_bypass(false) on success to exit. If scx_enable() fails during task initialization - e.g. scx_cgroup_init() or scx_init_task() returns an error - it jumps to err_disable while bypass is still active. scx_disable_workfn() then calls scx_bypass(true/false) for its own bypass, leaving the bypass depth at 1 instead of 0. This causes the system to remain permanently in bypass mode after a failed scx_enable(). Failures after task initialization is complete - e.g. scx_tryset_enable_state() at the end - already call scx_bypass(false) before reaching the error path and are not affected. This only affects a subset of failure modes. Fix it by tracking whether scx_enable() called scx_bypass(true) in a bool and having scx_disable_workfn() call an extra scx_bypass(false) to clear it. This is a temporary measure as the bypass depth will be moved into the sched instance, which will make this tracking unnecessary. Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/stable/286e6f7787a81239e1ce2989b52391ce%40kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-10mm: memfd_luo: add CONFIG_SHMEM dependencyArnd Bergmann-0/+1
The new memfd code fails to link without SHMEM: aarch64-linux-ld: mm/memfd_luo.o: in function `memfd_luo_retrieve_folios': memfd_luo.c:(.text.memfd_luo_retrieve_folios+0xdc): undefined reference to `shmem_add_to_page_cache' memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x11c): undefined reference to `shmem_inode_acct_blocks' memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x134): undefined reference to `shmem_recalc_inode' Add a Kconfig dependency to disallow that configuration. Link: https://lkml.kernel.org/r/20251204100203.1034394-1-arnd@kernel.org Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-10liveupdate: luo_core: fix redundant bound check in luo_ioctl()Pasha Tatashin-3/+1
The kernel test robot reported a Smatch warning: kernel/liveupdate/luo_core.c:402 luo_ioctl() warn: unsigned 'nr' is never less than zero. This occurs because 'nr' is unsigned and LIVEUPDATE_CMD_BASE is currently defined as 0, making the check (nr < LIVEUPDATE_CMD_BASE) always false. Remove the explicit lower bound check. The logic remains correct because 'nr' is unsigned; if nr is less than LIVEUPDATE_CMD_BASE, the expression (nr - LIVEUPDATE_CMD_BASE) will wrap around to a large positive value. This will inevitably be larger than ARRAY_SIZE(luo_ioctl_ops) and be caught by the upper bound check. Link: https://lkml.kernel.org/r/20251130010919.1488230-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511280300.6pvBmXUS-lkp@intel.com/ Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>