linux - Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

Age	Commit message (Collapse)	Author	Lines
2026-01-16	Merge tag 'pm-6.19-rc6' of ↵	Linus Torvalds	-107/+192
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix an error path memory leak in the energy model management code, fix a kerneldoc comment in it, and fix and revamp the energy model YNL specification added recently along with the new energy model management netlink interface (that received feedback after being added): - Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout) - Fix stale description of the cost field in struct em_perf_state to reflect the current code (Yaxiong Tian) - Fix and revamp the energy model YNL specification added recently along with the energy model netlink interface (Changwoo Min)" * tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: EM: Add dump to get-perf-domains in the EM YNL spec PM: EM: Change cpus' type from string to u64 array in the EM YNL spec PM: EM: Rename em.yaml to dev-energymodel.yaml PM: EM: Fix yamllint warnings in the EM YNL spec PM: EM: Fix memory leak in em_create_pd() error path PM: EM: Fix incorrect description of the cost field in struct em_perf_state
2026-01-16	genirq/chip: Change irq_chip_pm_put() return type to void	Rafael J. Wysocki	-11/+11
	The irq_chip_pm_put() return value is only used in __irq_do_set_handler() to trigger a WARN_ON() if it is negative, but doing so is not useful because irq_chip_pm_put() simply passes the pm_runtime_put() return value to its callers. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. For this reason, modify irq_chip_pm_put() to discard the pm_runtime_put() return value, change its return type to void, and drop the WARN_ON() around the irq_chip_pm_put() invocation from __irq_do_set_handler(). Also update the irq_chip_pm_put() kerneldoc comment to be more accurate. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/5075294.31r3eYUQgx@rafael.j.wysocki
2026-01-16	bpf: Preserve id of register in sync_linked_regs()	Puranjay Mohan	-1/+3
	sync_linked_regs() copies the id of known_reg to reg when propagating bounds of known_reg to reg using the off of known_reg, but when known_reg was linked to reg like: known_reg = reg ; both known_reg and reg get same id known_reg += 4 ; known_reg gets off = 4, and its id gets BPF_ADD_CONST now when a call to sync_linked_regs() happens, let's say with the following: if known_reg >= 10 goto pc+2 known_reg's new bounds are propagated to reg but now reg gets BPF_ADD_CONST from the copy. This means if another link to reg is created like: another_reg = reg ; another_reg should get the id of reg but assign_scalar_id_before_mov() sees BPF_ADD_CONST on reg and assigns a new id to it. As reg has a new id now, known_reg's link to reg is broken. If we find new bounds for known_reg, they will not be propagated to reg. This can be seen in the selftest added in the next commit: 0: (85) call bpf_get_prandom_u32#7 ; R0=scalar() 1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) 2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) 3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 4: (a5) if r1 < 0xa goto pc+4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 5: (bf) r2 = r0 ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R2=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) 6: (a5) if r1 < 0xe goto pc+2 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=14,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 7: (35) if r0 >= 0xa goto pc+1 ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=9,var_off=(0x0; 0xf)) 8: (37) r0 /= 0 div by zero When 4 is verified, r1's bounds are propagated to r0 but r0 also gets BPF_ADD_CONST (bug). When 5 is verified, r0 gets a new id (2) and its link with r1 is broken. After 6 we know r1 has bounds [14, 259] and therefore r0 should have bounds [10, 255], therefore the branch at 7 is always taken. But because r0's id was changed to 2, r1's new bounds are not propagated to r0. The verifier still thinks r0 has bounds [6, 255] before 7 and execution can reach div by zero. Fix this by preserving id in sync_linked_regs() like off and subreg_def. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260115151143.1344724-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-16	Merge tag 'printk-for-6.19-rc6' of ↵	Linus Torvalds	-20/+18
	git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Prevent softlockup by restoring IRQs in atomic flush after each record * tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk/nbcon: Restore IRQ in atomic flush after each emitted record
2026-01-16	dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready	Dan Williams	-11/+60
	Insert Soft Reserved memory into a dedicated soft_reserve_resource tree instead of the iomem_resource tree at boot. Delay publishing these ranges into the iomem hierarchy until ownership is resolved and the HMEM path is ready to consume them. Publishing Soft Reserved ranges into iomem too early conflicts with CXL hotplug and prevents region assembly when those ranges overlap CXL windows. Follow up patches will reinsert Soft Reserved ranges into iomem after CXL window publication is complete and HMEM is ready to claim the memory. This provides a cleaner handoff between EFI-defined memory ranges and CXL resource management without trimming or deleting resources later. In the meantime "Soft Reserved" resources will no longer appear in /proc/iomem, only their results. I.e. with "memmap=4G%4G+0xefffffff" Before: 100000000-1ffffffff : Soft Reserved 100000000-1ffffffff : dax1.0 100000000-1ffffffff : System RAM (kmem) After: 100000000-1ffffffff : dax1.0 100000000-1ffffffff : System RAM (kmem) The expectation is that this does not lead to a user visible regression because the dax1.0 device is created in both instances. Co-developed-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> [Smita: incorporate feedback from x86 maintainer review] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Link: https://patch.msgid.link/20251120031925.87762-2-Smita.KoralahalliChannabasappa@amd.com [djbw: cleanups and clarifications] Link: https://lore.kernel.org/69443f707b025_1cee10022@dwillia2-mobl4.notmuch Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2026-01-16	x86/uprobes: Fix XOL allocation failure for 32-bit tasks	Oleg Nesterov	-3/+7
	This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
2026-01-16	Merge branch 'pm-em'	Rafael J. Wysocki	-107/+192
	Merge fixes related to the energy model management for 6.19-rc6: - Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout) - Fix stale description of the cost field in struct em_perf_state to reflect the current code (Yaxiong Tian) - Fix and revamp the energy model YNL specification added recently along with the energy model netlink interface (Changwoo Min) * pm-em: PM: EM: Add dump to get-perf-domains in the EM YNL spec PM: EM: Change cpus' type from string to u64 array in the EM YNL spec PM: EM: Rename em.yaml to dev-energymodel.yaml PM: EM: Fix yamllint warnings in the EM YNL spec PM: EM: Fix memory leak in em_create_pd() error path PM: EM: Fix incorrect description of the cost field in struct em_perf_state
2026-01-16	kernel: add SPDX-License-Identifier lines	Tim Bird	-4/+2
	Add SPDX-License-Identifier lines to some old kernel files. Signed-off-by: Tim Bird <tim.bird@sony.com> Acked-by: Karim Yaghmour <karim.yaghmour@opersys.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-15	kernel: cgroup: Add LGPL-2.1 SPDX license ID to legacy_freezer.c	Tim Bird	-8/+1
	Add an appropriate SPDX-License-Identifier line to the file, and remove the GNU boilerplate text. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-15	kernel: cgroup: Add SPDX-License-Identifier lines	Tim Bird	-8/+2
	Add GPL-2.0 SPDX license id lines to a few old files, replacing the reference to the COPYING file. The COPYING file at the time of creation of these files (2007 and 2005) was GPL-v2.0, with an additional clause indicating that only v2 applied. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-15	kernel: modules: Add SPDX license identifier to kmod.c	Tim Bird	-0/+1
	Add a GPL-2.0 license identifier line for this file. kmod.c was originally introduced in the kernel in February of 1998 by Linus Torvalds - who was familiar with kernel licensing at the time this was introduced. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-01-15	arm64/ftrace,bpf: Fix partial regs after bpf_prog_run	Jiri Olsa	-0/+1
	Mahe reported issue with bpf_override_return helper not working when executed from kprobe.multi bpf program on arm. The problem is that on arm we use alternate storage for pt_regs object that is passed to bpf_prog_run and if any register is changed (which is the case of bpf_override_return) it's not propagated back to actual pt_regs object. Fixing this by introducing and calling ftrace_partial_regs_update function to propagate the values of changed registers (ip and stack). Reported-by: Mahe Tardy <mahe.tardy@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/bpf/20260112121157.854473-1-jolsa@kernel.org
2026-01-15	Merge tag 'ftrace-v6.19-rc5' of ↵	Linus Torvalds	-14/+15
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fix from Steven Rostedt: - Fix allocation accounting on boot up The ftrace records for each function that ftrace can attach to is done in a group of pages. At boot up, the number of pages are calculated and allocated. After that, the pages are filled with data. It may allocate more than needed due to some functions not being recorded (because they are unused weak functions), this too is recorded. After the data is filled in, a check is made to make sure the right number of pages were allocated. But this was off due to the assumption that the same number of entries fit per every page. Because the size of an entry does not evenly divide into PAGE_SIZE, there is a rounding error when a large number of pages is allocated to hold the events. This causes the check to fail and triggers a warning. Fix the accounting by finding out how many pages are actually allocated from the functions that allocate them and use that to see if all the pages allocated were used and the ones not used are properly freed. * tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Do not over-allocate ftrace memory
2026-01-15	sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead	Shrikanth Hegde	-4/+1
	nohz.nr_cpus was observed as contended cacheline when running enterprise workload on large systems. Fundamental scalability challenge with nohz.idle_cpus_mask and nohz.nr_cpus is the following: (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's any nohz balancing work to do, in every scheduler tick. (2) nohz_balance_enter_idle() and nohz_balance_exit_idle() (through nohz_balancer_kick() via sched_tick()) modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked. The characteristic frequencies are the following: (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency on CPU(which has not gone idle). This is a relatively constant frequency in the ~1 kHz range or lower. (2) happens at idle enter/exit frequency on every CPU that goes to idle. This is workload dependent, but can easily be hundreds of kHz for IO-bound loads and high CPU counts. Ie. can be orders of magnitude higher than (1), in which case a cachemiss at every invocation of (1) is almost inevitable. idle exit will trigger (1) on the CPU which is coming out of idle. There's two types of costs from these functions: (A) scheduler tick cost via (1): this happens on busy CPUs too, and is thus a primary scalability cost. But the rate here is constant and typically much lower than (B), hence the absolute benefit to workload scalability will be lower as well. (B) idle cost via (2): going-to-idle and coming-from-idle costs are secondary concerns, because they impact power efficiency more than they impact scalability. But in terms of absolute cost this scales up with nr_cpus as well, and a much faster rate, and thus may also approach and negatively impact system limits like memory bus/fabric bandwidth. Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n, the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines under typical NR_CPUS=512/2048. This implies two separate cachelines being dirtied upon idle entry / exit. nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant a functionally correct value. This means one less cacheline being dirtied in idle entry/exit path which helps to save some bus bandwidth w.r.t to those nohz functions(approx 50%). This in turn helps to improve enterprise workload throughput. On system with 480 CPUs, running "hackbench 40 process 10000 loops" (Avg of 3 runs) baseline: 0.81% hackbench [k] nohz_balance_exit_idle 0.21% hackbench [k] nohz_balancer_kick 0.09% swapper [k] nohz_run_idle_balance With patch: 0.35% hackbench [k] nohz_balance_exit_idle 0.09% hackbench [k] nohz_balancer_kick 0.07% swapper [k] nohz_run_idle_balance [Ingo Molnar: scalability analysis changlog] Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-4-sshegde@linux.ibm.com
2026-01-15	sched/fair: Change likelyhood of nohz.nr_cpus	Shrikanth Hegde	-2/+2
	These days most of the system have multi cores. The likelyhood of at least one or more CPUs in nohz (idle state) is higher. Give accurate hint to the branch predictor. Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-3-sshegde@linux.ibm.com
2026-01-15	sched/fair: Move checking for nohz cpus after time check	Shrikanth Hegde	-7/+16
	Current code does. - Read nohz.nr_cpus - Check if the time has passed to do NOHZ idle balance Instead do this. - Check if the time has passed to do NOHZ idle balance - Read nohz.nr_cpus This will skip the read most of the time in normal system usage. i.e when there are nohz.nr_cpus (system is not 100% busy). Note that when there are no idle CPUs(100% busy), even if the flag gets set to NOHZ_STATS_KICK \| NOHZ_NEXT_KICK, find_new_ilb will fail and there will be no NOHZ idle balance. In such cases there will be a very narrow window where, kick_ilb will be called un-necessarily. However current functionality is still retained. Note: This patch doesn't solve any cacheline overheads. No improvement in performance apart from saving a few cycles of reading nohz.nr_cpus Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com
2026-01-15	sched/fair: Fix math notation errors in avg_vruntime comment	Zhan Xusheng	-2/+2
	The avg_vruntime comment contains a couple of mathematical notation issues: - The summation over w_i * (V - v_i) is written in an ambiguous form - The delta term refers to v instead of v0, which is inconsistent with the code and preceding explanation Fix these to make the comment mathematically correct and consistent with the implementation. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114090035.19033-1-zhanxusheng@xiaomi.com
2026-01-15	sched: Fix build for modules using set_tsk_need_resched()	Gabriele Monaco	-0/+1
	Commit adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") added a tracepoint to the need_resched action that can be triggered also by set_tsk_need_resched. This function was previously accessible from out-of-tree modules but it's no longer available because the __trace_set_need_resched() symbol is not exported (together with the tracepoint itself, which was exported in a separate patch) and building such modules fails. Export __trace_set_need_resched to modules to fix those build issues. Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
2026-01-15	sched/deadline: Use ENQUEUE_MOVE to allow priority change	Peter Zijlstra	-1/+1
	Pierre reported hitting balance callback warnings for deadline tasks after commit 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern"). It turns out that DEQUEUE_SAVE+ENQUEUE_RESTORE does not preserve DL priority and subsequently trips a balance pass -- where one was not expected. From discussion with Juri and Luca, the purpose of this clause was to deal with tasks new to DL and all those sites will have MOVE set (as well as CLASS, but MOVE is move conservative at this point). Per the previous patches MOVE is audited to always run the balance callbacks, so switch enqueue_dl_entity() to use MOVE for this case. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15	sched: Deadline has dynamic priority	Peter Zijlstra	-2/+2
	While FIFO/RR have static priority, DEADLINE is a dynamic priority scheme. Notably it has static priority -1. Do not assume the priority doesn't change for deadline tasks just because the static priority doesn't change. This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate. Fixes: ff77e4685359 ("sched/rt: Fix PI handling vs. sched_setscheduler()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15	sched: Audit MOVE vs balance_callbacks	Peter Zijlstra	-2/+8
	The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change priority, which means there could be balance callbacks queued. Therefore audit all MOVE users and make sure they do run balance callbacks before dropping rq-lock. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15	sched: Fold rq-pin swizzle into __balance_callbacks()	Peter Zijlstra	-6/+8
	Prepare for more users needing the rq-pin swizzle. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15	sched/deadline: Avoid double update_rq_clock()	Peter Zijlstra	-2/+1
	When setup_new_dl_entity() is called from enqueue_task_dl() -> enqueue_dl_entity(), the rq-clock should already be updated, and calling update_rq_clock() again is not right. Move the update_rq_clock() to the one other caller of setup_new_dl_entity(): sched_init_dl_server(). Fixes: 9f239df55546 ("sched/deadline: Initialize dl_servers after SMP") Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Link: https://patch.msgid.link/20260113115622.GA831285@noisy.programming.kicks-ass.net
2026-01-15	sched/deadline: Ensure get_prio_dl() is up-to-date	Peter Zijlstra	-0/+6
	Pratheek tripped a WARN and noted the following issue: > Inspecting the set of events that led to the warning being triggered > showed the following: > > systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin! > > systemd-1 [008] dN.31 ...: sched_change_begin: Begin! > systemd-1 [008] dN.31 ...: sched_change_begin: Before dequeue_task()! > systemd-1 [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH > systemd-1 [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH > systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217 > systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047 > systemd-1 [008] dN.31 ...: sched_change_begin: Before put_prev_task()! > > systemd-1 [008] dN.31 ...: sched_change_end: Before enqueue_task()! > systemd-1 [008] dN.31 ...: sched_change_end: Before put_prev_task()! > systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047 > systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing balance callback! > systemd-1 [008] dN.31 ...: sched_change_end: End! > > systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end! > systemd-1 [008] dN.21 ...: __schedule: Woops! Balance callback found! > > 1. sched_change_begin() from guard(sched_change) in > do_set_cpus_allowed() stashes the priority, which for the deadline > task, is "p->dl.deadline". > 2. The dequeue of the deadline task replenishes the deadline. > 3. The task is enqueued back after guard's scope ends and since there is > no *_CLASS flags set, sched_change_end() calls > dl_sched_class->prio_changed() which compares the deadline. > 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value > differ from the stashed value and queues a balance pull callback. > 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a > do_balance_callbacks(). > 6. Grabbing the rq_lock() at subsequent __schedule() triggers the > warning since the balance pull callback was never executed before > dropping the lock. Meaning get_prio_dl() ought to update current and return an up-to-date value. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260106104113.GX3707891@noisy.programming.kicks-ass.net
2026-01-15	Merge tag 'mm-hotfixes-stable-2026-01-15-08-03' of ↵	Linus Torvalds	-19/+20
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: - kerneldoc fixes from Bagas Sanjaya - DAMON fixes from SeongJae - mremap VMA-related fixes from Lorenzo - various singletons - please see the changelogs for details * tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (30 commits) drivers/dax: add some missing kerneldoc comment fields for struct dev_dax mm: numa,memblock: include <asm/numa.h> for 'numa_nodes_parsed' mailmap: add entry for Daniel Thompson tools/testing/selftests: fix gup_longterm for unknown fs mm/page_alloc: prevent pcp corruption with SMP=n iommu/sva: include mmu_notifier.h header mm: kmsan: fix poisoning of high-order non-compound pages tools/testing/selftests: add forked (un)/faulted VMA merge tests mm/vma: enforce VMA fork limit on unfaulted,faulted mremap merge too tools/testing/selftests: add tests for !tgt, src mremap() merges mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge mm/zswap: fix error pointer free in zswap_cpu_comp_prepare() mm/damon/sysfs-scheme: cleanup access_pattern subdirs on scheme dir setup failure mm/damon/sysfs-scheme: cleanup quotas subdirs on scheme dir setup failure mm/damon/sysfs: cleanup attrs subdirs on context dir setup failure mm/damon/sysfs: cleanup intervals subdirs on attrs dir setup failure mm/damon/core: remove call_control in inactive contexts powerpc/watchdog: add support for hardlockup_sys_info sysctl mips: fix HIGHMEM initialization mm/hugetlb: ignore hugepage kernel args if hugepages are unsupported ...
2026-01-15	ftrace: Do not over-allocate ftrace memory	Guenter Roeck	-14/+15
	The pg_remaining calculation in ftrace_process_locs() assumes that ENTRIES_PER_PAGE multiplied by 2^order equals the actual capacity of the allocated page group. However, ENTRIES_PER_PAGE is PAGE_SIZE / ENTRY_SIZE (integer division). When PAGE_SIZE is not a multiple of ENTRY_SIZE (e.g. 4096 / 24 = 170 with remainder 16), high-order allocations (like 256 pages) have significantly more capacity than 256 * 170. This leads to pg_remaining being underestimated, which in turn makes skip (derived from skipped - pg_remaining) larger than expected, causing the WARN(skip != remaining) to trigger. Extra allocated pages for ftrace: 2 with 654 skipped WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7295 ftrace_process_locs+0x5bf/0x5e0 A similar problem in ftrace_allocate_records() can result in allocating too many pages. This can trigger the second warning in ftrace_process_locs(). Extra allocated pages for ftrace WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7276 ftrace_process_locs+0x548/0x580 Use the actual capacity of a page group to determine the number of pages to allocate. Have ftrace_allocate_pages() return the number of allocated pages to avoid having to calculate it. Use the actual page group capacity when validating the number of unused pages due to skipped entries. Drop the definition of ENTRIES_PER_PAGE since it is no longer used. Cc: stable@vger.kernel.org Fixes: 4a3efc6baff93 ("ftrace: Update the mcount_loc check of skipped entries") Link: https://patch.msgid.link/20260113152243.3557219-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-15	perf/core: Fix slow perf_event_task_exit() with LBR callstacks	Namhyung Kim	-2/+18
	I got a report that a task is stuck in perf_event_exit_task() waiting for global_ctx_data_rwsem. On large systems with lots threads, it'd have performance issues when it grabs the lock to iterate all threads in the system to allocate the context data. And it'd block task exit path which is problematic especially under memory pressure. perf_event_open perf_event_alloc attach_perf_ctx_data attach_global_ctx_data percpu_down_write (global_ctx_data_rwsem) for_each_process_thread alloc_task_ctx_data do_exit perf_event_exit_task percpu_down_read (global_ctx_data_rwsem) It should not hold the global_ctx_data_rwsem on the exit path. Let's skip allocation for exiting tasks and free the data carefully. Reported-by: Rosalie Fang <rosaliefang@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260112165157.1919624-1-namhyung@kernel.org
2026-01-14	powerpc/watchdog: add support for hardlockup_sys_info sysctl	Feng Tang	-1/+1
	Commit a9af76a78760 ("watchdog: add sys_info sysctls to dump sys info on system lockup") adds 'hardlock_sys_info' systcl knob for general kernel watchdog to control what kinds of system debug info to be dumped on hardlockup. Add similar support in powerpc watchdog code to make the sysctl knob more general, which also fixes a compiling warning in general watchdog code reported by 0day bot. Link: https://lkml.kernel.org/r/20251231080309.39642-1-feng.tang@linux.alibaba.com Fixes: a9af76a78760 ("watchdog: add sys_info sysctls to dump sys info on system lockup") Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512030920.NFKtekA7-lkp@intel.com/ Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14	kho: validate preserved memory map during population	Pasha Tatashin	-18/+19
	If the previous kernel enabled KHO but did not call kho_finalize() (e.g., CONFIG_LIVEUPDATE=n or userspace skipped the finalization step), the 'preserved-memory-map' property in the FDT remains empty/zero. Previously, kho_populate() would succeed regardless of the memory map's state, reserving the incoming scratch regions in memblock. However, kho_memory_init() would later fail to deserialize the empty map. By that time, the scratch regions were already registered, leading to partial initialization and subsequent list corruption (freeing scratch area twice) during kho_init(). Move the validation of the preserved memory map earlier into kho_populate(). If the memory map is empty/NULL: 1. Abort kho_populate() immediately with -ENOENT. 2. Do not register or reserve the incoming scratch memory, allowing the new kernel to reclaim those pages as standard free memory. 3. Leave the global 'kho_in' state uninitialized. Consequently, kho_memory_init() sees no active KHO context (kho_in.mem_chunks_phys is 0) and falls back to kho_reserve_scratch(), allocating fresh scratch memory as if it were a standard cold boot. Link: https://lkml.kernel.org/r/20251223140140.2090337-1-pasha.tatashin@soleen.com Fixes: de51999e687c ("kho: allow memory preservation state updates after finalization") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reported-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Closes: https://lore.kernel.org/all/20251218215613.GA17304@ranerica-svr.sc.intel.com Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14	bpf: Properly mark live registers for indirect jumps	Anton Protopopov	-0/+6
	For a `gotox rX` instruction the rX register should be marked as used in the compute_insn_live_regs() function. Fix this. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260114162544.83253-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-14	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5	Alexei Starovoitov	-54/+106
	Cross-merge BPF and other fixes after downstream PR. No conflicts. Adjacent: Auto-merging MAINTAINERS Auto-merging Makefile Auto-merging kernel/bpf/verifier.c Auto-merging kernel/sched/ext.c Auto-merging mm/memcontrol.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-14	dma/pool: Avoid allocating redundant pools	Robin Murphy	-5/+14
	On smaller systems, e.g. embedded arm64, it is common for all memory to end up in ZONE_DMA32 or even ZONE_DMA. In such cases it is redundant to allocate a nominal pool for an empty higher zone that just ends up coming from a lower zone that should already have its own pool anyway. We already have logic to skip allocating a ZONE_DMA pool when that is empty, so generalise that to save memory in the case of other zones too. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/8ab8d8a620dee0109f33f5cb63d6bfeed35aac37.1768230104.git.robin.murphy@arm.com
2026-01-14	dma/pool: Improve pool lookup	Robin Murphy	-4/+4
	If CONFIG_ZONE_DMA32 is enabled, but we have not allocated the corresponding atomic_pool_dma32, dma_guess_pool() may return the NULL value of that and fail a GFP_DMA32 allocation without trying to fall back to other pools which may exist. Furthermore, if no GFP_DMA pool exists, it is preferable to try GFP_DMA32 rather than immediately fall back to GFP_KERNEL with even less chance of success. Improve matters by encoding an explicit order of pool preference for each flag. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/c846b1a2f43295cac926c7af2ce907f62baec518.1768230104.git.robin.murphy@arm.com
2026-01-14	vdso: Remove struct getcpu_cache	Thomas Weißschuh	-3/+1
	The cache parameter of getcpu() is useless nowadays for various reasons. * It is never passed by userspace for either the vDSO or syscalls. * It is never used by the kernel. * It could not be made to work on the current vDSO architecture. * The structure definition is not part of the UAPI headers. * vdso_getcpu() is superseded by restartable sequences in any case. Remove the struct and its header. As a side-effect this gets rid of an unwanted inclusion of the linux/ header namespace from vDSO code. [ tglx: Adapt to s390 upstream changes */ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://patch.msgid.link/20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de
2026-01-13	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf	Linus Torvalds	-0/+5
	Pull bpf fixes from Alexei Starovoitov: - Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK in riscv JIT (Menglong Dong) - Fix reference count leak in bpf_prog_test_run_xdp() (Tetsuo Handa) - Fix metadata size check in bpf_test_run() (Toke Høiland-Jørgensen) - Check that BPF insn array is not allowed as a map for const strings (Deepanshu Kartikey) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix reference count leak in bpf_prog_test_run_xdp() bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() selftests/bpf: Update xdp_context_test_run test to check maximum metadata size bpf, test_run: Subtract size of xdp_frame from allowed metadata size riscv, bpf: Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK
2026-01-13	bpf: Return EACCES for incorrect access to insn array	Anton Protopopov	-1/+1
	The insn_array_map_direct_value_addr() function currently returns -EINVAL when the offset within the map is invalid. Change this to return -EACCES, so that it is consistent with similar boundary access checks in the verifier. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260111153047.8388-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13	bpf: Return proper address for non-zero offsets in insn array	Anton Protopopov	-1/+1
	The map_direct_value_addr() function of the instruction array map incorrectly adds offset to the resulting address. This is a bug, because later the resolve_pseudo_ldimm64() function adds the offset. Fix it. Corresponding selftests are added in a consequent commit. Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260111153047.8388-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13	bpf: return PTR_TO_BTF_ID \| PTR_TRUSTED from BPF kfuncs by default	Matt Bobrowski	-17/+29
	Teach the BPF verifier to treat pointers to struct types returned from BPF kfuncs as implicitly trusted (PTR_TO_BTF_ID \| PTR_TRUSTED) by default. Returning untrusted pointers to struct types from BPF kfuncs should be considered an exception only, and certainly not the norm. Update existing selftests to reflect the change in register type printing (e.g. `ptr_` becoming `trusted_ptr_` in verifier error messages). Link: https://lore.kernel.org/bpf/aV4nbCaMfIoM0awM@google.com/ Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260113083949.2502978-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13	bpf: Optimize the performance of find_bpffs_btf_enums	Donglin Peng	-25/+17
	Currently, vmlinux BTF is unconditionally sorted during the build phase. The function btf_find_by_name_kind executes the binary search branch, so find_bpffs_btf_enums can be optimized by using btf_find_by_name_kind. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260109130003.3313716-10-dolinux.peng@gmail.com
2026-01-13	bpf: Skip anonymous types in type lookup for performance	Donglin Peng	-10/+7
	Currently, vmlinux and kernel module BTFs are unconditionally sorted during the build phase, with named types placed at the end. Thus, anonymous types should be skipped when starting the search. In my vmlinux BTF, the number of anonymous types is 61,747, which means the loop count can be reduced by 61,747. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260109130003.3313716-9-dolinux.peng@gmail.com
2026-01-13	btf: Verify BTF sorting	Donglin Peng	-0/+43
	This patch checks whether the BTF is sorted by name in ascending order. If sorted, binary search will be used when looking up types. Specifically, vmlinux and kernel module BTFs are always sorted during the build phase with anonymous types placed before named types, so we only need to identify the starting ID of named types. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260109130003.3313716-8-dolinux.peng@gmail.com
2026-01-13	btf: Optimize type lookup with binary search	Donglin Peng	-9/+81
	Improve btf_find_by_name_kind() performance by adding binary search support for sorted types. Falls back to linear search for compatibility. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260109130003.3313716-7-dolinux.peng@gmail.com
2026-01-13	perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls	Jan H. Schönherr	-1/+2
	There are typically a lot of PMUs registered, but in many cases only few of them have an event registered (like the "cpu" PMU in the presence of the watchdog). As the mutex is already held, it's safe to just check for existing events before doing the cross CPU call. This change saves tens of milliseconds from kexec time (perceived as steal time during a hypervisor host update), with <2ms remaining for this step in the shutdown. There might be additional potential for parallelization or we could just disable performance monitoring during the actual shutdown and be less graceful about it. Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-01-13	genirq/cpuhotplug: Notify about affinity changes breaking the affinity mask	Imran Khan	-11/+23
	During CPU offlining the interrupts affined to that CPU are moved to other online CPUs, which might break the original affinity mask if the outgoing CPU was the last online CPU in that mask. This change is not propagated to irq_desc::affinity_notify(), which leaves users of the affinity notifier mechanism with stale information. Avoid this by scheduling affinity change notification work for interrupts that were affined to the CPU being offlined, if the new target CPU is not part of the original affinity mask. Since irq_set_affinity_locked() uses the same logic to schedule affinity change notification work, split out this logic into a dedicated function and use that at both places. [ tglx: Removed the EXPORT(), removed the !SMP stub, moved the prototype, added a lockdep assert instead of a comment, fixed up coding style and name space. Polished and clarified the change log ] Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260113143727.1041265-1-imran.f.khan@oracle.com
2026-01-13	simplify the callers of file_open_name()	Al Viro	-3/+1
	It accepts ERR_PTR() for name and does the right thing in that case. That allows to simplify the logics in callers, making them trivial to switch to CLASS(filename). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13	struct filename ->refcnt doesn't need to be atomic	Al Viro	-3/+3
	... or visible outside of audit, really. Note that references held in delayed_filename always have refcount 1, and from the moment of complete_getname() or equivalent point in getname...() there won't be any references to struct filename instance left in places visible to other threads. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13	get rid of audit_reusename()	Al Viro	-23/+0
	Originally we tried to avoid multiple insertions into audit names array during retry loop by a cute hack - memorize the userland pointer and if there already is a match, just grab an extra reference to it. Cute as it had been, it had problems - two identical pointers had audit aux entries merged, two identical strings did not. Having different behaviour for syscalls that differ only by addresses of otherwise identical string arguments is obviously wrong - if nothing else, compiler can decide to merge identical string literals. Besides, this hack does nothing for non-audited processes - they get a fresh copy for retry. It's not time-critical, but having behaviour subtly differ that way is bogus. These days we have very few places that import filename more than once (9 functions total) and it's easy to massage them so we get rid of all re-imports. With that done, we don't need audit_reusename() anymore. There's no need to memorize userland pointer either. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13	bpf: Remove an unused parameter in check_func_proto	Song Chen	-2/+2
	The func_id parameter is not needed in check_func_proto. This patch removes it. Signed-off-by: Song Chen <chensong_2000@189.cn> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260105155009.4581-1-chensong_2000@189.cn
2026-01-13	bpf: Recognize special arithmetic shift in the verifier	Alexei Starovoitov	-0/+39
	cilium bpf_wiregard.bpf.c when compiled with -O1 fails to load with the following verifier log: 192: (79) r2 = (u64 )(r10 -304) ; R2=pkt(r=40) R10=fp0 fp-304=pkt(r=40) ... 227: (85) call bpf_skb_store_bytes#9 ; R0=scalar() 228: (bc) w2 = w0 ; R0=scalar() R2=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 229: (c4) w2 s>>= 31 ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff)) 230: (54) w2 &= -134 ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a)) ... 232: (66) if w2 s> 0xffffffff goto pc+125 ; R2=scalar(smin=umin=umin32=0x80000000,smax=umax=umax32=0xffffff7a,smax32=-134,var_off=(0x80000000; 0x7fffff7a)) ... 238: (79) r4 = (u64 )(r10 -304) ; R4=scalar() R10=fp0 fp-304=scalar() 239: (56) if w2 != 0xffffff78 goto pc+210 ; R2=0xffffff78 // -136 ... 258: (71) r1 = (u8 )(r4 +0) R4 invalid mem access 'scalar' The error might confuse most bpf authors, since fp-304 slot had 'pkt' pointer at insn 192 and became 'scalar' at 238. That happened because bpf_skb_store_bytes() clears all packet pointers including those in the stack. On the first glance it might look like a bug in the source code, since ctx->data pointer should have been reloaded after the call to bpf_skb_store_bytes(). The relevant part of cilium source code looks like this: // bpf/lib/nodeport.h int dsr_set_ipip6() { if (ctx_adjust_hroom(...)) return DROP_INVALID; // -134 if (ctx_store_bytes(...)) return DROP_WRITE_ERROR; // -141 return 0; } bool dsr_fail_needs_reply(int code) { if (code == DROP_FRAG_NEEDED) // -136 return true; return false; } tail_nodeport_ipv6_dsr() { ret = dsr_set_ipip6(...); if (!IS_ERR(ret)) { ... } else { if (dsr_fail_needs_reply(ret)) return dsr_reply_icmp6(...); } } The code doesn't have arithmetic shift by 31 and it reloads ctx->data every time it needs to access it. So it's not a bug in the source code. The reason is DAGCombiner::foldSelectCCToShiftAnd() LLVM transformation: // If this is a select where the false operand is zero and the compare is a // check of the sign bit, see if we can perform the "gzip trick": // select_cc setlt X, 0, A, 0 -> and (sra X, size(X)-1), A // select_cc setgt X, 0, A, 0 -> and (not (sra X, size(X)-1)), A The conditional branch in dsr_set_ipip6() and its return values are optimized into BPF_ARSH plus BPF_AND: 227: (85) call bpf_skb_store_bytes#9 228: (bc) w2 = w0 229: (c4) w2 s>>= 31 ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff)) 230: (54) w2 &= -134 ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a)) after insn 230 the register w2 can only be 0 or -134, but the verifier approximates it, since there is no way to represent two scalars in bpf_reg_state. After fallthough at insn 232 the w2 can only be -134, hence the branch at insn 239: (56) if w2 != -136 goto pc+210 should be always taken, and trapping insn 258 should never execute. LLVM generated correct code, but the verifier follows impossible path and rejects valid program. To fix this issue recognize this special LLVM optimization and fork the verifier state. So after insn 229: (c4) w2 s>>= 31 the verifier has two states to explore: one with w2 = 0 and another with w2 = 0xffffffff which makes the verifier accept bpf_wiregard.c A similar pattern exists were OR operation is used in place of the AND operation, the verifier detects that pattern as well by forking the state before the OR operation with a scalar in range [-1,0]. Note there are 20+ such patterns in bpf_wiregard.o compiled with -O1 and -O2, but they're rarely seen in other production bpf programs, so push_stack() approach is not a concern. Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260112201424.816836-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13	bpf: Consistently use reg_state() for register access in the verifier	Mykyta Yatsenko	-23/+19
	Replace the pattern of declaring a local regs array from cur_regs() and then indexing into it with the more concise reg_state() helper. This simplifies the code by eliminating intermediate variables and makes register access more consistent throughout the verifier. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20260113134826.2214860-1-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>