aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-07-01sched: Fix preemption string of preempt_dynamic_noneThomas Weißschuh1-1/+1
Zero is a valid value for "preempt_dynamic_mode", namely "preempt_dynamic_none". Fix the off-by-one in preempt_model_str(), so that "preempty_dynamic_none" is correctly formatted as PREEMPT(none) instead of PREEMPT(undef). Fixes: 8bdc5daaa01e ("sched: Add a generic function to return the preemption string") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250626-preempt-str-none-v2-1-526213b70a89@linutronix.de
2025-06-30Merge tag 'entry-split-for-arm' into core/entryThomas Gleixner4-117/+119
Prerequisite for ARM[64] generic entry conversion Merge it into the entry branch so further changes can be based on it.
2025-06-30entry: Split generic entry into generic exception and syscall entryJinjie Ruan4-117/+119
Currently CONFIG_GENERIC_ENTRY enables both the generic exception entry logic and the generic syscall entry logic, which are otherwise loosely coupled. Introduce separate config options for these so that architectures can select the two independently. This will make it easier for architectures to migrate to generic entry code. Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/20250213130007.1418890-2-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20250624-generic-entry-split-v1-1-53d5ef4f94df@linaro.org [Linus Walleij: rebase onto v6.16-rc1]
2025-06-30perf/core: Fix the WARN_ON_ONCE is out of lock protected regionLuo Gengkun1-2/+2
commit 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") try to fix a concurrency problem between perf_cgroup_switch and perf_cgroup_event_disable. But it does not to move the WARN_ON_ONCE into lock-protected region, so the warning is still be triggered. Fixes: 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250626135403.2454105-1-luogengkun@huaweicloud.com
2025-06-29Merge tag 'perf_urgent_for_v6.16_rc4' of ↵Linus Torvalds2-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Make sure an AUX perf event is really disabled when it overruns * tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/aux: Fix pending disable flow when the AUX ring buffer overruns
2025-06-28Merge tag 'trace-v6.16-rc3' of ↵Linus Torvalds1-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: - Fix possible UAF on error path in filter_free_subsystem_filters() When freeing a subsystem filter, the filter for the subsystem is passed in to be freed and all the events within the subsystem will have their filter freed too. In order to free without waiting for RCU synchronization, list items are allocated to hold what is going to be freed to free it via a call_rcu(). If the allocation of these items fails, it will call the synchronization directly and free after that (causing a bit of delay for the user). The subsystem filter is first added to this list and then the filters for all the events under the subsystem. The bug is if one of the allocations of the list items for the event filters fail to allocate, it jumps to the "free_now" label which will free the subsystem filter, then all the items on the allocated list, and then the event filters that were not added to the list yet. But because the subsystem filter was added first, it gets freed twice. The solution is to add the subsystem filter after the events, and then if any of the allocations fail it will not try to free any of them twice * tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix filter logic error
2025-06-27Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. 6 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 5 are for MM" * tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add Lorenzo as THP co-maintainer mailmap: update Duje Mihanović's email address selftests/mm: fix validate_addr() helper crashdump: add CONFIG_KEYS dependency mailmap: correct name for a historical account of Zijun Hu mailmap: add entries for Zijun Hu fuse: fix runtime warning on truncate_folio_batch_exceptionals() scripts/gdb: fix dentry_name() lookup mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly() mm/hugetlb: remove unnecessary holding of hugetlb_lock MAINTAINERS: add missing files to mm page alloc section MAINTAINERS: add tree entry to mm init block mm: add OOM killer maintainer structure fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
2025-06-27tracing: Fix filter logic errorEdward Adam Davis1-7/+7
If the processing of the tr->events loop fails, the filter that has been added to filter_head will be released twice in free_filter_list(&head->rcu) and __free_filter(filter). After adding the filter of tr->events, add the filter to the filter_head process to avoid triggering uaf. Link: https://lore.kernel.org/tencent_4EF87A626D702F816CD0951CE956EC32CD0A@qq.com Fixes: a9d0aab5eb33 ("tracing: Fix regression of filter waiting a long time on RCU synchronization") Reported-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=daba72c4af9915e9c894 Tested-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-27bpf: guard BTF_ID_FLAGS(bpf_cgroup_read_xattr) with CONFIG_BPF_LSMEduard Zingerman1-1/+1
Function bpf_cgroup_read_xattr is defined in fs/bpf_fs_kfuncs.c, which is compiled only when CONFIG_BPF_LSM is set. Add CONFIG_BPF_LSM check to bpf_cgroup_read_xattr spec in common_btf_ids in kernel/bpf/helpers.c to avoid build failures for configs w/o CONFIG_BPF_LSM. Build failure example: BTF .tmp_vmlinux1.btf.o btf_encoder__tag_kfunc: failed to find kfunc 'bpf_cgroup_read_xattr' in BTF ... WARN: resolve_btfids: unresolved symbol bpf_cgroup_read_xattr make[2]: *** [scripts/Makefile.vmlinux:91: vmlinux.unstripped] Error 255 Fixes: 535b070f4a80 ("bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node") Reported-by: Jake Hillion <jakehillion@meta.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250627175309.2710973-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-27timekeeping: Provide interface to control auxiliary clocksThomas Gleixner1-0/+116
Auxiliary clocks are disabled by default and attempts to access them fail. Provide an interface to enable/disable them at run-time. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183758.444626478@linutronix.de
2025-06-27timekeeping: Provide update for auxiliary timekeepersThomas Gleixner1-0/+19
Update the auxiliary timekeepers periodically. For now this is tied to the system timekeeper update from the tick. This might be revisited and moved out of the tick. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183758.382451331@linutronix.de
2025-06-27timekeeping: Provide adjtimex() for auxiliary clocksThomas Gleixner1-0/+16
The behaviour is close to clock_adtime(CLOCK_REALTIME) with the following differences: 1) ADJ_SETOFFSET adjusts the auxiliary clock offset 2) ADJ_TAI is not supported 3) Leap seconds are not supported 4) PPS is not supported Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183758.317946543@linutronix.de
2025-06-27timekeeping: Prepare do_adtimex() for auxiliary clocksThomas Gleixner1-3/+33
Exclude ADJ_TAI, leap seconds and PPS functionality as they make no sense in the context of auxiliary clocks and provide a time stamp based on the actual clock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183758.253203783@linutronix.de
2025-06-27timekeeping: Make do_adjtimex() reusableThomas Gleixner1-46/+56
Split out the actual functionality of adjtimex() and make do_adjtimex() a wrapper which feeds the core timekeeper into it and handles the result including audit at the call site. This allows to reuse the actual functionality for auxiliary clocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183758.187322876@linutronix.de
2025-06-27timekeeping: Add auxiliary clock support to __timekeeping_inject_offset()Thomas Gleixner1-8/+31
Redirect the relative offset adjustment to the auxiliary clock offset instead of modifying CLOCK_REALTIME, which has no meaning in context of these clocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183758.124057787@linutronix.de
2025-06-27timekeeping: Make timekeeping_inject_offset() reusableThomas Gleixner1-11/+15
Split out the inner workings for auxiliary clock support and feed the core time keeper into it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183758.059934561@linutronix.de
2025-06-27timekeeping: Provide time setter for auxiliary clocksThomas Gleixner1-0/+44
Add clock_settime(2) support for auxiliary clocks. The function affects the AUX offset which is added to the "monotonic" clock readout of these clocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183757.995688714@linutronix.de
2025-06-27timekeeping: Add minimal posix-timers support for auxiliary clocksThomas Gleixner3-0/+25
Provide clock_getres(2) and clock_gettime(2) for auxiliary clocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183757.932220594@linutronix.de
2025-06-27timekeeping: Provide time getters for auxiliary clocksThomas Gleixner1-0/+65
Provide interfaces similar to the ktime_get*() family which provide access to the auxiliary clocks. These interfaces have a boolean return value, which indicates whether the accessed clock is valid or not. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183757.868342628@linutronix.de
2025-06-27timekeeping: Update auxiliary timekeepers on clocksource changeThomas Gleixner1-0/+33
Propagate a system clocksource change to the auxiliary timekeepers so that they can pick up the new clocksource. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250625183757.803890875@linutronix.de
2025-06-27bpf: Fix string kfuncs names in doc commentsViktor Malik1-3/+3
Documentation comments for bpf_strnlen and bpf_strcspn contained incorrect function names. Fixes: e91370550f1f ("bpf: Add kfuncs for read-only string operations") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/bpf/20250627174759.3a435f86@canb.auug.org.au/T/#u Signed-off-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/20250627082001.237606-1-vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-26Merge branch 'vfs-6.17.bpf' of ↵Alexei Starovoitov2-0/+8
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Merge branch 'vfs-6.17.bpf' from vfs tree into bpf-next/master and resolve conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-26smp: Use cpumask_any_but() in smp_call_function_many_cond()Yury Norov [NVIDIA]1-6/+1
smp_call_function_many_cond() opencodes cpumask_any_but(). Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250623000010.10124-3-yury.norov@gmail.com
2025-06-26smp: Improve locality in smp_call_function_any()Yury Norov [NVIDIA]1-16/+3
smp_call_function_any() tries to make a local call as it's the cheapest option, or switches to a CPU in the same node. If it's not possible, the algorithm gives up and searches for any CPU, in a numerical order. Instead, it can search for the best CPU based on NUMA locality, including the 2nd nearest hop (a set of equidistant nodes), and higher. sched_numa_find_nth_cpu() does exactly that, and also helps to drop most of the housekeeping code. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250623000010.10124-2-yury.norov@gmail.com
2025-06-26PM: Restrict swap use to later in the suspend sequenceMario Limonciello4-10/+2
Currently swap is restricted before drivers have had a chance to do their prepare() PM callbacks. Restricting swap this early means that if a driver needs to evict some content from memory into sawp in it's prepare callback, it won't be able to. On AMD dGPUs this can lead to failed suspends under memory pressure situations as all VRAM must be evicted to system memory or swap. Move the swap restriction to right after all devices have had a chance to do the prepare() callback. If there is any problem with the sequence, restore swap in the appropriate dpm resume callbacks or error handling paths. Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Tested-by: Nat Wittstock <nat@fardog.io> Tested-by: Lucian Langa <lucilanga@7pot.org> Link: https://patch.msgid.link/20250613214413.4127087-1-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3Alexei Starovoitov31-185/+471
Cross-merge BPF, perf and other fixes after downstream PRs. It restores BPF CI to green after critical fix commit bc4394e5e79c ("perf: Fix the throttle error of some clock events") No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-26bpf: Add kfuncs for read-only string operationsViktor Malik1-0/+382
String operations are commonly used so this exposes the most common ones to BPF programs. For now, we limit ourselves to operations which do not copy memory around. Unfortunately, most in-kernel implementations assume that strings are %NUL-terminated, which is not necessarily true, and therefore we cannot use them directly in the BPF context. Instead, we open-code them using __get_kernel_nofault instead of plain dereference to make them safe and limit the strings length to XATTR_SIZE_MAX to make sure the functions terminate. When __get_kernel_nofault fails, functions return -EFAULT. Similarly, when the size bound is reached, the functions return -E2BIG. In addition, we return -ERANGE when the passed strings are outside of the kernel address space. Note that thanks to these dynamic safety checks, no other constraints are put on the kfunc args (they are marked with the "__ign" suffix to skip any verifier checks for them). All of the functions return integers, including functions which normally (in kernel or libc) return pointers to the strings. The reason is that since the strings are generally treated as unsafe, the pointers couldn't be dereferenced anyways. So, instead, we return an index to the string and let user decide what to do with it. This also nicely fits with returning various error codes when necessary (see above). Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/4b008a6212852c1b056a413f86e3efddac73551c.1750917800.git.vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-26perf/aux: Fix pending disable flow when the AUX ring buffer overrunsLeo Yan2-5/+5
If an AUX event overruns, the event core layer intends to disable the event by setting the 'pending_disable' flag. Unfortunately, the event is not actually disabled afterwards. In commit: ca6c21327c6a ("perf: Fix missing SIGTRAPs") the 'pending_disable' flag was changed to a boolean. However, the AUX event code was not updated accordingly. The flag ends up holding a CPU number. If this number is zero, the flag is taken as false and the IRQ work is never triggered. Later, with commit: 2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()") a new IRQ work 'pending_disable_irq' was introduced to handle event disabling. The AUX event path was not updated to kick off the work queue. To fix this bug, when an AUX ring buffer overrun is detected, call perf_event_disable_inatomic() to initiate the pending disable flow. Also update the outdated comment for setting the flag, to reflect the boolean values (0 or 1). Fixes: 2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()") Fixes: ca6c21327c6a ("perf: Fix missing SIGTRAPs") Signed-off-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: James Clark <james.clark@linaro.org> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Liang Kan <kan.liang@linux.intel.com> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-perf-users@vger.kernel.org Link: https://lore.kernel.org/r/20250625170737.2918295-1-leo.yan@arm.com
2025-06-25Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds4-7/+10
Pull bpf fixes from Alexei Starovoitov: - Fix use-after-free in libbpf when map is resized (Adin Scannell) - Fix verifier assumptions about 2nd argument of bpf_sysctl_get_name (Jerome Marchand) - Fix verifier assumption of nullness of d_inode in dentry (Song Liu) - Fix global starvation of LRU map (Willem de Bruijn) - Fix potential NULL dereference in btf_dump__free (Yuan Chen) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: adapt one more case in test_lru_map to the new target_free libbpf: Fix possible use-after-free for externs selftests/bpf: Convert test_sysctl to prog_tests bpf: Specify access type of bpf_sysctl_get_name args libbpf: Fix null pointer dereference in btf_dump__free on allocation failure bpf: Adjust free target to avoid global starvation of LRU map bpf: Mark dentry->d_inode as trusted_or_null
2025-06-25Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-29/+34
Pull mount fixes from Al Viro: "Several mount-related fixes" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: userns and mnt_idmap leak in open_tree_attr(2) attach_recursive_mnt(): do not lock the covering tree when sliding something under it replace collect_mounts()/drop_collected_mounts() with a safer variant
2025-06-25sched_ext: Drop kfuncs marked for removal in 6.15Jake Hillion1-69/+2
sched_ext performed a kfunc renaming pass in 6.13 and kept the old names around for compatibility with old binaries. These were scheduled for cleanup in 6.15 but were missed. Submitting for cleanup in for-next. Removed the kfuncs, their flags, and any references I could find to them in doc comments. Left the entries in include/scx/compat.bpf.h as they're still useful to make new binaries compatible with old kernels. Tested by applying to my kernel. It builds and a modern version of scx_lavd loads fine. Signed-off-by: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-25crashdump: add CONFIG_KEYS dependencyArnd Bergmann1-0/+1
The dm_crypt code fails to build without CONFIG_KEYS: kernel/crash_dump_dm_crypt.c: In function 'restore_dm_crypt_keys_to_thread_keyring': kernel/crash_dump_dm_crypt.c:105:9: error: unknown type name 'key_ref_t'; did you mean 'key_ref_put'? There is a mix of 'select KEYS' and 'depends on KEYS' in Kconfig, so there is no single obvious solution here, but generally using 'depends on' makes more sense and is less likely to cause dependency loops. Link: https://lkml.kernel.org/r/20250620112140.3396316-1-arnd@kernel.org Fixes: 62f17d9df692 ("crash_dump: retrieve dm crypt keys in kdump kernel") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-25bpf: add btf_type_is_i{32,64} helpersAnton Protopopov4-38/+31
There are places in BPF code which check if a BTF type is an integer of particular size. This code can be made simpler by using helpers. Add new btf_type_is_i{32,64} helpers, and simplify code in a few files. (Suggested by Eduard for a patch which copy-pasted such a check [1].) v1 -> v2: * export less generic helpers (Eduard) * make subject less generic than in [v1] (Eduard) [1] https://lore.kernel.org/bpf/7edb47e73baa46705119a23c6bf4af26517a640f.camel@gmail.com/ [v1] https://lore.kernel.org/bpf/20250624193655.733050-1-a.s.protopopov@gmail.com/ Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625151621.1000584-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-25bpf: allow void* cast using bpf_rdonly_cast()Eduard Zingerman1-12/+61
Introduce support for `bpf_rdonly_cast(v, 0)`, which casts the value `v` to an untyped, untrusted pointer, logically similar to a `void *`. The memory pointed to by such a pointer is treated as read-only. As with other untrusted pointers, memory access violations on loads return zero instead of causing a fault. Technically: - The resulting pointer is represented as a register of type `PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED` with size zero. - Offsets within such pointers are not tracked. - Same load instructions are allowed to have both `PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED` and `PTR_TO_BTF_ID` as the base pointer types. In such cases, `bpf_insn_aux_data->ptr_type` is considered the weaker of the two: `PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED`. The following constraints apply to the new pointer type: - can be used as a base for LDX instructions; - can't be used as a base for ST/STX or atomic instructions; - can't be used as parameter for kfuncs or helpers. These constraints are enforced by existing handling of `MEM_RDONLY` flag and `PTR_TO_MEM` of size zero. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625182414.30659-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-25bpf: add bpf_features enumEduard Zingerman1-0/+6
This commit adds a kernel side enum for use in conjucntion with BTF CO-RE bpf_core_enum_value_exists. The goal of the enum is to assist with available BPF features detection. Intended usage looks as follows: if (bpf_core_enum_value_exists(enum bpf_features, BPF_FEAT_<f>)) ... use feature f ... Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625182414.30659-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-25bpf: Add range tracking for BPF_NEGSong Liu2-1/+21
Add range tracking for instruction BPF_NEG. Without this logic, a trivial program like the following will fail volatile bool found_value_b; SEC("lsm.s/socket_connect") int BPF_PROG(test_socket_connect) { if (!found_value_b) return -1; return 0; } with verifier log: "At program exit the register R0 has smin=0 smax=4294967295 should have been in [-4095, 0]". This is because range information is lost in BPF_NEG: 0: R1=ctx() R10=fp0 ; if (!found_value_b) @ xxxx.c:24 0: (18) r1 = 0xffa00000011e7048 ; R1_w=map_value(...) 2: (71) r0 = *(u8 *)(r1 +0) ; R0_w=scalar(smin32=0,smax=255) 3: (a4) w0 ^= 1 ; R0_w=scalar(smin32=0,smax=255) 4: (84) w0 = -w0 ; R0_w=scalar(range info lost) Note that, the log above is manually modified to highlight relevant bits. Fix this by maintaining proper range information with BPF_NEG, so that the verifier will know: 4: (84) w0 = -w0 ; R0_w=scalar(smin32=-255,smax=0) Also updated selftests based on the expected behavior. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625164025.3310203-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-25rcutorture: Fix rcutorture_one_extend_check() splat in RT kernelsZqiang1-3/+6
For built with CONFIG_PREEMPT_RT=y kernels, running rcutorture tests resulted in the following splat: [ 68.797425] rcutorture_one_extend_check during change: Current 0x1 To add 0x1 To remove 0x0 preempt_count() 0x0 [ 68.797533] WARNING: CPU: 2 PID: 512 at kernel/rcu/rcutorture.c:1993 rcutorture_one_extend_check+0x419/0x560 [rcutorture] [ 68.797601] Call Trace: [ 68.797602] <TASK> [ 68.797619] ? lockdep_softirqs_off+0xa5/0x160 [ 68.797631] rcutorture_one_extend+0x18e/0xcc0 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797646] ? local_clock+0x19/0x40 [ 68.797659] rcu_torture_one_read+0xf0/0x280 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797678] ? __pfx_rcu_torture_one_read+0x10/0x10 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797804] ? __pfx_rcu_torture_timer+0x10/0x10 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797815] rcu-torture: rcu_torture_reader task started [ 68.797824] rcu-torture: Creating rcu_torture_reader task [ 68.797824] rcu_torture_reader+0x238/0x580 [rcutorture 2466dbd2ff34dbaa36049cb323a80c3306ac997c] [ 68.797836] ? kvm_sched_clock_read+0x15/0x30 Disable BH does not change the SOFTIRQ corresponding bits in preempt_count() for RT kernels, this commit therefore use softirq_count() to check the if BH is disabled. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Make Trivial RCU ignore onoff_interval and shuffle_intervalPaul E. McKenney1-2/+13
Trivial RCU is a textbook implementation that is not used in the Linux kernel, but tested to keep textbooks (and presentations) honest. It is so trivial that it cannot deal with either CPU hotplug or external migration from one CPU to another. This commit therefore splats whenever onoff_interval or shuffle_interval are non-zero, and then sets them to zero in order to avoid false-positive failures. Those wishing to set these module parameters in order to force failures in Trivial RCU are free to revert this commit. Just don't expect me to be sympathetic to any resulting bug reports! Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202505131651.af6e81d7-lkp@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Drop redundant "insoftirq" parametersPaul E. McKenney1-19/+16
Given that the rcutorture_one_extend_check() function now uses in_serving_softirq() and in_hardirq(), it is no longer necessary to pass insoftirq flags down the function-call stack. This commit therefore removes those flags, and, while in the area, does a bit of whitespace cleanup. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Print number of RCU up/down readers and migrationsPaul E. McKenney1-5/+25
This commit prints the number of RCU up/down readers and the number of such readers that migrated from one CPU to another, along with the rest of the periodic rcu_torture_stats_print() output. These statistics are currently used only by srcu_down_read{,_fast}() and srcu_up_read(,_fast)(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Check for no up/down readers at task levelPaul E. McKenney1-0/+2
The design of testing of up/down readers such as srcu_down_read() and srcu_up_read() assumes that these are tested only by the rcu_torture_updown() kthread, and never by the rcu_torture_reader() kthread. Because we all know which road is paved with good intentions, this commit adds WARN_ON_ONCE() to verify that things are going to plan. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Check for ->up_read() without matching ->down_read()Paul E. McKenney1-2/+13
This commit creates counters in the rcu_torture_one_read_state_updown structure that check for a call to ->up_read() that lacks a matching call to ->down_read(). While in the area, add end-of-run cleanup code that prevents calls to rcu_torture_updown_hrt() from happening after the test has moved on. Yes, the srcu_barrier() at the end of the test will wait for them, but this could result in confusing states, statistics, and diagnostic information. So explicitly wait for them before we get to the end-of-test output. [ paulmck: Apply kernel test robot feedback. ] [ joel: Apply Boqun's fix for counter increment ordering. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Complain if an ->up_read() is delayed more than 10 secondsPaul E. McKenney1-3/+13
The down/up SRCU reader testing uses an hrtimer handler to exit the SRCU read-side critical section. This might be delayed, and if delayed for too long, it can prevent the rcutorture run from completing. This commit therefore complains if the hrtimer handler is delayed for more than ten seconds. [ paulmck, joel: Apply kernel test robot feedback to avoid false-positive complaint of excessive ->up_read() delays by using HRTIMER_MODE_HARD ] Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Pull rcu_torture_updown() loop body into new functionPaul E. McKenney1-20/+24
This is strictly a code-movement commit, pulling that part of the rcu_torture_updown() function's loop body that processes one rcu_torture_one_read_state_updown structure into a new rcu_torture_updown_one() function. The checks for the end of the torture test and the current structure being in use remain in the rcu_torture_updown() function. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Add tests for SRCU up/down reader primitivesPaul E. McKenney1-19/+206
This commit adds a new rcutorture.n_up_down kernel boot parameter that specifies the number of outstanding SRCU up/down readers, which begin in kthread context and end in an hrtimer handler. There is a new kthread ("rcu_torture_updown") that scans an per-reader array looking for elements whose readers have ended. This kthread sleeps between one and two milliseconds between consecutive scans. [ paulmck: Apply kernel test robot feedback. ] [ paulmck: Apply Z qiang feedback. ] [ joel: Fix build error: hrtimer_init is replaced by hrtimer_setup. ] [ joel: Apply Boqun bug fix to drop extra up_read() call in rcu_torture_updown()]. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Make rcutorture_one_extend_check() account for hard IRQsPaul E. McKenney1-4/+4
This commit retrospectively prepares for testing of RCU readers invoked from hardware interrupt handlers (for example, HRTIMER_MODE_HARD hrtimer handlers) in kernels built with CONFIG_RCU_TORTURE_TEST_CHK_RDR_STATE=y, which is rarely used but sometimes extremely useful. This preparation involves taking early exits if in_hardirq(), and, while we are in the area, a very early exit if in_nmi(). This means that a number of insoftirq parameters are no longer needed, but that is the subject of a later commit. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202505140917.8ee62cc6-lkp@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Start rcu_torture_writer() after rcu_torture_reader()Paul E. McKenney1-5/+6
Testing of rcutorture's SRCU-P scenario on a large arm64 system resulted in rcu_torture_writer() forward-progress failures, but these same tests passed on x86. After some off-list discussion of possible memory-ordering causes for these failures, Boqun showed that these were in fact due to reordering, but by the scheduler, not by the memory system. On x86, rcu_torture_writer() would have run quickly enough that by the time the rcu_torture_updown() kthread started, the rcu_torture_current variable would already be initialized, thus avoiding a bug in which a NULL value would cause rcu_torture_updown() to do an extra call to srcu_up_read_fast(). This commit therefore moves creation of the rcu_torture_writer() kthread after that of the rcu_torture_reader() kthreads. This results in deterministic failures on x86. What about the double-srcu_up_read_fast() bug? Boqun has the fix. But let's also fix the test while we are at it! Reported-by: Joel Fernandes <joelagnelf@nvidia.com> Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Print only one rtort_pipe_count splatPaul E. McKenney1-0/+1
The rcu_torture_writer() function scans the memory blocks after a stutter (or forced idle) interval, complaining about any that have not passed through ten grace periods since the start of the stutter interval. But one splat suffices, so this commit therefore stops at the first splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcu: Robustify rcu_is_cpu_rrupt_from_idle()Frederic Weisbecker1-10/+17
RCU relies on the context tracking nesting counter in order to determine if it is running in extended quiescent state. However the context tracking nesting counter is not completely synchronized with the actual context tracking state: * The nesting counter is set to 1 or incremented further _after_ the actual state is set to RCU watching. * The nesting counter is set to 0 or decremented further _before_ the actual state is set to RCU not watching. Therefore it is safe to assume that if ct_nesting() > 0, RCU is watching. But if ct_nesting() <= 0, RCU is not watching except for tiny windows. This hasn't been a problem so far because rcu_is_cpu_rrupt_from_idle() has only been called from interrupts. However the code is confusing and abuses the role of the context tracking nesting counter while there are more accurate indicators available. Clarify and robustify accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcu/exp: Protect against early QS reportFrederic Weisbecker1-7/+7
When a grace period is started, the ->expmask of each node is set up from sync_exp_reset_tree(). Then later on each leaf node also initialize its ->exp_tasks pointer. This means that the initialization of the quiescent state of a node and the initialization of its blocking tasks happen with an unlocked node gap in-between. It happens to be fine because nothing is expected to report an exp quiescent state within this gap, since no IPI have been issued yet and every rdp's ->cpu_no_qs.b.exp should be false. However if it were to happen by accident, the quiescent state could be reported and propagated while ignoring tasks that blocked _before_ the start of the grace period. Prevent such trouble to happen in the future and initialize both the quiescent states mask to report and the blocked tasks head from the same node locked block. If a task blocks within an RCU read side critical section before sync_exp_reset_tree() is called and is then unblocked between sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS won't be reported because no RCU exp IPI had been issued to request it through the setting of srdp->cpu_no_qs.b.exp. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>