| Age | Commit message (Collapse) | Author | Files | Lines |
|
Make dma_map_page_attrs() and dma_map_page_attrs() respect
DMA_ATTR_MMIO.
DMA_ATR_MMIO makes the functions behave the same as
dma_(un)map_resource():
- No swiotlb is possible
- Legacy dma_ops arches use ops->map_resource()
- No kmsan
- No arch_dma_map_phys_direct()
The prior patches have made the internal functions called here
support DMA_ATTR_MMIO.
This is also preparation for turning dma_map_resource() into an inline
calling dma_map_phys(DMA_ATTR_MMIO) to consolidate the flows.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/3660e2c78ea409d6c483a215858fb3af52cd0ed3.1757423202.git.leonro@nvidia.com
|
|
Convert the KMSAN DMA handling function from page-based to physical
address-based interface.
The refactoring renames kmsan_handle_dma() parameters from accepting
(struct page *page, size_t offset, size_t size) to (phys_addr_t phys,
size_t size). The existing semantics where callers are expected to
provide only kmap memory is continued here.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/3557cbaf66e935bc794f37d2b891ef75cbf2c80c.1757423202.git.leonro@nvidia.com
|
|
Convert the DMA direct mapping functions to accept physical addresses
directly instead of page+offset parameters. The functions were already
operating on physical addresses internally, so this change eliminates
the redundant page-to-physical conversion at the API boundary.
The functions dma_direct_map_page() and dma_direct_unmap_page() are
renamed to dma_direct_map_phys() and dma_direct_unmap_phys() respectively,
with their calling convention changed from (struct page *page,
unsigned long offset) to (phys_addr_t phys).
Architecture-specific functions arch_dma_map_page_direct() and
arch_dma_unmap_page_direct() are similarly renamed to
arch_dma_map_phys_direct() and arch_dma_unmap_phys_direct().
The is_pci_p2pdma_page() checks are replaced with DMA_ATTR_MMIO checks
to allow integration with dma_direct_map_resource and dma_direct_map_phys()
is extended to support MMIO path either.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/bb15a22f76dc2e26683333ff54e789606cfbfcf0.1757423202.git.leonro@nvidia.com
|
|
Rename the IOMMU DMA mapping functions to better reflect their actual
calling convention. The functions iommu_dma_map_page() and
iommu_dma_unmap_page() are renamed to iommu_dma_map_phys() and
iommu_dma_unmap_phys() respectively, as they already operate on physical
addresses rather than page structures.
The calling convention changes from accepting (struct page *page,
unsigned long offset) to (phys_addr_t phys), which eliminates the need
for page-to-physical address conversion within the functions. This
renaming prepares for the broader DMA API conversion from page-based
to physical address-based mapping throughout the kernel.
All callers are updated to pass physical addresses directly, including
dma_map_page_attrs(), scatterlist mapping functions, and DMA page
allocation helpers. The change simplifies the code by removing the
page_to_phys() + offset calculation that was previously done inside
the IOMMU functions.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/ed172f95f8f57782beae04f782813366894e98df.1757423202.git.leonro@nvidia.com
|
|
As a preparation for following map_page -> map_phys API conversion,
let's rename trace_dma_*map_page() to be trace_dma_*map_phys().
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/c0c02d7d8bd4a148072d283353ba227516a76682.1757423202.git.leonro@nvidia.com
|
|
Convert the DMA debug infrastructure from page-based to physical address-based
mapping as a preparation to rely on physical address for DMA mapping routines.
The refactoring renames debug_dma_map_page() to debug_dma_map_phys() and
changes its signature to accept a phys_addr_t parameter instead of struct page
and offset. Similarly, debug_dma_unmap_page() becomes debug_dma_unmap_phys().
A new dma_debug_phy type is introduced to distinguish physical address mappings
from other debug entry types. All callers throughout the codebase are updated
to pass physical addresses directly, eliminating the need for page-to-physical
conversion in the debug layer.
This refactoring eliminates the need to convert between page pointers and
physical addresses in the debug layer, making the code more efficient and
consistent with the DMA mapping API's physical address focus.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
[mszyprow: added a fixup]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/56d1a6769b68dfcbf8b26a75a7329aeb8e3c3b6a.1757423202.git.leonro@nvidia.com
Link: https://lore.kernel.org/all/20250910052618.GH341237@unreal/
|
|
dma-mapping fix for Linux 6.17
- one more fix for DMA API debugging infrastructure (Baochen Qiang)
|
|
Begin reporting arena page faults and the faulting address to BPF
program's stderr, this patch adds support in the arm64 and x86-64 JITs,
support for other archs can be added later.
The fault handlers receive the 32 bit address in the arena region so
the upper 32 bits of user_vm_start is added to it before printing the
address. This is what the user would expect to see as this is what is
printed by bpf_printk() is you pass it an address returned by
bpf_arena_alloc_pages();
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250911145808.58042-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
BPF streams are only valid for the main programs, to make it easier to
access streams from subprogs, introduce main_prog_aux in struct
bpf_prog_aux.
prog->aux->main_prog_aux = prog->aux, for main programs and
prog->aux->main_prog_aux = main_prog->aux, for subprograms.
Make bpf_prog_find_from_stack() use the added main_prog_aux to return
the mainprog when a subprog is found on the stack.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250911145808.58042-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Cross-merge BPF and other fixes after downstream PR.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a nasty hibernation regression introduced during the 6.16
cycle, an issue related to energy model management occurring on Intel
hybrid systems where some CPUs are offline to start with, and two
regressions in the amd-pstate driver:
- Restore a pm_restrict_gfp_mask() call in hibernation_snapshot()
that was removed incorrectly during the 6.16 development cycle
(Rafael Wysocki)
- Introduce a function for registering a perf domain without
triggering a system-wide CPU capacity update and make the
intel_pstate driver use it to avoid reocurring unsuccessful
attempts to update capacities of all CPUs in the system (Rafael
Wysocki)
- Fix setting of CPPC.min_perf in the active mode with performance
governor in the amd-pstate driver to restore its expected behavior
changed recently (Gautham Shenoy)
- Avoid mistakenly setting EPP to 0 in the amd-pstate driver after
system resume as a result of recent code changes (Mario
Limonciello)"
* tag 'pm-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Restrict GFP mask in hibernation_snapshot()
PM: EM: Add function for registering a PD without capacity update
cpufreq/amd-pstate: Fix a regression leading to EPP 0 after resume
cpufreq/amd-pstate: Fix setting of CPPC.min_perf in active mode for performance governor
|
|
Compared to the generic entry code, ARM64 does additional checks
when deciding to reschedule on return from interrupt. So introduce
arch_irqentry_exit_need_resched() in the need_resched()
condition of the generic raw_irqentry_exit_cond_resched(), with
a NOP default. This will allow ARM64 to implement the architecture
specific version for switching over to the generic entry code.
Suggested-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Suggested-by: Kevin Brodsky <kevin.brodsky@arm.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Pull bpf fixes from Alexei Starovoitov:
"A number of fixes accumulated due to summer vacations
- Fix out-of-bounds dynptr write in bpf_crypto_crypt() kfunc which
was misidentified as a security issue (Daniel Borkmann)
- Update the list of BPF selftests maintainers (Eduard Zingerman)
- Fix selftests warnings with icecc compiler (Ilya Leoshkevich)
- Disable XDP/cpumap direct return optimization (Jesper Dangaard
Brouer)
- Fix unexpected get_helper_proto() result in unusual configuration
BPF_SYSCALL=y and BPF_EVENTS=n (Jiri Olsa)
- Allow fallback to interpreter when JIT support is limited (KaFai
Wan)
- Fix rqspinlock and choose trylock fallback for NMI waiters. Pick
the simplest fix. More involved fix is targeted bpf-next (Kumar
Kartikeya Dwivedi)
- Fix cleanup when tcp_bpf_send_verdict() fails to allocate
psock->cork (Kuniyuki Iwashima)
- Disallow bpf_timer in PREEMPT_RT for now. Proper solution is being
discussed for bpf-next. (Leon Hwang)
- Fix XSK cq descriptor production (Maciej Fijalkowski)
- Tell memcg to use allow_spinning=false path in bpf_timer_init() to
avoid lockup in cgroup_file_notify() (Peilin Ye)
- Fix bpf_strnstr() to handle suffix match cases (Rong Tao)"
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Skip timer cases when bpf_timer is not supported
bpf: Reject bpf_timer for PREEMPT_RT
tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork.
bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()
bpf: Allow fall back to interpreter for programs with stack size <= 512
rqspinlock: Choose trylock fallback for NMI waiters
xsk: Fix immature cq descriptor production
bpf: Update the list of BPF selftests maintainers
selftests/bpf: Add tests for bpf_strnstr
selftests/bpf: Fix "expression result unused" warnings with icecc
bpf: Fix bpf_strnstr() to handle suffix match cases better
selftests/bpf: Extend crypto_sanity selftest with invalid dst buffer
bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt
bpf: Check the helper function is valid in get_helper_proto
bpf, cpumap: Disable page_pool direct xdp_return need larger scope
|
|
Merge a hibernation regression fix and an fix related to energy model
management for 6.17-rc6
* pm-sleep:
PM: hibernate: Restrict GFP mask in hibernation_snapshot()
* pm-em:
PM: EM: Add function for registering a PD without capacity update
|
|
When configuring a small audit rate limit in
/etc/audit/rules.d/audit.rules:
-a always,exit -F arch=b64 -S openat -S truncate -S ftruncate
-F exit=-EACCES -F auid>=1000 -F auid!=4294967295 -k access -r 100
And then repeatedly triggering permission denied as a normal user:
while :; do cat /proc/1/environ; done
We can see the messages in kernel log:
[ 2531.862184] audit: rate limit exceeded
The unreclaimable slab objects start to leak quickly. With kmemleak
enabled, many call traces appear like:
unreferenced object 0xffff99144b13f600 (size 232):
comm "cat", pid 1100, jiffies 4294739144
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 8540ec4f):
kmemleak_alloc+0x4a/0x90
kmem_cache_alloc_node+0x2ea/0x390
__alloc_skb+0x174/0x1b0
audit_log_start+0x198/0x3d0
audit_log_proctitle+0x32/0x160
audit_log_exit+0x6c6/0x780
__audit_syscall_exit+0xee/0x140
syscall_exit_work+0x12b/0x150
syscall_exit_to_user_mode_prepare+0x39/0x80
syscall_exit_to_user_mode+0x11/0x260
do_syscall_64+0x8c/0x180
entry_SYSCALL_64_after_hwframe+0x78/0x80
This shows that the skb allocated in audit_log_start() and queued
onto skb_list is never freed.
In audit_log_end(), each skb is dequeued from skb_list and passed
to __audit_log_end(). However, when the audit rate limit is exceeded,
__audit_log_end() simply prints "rate limit exceeded" and returns
without processing the skb. Since the skb is already removed from
skb_list, audit_buffer_free() cannot free it later, leading to a
memory leak.
Fix this by freeing the skb when the rate limit is exceeded.
Fixes: eb59d494eebd ("audit: add record for multiple task security contexts")
Signed-off-by: Gerald Yang <gerald.yang@canonical.com>
[PM: fixes tag, subj tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
When enable CONFIG_PREEMPT_RT, the kernel will warn when run timer
selftests by './test_progs -t timer':
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
In order to avoid such warning, reject bpf_timer in verifier when
PREEMPT_RT is enabled.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20250910125740.52172-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Remove redundant __GFP_NOWARN flag is kmalloc
As now __GFP_NOWARN is part of __GFP_NOWAIT, it can be removed from
kmalloc as it is redundant.
- Use copy_from_user_nofault() instead of _inatomic() for trace markers
The trace_marker files are written to to allow user space to quickly
write into the tracing ring buffer.
Back in 2016, the get_user_pages_fast() and the kmap() logic was
replaced by a __copy_from_user_inatomic(), but didn't properly
disable page faults around it.
Since the time this was added, copy_from_user_nofault() was added
which does the required page fault disabling for us.
- Fix the assembly markup in the ftrace direct sample code
The ftrace direct sample code (which is also used for selftests), had
the size directive between the "leave" and the "ret" instead of after
the ret. This caused objtool to think the code was unreachable.
- Only call unregister_pm_notifier() on outer most fgraph registration
There was an error path in register_ftrace_graph() that did not call
unregister_pm_notifier() on error, so it was added in the error path.
The problem with that fix, is that register_pm_notifier() is only
called by the initial user of fgraph. If that succeeds, but another
fgraph registration were to fail, then unregister_pm_notifier() would
be called incorrectly.
- Fix a crash in osnoise when zero size cpumask is passed in
If a zero size CPU mask is passed in, the kmalloc() would return
ZERO_SIZE_PTR which is not checked, and the code would continue
thinking it had real memory and crash. If zero is passed in as the
size of the write, simply return 0.
- Fix possible warning in trace_pid_write()
If while processing a series of numbers passed to the "set_event_pid"
file, and one of the updates fails to allocate (triggered by a fault
injection), it can cause a warning to trigger. Check the return value
of the call to trace_pid_list_set() and break out early with an error
code if it fails.
* tag 'trace-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Silence warning when chunk allocation fails in trace_pid_write
tracing/osnoise: Fix null-ptr-deref in bitmap_parselist()
trace/fgraph: Fix error handling
ftrace/samples: Fix function size computation
tracing: Fix tracing_marker may trigger page fault during preempt_disable
trace: Remove redundant __GFP_NOWARN
|
|
Commit 12ffc3b1513e ("PM: Restrict swap use to later in the suspend
sequence") incorrectly removed a pm_restrict_gfp_mask() call from
hibernation_snapshot(), so memory allocations involving swap are not
prevented from being carried out in this code path any more which may
lead to serious breakage.
The symptoms of such breakage have become visible after adding a
shrink_shmem_memory() call to hibernation_snapshot() in commit
2640e819474f ("PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()")
which caused this problem to be much more likely to manifest itself.
However, since commit 2640e819474f was initially present in the DRM
tree that did not include commit 12ffc3b1513e, the symptoms of this
issue were not visible until merge commit 260f6f4fda93 ("Merge tag
'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel")
that exposed it through an entirely reasonable merge conflict
resolution.
Fixes: 12ffc3b1513e ("PM: Restrict swap use to later in the suspend sequence")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220555
Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Cc: 6.16+ <stable@vger.kernel.org> # 6.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
|
|
to cgroup.procs
The static usage pattern of creating a cgroup, enabling controllers,
and then seeding it with CLONE_INTO_CGROUP doesn't require write
locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
patch.
To avoid affecting other users, the per threadgroup rwsem is only used
when the favordynmods is enabled.
As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.
To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.
Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.
The locking behavior is as follows:
write cgroup.procs/threads | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock() | down_read(&g_rwsem) | cgroup_lock()
down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem)
critical section | critical section | critical section
up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem)
cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock()
g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.
This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.
With this patch, under heavy fork and exec interference, the long-tail
latency of cgroup migration has been reduced from milliseconds to
microseconds. Under heavy cgroup migration interference, the multi-CPU
score of the spawn test case in UnixBench increased by 9%.
tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to
pr_warn_once().
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Later patches will introduce a new parameter `task` to
cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock
within cgroup_procs_write_start.
Between obtaining the threadgroup leader via PID and acquiring the
cgroup attach lock, the threadgroup leader may change, which could lead
to incorrect cgroup migration. Therefore, after acquiring the cgroup
attach lock, we check whether the threadgroup leader has changed, and if
so, retry the operation.
tj: Minor comment adjustments.
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Dynamic cgroup migration involving threadgroup locks can be in one of
two states: no lock held, or holding the global lock. Explicitly
declaring the different lock modes to make the code easier to
understand and facilitates future extensions of the lock modes.
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
For large values of CONFIG_NR_CPUS, the newly added kunit test fails
to build:
kernel/printk/printk_ringbuffer_kunit_test.c: In function 'test_readerwriter':
kernel/printk/printk_ringbuffer_kunit_test.c:279:1: error: the frame size of 1432 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]
Change this to use cpumask_var_t and allocate it dynamically when
CONFIG_CPUMASK_OFFSTACK is set.
The variable has to be released via a KUnit action wrapper so that it is
freed when the test fails and gets aborted. The parameter type is hardcoded
to "struct cpumask *" because the macro KUNIT_DEFINE_ACTION_WRAPPER()
does not accept an array. But the function does nothing when
CONFIG_CPUMASK_OFFSTACK is not set anyway.
Fixes: 5ea2bcdfbf46 ("printk: ringbuffer: Add KUnit test")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/all/20250620192554.2234184-1-arnd@kernel.org # v1
[pmladek@suse.com: Correctly handle allocation failures and freeing using KUnit test API.]
Link: https://lore.kernel.org/all/20250702095157.110916-3-pmladek@suse.com # v2
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The intel_pstate driver manages CPU capacity changes itself and it does
not need an update of the capacity of all CPUs in the system to be
carried out after registering a PD.
Moreover, in some configurations (for instance, an SMT-capable
hybrid x86 system booted with nosmt in the kernel command line) the
em_check_capacity_update() call at the end of em_dev_register_perf_domain()
always fails and reschedules itself to run once again in 1 s, so
effectively it runs in vain every 1 s forever.
To address this, introduce a new variant of em_dev_register_perf_domain(),
called em_dev_register_pd_no_update(), that does not invoke
em_check_capacity_update(), and make intel_pstate use it instead of the
original.
Fixes: 7b010f9b9061 ("cpufreq: intel_pstate: EAS support for hybrid platforms")
Closes: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix.com/
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Tested-by: Kenneth R. Crudup <kenny@panix.com>
Cc: 6.16+ <stable@vger.kernel.org> # 6.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Currently, calling bpf_map_kmalloc_node() from __bpf_async_init() can
cause various locking issues; see the following stack trace (edited for
style) as one example:
...
[10.011566] do_raw_spin_lock.cold
[10.011570] try_to_wake_up (5) double-acquiring the same
[10.011575] kick_pool rq_lock, causing a hardlockup
[10.011579] __queue_work
[10.011582] queue_work_on
[10.011585] kernfs_notify
[10.011589] cgroup_file_notify
[10.011593] try_charge_memcg (4) memcg accounting raises an
[10.011597] obj_cgroup_charge_pages MEMCG_MAX event
[10.011599] obj_cgroup_charge_account
[10.011600] __memcg_slab_post_alloc_hook
[10.011603] __kmalloc_node_noprof
...
[10.011611] bpf_map_kmalloc_node
[10.011612] __bpf_async_init
[10.011615] bpf_timer_init (3) BPF calls bpf_timer_init()
[10.011617] bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
[10.011619] bpf__sched_ext_ops_runnable
[10.011620] enqueue_task_scx (2) BPF runs with rq_lock held
[10.011622] enqueue_task
[10.011626] ttwu_do_activate
[10.011629] sched_ttwu_pending (1) grabs rq_lock
...
The above was reproduced on bpf-next (b338cf849ec8) by modifying
./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
ops.runnable(), and hacking the memcg accounting code a bit to make
a bpf_timer_init() call more likely to raise an MEMCG_MAX event.
We have also run into other similar variants (both internally and on
bpf-next), including double-acquiring cgroup_file_kn_lock, the same
worker_pool::lock, etc.
As suggested by Shakeel, fix this by using __GFP_HIGH instead of
GFP_ATOMIC in __bpf_async_init(), so that e.g. if try_charge_memcg()
raises an MEMCG_MAX event, we call __memcg_memory_event() with
@allow_spinning=false and avoid calling cgroup_file_notify() there.
Depends on mm patch
"memcg: skip cgroup_file_notify if spinning is not allowed":
https://lore.kernel.org/bpf/20250905201606.66198-1-shakeel.butt@linux.dev/
v0 approach s/bpf_map_kmalloc_node/bpf_mem_alloc/
https://lore.kernel.org/bpf/20250905061919.439648-1-yepeilin@google.com/
v1 approach:
https://lore.kernel.org/bpf/20250905234547.862249-1-yepeilin@google.com/
Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/20250909095222.2121438-1-yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
OpenWRT users reported regression on ARMv6 devices after updating to latest
HEAD, where tcpdump filter:
tcpdump "not ether host 3c37121a2b3c and not ether host 184ecbca2a3a \
and not ether host 14130b4d3f47 and not ether host f0f61cf440b7 \
and not ether host a84b4dedf471 and not ether host d022be17e1d7 \
and not ether host 5c497967208b and not ether host 706655784d5b"
fails with warning: "Kernel filter failed: No error information"
when using config:
# CONFIG_BPF_JIT_ALWAYS_ON is not set
CONFIG_BPF_JIT_DEFAULT_ON=y
The issue arises because commits:
1. "bpf: Fix array bounds error with may_goto" changed default runtime to
__bpf_prog_ret0_warn when jit_requested = 1
2. "bpf: Avoid __bpf_prog_ret0_warn when jit fails" returns error when
jit_requested = 1 but jit fails
This change restores interpreter fallback capability for BPF programs with
stack size <= 512 bytes when jit fails.
Reported-by: Felix Fietkau <nbd@nbd.name>
Closes: https://lore.kernel.org/bpf/2e267b4b-0540-45d8-9310-e127bf95fc63@nbd.name/
Fixes: 6ebc5030e0c5 ("bpf: Fix array bounds error with may_goto")
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250909144614.2991253-1-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, out of all 3 types of waiters in the rqspinlock slow path
(i.e., pending bit waiter, wait queue head waiter, and wait queue
non-head waiter), only the pending bit waiter and wait queue head
waiters apply deadlock checks and a timeout on their waiting loop. The
assumption here was that the wait queue head's forward progress would be
sufficient to identify cases where the lock owner or pending bit waiter
is stuck, and non-head waiters relying on the head waiter would prove to
be sufficient for their own forward progress.
However, the head waiter itself can be preempted by a non-head waiter
for the same lock (AA) or a different lock (ABBA) in a manner that
impedes its forward progress. In such a case, non-head waiters not
performing deadlock and timeout checks becomes insufficient, and the
system can enter a state of lockup.
This is typically not a concern with non-NMI lock acquisitions, as lock
holders which in run in different contexts (IRQ, non-IRQ) use "irqsave"
variants of the lock APIs, which naturally excludes such lock holders
from preempting one another on the same CPU.
It might seem likely that a similar case may occur for rqspinlock when
programs are attached to contention tracepoints (begin, end), however,
these tracepoints either precede the enqueue into the wait queue, or
succeed it, therefore cannot be used to preempt a head waiter's waiting
loop.
We must still be careful against nested kprobe and fentry programs that
may attach to the middle of the head's waiting loop to stall forward
progress and invoke another rqspinlock acquisition that proceeds as a
non-head waiter. To this end, drop CC_FLAGS_FTRACE from the rqspinlock.o
object file.
For now, this issue is resolved by falling back to a repeated trylock on
the lock word from NMI context, while performing the deadlock checks to
break out early in case forward progress is impossible, and use the
timeout as a final fallback.
A more involved fix to terminate the queue when such a condition occurs
will be made as a follow up. A selftest to stress this aspect of nested
NMI/non-NMI locking attempts will be added in a subsequent patch to the
bpf-next tree when this fix lands and trees are synchronized.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: 164c246571e9 ("rqspinlock: Protect waiters in queue from stalls")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250909184959.3509085-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
bpf_strnstr() should not treat the ending '\0' of s2 as a matching character
if the parameter 'len' equal to s2 string length, for example:
1. bpf_strnstr("openat", "open", 4) = -ENOENT
2. bpf_strnstr("openat", "open", 5) = 0
This patch makes (1) return 0, fix just the `len == strlen(s2)` case.
And fix a more general case when s2 is a suffix of the first len
characters of s1.
Fixes: e91370550f1f ("bpf: Add kfuncs for read-only string operations")
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/tencent_17DC57B9D16BC443837021BEACE84B7C1507@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Stanislav reported that in bpf_crypto_crypt() the destination dynptr's
size is not validated to be at least as large as the source dynptr's
size before calling into the crypto backend with 'len = src_len'. This
can result in an OOB write when the destination is smaller than the
source.
Concretely, in mentioned function, psrc and pdst are both linear
buffers fetched from each dynptr:
psrc = __bpf_dynptr_data(src, src_len);
[...]
pdst = __bpf_dynptr_data_rw(dst, dst_len);
[...]
err = decrypt ?
ctx->type->decrypt(ctx->tfm, psrc, pdst, src_len, piv) :
ctx->type->encrypt(ctx->tfm, psrc, pdst, src_len, piv);
The crypto backend expects pdst to be large enough with a src_len length
that can be written. Add an additional src_len > dst_len check and bail
out if it's the case. Note that these kfuncs are accessible under root
privileges only.
Fixes: 3e1c6f35409f ("bpf: make common crypto API for TC/XDP programs")
Reported-by: Stanislav Fort <disclosure@aisle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20250829143657.318524-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
- one more fix for DMA API debugging infrastructure (Baochen Qiang)
* tag 'dma-mapping-6.17-2025-09-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-debug: don't enforce dma mapping check on noncoherent allocations
|
|
A typical set of messages that gets printed as a result of the clocksource
watchdog finding the TSC unstable usually does not contain messages
indicating CPUs being ahead of or behind the CPU from which the check is
carried out. That fact suggests that the TSC does not experience time skew
between CPUs (if the clocksource.verify_n_cpus parameter is set to a
negative value) but quantitative information is missing.
The cs_nsec_max value printed by the "CPU %d check durations" message
actually provides a worst case estimate of the time skew. If all CPUs have
been checked, the cs_nsec_max value multiplied by 2 is the maximum
possible time skew between the TSCs of any two CPUs on the system. The
worst case estimate is derived from two boundary cases:
1. No time is consumed to execute instructions between csnow_begin and
csnow_mid while all the cs_nsec_max time is consumed by the code between
csnow_mid and csnow_end. In this case, the maximum undetectable time skew
of a CPU being ahead would be cs_nsec_max.
2. All the cs_nsec_max time is consumed to execute instructions between
csnow_begin and csnow_mid while no time is consumed by the code between
csnow_mid and csnow_end. In this case, the maximum undetectable time skew
of a CPU being behind would be cs_nsec_max.
The worst case estimate assumes a system experiencing a corner case
consisting of the two boundary cases.
Always print the "CPU %d check durations" message so that the maximum
possible time skew measured by the TSC sync check can be compared to the
time skew measured by the clocksource watchdog.
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/aIuXXfdITXdI0lLp@incl
|
|
When testing softirq based hrtimers on an ARM32 board, with high resolution
mode and NOHZ inactive, softirq based hrtimers fail to expire after being
moved away from an offline CPU:
CPU0 CPU1
hrtimer_start(..., HRTIMER_MODE_SOFT);
cpu_down(CPU1) ...
hrtimers_cpu_dying()
// Migrate timers to CPU0
smp_call_function_single(CPU0, returgger_next_event);
retrigger_next_event()
if (!highres && !nohz)
return;
As retrigger_next_event() is a NOOP when both high resolution timers and
NOHZ are inactive CPU0's hrtimer_cpu_base::softirq_expires_next is not
updated and the migrated softirq timers never expire unless there is a
softirq based hrtimer queued on CPU0 later.
Fix this by removing the hrtimer_hres_active() and tick_nohz_active() check
in retrigger_next_event(), which enforces a full update of the CPU base.
As this is not a fast path the extra cost does not matter.
[ tglx: Massaged change log ]
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250805081025.54235-1-wangxiongfeng2@huawei.com
|
|
tick_shutdown() sets the state of the clockevent device to detached
first and the invokes clockevents_exchange_device(), which in turn
invokes clockevents_switch_state().
But clockevents_switch_state() returns without invoking the device shutdown
callback as the device is already in detached state. As a consequence the
timer device is not shutdown when a CPU goes offline.
tick_shutdown() does this because it was originally invoked on a online CPU
and not on the outgoing CPU. It therefore could not access the clockevent
device of the already offlined CPU and just set the state.
Since commit 3b1596a21fbf tick_shutdown() is called on the outgoing CPU, so
the hardware device can be accessed.
Remove the state set before calling clockevents_exchange_device(), so that
the subsequent clockevents_switch_state() handles the state transition and
invokes the shutdown callback of the clockevent device.
[ tglx: Massaged change log ]
Fixes: 3b1596a21fbf ("clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYING")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250906064952.3749122-2-maobibo@loongson.cn
|
|
Align the ordering to the one used for hrtimer_bases.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-9-3ae822e5bfbd@linutronix.de
|
|
The get_time() callbacks always need to match the bases clockid.
Instead of maintaining that association twice in hrtimer_bases,
use a helper.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-8-3ae822e5bfbd@linutronix.de
|
|
The field timer->base->get_time is a private implementation detail and
should not be accessed outside of the hrtimer core.
Switch to the equivalent helper.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de
|
|
The field timer->base->get_time is a private implementation detail and
should not be accessed outside of the hrtimer core.
Switch to the equivalent helper.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-2-3ae822e5bfbd@linutronix.de
|
|
The field timer->base->get_time is a private implementation detail and
should not be accessed outside of the hrtimer core.
Switch to the equivalent helpers.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-1-3ae822e5bfbd@linutronix.de
|
|
Syzkaller trigger a fault injection warning:
WARNING: CPU: 1 PID: 12326 at tracepoint_add_func+0xbfc/0xeb0
Modules linked in:
CPU: 1 UID: 0 PID: 12326 Comm: syz.6.10325 Tainted: G U 6.14.0-rc5-syzkaller #0
Tainted: [U]=USER
Hardware name: Google Compute Engine/Google Compute Engine
RIP: 0010:tracepoint_add_func+0xbfc/0xeb0 kernel/tracepoint.c:294
Code: 09 fe ff 90 0f 0b 90 0f b6 74 24 43 31 ff 41 bc ea ff ff ff
RSP: 0018:ffffc9000414fb48 EFLAGS: 00010283
RAX: 00000000000012a1 RBX: ffffffff8e240ae0 RCX: ffffc90014b78000
RDX: 0000000000080000 RSI: ffffffff81bbd78b RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffffffffef
R13: 0000000000000000 R14: dffffc0000000000 R15: ffffffff81c264f0
FS: 00007f27217f66c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2e80dff8 CR3: 00000000268f8000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
tracepoint_probe_register_prio+0xc0/0x110 kernel/tracepoint.c:464
register_trace_prio_sched_switch include/trace/events/sched.h:222 [inline]
register_pid_events kernel/trace/trace_events.c:2354 [inline]
event_pid_write.isra.0+0x439/0x7a0 kernel/trace/trace_events.c:2425
vfs_write+0x24c/0x1150 fs/read_write.c:677
ksys_write+0x12b/0x250 fs/read_write.c:731
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
We can reproduce the warning by following the steps below:
1. echo 8 >> set_event_notrace_pid. Let tr->filtered_pids owns one pid
and register sched_switch tracepoint.
2. echo ' ' >> set_event_pid, and perform fault injection during chunk
allocation of trace_pid_list_alloc. Let pid_list with no pid and
assign to tr->filtered_pids.
3. echo ' ' >> set_event_pid. Let pid_list is NULL and assign to
tr->filtered_pids.
4. echo 9 >> set_event_pid, will trigger the double register
sched_switch tracepoint warning.
The reason is that syzkaller injects a fault into the chunk allocation
in trace_pid_list_alloc, causing a failure in trace_pid_list_set, which
may trigger double register of the same tracepoint. This only occurs
when the system is about to crash, but to suppress this warning, let's
add failure handling logic to trace_pid_list_set.
Link: https://lore.kernel.org/20250908024658.2390398-1-pulehui@huaweicloud.com
Fixes: 8d6e90983ade ("tracing: Create a sparse bitmask for pid filtering")
Reported-by: syzbot+161412ccaeff20ce4dde@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67cb890e.050a0220.d8275.022e.GAE@google.com
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This patch adds a new WQ_PERCPU flag to explicitly request the use of
the per-CPU behavior. Both flags coexist for one release cycle to allow
callers to transition their calls.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
All existing users have been updated accordingly.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/r/20250905085309.94596-4-marco.crivellari@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/r/20250905085309.94596-3-marco.crivellari@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.
queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.
This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.
The old wq will be kept for a few release cylces.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/r/20250905085309.94596-2-marco.crivellari@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix a severe slowdown regression in the timer vDSO code related to the
while() loop in __iter_div_u64_rem(), when the AUX-clock is enabled"
* tag 'timers-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
vdso/vsyscall: Avoid slow division loop in auxiliary clock update
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix an 'allocation from atomic context' regression in the futex
vmalloc variant"
* tag 'locking-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Move futex_hash_free() back to __mmput()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fix from Ingo Molnar:
"Fix regression where PERF_EVENT_IOC_REFRESH counters miss a PMU-stop"
* tag 'perf-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix the POLL_HUP delivery breakage
|
|
A crash was observed with the following output:
BUG: kernel NULL pointer dereference, address: 0000000000000010
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 #138 PREEMPT(voluntary)
RIP: 0010:bitmap_parselist+0x53/0x3e0
Call Trace:
<TASK>
osnoise_cpus_write+0x7a/0x190
vfs_write+0xf8/0x410
? do_sys_openat2+0x88/0xd0
ksys_write+0x60/0xd0
do_syscall_64+0xa4/0x260
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
This issue can be reproduced by below code:
fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY);
write(fd, "0-2", 0);
When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return
ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which
trigger the null pointer dereference. Add check for the parameter 'count'.
Cc: <mhiramat@kernel.org>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <tglozar@redhat.com>
Link: https://lore.kernel.org/20250906035610.3880282-1-wangliang74@huawei.com
Fixes: 17f89102fe23 ("tracing/osnoise: Allow arbitrarily long CPU string")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Commit edede7a6dcd7 ("trace/fgraph: Fix the warning caused by missing
unregister notifier") added a call to unregister the PM notifier if
register_ftrace_graph() failed. It does so unconditionally. However,
the PM notifier is only registered with the first call to
register_ftrace_graph(). If the first registration was successful and
a subsequent registration failed, the notifier is now unregistered even
if ftrace graphs are still registered.
Fix the problem by only unregistering the PM notifier during error handling
if there are no active fgraph registrations.
Fixes: edede7a6dcd7 ("trace/fgraph: Fix the warning caused by missing unregister notifier")
Closes: https://lore.kernel.org/all/63b0ba5a-a928-438e-84f9-93028dd72e54@roeck-us.net/
Cc: Ye Weihua <yeweihua4@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250906050618.2634078-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"A single small audit patch to fix a potential out-of-bounds read
caused by a negative array index when comparing paths"
* tag 'audit-pr-20250905' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: fix out-of-bounds read in audit_compare_dname_path()
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.
queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.
This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.
The old wq will be kept for a few release cylces.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pull for-6.17-fixes to receive 79f919a89c9d ("cgroup: split
cgroup_destroy_wq into 3 workqueues") to resolve its conflict with
7fa33aa3b001 ("cgroup: WQ_PERCPU added to alloc_workqueue users"). The
latter adds WQ_PERCPU when creating cgroup_destroy_wq and the former splits
the workqueue into three. Resolve by applying WQ_PERCPU to the three split
workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
|