summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2024-10-10timekeeping: Add interfaces for handling timestamps with a floor valueJeff Layton-0/+104
Multigrain timestamps allow the kernel to use fine-grained timestamps when an inode's attributes is being actively observed via ->getattr(). With this support, it's possible for a file to get a fine-grained timestamp, and another modified after it to get a coarse-grained stamp that is earlier than the fine-grained time. If this happens then the files can appear to have been modified in reverse order, which breaks VFS ordering guarantees [1]. To prevent this, maintain a floor value for multigrain timestamps. Whenever a fine-grained timestamp is handed out, record it, and when later coarse-grained stamps are handed out, ensure they are not earlier than that value. If the coarse-grained timestamp is earlier than the fine-grained floor, return the floor value instead. Add a static singleton atomic64_t into timekeeper.c that is used to keep track of the latest fine-grained time ever handed out. This is tracked as a monotonic ktime_t value to ensure that it isn't affected by clock jumps. Because it is updated at different times than the rest of the timekeeper object, the floor value is managed independently of the timekeeper via a cmpxchg() operation, and sits on its own cacheline. Add two new public interfaces: - ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the coarse-grained clock and the floor time - ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries to swap it into the floor. A timespec64 is filled with the result. The floor value is global and updated via a single try_cmpxchg(). If that fails then the operation raced with a concurrent update. Any concurrent update must be later than the existing floor value, so any racing tasks can accept any resulting floor value without retrying. [1]: POSIX requires that files be stamped with realtime clock values, and makes no provision for dealing with backward clock jumps. If a backward realtime clock jump occurs, then files can appear to have been modified in reverse order. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20241002-mgtime-v10-1-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-09bpf: fix unpopulated name_len field in perf_event link infoTyrone Wu-7/+22
Previously when retrieving `bpf_link_info.perf_event` for kprobe/uprobe/tracepoint, the `name_len` field was not populated by the kernel, leaving it to reflect the value initially set by the user. This behavior was inconsistent with how other input/output string buffer fields function (e.g. `raw_tracepoint.tp_name_len`). This patch fills `name_len` with the actual size of the string name. Fixes: 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") Signed-off-by: Tyrone Wu <wudevelops@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20241008164312.46269-1-wudevelops@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09bpf: use kvzmalloc to allocate BPF verifier environmentRik van Riel-2/+2
The kzmalloc call in bpf_check can fail when memory is very fragmented, which in turn can lead to an OOM kill. Use kvzmalloc to fall back to vmalloc when memory is too fragmented to allocate an order 3 sized bpf verifier environment. Admittedly this is not a very common case, and only happens on systems where memory has already been squeezed close to the limit, but this does not seem like much of a hot path, and it's a simple enough fix. Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20241008170735.16766766@imladris.surriel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09tracing: Use atomic64_inc_return() in trace_clock_counter()Uros Bizjak-1/+1
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241007085651.48544-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09trace/trace_event_perf: remove duplicate samples on the first tracepoint eventLevi Yun-0/+6
When a tracepoint event is created with attr.freq = 1, 'hwc->period_left' is not initialized correctly. As a result, in the perf_swevent_overflow() function, when the first time the event occurs, it calculates the event overflow and the perf_swevent_set_period() returns 3, this leads to the event are recorded for three duplicate times. Step to reproduce: 1. Enable the tracepoint event & starting tracing $ echo 1 > /sys/kernel/tracing/events/module/module_free $ echo 1 > /sys/kernel/tracing/tracing_on 2. Record with perf $ perf record -a --strict-freq -F 1 -e "module:module_free" 3. Trigger module_free event. $ modprobe -i sunrpc $ modprobe -r sunrpc Result: - Trace pipe result: $ cat trace_pipe modprobe-174509 [003] ..... 6504.868896: module_free: sunrpc - perf sample: modprobe 174509 [003] 6504.868980: module:module_free: sunrpc modprobe 174509 [003] 6504.868980: module:module_free: sunrpc modprobe 174509 [003] 6504.868980: module:module_free: sunrpc By setting period_left via perf_swevent_set_period() as other sw_event did, This problem could be solved. After patch: - Trace pipe result: $ cat trace_pipe modprobe 1153096 [068] 613468.867774: module:module_free: xfs - perf sample modprobe 1153096 [068] 613468.867794: module:module_free: xfs Link: https://lore.kernel.org/20240913021347.595330-1-yeoreum.yun@arm.com Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment") Signed-off-by: Levi Yun <yeoreum.yun@arm.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09bpf: Check the remaining info_cnt before repeating btf fieldsHou Tao-4/+10
When trying to repeat the btf fields for array of nested struct, it doesn't check the remaining info_cnt. The following splat will be reported when the value of ret * nelems is greater than BTF_FIELDS_MAX: ------------[ cut here ]------------ UBSAN: array-index-out-of-bounds in ../kernel/bpf/btf.c:3951:49 index 11 is out of range for type 'btf_field_info [11]' CPU: 6 UID: 0 PID: 411 Comm: test_progs ...... 6.11.0-rc4+ #1 Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ... Call Trace: <TASK> dump_stack_lvl+0x57/0x70 dump_stack+0x10/0x20 ubsan_epilogue+0x9/0x40 __ubsan_handle_out_of_bounds+0x6f/0x80 ? kallsyms_lookup_name+0x48/0xb0 btf_parse_fields+0x992/0xce0 map_create+0x591/0x770 __sys_bpf+0x229/0x2410 __x64_sys_bpf+0x1f/0x30 x64_sys_call+0x199/0x9f0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fea56f2cc5d ...... </TASK> ---[ end trace ]--- Fix it by checking the remaining info_cnt in btf_repeat_fields() before repeating the btf fields. Fixes: 64e8ee814819 ("bpf: look into the types of the fields of a struct type recursively.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241008071114.3718177-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09Merge tag 'mm-hotfixes-stable-2024-10-09-15-46' of ↵Linus Torvalds-4/+16
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes, 5 of which are c:stable. All singletons, about half of which are MM" * tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: zswap: delete comments for "value" member of 'struct zswap_entry'. CREDITS: sort alphabetically by name secretmem: disable memfd_secret() if arch cannot set direct map .mailmap: update Fangrui's email mm/huge_memory: check pmd_special() only after pmd_present() resource, kunit: fix user-after-free in resource_test_region_intersects() fs/proc/kcore.c: allow translation of physical memory addresses selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test device-dax: correct pgoff align in dax_set_mapping() kthread: unpark only parked kthread Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN" bcachefs: do not use PF_MEMALLOC_NORECLAIM
2024-10-09tracing/perf: Add might_fault check to syscall probesMathieu Desnoyers-0/+2
Add a might_fault() check to validate that the perf sys_enter/sys_exit probe callbacks are indeed called from a context where page faults can be handled. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-8-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09tracing/ftrace: Add might_fault check to syscall probesMathieu Desnoyers-0/+2
Add a might_fault() check to validate that the ftrace sys_enter/sys_exit probe callbacks are indeed called from a context where page faults can be handled. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-7-mathieu.desnoyers@efficios.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09tracing/perf: disable preemption in syscall probeMathieu Desnoyers-0/+12
In preparation for allowing system call enter/exit instrumentation to handle page faults, make sure that perf can handle this change by explicitly disabling preemption within the perf system call tracepoint probes to respect the current expectations within perf ring buffer code. This change does not yet allow perf to take page faults per se within its probe, but allows its existing probes to adapt to the upcoming change. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-4-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09tracing/ftrace: disable preemption in syscall probeMathieu Desnoyers-0/+12
In preparation for allowing system call enter/exit instrumentation to handle page faults, make sure that ftrace can handle this change by explicitly disabling preemption within the ftrace system call tracepoint probes to respect the current expectations within ftrace ring buffer code. This change does not yet allow ftrace to take page faults per se within its probe, but allows its existing probes to adapt to the upcoming change. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-3-mathieu.desnoyers@efficios.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09resource, kunit: fix user-after-free in resource_test_region_intersects()Huang Ying-4/+14
In resource_test_insert_resource(), the pointer is used in error message after kfree(). This is user-after-free. To fix this, we need to call kunit_add_action_or_reset() to schedule memory freeing after usage. But kunit_add_action_or_reset() itself may fail and free the memory. So, its return value should be checked and abort the test for failure. Then, we found that other usage of kunit_add_action_or_reset() in resource_test_region_intersects() needs to be fixed too. We fix all these user-after-free bugs in this patch. Link: https://lkml.kernel.org/r/20240930070611.353338-1-ying.huang@intel.com Fixes: 99185c10d5d9 ("resource, kunit: add test case for region_intersects()") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Kees Bakker <kees@ijzerbout.nl> Closes: https://lore.kernel.org/lkml/87ldzaotcg.fsf@yhuang6-desk2.ccr.corp.intel.com/ Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09kthread: unpark only parked kthreadFrederic Weisbecker-0/+2
Calling into kthread unparking unconditionally is mostly harmless when the kthread is already unparked. The wake up is then simply ignored because the target is not in TASK_PARKED state. However if the kthread is per CPU, the wake up is preceded by a call to kthread_bind() which expects the task to be inactive and in TASK_PARKED state, which obviously isn't the case if it is unparked. As a result, calling kthread_stop() on an unparked per-cpu kthread triggers such a warning: WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525 <TASK> kthread_stop+0x17a/0x630 kernel/kthread.c:707 destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810 wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257 netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693 default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769 ops_exit_list net/core/net_namespace.c:178 [inline] cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd70 kernel/workqueue.c:3393 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fix this with skipping unecessary unparking while stopping a kthread. Link: https://lkml.kernel.org/r/20240913214634.12557-1-frederic@kernel.org Fixes: 5c25b5ff89f0 ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com Tested-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Hillf Danton <hdanton@sina.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09ring-buffer: Do not have boot mapped buffers hook to CPU hotplugSteven Rostedt-3/+6
The boot mapped ring buffer has its buffer mapped at a fixed location found at boot up. It is not dynamic. It cannot grow or be expanded when new CPUs come online. Do not hook fixed memory mapped ring buffers to the CPU hotplug callback, otherwise it can cause a crash when it tries to add the buffer to the memory that is already fully occupied. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241008143242.25e20801@gandalf.local.home Fixes: be68d63a139bd ("ring-buffer: Add ring_buffer_alloc_range()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09locking/ww_mutex: Adjust to lockdep nest_lock requirementsThomas Hellström-3/+5
When using mutex_acquire_nest() with a nest_lock, lockdep refcounts the number of acquired lockdep_maps of mutexes of the same class, and also keeps a pointer to the first acquired lockdep_map of a class. That pointer is then used for various comparison-, printing- and checking purposes, but there is no mechanism to actively ensure that lockdep_map stays in memory. Instead, a warning is printed if the lockdep_map is freed and there are still held locks of the same lock class, even if the lockdep_map itself has been released. In the context of WW/WD transactions that means that if a user unlocks and frees a ww_mutex from within an ongoing ww transaction, and that mutex happens to be the first ww_mutex grabbed in the transaction, such a warning is printed and there might be a risk of a UAF. Note that this is only problem when lockdep is enabled and affects only dereferences of struct lockdep_map. Adjust to this by adding a fake lockdep_map to the acquired context and make sure it is the first acquired lockdep map of the associated ww_mutex class. Then hold it for the duration of the WW/WD transaction. This has the side effect that trying to lock a ww mutex *without* a ww_acquire_context but where a such context has been acquire, we'd see a lockdep splat. The test-ww_mutex.c selftest attempts to do that, so modify that particular test to not acquire a ww_acquire_context if it is not going to be used. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241009092031.6356-1-thomas.hellstrom@linux.intel.com
2024-10-09bpf: Constify ctl_table argument of filter functionThomas Weißschuh-1/+1
The sysctl core is moving to allow "struct ctl_table" in read-only memory. As a preparation for that all functions handling "struct ctl_table" need to be able to work with "const struct ctl_table". As __cgroup_bpf_run_filter_sysctl() does not modify its table, it can be adapted trivially. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2024-10-08tracepoint: Remove SRCU protectionSteven Rostedt-50/+1
With the removal of the trace_*_rcuidle() tracepoints, there is no reason to protect tracepoints with SRCU. The reason the SRCU protection was added, was because it can protect tracepoints when RCU is not "watching". Now that tracepoints are only used when RCU is watching, remove the SRCU protection. It just made things more complex and confusing anyway. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241003184220.0dc21d35@gandalf.local.home Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-08tracing: Remove definition of trace_*_rcuidle()Steven Rostedt-20/+6
The trace_*_rcuidle() variant of a tracepoint was to handle places where a tracepoint was located but RCU was not "watching". All those locations have been removed, and RCU should be watching where all tracepoints are located. We can now remove the trace_*_rcuidle() variant. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241003181629.36209057@gandalf.local.home Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-08tracepoints: Use new static branch APIJosh Poimboeuf-5/+5
The old static key API is deprecated. Switch to the new one. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/7a08dae3c5eddb14b13864923c1b58ac1f4af83c.1728414936.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-08Merge tag 'sched_ext-for-6.12-rc2-fixes' of ↵Linus Torvalds-19/+37
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - ops.enqueue() didn't have a way to tell whether select_task_rq_scx() and thus ops.select() were skipped. Some schedulers were incorrectly using SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using it. - Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit() - Fix error information clobbering during load - Add missing __weak markers to BPF helper declarations - Doc update * tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Documentation: Update instructions for running example schedulers sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called sched/core: Make select_task_rq() take the pointer to wake_flags instead of value sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init() sched_ext: Improve error reporting during loading sched_ext: Add __weak markers to BPF helper function decalarations
2024-10-08bpf, lsm: Remove bpf_lsm_key_free hookThomas Weißschuh-4/+0
The key_free LSM hook has been removed. Remove the corresponding BPF hook. Avoid warnings during the build: BTFIDS vmlinux WARN: resolve_btfids: unresolved symbol bpf_lsm_key_free Fixes: 5f8d28f6d7d5 ("lsm: infrastructure management of the key security blob") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20241005-lsm-key_free-v1-1-42ea801dbd63@weissschuh.net
2024-10-08tracing: Remove TRACE_EVENT_FL_FILTERED logicZheng Yejian-74/+20
After commit dcb0b5575d24 ("tracing: Remove TRACE_EVENT_FL_USE_CALL_FILTER logic"), no one's going to set the TRACE_EVENT_FL_FILTERED or change the call->filter, so remove related logic. Link: https://lore.kernel.org/20240911010026.2302849-1-zhengyejian@huaweicloud.com Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-08tracing/branch-profiler: Replace deprecated strncpy with strscpyJustin Stitt-4/+2
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. Both of these fields want to be NUL-terminated as per their use in printk: F_printk("%u:%s:%s (%u)%s", __entry->line, __entry->func, __entry->file, __entry->correct, __entry->constant ? " CONSTANT" : "") Use strscpy() as it NUL-terminates the destination buffer, so it doesn't have to be done manually. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/20240826-strncpy-kernel-trace-trace_branch-c-v1-1-b2c14f2e9e84@google.com Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-08ftrace: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())Li Chen-7/+3
Use this_cpu_ptr() instead of open coding the equivalent in various ftrace functions. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/87y14t6ofi.wl-me@linux.beauty Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-08cgroup/rstat: Tracking cgroup-level niced CPU timeJoshua Hahn-5/+14
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked). This patch exposes niced CPU time to the userspace, allowing users to get a better understanding of their hardware limits and can facilitate more informed workload distribution. A new field 'ntime' is added to struct cgroup_base_stat as opposed to struct task_cputime to minimize footprint. Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-08cgroup/bpf: use a dedicated workqueue for cgroup bpf destructionChen Ridong-1/+18
A hung_task problem shown below was found: INFO: task kworker/0:0:8 blocked for more than 327 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Workqueue: events cgroup_bpf_release Call Trace: <TASK> __schedule+0x5a2/0x2050 ? find_held_lock+0x33/0x100 ? wq_worker_sleeping+0x9e/0xe0 schedule+0x9f/0x180 schedule_preempt_disabled+0x25/0x50 __mutex_lock+0x512/0x740 ? cgroup_bpf_release+0x1e/0x4d0 ? cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? cgroup_bpf_release+0x1e/0x4d0 ? mutex_lock_nested+0x2b/0x40 ? __pfx_delay_tsc+0x10/0x10 mutex_lock_nested+0x2b/0x40 cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 ? process_scheduled_works+0x161/0x8a0 process_scheduled_works+0x23a/0x8a0 worker_thread+0x231/0x5b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x14d/0x1c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x59/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> This issue can be reproduced by the following pressuse test: 1. A large number of cpuset cgroups are deleted. 2. Set cpu on and off repeatly. 3. Set watchdog_thresh repeatly. The scripts can be obtained at LINK mentioned above the signature. The reason for this issue is cgroup_mutex and cpu_hotplug_lock are acquired in different tasks, which may lead to deadlock. It can lead to a deadlock through the following steps: 1. A large number of cpusets are deleted asynchronously, which puts a large number of cgroup_bpf_release works into system_wq. The max_active of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are cgroup_bpf_release works, and many cgroup_bpf_release works will be put into inactive queue. As illustrated in the diagram, there are 256 (in the acvtive queue) + n (in the inactive queue) works. 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put smp_call_on_cpu work into system_wq. However step 1 has already filled system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has to wait until the works that were put into the inacvtive queue earlier have executed (n cgroup_bpf_release), so it will be blocked for a while. 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. 4. Cpusets that were deleted at step 1 put cgroup_release works into cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. When cgroup_metux is acqured by work at css_killed_work_fn, it will call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. However, cpuset_css_offline will be blocked for step 3. 5. At this moment, there are 256 works in active queue that are cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as a result, all of them are blocked. Consequently, sscs.work can not be executed. Ultimately, this situation leads to four processes being blocked, forming a deadlock. system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) ... 2000+ cgroups deleted asyn 256 actives + n inactives __lockup_detector_reconfigure P(cpu_hotplug_lock.read) put sscs.work into system_wq 256 + n + 1(sscs.work) sscs.work wait to be executed warting sscs.work finish percpu_down_write P(cpu_hotplug_lock.write) ...blocking... css_killed_work_fn P(cgroup_mutex) cpuset_css_offline P(cpu_hotplug_lock.read) ...blocking... 256 cgroup_bpf_release mutex_lock(&cgroup_mutex); ..blocking... To fix the problem, place cgroup_bpf_release works on a dedicated workqueue which can break the loop and solve the problem. System wqs are for misc things which shouldn't create a large number of concurrent work items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work items, it should use its own dedicated workqueue. Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Cc: stable@vger.kernel.org # v5.3+ Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t Tested-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-07bpf: Fix memory leak in bpf_core_applyJiri Olsa-0/+1
We need to free specs properly. Fixes: 3d2786d65aaa ("bpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relos") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20241007160958.607434-1-jolsa@kernel.org
2024-10-07sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTEDTejun Heo-0/+1
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to tell whether ops.select_cpu() was called. This is incorrect as ops.select_cpu() can be skipped in the wakeup path and leads to e.g. incorrectly skipping direct dispatch for tasks that are bound to a single CPU. sched core has been updated to specify ENQUEUE_RQ_SELECTED if ->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update scx_qmap to test it instead of SCX_ENQ_WAKEUP. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-07sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() ↵Tejun Heo-2/+9
was called During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or migration is disabled. sched_ext schedulers may perform operations such as direct dispatch from ->select_task_rq() path and it is useful for them to know whether ->select_task_rq() was skipped in the ->enqueue_task() path. Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose and end up assuming incorrectly that ->select_task_rq() was called for tasks that are bound to a single CPU or migration disabled. Make select_task_rq() indicate whether ->select_task_rq() was called by setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that to ENQUEUE_RQ_SELECTED for ->enqueue_task(). This will be used by sched_ext to fix ->select_task_rq() skip detection. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-07sched/core: Make select_task_rq() take the pointer to wake_flags instead of ↵Tejun Heo-5/+8
value This will be used to allow select_task_rq() to indicate whether ->select_task_rq() was called by modifying *wake_flags. This makes try_to_wake_up() call all functions that take wake_flags with WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is more consistent, and, as the flag is only tested by ->select_task_rq() implementations, it doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-07remove pointless includes of <linux/fdtable.h>Al Viro-7/+0
some of those used to be needed, some had been cargo-culted for no reason... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-07get rid of ...lookup...fdget_rcu() familyAl Viro-8/+2
Once upon a time, predecessors of those used to do file lookup without bumping a refcount, provided that caller held rcu_read_lock() across the lookup and whatever it wanted to read from the struct file found. When struct file allocation switched to SLAB_TYPESAFE_BY_RCU, that stopped being feasible and these primitives started to bump the file refcount for lookup result, requiring the caller to call fput() afterwards. But that turned them pointless - e.g. rcu_read_lock(); file = lookup_fdget_rcu(fd); rcu_read_unlock(); is equivalent to file = fget_raw(fd); and all callers of lookup_fdget_rcu() are of that form. Similarly, task_lookup_fdget_rcu() calls can be replaced with calling fget_task(). task_lookup_next_fdget_rcu() doesn't have direct counterparts, but its callers would be happier if we replaced it with an analogue that deals with RCU internally. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-07uprobes: fold xol_take_insn_slot() into xol_get_insn_slot()Oleg Nesterov-13/+4
After the previous change xol_take_insn_slot() becomes trivial, kill it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241001142503.GA13633@redhat.com
2024-10-07uprobes: kill xol_area->slot_countOleg Nesterov-14/+15
Add the new helper, xol_get_slot_nr() which does find_first_zero_bit() + test_and_set_bit(). xol_take_insn_slot() can wait for the "xol_get_slot_nr() < UINSNS_PER_PAGE" event instead of "area->slot_count < UINSNS_PER_PAGE". So we can kill area->slot_count and avoid atomic_inc() + atomic_dec(), this simplifies the code and can slightly improve the performance. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241001142458.GA13629@redhat.com
2024-10-07uprobes: deny mremap(xol_vma)Oleg Nesterov-13/+17
kernel/events/uprobes.c assumes that xol_area->vaddr is always correct but a malicious application can remap its "[uprobes]" vma to another adress to confuse the kernel. Introduce xol_mremap() to make this impossible. With this change utask->xol_vaddr in xol_free_insn_slot() can't be invalid, we can turn the offset check into WARN_ON_ONCE(offset >= PAGE_SIZE). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240929144258.GA9492@redhat.com
2024-10-07uprobes: pass utask to xol_get_insn_slot() and xol_free_insn_slot()Oleg Nesterov-9/+8
Add the "struct uprobe_task *utask" argument to xol_get_insn_slot() and xol_free_insn_slot(), their callers already have it so we can avoid the unnecessary dereference and simplify the code. Kill the "tsk" argument of xol_free_insn_slot(), it is always current. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240929144253.GA9487@redhat.com
2024-10-07uprobes: move the initialization of utask->xol_vaddr from pre_ssout() to ↵Oleg Nesterov-14/+8
xol_get_insn_slot() This simplifies the code and makes xol_get_insn_slot() symmetric with xol_free_insn_slot() which clears utask->xol_vaddr. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240929144248.GA9483@redhat.com
2024-10-07uprobes: simplify xol_take_insn_slot() and its callerOleg Nesterov-13/+4
The do / while (slot_nr >= UINSNS_PER_PAGE) loop in xol_take_insn_slot() makes no sense, the checked condition is always true. Change this code to use the "for (;;)" loop, this way we do not need to change slot_nr if test_and_set_bit() fails. Also, kill the unnecessary xol_vaddr != NULL check in xol_get_insn_slot(), xol_take_insn_slot() never returns NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240929144244.GA9480@redhat.com
2024-10-07uprobes: kill the unnecessary put_uprobe/xol_free_insn_slot in ↵Oleg Nesterov-19/+5
uprobe_free_utask() If pre_ssout() succeeds and sets utask->active_uprobe and utask->xol_vaddr the task must not exit until it calls handle_singlestep() which does the necessary put_uprobe() and xol_free_insn_slot(). Remove put_uprobe() and xol_free_insn_slot() from uprobe_free_utask(). With this change xol_free_insn_slot() can't hit xol_area/utask/xol_vaddr == NULL, we can kill the unnecessary checks checks and simplify this function more. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240929144239.GA9475@redhat.com
2024-10-07uprobes: sanitiize xol_free_insn_slot()Oleg Nesterov-12/+9
1. Clear utask->xol_vaddr unconditionally, even if this addr is not valid, xol_free_insn_slot() should never return with utask->xol_vaddr != NULL. 2. Add a comment to explain why do we need to validate slot_addr. 3. Simplify the validation above. We can simply check offset < PAGE_SIZE, unsigned underflows are fine, it should work if slot_addr < area->vaddr. 4. Kill the unnecessary "slot_nr >= UINSNS_PER_PAGE" check, slot_nr must be valid if offset < PAGE_SIZE. The next patches will cleanup this function even more. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240929144235.GA9471@redhat.com
2024-10-07uprobes: don't abuse get_utask() in pre_ssout() and prepare_uretprobe()Oleg Nesterov-11/+3
handle_swbp() calls get_utask() before prepare_uretprobe() or pre_ssout() can be called, they can simply use current->utask which can't be NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240929144230.GA9468@redhat.com
2024-10-07uprobes: switch to RCU Tasks Trace flavor for better performanceAndrii Nakryiko-22/+16
This patch switches uprobes SRCU usage to RCU Tasks Trace flavor, which is optimized for more lightweight and quick readers (at the expense of slower writers, which for uprobes is a fine tradeof) and has better performance and scalability with number of CPUs. Similarly to baseline vs SRCU, we've benchmarked SRCU-based implementation vs RCU Tasks Trace implementation. SRCU ==== uprobe-nop ( 1 cpus): 3.276 ± 0.005M/s ( 3.276M/s/cpu) uprobe-nop ( 2 cpus): 4.125 ± 0.002M/s ( 2.063M/s/cpu) uprobe-nop ( 4 cpus): 7.713 ± 0.002M/s ( 1.928M/s/cpu) uprobe-nop ( 8 cpus): 8.097 ± 0.006M/s ( 1.012M/s/cpu) uprobe-nop (16 cpus): 6.501 ± 0.056M/s ( 0.406M/s/cpu) uprobe-nop (32 cpus): 4.398 ± 0.084M/s ( 0.137M/s/cpu) uprobe-nop (64 cpus): 6.452 ± 0.000M/s ( 0.101M/s/cpu) uretprobe-nop ( 1 cpus): 2.055 ± 0.001M/s ( 2.055M/s/cpu) uretprobe-nop ( 2 cpus): 2.677 ± 0.000M/s ( 1.339M/s/cpu) uretprobe-nop ( 4 cpus): 4.561 ± 0.003M/s ( 1.140M/s/cpu) uretprobe-nop ( 8 cpus): 5.291 ± 0.002M/s ( 0.661M/s/cpu) uretprobe-nop (16 cpus): 5.065 ± 0.019M/s ( 0.317M/s/cpu) uretprobe-nop (32 cpus): 3.622 ± 0.003M/s ( 0.113M/s/cpu) uretprobe-nop (64 cpus): 3.723 ± 0.002M/s ( 0.058M/s/cpu) RCU Tasks Trace =============== uprobe-nop ( 1 cpus): 3.396 ± 0.002M/s ( 3.396M/s/cpu) uprobe-nop ( 2 cpus): 4.271 ± 0.006M/s ( 2.135M/s/cpu) uprobe-nop ( 4 cpus): 8.499 ± 0.015M/s ( 2.125M/s/cpu) uprobe-nop ( 8 cpus): 10.355 ± 0.028M/s ( 1.294M/s/cpu) uprobe-nop (16 cpus): 7.615 ± 0.099M/s ( 0.476M/s/cpu) uprobe-nop (32 cpus): 4.430 ± 0.007M/s ( 0.138M/s/cpu) uprobe-nop (64 cpus): 6.887 ± 0.020M/s ( 0.108M/s/cpu) uretprobe-nop ( 1 cpus): 2.174 ± 0.001M/s ( 2.174M/s/cpu) uretprobe-nop ( 2 cpus): 2.853 ± 0.001M/s ( 1.426M/s/cpu) uretprobe-nop ( 4 cpus): 4.913 ± 0.002M/s ( 1.228M/s/cpu) uretprobe-nop ( 8 cpus): 5.883 ± 0.002M/s ( 0.735M/s/cpu) uretprobe-nop (16 cpus): 5.147 ± 0.001M/s ( 0.322M/s/cpu) uretprobe-nop (32 cpus): 3.738 ± 0.008M/s ( 0.117M/s/cpu) uretprobe-nop (64 cpus): 4.397 ± 0.002M/s ( 0.069M/s/cpu) Peak throughput for uprobes increases from 8 mln/s to 10.3 mln/s (+28%!), and for uretprobes from 5.3 mln/s to 5.8 mln/s (+11%), as we have more work to do on uretprobes side. Even single-thread (no contention) performance is slightly better: 3.276 mln/s to 3.396 mln/s (+3.5%) for uprobes, and 2.055 mln/s to 2.174 mln/s (+5.8%) for uretprobes. We also select TASKS_TRACE_RCU for UPROBES in Kconfig due to the new dependency. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20240910174312.3646590-1-andrii@kernel.org
2024-10-07sched/wait: Remove unused bit_wait_io_timeoutDr. David Alan Gilbert-14/+0
bit_wait_io_timeout has been unused since 2016's commit 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit") Remove it. Signed-off-by: "Dr. David Alan Gilbert" <linux@treblig.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241001234016.231696-1-linux@treblig.org
2024-10-07sched/fair: fix the comment for PREEMPT_SHORTHuang Shijie-1/+1
We do not have RESPECT_SLICE, we only have RUN_TO_PARITY. Change RESPECT_SLICE to RUN_TO_PARITY, makes it more clear. Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com> Link: https://lkml.kernel.org/r/20241001070456.10939-1-shijie@os.amperecomputing.com
2024-10-07sched/fair: optimize the PLACE_LAG when se->vlag is zeroHuang Shijie-1/+1
When PLACE_LAG is enabled, from the relationship: vl_i = (W + w_i)*vl'_i / W we know that if vl'_i(se->vlag) is zero, the vl_i is zero too. So if se->vlag is zero, there is no need to waste cycles to do the calculation. Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com> Link: https://lkml.kernel.org/r/20241001070021.10626-1-shijie@os.amperecomputing.com
2024-10-07sched/fair: remove the DOUBLE_TICK featureHuang Shijie-7/+0
The patch "5e963f2bd46 sched/fair: Commit to EEVDF" removed the code following the DOUBLE_TICK: - - if (!sched_feat(EEVDF) && cfs_rq->nr_running > 1) - check_preempt_tick(cfs_rq, curr); The DOUBLE_TICK feature becomes dead code now, so remove it. Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Christoph Lameter (Ampere)" <cl@linux.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20241001065451.10356-1-shijie@os.amperecomputing.com
2024-10-07softirq: use bit waits instead of var waits.NeilBrown-10/+4
The waiting in softirq.c is always waiting for a bit to be cleared. This makes the bit wait functions seem more suitable. By switching over we can rid of all explicit barriers. We also use wait_on_bit_lock() to avoid an explicit loop. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240925053405.3960701-8-neilb@suse.de
2024-10-07sched: Document wait_var_event() family of functions and wake_up_var()NeilBrown-0/+30
wake_up_var(), wait_var_event() and related interfaces are not documented but have important ordering requirements. This patch adds documentation and makes these requirements explicit. The return values for those wait_var_event_* functions which return a value are documented. Note that these are, perhaps surprisingly, sometimes different from comparable wait_on_bit() functions. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240925053405.3960701-4-neilb@suse.de
2024-10-07sched: Improve documentation for wake_up_bit/wait_on_bit family of functionsNeilBrown-13/+21
This patch revises the documention for wake_up_bit(), clear_and_wake_up_bit(), and all the wait_on_bit() family of functions. The new documentation places less emphasis on the pool of waitqueues used (an implementation detail) and focuses instead on details of how the functions behave. The barriers included in the wait functions and clear_and_wake_up_bit() and those required for wake_up_bit() are spelled out more clearly. The error statuses returned are given explicitly. The fact that the wait_on_bit_lock() function sets the bit is made more obvious. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240925053405.3960701-3-neilb@suse.de
2024-10-07sched: change wake_up_bit() and related function to expect unsigned long *NeilBrown-6/+6
wake_up_bit() currently allows a "void *". While this isn't strictly a problem as the address is never dereferenced, it is inconsistent with the corresponding wait_on_bit() which requires "unsigned long *" and does dereference the pointer. Any code that needs to wait for a change in something other than an unsigned long would be better served by wake_up_var()/wait_var_event(). This patch changes all related "void *" to "unsigned long *". Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240925053405.3960701-2-neilb@suse.de