| Age | Commit message (Collapse) | Author | Lines |
|
rcu_dereference_rtnl() does not clearly tell whether the caller
is under RCU or RTNL.
Let's add in6_dev_rcu() to make it easy to remove __in6_dev_get()
in the future.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
into clk-for-6.19
Merge Kaanapali RPMh, TCSR and global clock controllers through a topic
branch, so they can be made available in the DeviceTree branch as well.
|
|
Add device tree bindings for the global clock controller on Qualcomm
Kaanapali platform.
Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Taniya Das <taniya.das@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251030-gcc_kaanapali-v2-v2-3-a774a587af6f@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
Add unique ID for Qualcomm QCS6490 SoC.
Signed-off-by: Komal Bajaj <komal.bajaj@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251103-qcs6490_soc_id-v1-1-c139dd1e32c8@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
Cross-merge BPF and other fixes after downstream PR.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
finish_task_switch()
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:
- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load, because the cgroup might be destroyed/unlinked
while the zombie or dead task is still lingering on the scx_tasks list.
- ops.cgroup_exit() could be called before ops.exit_task() is called on all
member tasks, leading to incorrect exit ordering.
Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.
By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
as cgroup_mutex prevents cgroup destruction while the task is still linked.
- All member tasks have ops.exit_task() called and are removed from scx_tasks
before the cgroup can be destroyed and trigger ops.cgroup_exit().
This fix is made possible by the cgroup_task_dead() split in the previous patch.
This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-6.19
Pull cgroup/for-6.19 to receive:
16dad7801aad ("cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()")
260fbcb92bbe ("cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()")
d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
These are needed for the sched_ext cgroup exit ordering fix.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
from its cgroup. From the cgroup's perspective, the task is now gone. If this
makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
that notify controllers the cgroup is going offline resource-wise.
However, the exiting task can still run, perform memory operations, and schedule
until the final context switch in finish_task_switch(). This creates a confusing
situation where controllers are told a cgroup is offline while resource
activities are still happening in it. While this hasn't broken existing
controllers, it has caused direct confusion for sched_ext schedulers.
Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
the subsystem exit callbacks and continues to be called from do_exit(). The
css_set cleanup is moved to the new cgroup_task_dead() which is called from
finish_task_switch() after the final context switch, so that the cgroup only
appears empty after the task is truly done running.
This also reorders operations so that subsys->exit() is now called before
unlinking from the cgroup, which shouldn't break anything.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
confusing because they look like they're operating on cgroups themselves when
they're actually task lifecycle hooks. For example, cgroup_init() initializes
the cgroup subsystem while cgroup_exit() is a task exit notification to
cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
cgroup_task_free() to make it clear that these operate on tasks.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Add the listns() system call to all architectures.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-20-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a new listns() system call that allows userspace to iterate through
namespaces in the system. This provides a programmatic interface to
discover and inspect namespaces, enhancing existing namespace apis.
Currently, there is no direct way for userspace to enumerate namespaces
in the system. Applications must resort to scanning /proc/<pid>/ns/
across all processes, which is:
1. Inefficient - requires iterating over all processes
2. Incomplete - misses inactive namespaces that aren't attached to any
running process but are kept alive by file descriptors, bind mounts,
or parent namespace references
3. Permission-heavy - requires access to /proc for many processes
4. No ordering or ownership.
5. No filtering per namespace type: Must always iterate and check all
namespaces.
The list goes on. The listns() system call solves these problems by
providing direct kernel-level enumeration of namespaces. It is similar
to listmount() but obviously tailored to namespaces.
/*
* @req: Pointer to struct ns_id_req specifying search parameters
* @ns_ids: User buffer to receive namespace IDs
* @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
* @flags: Reserved for future use (must be 0)
*/
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
Returns:
- On success: Number of namespace IDs written to ns_ids
- On error: Negative error code
/*
* @size: Structure size
* @ns_id: Starting point for iteration; use 0 for first call, then
* use the last returned ID for subsequent calls to paginate
* @ns_type: Bitmask of namespace types to include (from enum ns_type):
* 0: Return all namespace types
* MNT_NS: Mount namespaces
* NET_NS: Network namespaces
* USER_NS: User namespaces
* etc. Can be OR'd together
* @user_ns_id: Filter results to namespaces owned by this user namespace:
* 0: Return all namespaces (subject to permission checks)
* LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
* Other value: Namespaces owned by the specified user namespace ID
*/
struct ns_id_req {
__u32 size; /* sizeof(struct ns_id_req) */
__u32 spare; /* Reserved, must be 0 */
__u64 ns_id; /* Last seen namespace ID (for pagination) */
__u32 ns_type; /* Filter by namespace type(s) */
__u32 spare2; /* Reserved, must be 0 */
__u64 user_ns_id; /* Filter by owning user namespace */
};
Example 1: List all namespaces
void list_all_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0, /* Start from beginning */
.ns_type = 0, /* All types */
.user_ns_id = 0, /* All user namespaces */
};
uint64_t ids[100];
ssize_t ret;
printf("All namespaces in the system:\n");
do {
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
break;
}
for (ssize_t i = 0; i < ret; i++)
printf(" Namespace ID: %llu\n", (unsigned long long)ids[i]);
/* Continue from last seen ID */
if (ret > 0)
req.ns_id = ids[ret - 1];
} while (ret == 100); /* Buffer was full, more may exist */
}
Example 2: List network namespaces only
void list_network_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = NET_NS, /* Only network namespaces */
.user_ns_id = 0,
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
return;
}
printf("Network namespaces: %zd found\n", ret);
for (ssize_t i = 0; i < ret; i++)
printf(" netns ID: %llu\n", (unsigned long long)ids[i]);
}
Example 3: List namespaces owned by current user namespace
void list_owned_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = 0, /* All types */
.user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
return;
}
printf("Namespaces owned by my user namespace: %zd\n", ret);
for (ssize_t i = 0; i < ret; i++)
printf(" ns ID: %llu\n", (unsigned long long)ids[i]);
}
Example 4: List multiple namespace types
void list_network_and_mount_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = NET_NS | MNT_NS, /* Network and mount */
.user_ns_id = 0,
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
printf("Network and mount namespaces: %zd found\n", ret);
}
Example 5: Pagination through large namespace sets
void list_all_with_pagination(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = 0,
.user_ns_id = 0,
};
uint64_t ids[50];
size_t total = 0;
ssize_t ret;
printf("Enumerating all namespaces with pagination:\n");
while (1) {
ret = listns(&req, ids, 50, 0);
if (ret < 0) {
perror("listns");
break;
}
if (ret == 0)
break; /* No more namespaces */
total += ret;
printf(" Batch: %zd namespaces\n", ret);
/* Last ID in this batch becomes start of next batch */
req.ns_id = ids[ret - 1];
if (ret < 50)
break; /* Partial batch = end of results */
}
printf("Total: %zu namespaces\n", total);
}
Permission Model
listns() respects namespace isolation and capabilities:
(1) Global listing (user_ns_id = 0):
- Requires CAP_SYS_ADMIN in the namespace's owning user namespace
- OR the namespace must be in the caller's namespace context (e.g.,
a namespace the caller is currently using)
- User namespaces additionally allow listing if the caller has
CAP_SYS_ADMIN in that user namespace itself
(2) Owner-filtered listing (user_ns_id != 0):
- Requires CAP_SYS_ADMIN in the specified owner user namespace
- OR the namespace must be in the caller's namespace context
- This allows unprivileged processes to enumerate namespaces they own
(3) Visibility:
- Only "active" namespaces are listed
- A namespace is active if it has a non-zero __ns_ref_active count
- This includes namespaces used by running processes, held by open
file descriptors, or kept active by bind mounts
- Inactive namespaces (kept alive only by internal kernel
references) are not visible via listns()
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-19-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Allow to walk the unified namespace list completely locklessly.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-18-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The namespace tree doesn't express the ownership concept of namespace
appropriately. Maintain a list of directly owned namespaces per user
namespace. This will allow userspace and the kernel to use the listns()
system call to walk the namespace tree by owning user namespace. The
rbtree is used to find the relevant namespace entry point which allows
to continue iteration and the owner list can be used to walk the tree
completely lock free.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-16-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The initial set of namespace comes with fixed inode numbers making it
easy for userspace to identify them solely based on that information.
This has long preceeded anything here.
Similarly, let's assign fixed namespace ids for the initial namespaces.
Kill the cookie and use a sequentially increasing number. This has the
nice side-effect that the owning user namespace will always have a
namespace id that is smaller than any of it's descendant namespaces.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-15-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This will allow userspace to lookup and stat a namespace simply by its
identifier without having to know what type of namespace it is.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-13-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Make it easier to spot that they belong together conceptually.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-12-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The namespace tree is, among other things, currently used to support
file handles for namespaces. When a namespace is created it is placed on
the namespace trees and when it is destroyed it is removed from the
namespace trees.
While a namespace is on the namespace trees with a valid reference count
it is possible to reopen it through a namespace file handle. This is all
fine but has some issues that should be addressed.
On current kernels a namespace is visible to userspace in the
following cases:
(1) The namespace is in use by a task.
(2) The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount).
Note that (2) only cares about direct persistence of the namespace
itself not indirectly via e.g., file->f_cred file references or
similar.
(3) The namespace is a hierarchical namespace type and is the parent of
a single or multiple child namespaces.
Case (3) is interesting because it is possible that a parent namespace
might not fulfill any of (1) or (2), i.e., is invisible to userspace but
it may still be resurrected through the NS_GET_PARENT ioctl().
Currently namespace file handles allow much broader access to namespaces
than what is currently possible via (1)-(3). The reason is that
namespaces may remain pinned for completely internal reasons yet are
inaccessible to userspace.
For example, a user namespace my remain pinned by get_cred() calls to
stash the opener's credentials into file->f_cred. As it stands file
handles allow to resurrect such a users namespace even though this
should not be possible via (1)-(3). This is a fundamental uapi change
that we shouldn't do if we don't have to.
Consider the following insane case: Various architectures support the
CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction.
When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switches back to another task. But because of ptrace()
permission checks struct mm_struct stashes the user namespace of the
task that struct mm_struct originally belonged to. The kernel thread
will take a reference on the struct mm_struct and thus pin it.
So on an idle system user namespaces can be persisted for arbitrary
amounts of time which also means that they can be resurrected using
namespace file handles. That makes no sense whatsoever. The problem is
of course excarabted on large systems with a huge number of cpus.
To handle this nicely we introduce an active reference count which
tracks (1)-(3). This is easy to do as all of these things are already
managed centrally. Only (1)-(3) will count towards the active reference
count and only namespaces which are active may be opened via namespace
file handles.
The problem is that namespaces may be resurrected. Which means that they
can become temporarily inactive and will be reactived some time later.
Currently the only example of this is the SIOGCSKNS socket ioctl. The
SIOCGSKNS ioctl allows to open a network namespace file descriptor based
on a socket file descriptor.
If a socket is tied to a network namespace that subsequently becomes
inactive but that socket is persisted by another process in another
network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the
SIOCGSKNS ioctl will resurrect this network namespace.
So calls to open_related_ns() and open_namespace() will end up
resurrecting the corresponding namespace tree.
Note that the active reference count does not regulate the lifetime of
the namespace itself. This is still done by the normal reference count.
The active reference count can only be elevated if the regular reference
count is elevated.
The active reference count also doesn't regulate the presence of a
namespace on the namespace trees. It only regulates its visiblity to
namespace file handles (and in later patches to listns()).
A namespace remains on the namespace trees from creation until its
actual destruction. This will allow the kernel to always reach any
namespace trivially and it will also enable subsystems like bpf to walk
the namespace lists on the system for tracing or general introspection
purposes.
Note that different namespaces have different visibility lifetimes on
current kernels. While most namespace are immediately released when the
last task using them exits, the user- and pid namespace are persisted
and thus both remain accessible via /proc/<pid>/ns/<ns_type>.
The user namespace lifetime is aliged with struct cred and is only
released through exit_creds(). However, it becomes inaccessible to
userspace once the last task using it is reaped, i.e., when
release_task() is called and all proc entries are flushed. Similarly,
the pid namespace is also visible until the last task using it has been
reaped and the associated pid numbers are freed.
The active reference counts of the user- and pid namespace are
decremented once the task is reaped.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-11-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The current naming is very misleading as this really isn't exiting all
of the task's namespaces. It is only exiting the namespaces that hang of
off nsproxy. Reflect that in the name.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-10-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Implement ns_ref_read() the same way as ns_ref_{get,put}().
No point in making that any more special or different from the other
helpers.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-9-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Make sure that the list is always initialized for initial namespaces.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-8-2e6f823ebdc0@kernel.org
Fixes: 885fc8ac0a4d ("nstree: make iterator generic")
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add an initializer that can be used for the ns common initialization for
static namespace such as most init namespaces.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/87ecqhy2y5.ffs@tglx
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
I authored the files a short while ago.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
There is a better way to handle the problem IORING_REGISTER_ZCRX_REFILL
solves. The uapi can also be slightly adjusted to accommodate future
extensions. Remove the feature for now, it'll be reworked for the next
release.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring task work dispatch makes an indirect call to struct io_kiocb's
io_task_work.func field to allow running arbitrary task work functions.
In the uring_cmd case, this calls io_uring_cmd_work(), which immediately
makes another indirect call to struct io_uring_cmd's task_work_cb field.
Change the uring_cmd task work callbacks to functions whose signatures
match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert
from the task work's struct io_tw_req argument to struct io_uring_cmd *.
Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid
manufacturing issue_flags in the uring_cmd task work callbacks. Now
uring_cmd task work dispatch makes a single indirect call to the
uring_cmd implementation's callback. This also allows removing the
task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for
future storage.
Since fuse_uring_send_in_task() now has access to the io_tw_token_t,
check its cancel field directly instead of relying on the
IO_URING_F_TASK_DEAD issue flag.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for uring_cmd implementations to implement functions
with the io_req_tw_func_t signature, introduce a wrapper struct
io_tw_req to hide the struct io_kiocb * argument. The intention is for
only the io_uring core to access the inner struct io_kiocb *. uring_cmd
implementations should instead call a helper from io_uring/cmd.h to
convert struct io_tw_req to struct io_uring_cmd *.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, REQ_OP_WRITE_ZEROES operations are not handled in the
blktrace infrastructure, resulting in incorrect or missing operation
labels in ftrace blktrace output. This manifests as write-zeroes
operations appearing with incorrect labels like "N" instead of a
proper "WZ" designation.
This patch adds complete support for REQ_OP_WRITE_ZEROES across the
blktrace infrastructure:
Add BLK_TC_WRITE_ZEROES trace category in blktrace_api.h and update
BLK_TC_END_V2 marker accordingly
Map REQ_OP_WRITE_ZEROES to BLK_TC_WRITE_ZEROES in __blk_add_trace()
to ensure proper trace event categorization
Update fill_rwbs() to generate "WZ" label for write-zeroes operations
in ftrace output, making them easily identifiable
Add "write-zeroes" string mapping in act_to_str array for debugfs
filter interface
Update blk_fill_rwbs() to handle REQ_OP_WRITE_ZEROES for block layer
event tracing
With this fix, write-zeroes operations are now correctly traced and
displayed.
===========================================================
BEFORE THIS PATCH
===========================================================
blkdiscard -z -o 0 -l 40960 /dev/nvme0n1
blkdiscard-3809 [030] ..... 1212.253701: block_bio_queue: 259,0 NS 0 + 80 [blkdiscard]
blkdiscard-3809 [030] ..... 1212.253703: block_getrq: 259,0 NS 0 + 80 [blkdiscard]
blkdiscard-3809 [030] ..... 1212.253704: block_io_start: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard]
blkdiscard-3809 [030] ..... 1212.253704: block_plug: [blkdiscard]
blkdiscard-3809 [030] ..... 1212.253706: block_unplug: [blkdiscard] 1
blkdiscard-3809 [030] ..... 1212.253706: block_rq_insert: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard]
kworker/30:1H-566 [030] ..... 1212.253726: block_rq_issue: 259,0 NS 40960 () 0 + 80 be,0,4 [kworker/30:1H]
<idle>-0 [030] d.h1. 1212.253957: block_rq_complete: 259,0 NS () 0 + 80 be,0,4 [0]
<idle>-0 [030] dNh1. 1212.253960: block_io_done: 259,0 NS 0 () 0 + 0 none,0,0 [swapper/30]
Trace Event Breakdown:
Event | Device | Op | Sector | Sectors | Byte Size | Calculation
block_bio_queue | 259,0 | NS | 0 | 80 | - | 80 × 512 = 40,960
block_getrq | 259,0 | NS | 0 | 80 | - | 80 × 512 = 40,960
block_io_start | 259,0 | NS | 0 | 80 | 40960 | Direct from trace
block_rq_insert | 259,0 | NS | 0 | 80 | 40960 | Direct from trace
block_rq_issue | 259,0 | NS | 0 | 80 | 40960 | Direct from trace
block_rq_complete | 259,0 | NS | 0 | 80 | - | 80 × 512 = 40,960
block_io_done | 259,0 | NS | 0 | 0 | 0 | Completion (no data)
Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes
===========================================================
AFTER THIS PATCH
===========================================================
blkdiscard -z -o 0 -l 40960 /dev/nvme0n1
blkdiscard-2477 [020] ..... 960.989131: block_bio_queue: 259,0 WZS 0 + 80 [blkdiscard]
blkdiscard-2477 [020] ..... 960.989134: block_getrq: 259,0 WZS 0 + 80 [blkdiscard]
blkdiscard-2477 [020] ..... 960.989135: block_io_start: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard]
blkdiscard-2477 [020] ..... 960.989138: block_plug: [blkdiscard]
blkdiscard-2477 [020] ..... 960.989140: block_unplug: [blkdiscard] 1
blkdiscard-2477 [020] ..... 960.989141: block_rq_insert: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard]
kworker/20:1H-736 [020] ..... 960.989166: block_rq_issue: 259,0 WZS 40960 () 0 + 80 be,0,4 [kworker/20:1H]
<idle>-0 [020] d.h1. 960.989476: block_rq_complete: 259,0 WZS () 0 + 80 be,0,4 [0]
<idle>-0 [020] dNh1. 960.989482: block_io_done: 259,0 WZS 0 () 0 + 0 none,0,0 [swapper/20]
Trace Event Breakdown:
Event | Device | Op | Sector | Sectors | Byte Size | Calculation
block_bio_queue | 259,0 | WZS | 0 | 80 | - | 80 × 512 = 40,960
block_getrq | 259,0 | WZS | 0 | 80 | - | 80 × 512 = 40,960
block_io_start | 259,0 | WZS | 0 | 80 | 40960 | Direct from trace
block_rq_insert | 259,0 | WZS | 0 | 80 | 40960 | Direct from trace
block_rq_issue | 259,0 | WZS | 0 | 80 | 40960 | Direct from trace
block_rq_complete | 259,0 | WZS | 0 | 80 | - | 80 × 512 = 40,960
block_io_done | 259,0 | WZS | 0 | 0 | 0 | Completion (no data)
Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes
Tested with ftrace blktrace on NVMe devices using blkdiscard with
the -z (write-zeroes) flag.
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The vbi_fops stored in struct saa7146_ext_vv is a full
v4l2_file_operations, but only its .write field is used. Replace it with
a single vbi_write function pointer to save memory.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
|
|
Retain the constness of the graph objects and interfaces in macros to
obtain their containers, by switching to container_of_const().
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
|
|
Retain the constness of the object in media_entity_to_video_device() and
to_video_device(), by switching to container_of_const().
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
|
|
Retain the constness of the object in media_entity_to_v4l2_subdev(), by
switching to container_of_const().
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
|
|
ASM GOTO is miscompiled by GCC when it is used inside a auto cleanup scope:
bool foo(u32 __user *p, u32 val)
{
scoped_guard(pagefault)
unsafe_put_user(val, p, efault);
return true;
efault:
return false;
}
e80: e8 00 00 00 00 call e85 <foo+0x5>
e85: 65 48 8b 05 00 00 00 00 mov %gs:0x0(%rip),%rax
e8d: 83 80 04 14 00 00 01 addl $0x1,0x1404(%rax) // pf_disable++
e94: 89 37 mov %esi,(%rdi)
e96: 83 a8 04 14 00 00 01 subl $0x1,0x1404(%rax) // pf_disable--
e9d: b8 01 00 00 00 mov $0x1,%eax // success
ea2: e9 00 00 00 00 jmp ea7 <foo+0x27> // ret
ea7: 31 c0 xor %eax,%eax // fail
ea9: e9 00 00 00 00 jmp eae <foo+0x2e> // ret
which is broken as it leaks the pagefault disable counter on failure.
Clang at least fails the build.
Linus suggested to add a local label into the macro scope and let that
jump to the actual caller supplied error label.
__label__ local_label; \
arch_unsafe_get_user(x, ptr, local_label); \
if (0) { \
local_label: \
goto label; \
That works for both GCC and clang.
clang:
c80: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
c85: 65 48 8b 0c 25 00 00 00 00 mov %gs:0x0,%rcx
c8e: ff 81 04 14 00 00 incl 0x1404(%rcx) // pf_disable++
c94: 31 c0 xor %eax,%eax // set retval to false
c96: 89 37 mov %esi,(%rdi) // write
c98: b0 01 mov $0x1,%al // set retval to true
c9a: ff 89 04 14 00 00 decl 0x1404(%rcx) // pf_disable--
ca0: 2e e9 00 00 00 00 cs jmp ca6 <foo+0x26> // ret
The exception table entry points correctly to c9a
GCC:
f70: e8 00 00 00 00 call f75 <baz+0x5>
f75: 65 48 8b 05 00 00 00 00 mov %gs:0x0(%rip),%rax
f7d: 83 80 04 14 00 00 01 addl $0x1,0x1404(%rax) // pf_disable++
f84: 8b 17 mov (%rdi),%edx
f86: 89 16 mov %edx,(%rsi)
f88: 83 a8 04 14 00 00 01 subl $0x1,0x1404(%rax) // pf_disable--
f8f: b8 01 00 00 00 mov $0x1,%eax // success
f94: e9 00 00 00 00 jmp f99 <baz+0x29> // ret
f99: 83 a8 04 14 00 00 01 subl $0x1,0x1404(%rax) // pf_disable--
fa0: 31 c0 xor %eax,%eax // fail
fa2: e9 00 00 00 00 jmp fa7 <baz+0x37> // ret
The exception table entry points correctly to f99
So both compilers optimize out the extra goto and emit correct and
efficient code.
Provide a generic wrapper to do that to avoid modifying all the affected
architecture specific implementation with that workaround.
The only change required for architectures is to rename unsafe_*_user() to
arch_unsafe_*_user(). That's done in subsequent changes.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/877bweujtn.ffs@tglx
|
|
max_user_freq has not been related to the hardware RTC since commit
6610e0893b8b ("RTC: Rework RTC code to use timerqueue for events"). Stop
setting it from individual driver to avoid confusing new contributors.
Acked-by: Joshua Kinard <linux@kumba.dev>
Link: https://patch.msgid.link/20251101-max_user_freq-v1-2-c9a274fd6883@bootlin.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
CPU_CYCLES is expected to count the logical CPU (PE) clock. Currently it's
preferred to use PMCCNTR_EL0 for counting CPU_CYCLES, but it'll count
processor clock rather than the PE clock (ARM DDI0487 L.b D13.1.3) if
one of the SMT siblings is not idle on a multi-threaded implementation.
So don't use it on SMT cores.
Introduce topology_core_has_smt() for knowing the SMT implementation and
cached it in arm_pmu::has_smt during allocation.
When counting cycles on SMT CPU 2-3 and CPU 3 is idle, without this
patch we'll get:
[root@client1 tmp]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
--taskset 2 --timeout 1
[...]
Performance counter stats for 'CPU(s) 2-3':
CPU2 2880457316 cycles
CPU3 2880459810 cycles
1.254688470 seconds time elapsed
With this patch the idle state of CPU3 is observed as expected:
[root@client1 ~]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
--taskset 2 --timeout 1
[...]
Performance counter stats for 'CPU(s) 2-3':
CPU2 2558580492 cycles
CPU3 305749 cycles
1.113626410 seconds time elapsed
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Correct most kernel-doc warnings in include/linux/mtd/spear_smi.h
by adding a leading '@' to the description of struct members.
Add a new description for the missing @np member.
Warning: spear_smi.h:48 struct member 'name' not described
in 'spear_smi_flash_info'
Warning: spear_smi.h:48 struct member 'mem_base' not described
in 'spear_smi_flash_info'
Warning: spear_smi.h:48 struct member 'size' not described
in 'spear_smi_flash_info'
Warning: spear_smi.h:48 struct member 'partitions' not described
in 'spear_smi_flash_info'
Warning: spear_smi.h:48 struct member 'nr_partitions' not described
in 'spear_smi_flash_info'
Warning: spear_smi.h:48 struct member 'fast_mode' not described
in 'spear_smi_flash_info'
Warning: spear_smi.h:62 struct member 'clk_rate' not described
in 'spear_smi_plat_data'
Warning: spear_smi.h:62 struct member 'num_flashes' not described
in 'spear_smi_plat_data'
Warning: spear_smi.h:62 struct member 'board_flash_info' not described
in 'spear_smi_plat_data'
Warning: spear_smi.h:62 struct member 'np' not described
in 'spear_smi_plat_data'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
|
|
We have differing implementations for display runtime pm in i915 and xe
drivers. Add struct of function pointers into display_parent_interface
which will contain used implementation of runtime pm.
v2:
- add _interface suffix to rpm function pointer struct
- add struct ref_tracker forward declaration
- use kernel-doc comments
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Link: https://patch.msgid.link/20251030202836.1815680-3-jouni.hogander@intel.com
|
|
Let's gradually start calling i915 and xe parent, or core, drivers from
display via function pointers passed at display probe.
Going forward, the struct intel_display_parent_interface is expected to
include const pointers to sub-structs by functionality, for example:
struct intel_display_rpm {
struct ref_tracker *(*get)(struct drm_device *drm);
/* ... */
};
struct intel_display_parent_interface {
/* ... */
const struct intel_display_rpm *rpm;
};
This is a baby step towards not building display as part of both i915
and xe drivers, but rather making it an independent driver interfacing
with the two.
v3: useless include additions dropped
v2: unrelated include removal dropped
Cc: Jouni Högander <jouni.hogander@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Link: https://patch.msgid.link/20251030202836.1815680-2-jouni.hogander@intel.com
|
|
Convert 9p to the new mount API. This patch consolidates all parsing
into fs/9p/v9fs.c, which stores all results into a filesystem context
which can be passed to the various transports as needed.
Some of the parsing helper functions such as get_cache_mode() have been
eliminated in favor of using the new mount API's enum param type,
for simplicity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-5-sandeen@redhat.com>
[ Dominique: handled source explicitly as per follow-up discussion ]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
This patch creates a new v9fs_context structure which includes
new p9_session_opts and p9_client_opts structures, as well as
re-using the existing p9_fd_opts and p9_rdma_opts to store options
during parsing. The new structure will be used in the next
commit to pass all parsed options to the appropriate transports.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-4-sandeen@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
With the new mount API all option parsing will need to happen
in fs/v9fs.c, so move some existing data structures and macros
to header files to facilitate this. Rename some to reflect
the transport they are used for (rdma, fd, etc), for clarity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-3-sandeen@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
296b67059 removed fsparam_u32hex because there were no callers
(yet) and it didn't build due to using the nonexistent symbol
fs_param_is_u32_hex.
fs/9p will need this parser, so add it back with the appropriate
fix (use fs_param_is_u32).
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-2-sandeen@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
'->def' is only ever used as a true/false flag
Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Message-ID: <20251103-v9fs_trans_def_bool-v1-1-f33dc7ed9e81@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
While developing a 9P server (https://github.com/Barre/ZeroFS) and
testing it under high-load, I was running into allocation failures.
The failures occur even with plenty of free memory available because
kmalloc requires contiguous physical memory.
This results in errors like:
ls: page allocation failure: order:7, mode:0x40c40(GFP_NOFS|__GFP_COMP)
This patch introduces a transport capability flag (supports_vmalloc)
that indicates whether a transport can work with vmalloc'd buffers
(non-physically contiguous memory). Transports requiring DMA should
leave this flag as false.
The fd-based transports (tcp, unix, fd) set this flag to true, and
p9_fcall_init will use kvmalloc instead of kmalloc for these
transports. This allows the allocator to fall back to vmalloc when
contiguous physical memory is not available.
Additionally, if kmem_cache_alloc fails, the code falls back to
kvmalloc for transports that support it.
Signed-off-by: Pierre Barre <pierre@barre.sh>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Message-ID: <d2017c29-11fb-44a5-bd0f-4204329bbefb@app.fastmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Mike Christie <michael.christie@oracle.com> says:
The following patches were made over Linus tree. They fix/improve the
stats used in the main IO path. The first patch fixes an issue where
I made some stats u32 when they should have stayed u64. The rest of
the patches improve the handling of RW/num_cmds stats to reduce code
duplication and improve performance.
Link: https://patch.msgid.link/20250917221338.14813-1-michael.christie@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The atomic use in the main I/O path is causing perf issues when using
higher performance backend devices and multiple queues (more than
10 when using vhost-scsi) like with this fio workload:
[global]
bs=4K
iodepth=128
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
name=standard-iops
rw=randread
numjobs=16
cpus_allowed=0-15
To fix this issue, move the LUN stats to per CPU.
Note: I forgot to include this patch with the delayed/ordered per CPU
tracking and per device/device entry per CPU stats. With this patch you
get the full 33% improvements when using fast backends, multiple queues
and multiple IO submiters.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Link: https://patch.msgid.link/20250917221338.14813-4-michael.christie@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
In commit 9cf2317b795d ("scsi: target: Move I/O path stats to per CPU")
I saw we sometimes use %u and also misread the spec. As a result I
thought all the stats were supposed to be 32-bit only. However, for the
majority of cases we support currently, the spec specifies u64 bit
stats. This patch converts the stats changed in the commit above to u64.
Fixes: 9cf2317b795d ("scsi: target: Move I/O path stats to per CPU")
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Link: https://patch.msgid.link/20250917221338.14813-2-michael.christie@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
John Garry <john.g.garry@oracle.com> says:
This is a reposting of Mike's atomic writes support for the SCSI
target.
Again, we are now only supporting target_core_iblock. It's implemented
similar to UNMAP where we do not do any emulation and instead pass the
operation to the block layer.
Link: https://patch.msgid.link/20251020103820.2917593-1-john.g.garry@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Add the core LIO code to process the WRITE_ATOMIC_16 command.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
jpg: fix return code from sbc_check_atomic, reformat
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20251020103820.2917593-5-john.g.garry@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Add a helper function that sets up the atomic value based on a
block_device similar to what we do for unmap.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
jpg: Set atomic alignment, drop atomic_supported reference
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20251020103820.2917593-4-john.g.garry@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Add atomic fields to the se_device and export them in configfs.
Initially only target_core_iblock will be supported and we will inherit
all the settings from the block layer.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
jpg: Stop being allowed to configure atomic write alignment,
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20251020103820.2917593-3-john.g.garry@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Rename target_configure_unmap_from_queue() to
target_configure_unmap_from_bdev() since it now takes a bdev.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20251020103820.2917593-2-john.g.garry@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|