aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/block/ublk_drv.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-10-02Merge tag 'for-6.18/block-20250929' of ↵Linus Torvalds1-115/+121
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - FC target fixes (Daniel) - Authentication fixes and updates (Martin, Chris) - Admin controller handling (Kamaljit) - Target lockdep assertions (Max) - Keep-alive updates for discovery (Alastair) - Suspend quirk (Georg) - MD pull request via Yu: - Add support for a lockless bitmap. A key feature for the new bitmap are that the IO fastpath is lockless. If a user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes. By supporting only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery. - Switch ->getgeo() and ->bios_param() to using struct gendisk rather than struct block_device. - Rust block changes via Andreas. This series adds configuration via configfs and remote completion to the rnull driver. The series also includes a set of changes to the rust block device driver API: a few cleanup patches, and a few features supporting the rnull changes. The series removes the raw buffer formatting logic from `kernel::block` and improves the logic available in `kernel::string` to support the same use as the removed logic. - floppy arch cleanups - Reduce the number of dereferencing needed for ublk commands - Restrict supported sockets for nbd. Mostly done to eliminate a class of issues perpetually reported by syzbot, by using nonsensical socket setups. - A few s390 dasd block fixes - Fix a few issues around atomic writes - Improve DMA interation for integrity requests - Improve how iovecs are treated with regards to O_DIRECT aligment constraints. We used to require each segment to adhere to the constraints, now only the request as a whole needs to. - Clean up and improve p2p support, enabling use of p2p for metadata payloads - Improve locking of request lookup, using SRCU where appropriate - Use page references properly for brd, avoiding very long RCU sections - Fix ordering of recursively submitted IOs - Clean up and improve updating nr_requests for a live device - Various fixes and cleanups * tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits) s390/dasd: enforce dma_alignment to ensure proper buffer validation s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request ublk: remove redundant zone op check in ublk_setup_iod() nvme: Use non zero KATO for persistent discovery connections nvmet: add safety check for subsys lock nvme-core: use nvme_is_io_ctrl() for I/O controller check nvme-core: do ioccsz/iorcsz validation only for I/O controllers nvme-core: add method to check for an I/O controller blk-cgroup: fix possible deadlock while configuring policy blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path blk-mq: Fix more tag iteration function documentation selftests: ublk: fix behavior when fio is not installed ublk: don't access ublk_queue in ublk_unmap_io() ublk: pass ublk_io to __ublk_complete_rq() ublk: don't access ublk_queue in ublk_need_complete_req() ublk: don't access ublk_queue in ublk_check_commit_and_fetch() ublk: don't pass ublk_queue to ublk_fetch() ublk: don't access ublk_queue in ublk_config_io_buf() ublk: don't access ublk_queue in ublk_check_fetch_buf() ublk: pass q_id and tag to __ublk_check_and_get_req() ...
2025-10-02Merge tag 'for-6.18/io_uring-20250929' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...
2025-09-24ublk: remove redundant zone op check in ublk_setup_iod()Caleb Sander Mateos1-5/+0
ublk_setup_iod() checks first whether the request is a zoned operation issued to a device without zoned support and returns BLK_STS_IOERR if so. However, such a request would already hit the default case in the subsequent switch statement and fail the ublk_queue_is_zoned() check, which also results in a return of BLK_STS_IOERR. So remove the redundant early check for unsupported zone ops. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23io_uring/cmd: drop unused res2 param from io_uring_cmd_done()Caleb Sander Mateos1-3/+3
Commit 79525b51acc1 ("io_uring: fix nvme's 32b cqes on mixed cq") split out a separate io_uring_cmd_done32() helper for ->uring_cmd() implementations that return 32-byte CQEs. The res2 value passed to io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores it when is_cqe32 is passed as false. So drop the parameter from io_uring_cmd_done() to simplify the callers and clarify that it's not possible to return an extra value beyond the 32-bit CQE result. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_unmap_io()Caleb Sander Mateos1-10/+14
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_unmap_io() is a frequent cache miss. Pass to __ublk_complete_rq() whether the ublk server's data buffer needs to be copied to the request. In the callers __ublk_fail_req() and ublk_ch_uring_cmd_local(), get the flags from the ublk_device instead, as its flags have just been read. In ublk_put_req_ref(), pass false since all the features that require reference counting disable copying of the data buffer upon completion. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass ublk_io to __ublk_complete_rq()Caleb Sander Mateos1-6/+5
All callers of __ublk_complete_rq() already know the ublk_io. Pass it in to avoid looking it up again. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_need_complete_req()Caleb Sander Mateos1-3/+3
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_need_complete_req() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_check_commit_and_fetch()Caleb Sander Mateos1-4/+4
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_check_commit_and_fetch() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass ublk_queue to ublk_fetch()Caleb Sander Mateos1-3/+2
ublk_fetch() only uses the ublk_queue to get the ublk_device, which its caller already has. So just pass the ublk_device directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_config_io_buf()Caleb Sander Mateos1-5/+5
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_config_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_check_fetch_buf()Caleb Sander Mateos1-4/+4
Obtain the ublk device flags from ublk_device to avoid needing to access the ublk_queue, which may be a cache miss. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass q_id and tag to __ublk_check_and_get_req()Caleb Sander Mateos1-13/+11
__ublk_check_and_get_req() only uses its ublk_queue argument to get the q_id and tag. Pass those arguments explicitly to save an access to the ublk_queue. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_daemon_register_io_buf()Caleb Sander Mateos1-1/+1
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_daemon_register_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_register_io_buf()Caleb Sander Mateos1-1/+1
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_register_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass ublk_device to ublk_register_io_buf()Caleb Sander Mateos1-4/+6
Avoid repeating the 2 dereferences to get the ublk_device from the io_uring_cmd by passing it from ublk_ch_uring_cmd_local() to ublk_register_io_buf(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't dereference ublk_queue in ublk_check_and_get_req()Caleb Sander Mateos1-2/+2
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_ch_{read,write}_iter() is a frequent cache miss. Get the flags and queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't dereference ublk_queue in ublk_ch_uring_cmd_local()Caleb Sander Mateos1-1/+1
For ublk servers with many ublk queues, accessing the ublk_queue to handle a ublk command is a frequent cache miss. Get the queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: add helpers to check ublk_device flagsCaleb Sander Mateos1-0/+34
Introduce ublk_device analogues of the ublk_queue flag helpers: - ublk_support_zero_copy() -> ublk_dev_support_user_copy() - ublk_support_auto_buf_reg() -> ublk_dev_support_auto_buf_reg() - ublk_support_user_copy() -> ublk_dev_support_user_copy() - ublk_need_map_io() -> ublk_dev_need_map_io() - ublk_need_req_ref() -> ublk_dev_need_req_ref() - ublk_need_get_data() -> ublk_dev_need_get_data() These will be used in subsequent changes to avoid accessing the ublk_queue just for the flags, and instead use the ublk_device. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass ublk_queue to __ublk_fail_req()Caleb Sander Mateos1-3/+3
__ublk_fail_req() only uses the ublk_queue to get the ublk_device, which its caller already has. So just pass the ublk_device directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass q_id to ublk_queue_cmd_buf_size()Caleb Sander Mateos1-7/+5
ublk_queue_cmd_buf_size() only needs the queue depth, which is the same for all queues. Get the queue depth from the ublk_device instead so the q_id parameter can be dropped. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: remove ubq check in ublk_check_and_get_req()Caleb Sander Mateos1-3/+0
ublk_get_queue() never returns a NULL pointer, so there's no need to check its return value in ublk_check_and_get_req(). Drop the check. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10ublk: consolidate nr_io_ready and nr_queues_readyCaleb Sander Mateos1-16/+12
ublk_mark_io_ready() tracks whether all the ublk_device's I/Os have been fetched by incrementing ublk_queue's nr_io_ready count and incrementing ublk_device's nr_queues_ready count if the whole queue is ready. Simplify the logic by just tracking the total number of fetched I/Os on each ublk_device. When this count reaches nr_hw_queues * queue_depth, the ublk_device is ready to receive I/O. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-03ublk: inline __ublk_ch_uring_cmd()Caleb Sander Mateos1-39/+23
ublk_ch_uring_cmd_local() is a thin wrapper around __ublk_ch_uring_cmd() that copies the ublksrv_io_cmd from user-mapped memory to the stack using READ_ONCE(). This ublksrv_io_cmd is passed by pointer to __ublk_ch_uring_cmd() and __ublk_ch_uring_cmd() is a large function unlikely to be inlined, so __ublk_ch_uring_cmd() will have to load the ublksrv_io_cmd fields back from the stack. Inline __ublk_ch_uring_cmd() into ublk_ch_uring_cmd_local() and load the ublksrv_io_cmd fields into local variables with READ_ONCE(). This allows the compiler to delay loading the fields until they are needed and choose whether to store them in registers or on the stack. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250808153251.282107-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-28ublk: avoid ublk_io_release() called after ublk char dev is closedMing Lei1-2/+70
When running test_stress_04.sh, the following warning is triggered: WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv] This happens when the daemon is abruptly killed: - some references may still be held, because registering IO buffer doesn't grab ublk char device reference OR - io->task_registered_buffers won't be cleared because io buffer is released from non-daemon context For zero-copy and auto buffer register modes, I/O reference crosses syscalls, so IO reference may not be dropped naturally when ublk server is killed abruptly. However, when releasing io_uring context, it is guaranteed that the reference is dropped finally, see io_sqe_buffers_unregister() from io_ring_ctx_free(). Fix this by adding ublk_drain_io_references() that: - Waits for active I/O references dropped in async way by scheduling work function, for avoiding ublk dev and io_uring file's release dependency - Reinitializes io->ref and io->task_registered_buffers to clean state This ensures the reference count state is clean when ublk_queue_reinit() is called, preventing the warning and potential use-after-free. Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec") Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task") Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250827121602.2619736-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11ublk: check for unprivileged daemon on each I/O fetchCaleb Sander Mateos1-9/+7
Commit ab03a61c6614 ("ublk: have a per-io daemon instead of a per-queue daemon") allowed each ublk I/O to have an independent daemon task. However, nr_privileged_daemon is only computed based on whether the last I/O fetched in each ublk queue has an unprivileged daemon task. Fix this by checking whether every fetched I/O's daemon is privileged. Change nr_privileged_daemon from a count of queues to a boolean indicating whether any I/Os have an unprivileged daemon. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: ab03a61c6614 ("ublk: have a per-io daemon instead of a per-queue daemon") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250808155216.296170-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11ublk: don't quiesce in ublk_ch_releaseUday Shankar1-7/+5
ublk_ch_release currently quiesces the device's request_queue while setting force_abort/fail_io. This avoids data races by preventing concurrent reads from the I/O path, but is not strictly needed - at this point, canceling is already set and guaranteed to be observed by any concurrently executing I/Os, so they will be handled properly even if the changes to force_abort/fail_io propagate to the I/O path later. Remove the quiesce/unquiesce calls from ublk_ch_release. This makes the writes to force_abort/fail_io concurrent with the reads in the I/O path, so make the accesses atomic. Before this change, the call to blk_mq_quiesce_queue was responsible for most (90%) of the runtime of ublk_ch_release. With that call eliminated, ublk_ch_release runs much faster. Here is a comparison of the total time spent in calls to ublk_ch_release when a server handling 128 devices exits, before and after this change: before: 1.11s after: 0.09s Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250808-ublk_quiesce2-v1-1-f87ade33fa3d@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-28Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linuxLinus Torvalds1-222/+356
Pull block updates from Jens Axboe: - MD pull request via Yu: - call del_gendisk synchronously (Xiao) - cleanup unused variable (John) - cleanup workqueue flags (Ryo) - fix faulty rdev can't be removed during resync (Qixing) - NVMe pull request via Christoph: - try PCIe function level reset on init failure (Keith Busch) - log TLS handshake failures at error level (Maurizio Lombardi) - pci-epf: do not complete commands twice if nvmet_req_init() fails (Rick Wertenbroek) - misc cleanups (Alok Tiwari) - Removal of the pktcdvd driver This has been more than a decade coming at this point, and some recently revealed breakages that had it causing issues even for cases where it isn't required made me re-pull the trigger on this one. It's known broken and nobody has stepped up to maintain the code - Series for ublk supporting batch commands, enabling the use of multishot where appropriate - Speed up ublk exit handling - Fix for the two-stage elevator fixing which could leak data - Convert NVMe to use the new IOVA based API - Increase default max transfer size to something more reasonable - Series fixing write operations on zoned DM devices - Add tracepoints for zoned block device operations - Prep series working towards improving blk-mq queue management in the presence of isolated CPUs - Don't allow updating of the block size of a loop device that is currently under exclusively ownership/open - Set chunk sectors from stacked device stripe size and use it for the atomic write size limit - Switch to folios in bcache read_super() - Fix for CD-ROM MRW exit flush handling - Various tweaks, fixes, and cleanups * tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits) block: restore two stage elevator switch while running nr_hw_queue update cdrom: Call cdrom_mrw_exit from cdrom_release function sunvdc: Balance device refcount in vdc_port_mpgroup_check nvme-pci: try function level reset on init failure dm: split write BIOs on zone boundaries when zone append is not emulated block: use chunk_sectors when evaluating stacked atomic write limits dm-stripe: limit chunk_sectors to the stripe size md/raid10: set chunk_sectors limit md/raid0: set chunk_sectors limit block: sanitize chunk_sectors for atomic write limits ilog2: add max_pow_of_two_factor() nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails nvme-tcp: log TLS handshake failures at error level docs: nvme: fix grammar in nvme-pci-endpoint-target.rst nvme: fix typo in status code constant for self-test in progress nvmet: remove redundant assignment of error code in nvmet_ns_enable() nvme: fix incorrect variable in io cqes error message nvme: fix multiple spelling and grammar issues in host drivers block: fix blk_zone_append_update_request_bio() kernel-doc md/raid10: fix set but not used variable in sync_request_write() ...
2025-07-15ublk: remove unused req argument from ublk_sub_req_ref()Caleb Sander Mateos1-5/+4
Since commit b749965edda8 ("ublk: remove ublk_commit_and_fetch()"), ublk_sub_req_ref() no longer uses its struct request *req argument. So drop the argument from ublk_sub_req_ref(), and from ublk_need_complete_req(), which only passes it to ublk_sub_req_ref(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250715154244.1626810-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: pass 'const struct ublk_io *' to ublk_[un]map_io()Ming Lei1-2/+2
Pass 'const struct ublk_io *' to ublk_[un]map_io() since just io->addr and io->res are read in the two helpers. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-11-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: remove ublk_commit_and_fetch()Ming Lei1-18/+18
Remove ublk_commit_and_fetch() and open code request completion. Consolidate accesses to struct ublk_io in UBLK_IO_COMMIT_AND_FETCH_REQ. When the ublk_io daemon task restriction is relaxed in the future, ublk_io will need to be protected by a lock. Unregister the auto-registered buffer and complete the request last, as these don't need to happen under the lock. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-10-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: add helper ublk_check_fetch_buf()Ming Lei1-13/+19
Add a helper ublk_check_fetch_buf() to validate UBLK_IO_FETCH_REQ's addr. This doesn't require access to the ublk_io, so it can be done before taking the ublk_device mutex. This way also fixes one missing return value of -EINVAL in case of early failure from ublk_fetch(). Fixes: b69b8edfb27d ("ublk: properly serialize all FETCH_REQs") Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: store auto buffer register data into `struct ublk_io`Ming Lei1-18/+12
We can share space of `io->addr` for storing auto buffer register data and user space buffer address. So store auto buffer register data into `struct ublk_io`. Prepare for supporting batch IO in which many ublk IOs share single uring_cmd, so we can't store auto buffer register data into uring_cmd pdu. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: move auto buffer register handling into one dedicated helperMing Lei1-56/+71
Move check & clearing UBLK_IO_FLAG_AUTO_BUF_REG to ublk_handle_auto_buf_reg(), also return buffer index from this helper. Also move ublk_set_auto_buf_reg() to this single helper too. Add ublk_config_io_buf() for setting up ublk io buffer, covers both ublk buffer copy or auto buffer register. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: avoid to pass `struct ublksrv_io_cmd *` to ublk_commit_and_fetch()Ming Lei1-15/+29
Refactor ublk_commit_and_fetch() in the following way for removing parameter of `struct ublksrv_io_cmd *`: - return `struct request *` from ublk_fill_io_cmd(), so that we can use request reference reliably in this way cause both request and io_uring_cmd reference share same storage - move ublk_fill_io_cmd() before calling into ublk_commit_and_fetch(), so that ublk_fill_io_cmd() could be run with per-io lock held for supporting command batch. - pass ->zone_append_lba to ublk_commit_and_fetch() directly The main motivation is to reproduce ublk_commit_and_fetch() for fetching io command batch with multishot uring_cmd. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: let ublk_fill_io_cmd() cover more thingsMing Lei1-4/+2
Let ublk_fill_io_cmd() clear UBLK_IO_FLAG_OWNED_BY_SRV too. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: move fake timeout logic into __ublk_complete_rq()Ming Lei1-4/+1
Almost every block driver deals with fake timeout logic around real request completion code. Also the existing way may cause request reference count leak, so move the logic into __ublk_complete_rq(), then we can skip the completion in the last step like other drivers. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: look up ublk task via its pid in timeout handlerMing Lei1-8/+17
Look up ublk process via its pid in timeout handler, so we can avoid to touch io->task, because it is fragile to touch task structure. It is fine to kill ublk server process and this way is simpler. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: validate ublk server pidMing Lei1-0/+9
ublk server pid(the `tgid` of the process opening the ublk device) is stored in `ublk_device->ublksrv_tgid`. This `tgid` is then checked against the `ublksrv_pid` in `ublk_ctrl_start_dev` and `ublk_ctrl_end_recovery`. This ensures that correct ublk server pid is stored in device info. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04ublk: introduce and use ublk_set_canceling helperUday Shankar1-20/+34
For performance reasons (minimizing the number of cache lines accessed in the hot path), we store the "canceling" state redundantly - there is one flag in the device, which can be considered the source of truth, and per-queue copies of that flag. This redundancy can cause confusion, and opens the door to bugs where the state is set inconsistently. Try to guard against these bugs by introducing a ublk_set_canceling helper which is the sole mutator of both the per-device and per-queue canceling state. This helper always sets the state consistently. Use the helper in all places where we need to modify the canceling state. No functional changes are expected. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250703-ublk_too_many_quiesce-v2-2-3527b5339eeb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04ublk: speed up ublk server exit handlingUday Shankar1-15/+21
Recently, we've observed a few cases where a ublk server is able to complete restart more quickly than the driver can process the exit of the previous ublk server. The new ublk server comes up, attempts recovery of the preexisting ublk devices, and observes them still in state UBLK_S_DEV_LIVE. While this is possible due to the asynchronous nature of io_uring cleanup and should therefore be handled properly in the ublk server, it is still preferable to make ublk server exit handling faster if possible, as we should strive for it to not be a limiting factor in how fast a ublk server can restart and provide service again. Analysis of the issue showed that the vast majority of the time spent in handling the ublk server exit was in calls to blk_mq_quiesce_queue, which is essentially just a (relatively expensive) call to synchronize_rcu. The ublk server exit path currently issues an unnecessarily large number of calls to blk_mq_quiesce_queue, for two reasons: 1. It tries to call blk_mq_quiesce_queue once per ublk_queue. However, blk_mq_quiesce_queue targets the request_queue of the underlying ublk device, of which there is only one. So the number of calls is larger than necessary by a factor of nr_hw_queues. 2. In practice, it calls blk_mq_quiesce_queue _more_ than once per ublk_queue. This is because of a data race where we read ubq->canceling without any locking when deciding if we should call ublk_start_cancel. It is thus possible for two calls to ublk_uring_cmd_cancel_fn against the same ublk_queue to both call ublk_start_cancel against the same ublk_queue. Fix this by making the "canceling" flag a per-device state. This actually matches the existing code better, as there are several places where the flag is set or cleared for all queues simultaneously, and there is the general expectation that cancellation corresponds with ublk server exit. This per-device canceling flag is then checked under a (new) lock (addressing the data race (2) above), and the queue is only quiesced if it is cleared (addressing (1) above). The result is just one call to blk_mq_quiesce_queue per ublk device. To minimize the number of cache lines that are accessed in the hot path, the per-queue canceling flag is kept. The values of the per-device canceling flag and all per-queue canceling flags should always match. In our setup, where one ublk server handles I/O for 128 ublk devices, each having 24 hardware queues of depth 4096, here are the results before and after this patch, where teardown time is measured from the first call to io_ring_ctx_wait_and_kill to the return from the last ublk_ch_release: before after number of calls to blk_mq_quiesce_queue: 6469 256 teardown time: 11.14s 2.44s There are still some potential optimizations here, but this takes care of a big chunk of the ublk server exit handling delay. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250703-ublk_too_many_quiesce-v2-1-3527b5339eeb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: don't queue request if the associated uring_cmd is canceledMing Lei1-5/+6
Commit 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task") need to dereference `io->cmd` for checking if the IO can be added to current batch, see ublk_belong_to_same_batch() and io_uring_cmd_ctx_handle(). However, `io->cmd` may become invalid after the uring_cmd is canceled. Fixes it by only allowing to queue this IO in case that ublk_prep_req() returns `BLK_STS_OK`, when 'io->cmd' is guaranteed to be valid. Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250701072325.1458109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: cache-align struct ublk_ioCaleb Sander Mateos1-1/+1
struct ublk_io is already 56 bytes on 64-bit architectures, so round it up to a full cache line (typically 64 bytes). This ensures a single ublk_io doesn't span multiple cache lines and prevents false sharing if consecutive ublk_io's are accessed by different daemon tasks. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-15-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: remove ubq checks from ublk_{get,put}_req_ref()Caleb Sander Mateos1-24/+11
ublk_get_req_ref() and ublk_put_req_ref() currently call ublk_need_req_ref(ubq) to check whether the ublk device features require reference counting of its requests. However, all callers already know that reference counting is required: - __ublk_check_and_get_req() is only called from ublk_check_and_get_req() if user copy is enabled, and from ublk_register_io_buf() if zero copy is enabled - ublk_io_release() is only called for requests registered by ublk_register_io_buf(), which requires zero copy - ublk_ch_read_iter() and ublk_ch_write_iter() only call ublk_put_req_ref() if ublk_check_and_get_req() succeeded, which requires user copy to be enabled So drop the ublk_need_req_ref() check and the ubq argument in ublk_get_req_ref() and ublk_put_req_ref(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-14-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon taskCaleb Sander Mateos1-1/+8
ublk_io_release() performs an expensive atomic refcount decrement. This atomic operation is unnecessary in the common case where the request's buffer is registered and unregistered on the daemon task before handling UBLK_IO_COMMIT_AND_FETCH_REQ for the I/O. So if ublk_io_release() is called on the daemon task and task_registered_buffers is positive, just decrement task_registered_buffers (nonatomically). ublk_sub_req_ref() will apply this decrement when it atomically subtracts from io->ref. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-13-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon taskCaleb Sander Mateos1-9/+61
ublk_register_io_buf() performs an expensive atomic refcount increment, as well as a lot of pointer chasing to look up the struct request. Create a separate ublk_daemon_register_io_buf() for the daemon task to call. Initialize ublk_io's reference count to a large number, introduce a field task_registered_buffers to count the buffers registered on the daemon task, and atomically subtract the large number minus task_registered_buffers in ublk_commit_and_fetch(). Also obtain the struct request directly from ublk_io's req field instead of looking it up on the tagset. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-12-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: return early if blk_should_fake_timeout()Caleb Sander Mateos1-2/+3
Make the unlikely case blk_should_fake_timeout() return early to reduce the indentation of the successful path. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-11-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: allow UBLK_IO_(UN)REGISTER_IO_BUF on any taskCaleb Sander Mateos1-5/+22
Currently, UBLK_IO_REGISTER_IO_BUF and UBLK_IO_UNREGISTER_IO_BUF are only permitted on the ublk_io's daemon task. But this restriction is unnecessary. ublk_register_io_buf() calls __ublk_check_and_get_req() to look up the request from the tagset and atomically take a reference on the request without accessing the ublk_io. ublk_unregister_io_buf() doesn't use the q_id or tag at all. So allow these opcodes even on tasks other than io->task. Handle UBLK_IO_UNREGISTER_IO_BUF before obtaining the ubq and io since the buffer index being unregistered is not necessarily related to the specified q_id and tag. Add a feature flag UBLK_F_BUF_REG_OFF_DAEMON that userspace can use to determine whether the kernel supports off-daemon buffer registration. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-10-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: don't take ublk_queue in ublk_unregister_io_buf()Caleb Sander Mateos1-3/+3
UBLK_IO_UNREGISTER_IO_BUF currently requires a valid q_id and tag to be passed in the ublksrv_io_cmd. However, only the addr (registered buffer index) is actually used to unregister the buffer. There is no check that the q_id and tag are for the ublk request whose buffer is registered at the given index. To prepare to allow userspace to omit the q_id and tag, check the UBLK_F_SUPPORT_ZERO_COPY flag on the ublk_device instead of the ublk_queue. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-9-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: consolidate UBLK_IO_FLAG_{ACTIVE,OWNED_BY_SRV} checksCaleb Sander Mateos1-4/+1
UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_OWNED_BY_SRV are mutually exclusive. So just check that UBLK_IO_FLAG_OWNED_BY_SRV is set in __ublk_ch_uring_cmd(); that implies UBLK_IO_FLAG_ACTIVE is unset. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-7-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: remove task variable from __ublk_ch_uring_cmd()Caleb Sander Mateos1-3/+1
The variable is computed from a simple expression and used once, so just replace it with the expression. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-6-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>