<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/io_uring/io_uring.c, branch v6.7</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.7</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.7'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2023-12-04T02:09:28Z</updated>
<entry>
<title>io_uring: fix mutex_unlock with unreferenced ctx</title>
<updated>2023-12-04T02:09:28Z</updated>
<author>
<name>Pavel Begunkov</name>
<email>asml.silence@gmail.com</email>
</author>
<published>2023-12-03T15:37:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f7b32e785042d2357c5abc23ca6db1b92c91a070'/>
<id>urn:sha1:f7b32e785042d2357c5abc23ca6db1b92c91a070</id>
<content type='text'>
Callers of mutex_unlock() have to make sure that the mutex stays alive
for the whole duration of the function call. For io_uring that means
that the following pattern is not valid unless we ensure that the
context outlives the mutex_unlock() call.

mutex_lock(&amp;ctx-&gt;uring_lock);
req_put(req); // typically via io_req_task_submit()
mutex_unlock(&amp;ctx-&gt;uring_lock);

Most contexts are fine: io-wq pins requests, syscalls hold the file,
task works are taking ctx references and so on. However, the task work
fallback path doesn't follow the rule.

Cc:  &lt;stable@vger.kernel.org&gt;
Fixes: 04fc6c802d ("io_uring: save ctx put/get for task_work submit")
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Pavel Begunkov &lt;asml.silence@gmail.com&gt;
Link: https://lore.kernel.org/io-uring/CAG48ez3xSoYb+45f1RLtktROJrpiDQ1otNvdR+YLQf7m+Krj5Q@mail.gmail.com/
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: use fget/fput consistently</title>
<updated>2023-11-28T18:56:29Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-11-28T17:29:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=73363c262d6a7d26063da96610f61baf69a70f7c'/>
<id>urn:sha1:73363c262d6a7d26063da96610f61baf69a70f7c</id>
<content type='text'>
Normally within a syscall it's fine to use fdget/fdput for grabbing a
file from the file table, and it's fine within io_uring as well. We do
that via io_uring_enter(2), io_uring_register(2), and then also for
cancel which is invoked from the latter. io_uring cannot close its own
file descriptors as that is explicitly rejected, and for the cancel
side of things, the file itself is just used as a lookup cookie.

However, it is more prudent to ensure that full references are always
grabbed. For anything threaded, either explicitly in the application
itself or through use of the io-wq worker threads, this is what happens
anyway. Generalize it and use fget/fput throughout.

Also see the below link for more details.

Link: https://lore.kernel.org/io-uring/CAG48ez1htVSO3TqmrF8QcX2WFuYTRM-VZ_N10i-VZgbtg=NNqw@mail.gmail.com/
Suggested-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: free io_buffer_list entries via RCU</title>
<updated>2023-11-28T18:45:02Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-11-28T00:54:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5cf4f52e6d8aa2d3b7728f568abbf9d42a3af252'/>
<id>urn:sha1:5cf4f52e6d8aa2d3b7728f568abbf9d42a3af252</id>
<content type='text'>
mmap_lock nests under uring_lock out of necessity, as we may be doing
user copies with uring_lock held. However, for mmap of provided buffer
rings, we attempt to grab uring_lock with mmap_lock already held from
do_mmap(). This makes lockdep, rightfully, complain:

WARNING: possible circular locking dependency detected
6.7.0-rc1-00009-gff3337ebaf94-dirty #4438 Not tainted
------------------------------------------------------
buf-ring.t/442 is trying to acquire lock:
ffff00020e1480a8 (&amp;ctx-&gt;uring_lock){+.+.}-{3:3}, at: io_uring_validate_mmap_request.isra.0+0x4c/0x140

but task is already holding lock:
ffff0000dc226190 (&amp;mm-&gt;mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x124/0x264

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-&gt; #1 (&amp;mm-&gt;mmap_lock){++++}-{3:3}:
       __might_fault+0x90/0xbc
       io_register_pbuf_ring+0x94/0x488
       __arm64_sys_io_uring_register+0x8dc/0x1318
       invoke_syscall+0x5c/0x17c
       el0_svc_common.constprop.0+0x108/0x130
       do_el0_svc+0x2c/0x38
       el0_svc+0x4c/0x94
       el0t_64_sync_handler+0x118/0x124
       el0t_64_sync+0x168/0x16c

-&gt; #0 (&amp;ctx-&gt;uring_lock){+.+.}-{3:3}:
       __lock_acquire+0x19a0/0x2d14
       lock_acquire+0x2e0/0x44c
       __mutex_lock+0x118/0x564
       mutex_lock_nested+0x20/0x28
       io_uring_validate_mmap_request.isra.0+0x4c/0x140
       io_uring_mmu_get_unmapped_area+0x3c/0x98
       get_unmapped_area+0xa4/0x158
       do_mmap+0xec/0x5b4
       vm_mmap_pgoff+0x158/0x264
       ksys_mmap_pgoff+0x1d4/0x254
       __arm64_sys_mmap+0x80/0x9c
       invoke_syscall+0x5c/0x17c
       el0_svc_common.constprop.0+0x108/0x130
       do_el0_svc+0x2c/0x38
       el0_svc+0x4c/0x94
       el0t_64_sync_handler+0x118/0x124
       el0t_64_sync+0x168/0x16c

From that mmap(2) path, we really just need to ensure that the buffer
list doesn't go away from underneath us. For the lower indexed entries,
they never go away until the ring is freed and we can always sanely
reference those as long as the caller has a file reference. For the
higher indexed ones in our xarray, we just need to ensure that the
buffer list remains valid while we return the address of it.

Free the higher indexed io_buffer_list entries via RCU. With that we can
avoid needing -&gt;uring_lock inside mmap(2), and simply hold the RCU read
lock around the buffer list lookup and address check.

To ensure that the arrayed lookup either returns a valid fully formulated
entry via RCU lookup, add an 'is_ready' flag that we access with store
and release memory ordering. This isn't needed for the xarray lookups,
but doesn't hurt either. Since this isn't a fast path, retain it across
both types. Similarly, for the allocated array inside the ctx, ensure
we use the proper load/acquire as setup could in theory be running in
parallel with mmap.

While in there, add a few lockdep checks for documentation purposes.

Cc: stable@vger.kernel.org
Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring")
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring/kbuf: defer release of mapped buffer rings</title>
<updated>2023-11-28T14:56:16Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-11-27T23:47:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c392cbecd8eca4c53f2bf508731257d9d0a21c2d'/>
<id>urn:sha1:c392cbecd8eca4c53f2bf508731257d9d0a21c2d</id>
<content type='text'>
If a provided buffer ring is setup with IOU_PBUF_RING_MMAP, then the
kernel allocates the memory for it and the application is expected to
mmap(2) this memory. However, io_uring uses remap_pfn_range() for this
operation, so we cannot rely on normal munmap/release on freeing them
for us.

Stash an io_buf_free entry away for each of these, if any, and provide
a helper to free them post -&gt;release().

Cc: stable@vger.kernel.org
Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring")
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: enable io_mem_alloc/free to be used in other parts</title>
<updated>2023-11-28T03:53:52Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-11-28T03:53:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=edecf1689768452ba1a64b7aaf3a47a817da651a'/>
<id>urn:sha1:edecf1689768452ba1a64b7aaf3a47a817da651a</id>
<content type='text'>
In preparation for using these helpers, make them non-static and add
them to our internal header.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAP</title>
<updated>2023-11-28T00:10:56Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-11-28T00:08:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6f007b1406637d3d73d42e41d7e8d9b245185e69'/>
<id>urn:sha1:6f007b1406637d3d73d42e41d7e8d9b245185e69</id>
<content type='text'>
This flag only applies to the SQ and CQ rings, it's perfectly valid
to use a mmap approach for the provided ring buffers. Move the
check into where it belongs.

Cc: stable@vger.kernel.org
Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAP</title>
<updated>2023-11-27T15:28:56Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-11-25T04:02:01Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=820d070feb668aab5bc9413c285a1dda2a70e076'/>
<id>urn:sha1:820d070feb668aab5bc9413c285a1dda2a70e076</id>
<content type='text'>
io_sqes_map() is used rather than io_mem_alloc(), if the application
passes in memory for mapping rather than have the kernel allocate it and
then mmap(2) the ranges. This then calls __io_uaddr_map() to perform the
page mapping and pinning, which checks if we end up with the same pages,
if more than one page is mapped. But this check is incorrect and only
checks if the first and last pages are the same, where it really should
be checking if the mapped pages are contigous. This allows mapping a
single normal page, or a huge page range.

Down the line we can add support for remapping pages to be virtually
contigous, which is really all that io_uring cares about.

Cc: stable@vger.kernel.org
Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes")
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>Merge tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux</title>
<updated>2023-11-01T21:25:08Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-11-01T21:25:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4de520f1fcefd4ebb7dddcf28bde1b330c2f6b5d'/>
<id>urn:sha1:4de520f1fcefd4ebb7dddcf28bde1b330c2f6b5d</id>
<content type='text'>
Pull io_uring futex support from Jens Axboe:
 "This adds support for using futexes through io_uring - first futex
  wake and wait, and then the vectored variant of waiting, futex waitv.

  For both wait/wake/waitv, we support the bitset variant, as the
  'normal' variants can be easily implemented on top of that.

  PI and requeue are not supported through io_uring, just the above
  mentioned parts. This may change in the future, but in the spirit of
  keeping this small (and based on what people have been asking for),
  this is what we currently have.

  Wake support is pretty straight forward, most of the thought has gone
  into the wait side to avoid needing to offload wait operations to a
  blocking context. Instead, we rely on the usual callbacks to retry and
  post a completion event, when appropriate.

  As far as I can recall, the first request for futex support with
  io_uring came from Andres Freund, working on postgres. His aio rework
  of postgres was one of the early adopters of io_uring, and futex
  support was a natural extension for that. This is relevant from both a
  usability point of view, as well as for effiency and performance. In
  Andres's words, for the former:

     Futex wait support in io_uring makes it a lot easier to avoid
     deadlocks in concurrent programs that have their own buffer pool:
     Obviously pages in the application buffer pool have to be locked
     during IO. If the initiator of IO A needs to wait for a held lock
     B, the holder of lock B might wait for the IO A to complete. The
     ability to wait for a lock and IO completions at the same time
     provides an efficient way to avoid such deadlocks

  and in terms of effiency, even without unlocking the full potential
  yet, Andres says:

     Futex wake support in io_uring is useful because it allows for more
     efficient directed wakeups. For some "locks" postgres has queues
     implemented in userspace, with wakeup logic that cannot easily be
     implemented with FUTEX_WAKE_BITSET on a single "futex word"
     (imagine waiting for journal flushes to have completed up to a
     certain point).

     Thus a "lock release" sometimes need to wake up many processes in a
     row. A quick-and-dirty conversion to doing these wakeups via
     io_uring lead to a 3% throughput increase, with 12% fewer context
     switches, albeit in a fairly extreme workload"

* tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux:
  io_uring: add support for vectored futex waits
  futex: make the vectored futex operations available
  futex: make futex_parse_waitv() available as a helper
  futex: add wake_data to struct futex_q
  io_uring: add support for futex wake and wait
  futex: abstract out a __futex_wake_mark() helper
  futex: factor out the futex wake handling
  futex: move FUTEX2_VALID_MASK to futex.h
</content>
</entry>
<entry>
<title>Merge tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux</title>
<updated>2023-11-01T21:09:19Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-11-01T21:09:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ffa059b262ba72571e7fefe7fa2b4ebb6776b277'/>
<id>urn:sha1:ffa059b262ba72571e7fefe7fa2b4ebb6776b277</id>
<content type='text'>
Pull io_uring updates from Jens Axboe:
 "This contains the core io_uring updates, of which there are not many,
  and adds support for using WAITID through io_uring and hence not
  needing to block on these kinds of events.

  Outside of that, tweaks to the legacy provided buffer handling and
  some cleanups related to cancelations for uring_cmd support"

* tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux:
  io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups
  io_uring/kbuf: Use slab for struct io_buffer objects
  io_uring/kbuf: Allow the full buffer id space for provided buffers
  io_uring/kbuf: Fix check of BID wrapping in provided buffers
  io_uring/rsrc: cleanup io_pin_pages()
  io_uring: cancelable uring_cmd
  io_uring: retain top 8bits of uring_cmd flags for kernel internal use
  io_uring: add IORING_OP_WAITID support
  exit: add internal include file with helpers
  exit: add kernel_waitid_prepare() helper
  exit: move core of do_wait() into helper
  exit: abstract out should_wake helper for child_wait_callback()
  io_uring/rw: add support for IORING_OP_READ_MULTISHOT
  io_uring/rw: mark readv/writev as vectored in the opcode definition
  io_uring/rw: split io_read() into a helper
</content>
</entry>
<entry>
<title>io_uring: fix crash with IORING_SETUP_NO_MMAP and invalid SQ ring address</title>
<updated>2023-10-18T15:22:14Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-10-18T14:09:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8b51a3956d44ea6ade962874ade14de9a7d16556'/>
<id>urn:sha1:8b51a3956d44ea6ade962874ade14de9a7d16556</id>
<content type='text'>
If we specify a valid CQ ring address but an invalid SQ ring address,
we'll correctly spot this and free the allocated pages and clear them
to NULL. However, we don't clear the ring page count, and hence will
attempt to free the pages again. We've already cleared the address of
the page array when freeing them, but we don't check for that. This
causes the following crash:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Oops [#1]
Modules linked in:
CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-rc5-dirty #56
Hardware name: ucbbar,riscvemu-bare (DT)
Workqueue: events_unbound io_ring_exit_work
epc : io_pages_free+0x2a/0x58
 ra : io_rings_free+0x3a/0x50
 epc : ffffffff808811a2 ra : ffffffff80881406 sp : ffff8f80000c3cd0
 status: 0000000200000121 badaddr: 0000000000000000 cause: 000000000000000d
 [&lt;ffffffff808811a2&gt;] io_pages_free+0x2a/0x58
 [&lt;ffffffff80881406&gt;] io_rings_free+0x3a/0x50
 [&lt;ffffffff80882176&gt;] io_ring_exit_work+0x37e/0x424
 [&lt;ffffffff80027234&gt;] process_one_work+0x10c/0x1f4
 [&lt;ffffffff8002756e&gt;] worker_thread+0x252/0x31c
 [&lt;ffffffff8002f5e4&gt;] kthread+0xc4/0xe0
 [&lt;ffffffff8000332a&gt;] ret_from_fork+0xa/0x1c

Check for a NULL array in io_pages_free(), but also clear the page counts
when we free them to be on the safer side.

Reported-by: rtm@csail.mit.edu
Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
</feed>
