<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/drivers/infiniband, branch for-next</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=for-next</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=for-next'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-01-03T19:09:35Z</updated>
<entry>
<title>Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma</title>
<updated>2025-01-03T19:09:35Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-01-03T19:09:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dea3165f989b259bc1293d9dfc50053223f5f4b2'/>
<id>urn:sha1:dea3165f989b259bc1293d9dfc50053223f5f4b2</id>
<content type='text'>
Pull rdma fixes from Jason Gunthorpe:
 "A lot of fixes accumulated over the holiday break:

   - Static tool fixes, value is already proven to be NULL, possible
     integer overflow

   - Many bnxt_re fixes:
      - Crashes due to a mismatch in the maximum SGE list size
      - Don't waste memory for user QPs by creating kernel-only
        structures
      - Fix compatability issues with older HW in some of the new HW
        features recently introduced: RTS-&gt;RTS feature, work around 9096
      - Do not allow destroy_qp to fail
      - Validate QP MTU against device limits
      - Add missing validation on madatory QP attributes for RTR-&gt;RTS
      - Report port_num in query_qp as required by the spec
      - Fix creation of QPs of the maximum queue size, and in the
        variable mode
      - Allow all QPs to be used on newer HW by limiting a work around
        only to HW it affects
      - Use the correct MSN table size for variable mode QPs
      - Add missing locking in create_qp() accessing the qp_tbl
      - Form WQE buffers correctly when some of the buffers are 0 hop
      - Don't crash on QP destroy if the userspace doesn't setup the
        dip_ctx
      - Add the missing QP flush handler call on the DWQE path to avoid
        hanging on error recovery
      - Consistently use ENXIO for return codes if the devices is
        fatally errored

   - Try again to fix VLAN support on iwarp, previous fix was reverted
     due to breaking other cards

   - Correct error path return code for rdma netlink events

   - Remove the seperate net_device pointer in siw and rxe which
     syzkaller found a way to UAF

   - Fix a UAF of a stack ib_sge in rtrs

   - Fix a regression where old mlx5 devices and FW were wrongly
     activing new device features and failing"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits)
  RDMA/mlx5: Enable multiplane mode only when it is supported
  RDMA/bnxt_re: Fix error recovery sequence
  RDMA/rtrs: Ensure 'ib_sge list' is accessible
  RDMA/rxe: Remove the direct link to net_device
  RDMA/hns: Fix missing flush CQE for DWQE
  RDMA/hns: Fix warning storm caused by invalid input in IO path
  RDMA/hns: Fix accessing invalid dip_ctx during destroying QP
  RDMA/hns: Fix mapping error of zero-hop WQE buffer
  RDMA/bnxt_re: Fix the locking while accessing the QP table
  RDMA/bnxt_re: Fix MSN table size for variable wqe mode
  RDMA/bnxt_re: Add send queue size check for variable wqe
  RDMA/bnxt_re: Disable use of reserved wqes
  RDMA/bnxt_re: Fix max_qp_wrs reported
  RDMA/siw: Remove direct link to net_device
  RDMA/nldev: Set error code in rdma_nl_notify_event
  RDMA/bnxt_re: Fix reporting hw_ver in query_device
  RDMA/bnxt_re: Fix to export port num to ib_query_qp
  RDMA/bnxt_re: Fix setting mandatory attributes for modify_qp
  RDMA/bnxt_re: Add check for path mtu in modify_qp
  RDMA/bnxt_re: Fix the check for 9060 condition
  ...
</content>
</entry>
<entry>
<title>RDMA/mlx5: Enable multiplane mode only when it is supported</title>
<updated>2025-01-03T13:17:19Z</updated>
<author>
<name>Mark Zhang</name>
<email>markzhang@nvidia.com</email>
</author>
<published>2024-12-19T12:23:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=45d339fefaa3dcd237038769e0d34584fb867390'/>
<id>urn:sha1:45d339fefaa3dcd237038769e0d34584fb867390</id>
<content type='text'>
Driver queries vport_cxt.num_plane and enables multiplane when it is
greater then 0, but some old FWs (versions from x.40.1000 till x.42.1000),
report vport_cxt.num_plane = 1 unexpectedly.

Fix it by querying num_plane only when HCA_CAP2.multiplane bit is set.

Fixes: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
Link: https://patch.msgid.link/r/1ef901acdf564716fcf550453cf5e94f343777ec.1734610916.git.leon@kernel.org
Cc: stable@vger.kernel.org
Reported-by: Francesco Poli &lt;invernomuto@paranoici.org&gt;
Closes: https://lore.kernel.org/all/nvs4i2v7o6vn6zhmtq4sgazy2hu5kiulukxcntdelggmznnl7h@so3oul6uwgbl/
Signed-off-by: Mark Zhang &lt;markzhang@nvidia.com&gt;
Signed-off-by: Leon Romanovsky &lt;leonro@nvidia.com&gt;
Reviewed-by: Michal Swiatkowski &lt;michal.swiatkowski@linux.intel.com&gt;
Signed-off-by: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
</content>
</entry>
<entry>
<title>RDMA/bnxt_re: Fix error recovery sequence</title>
<updated>2024-12-31T13:28:36Z</updated>
<author>
<name>Kalesh AP</name>
<email>kalesh-anakkur.purayil@broadcom.com</email>
</author>
<published>2024-12-31T02:50:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e6178bf78d0378c2d397a6aafaf4882d0af643fa'/>
<id>urn:sha1:e6178bf78d0378c2d397a6aafaf4882d0af643fa</id>
<content type='text'>
Fixed to return ENXIO from __send_message_basic_sanity()
to indicate that device is in error state. In the case of
ERR_DEVICE_DETACHED state, the driver should not post the
commands to the firmware as it will time out eventually.

Removed bnxt_re_modify_qp() call from bnxt_re_dev_stop()
as it is a no-op.

Fixes: cc5b9b48d447 ("RDMA/bnxt_re: Recover the device when FW error is detected")
Signed-off-by: Kalesh AP &lt;kalesh-anakkur.purayil@broadcom.com&gt;
Signed-off-by: Kashyap Desai &lt;kashyap.desai@broadcom.com&gt;
Link: https://patch.msgid.link/20241231025008.2267162-1-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Selvin Xavier &lt;selvin.xavier@broadcom.com&gt;
Signed-off-by: Leon Romanovsky &lt;leon@kernel.org&gt;
</content>
</entry>
<entry>
<title>RDMA/rtrs: Ensure 'ib_sge list' is accessible</title>
<updated>2024-12-31T13:28:26Z</updated>
<author>
<name>Li Zhijian</name>
<email>lizhijian@fujitsu.com</email>
</author>
<published>2024-12-31T01:34:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fb514b31395946022f13a08e06a435f53cf9e8b3'/>
<id>urn:sha1:fb514b31395946022f13a08e06a435f53cf9e8b3</id>
<content type='text'>
Move the declaration of the 'ib_sge list' variable outside the
'always_invalidate' block to ensure it remains accessible for use
throughout the function.

Previously, 'ib_sge list' was declared within the 'always_invalidate'
block, limiting its accessibility, then caused a
'BUG: kernel NULL pointer dereference'[1].
 ? __die_body.cold+0x19/0x27
 ? page_fault_oops+0x15a/0x2d0
 ? search_module_extables+0x19/0x60
 ? search_bpf_extables+0x5f/0x80
 ? exc_page_fault+0x7e/0x180
 ? asm_exc_page_fault+0x26/0x30
 ? memcpy_orig+0xd5/0x140
 rxe_mr_copy+0x1c3/0x200 [rdma_rxe]
 ? rxe_pool_get_index+0x4b/0x80 [rdma_rxe]
 copy_data+0xa5/0x230 [rdma_rxe]
 rxe_requester+0xd9b/0xf70 [rdma_rxe]
 ? finish_task_switch.isra.0+0x99/0x2e0
 rxe_sender+0x13/0x40 [rdma_rxe]
 do_task+0x68/0x1e0 [rdma_rxe]
 process_one_work+0x177/0x330
 worker_thread+0x252/0x390
 ? __pfx_worker_thread+0x10/0x10

This change ensures the variable is available for subsequent operations
that require it.

[1] https://lore.kernel.org/linux-rdma/6a1f3e8f-deb0-49f9-bc69-a9b03ecfcda7@fujitsu.com/

Fixes: 9cb837480424 ("RDMA/rtrs: server: main functionality")
Signed-off-by: Li Zhijian &lt;lizhijian@fujitsu.com&gt;
Link: https://patch.msgid.link/20241231013416.1290920-1-lizhijian@fujitsu.com
Signed-off-by: Leon Romanovsky &lt;leon@kernel.org&gt;
</content>
</entry>
<entry>
<title>RDMA/rxe: Remove the direct link to net_device</title>
<updated>2024-12-24T09:36:40Z</updated>
<author>
<name>Zhu Yanjun</name>
<email>yanjun.zhu@linux.dev</email>
</author>
<published>2024-12-20T22:23:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2ac5415022d16d63d912a39a06f32f1f51140261'/>
<id>urn:sha1:2ac5415022d16d63d912a39a06f32f1f51140261</id>
<content type='text'>
The similar patch in siw is in the link:
https://git.kernel.org/rdma/rdma/c/16b87037b48889

This problem also occurred in RXE. The following analyze this problem.
In the following Call Traces:
"
BUG: KASAN: slab-use-after-free in dev_get_flags+0x188/0x1d0 net/core/dev.c:8782
Read of size 4 at addr ffff8880554640b0 by task kworker/1:4/5295

CPU: 1 UID: 0 PID: 5295 Comm: kworker/1:4 Not tainted
6.12.0-rc3-syzkaller-00399-g9197b73fd7bb #0
Hardware name: Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
Workqueue: infiniband ib_cache_event_task
Call Trace:
 &lt;TASK&gt;
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 dev_get_flags+0x188/0x1d0 net/core/dev.c:8782
 rxe_query_port+0x12d/0x260 drivers/infiniband/sw/rxe/rxe_verbs.c:60
 __ib_query_port drivers/infiniband/core/device.c:2111 [inline]
 ib_query_port+0x168/0x7d0 drivers/infiniband/core/device.c:2143
 ib_cache_update+0x1a9/0xb80 drivers/infiniband/core/cache.c:1494
 ib_cache_event_task+0xf3/0x1e0 drivers/infiniband/core/cache.c:1568
 process_one_work kernel/workqueue.c:3229 [inline]
 process_scheduled_works+0xa65/0x1850 kernel/workqueue.c:3310
 worker_thread+0x870/0xd30 kernel/workqueue.c:3391
 kthread+0x2f2/0x390 kernel/kthread.c:389
 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 &lt;/TASK&gt;
"

1). In the link [1],

"
 infiniband syz2: set down
"

This means that on 839.350575, the event ib_cache_event_task was sent andi
queued in ib_wq.

2). In the link [1],

"
 team0 (unregistering): Port device team_slave_0 removed
"

It indicates that before 843.251853, the net device should be freed.

3). In the link [1],

"
 BUG: KASAN: slab-use-after-free in dev_get_flags+0x188/0x1d0
"

This means that on 850.559070, this slab-use-after-free problem occurred.

In all, on 839.350575, the event ib_cache_event_task was sent and queued
in ib_wq,

before 843.251853, the net device veth was freed.

on 850.559070, this event was executed, and the mentioned freed net device
was called. Thus, the above call trace occurred.

[1] https://syzkaller.appspot.com/x/log.txt?x=12e7025f980000

Reported-by: syzbot+4b87489410b4efd181bf@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=4b87489410b4efd181bf
Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Zhu Yanjun &lt;yanjun.zhu@linux.dev&gt;
Link: https://patch.msgid.link/20241220222325.2487767-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky &lt;leon@kernel.org&gt;
</content>
</entry>
<entry>
<title>RDMA/hns: Fix missing flush CQE for DWQE</title>
<updated>2024-12-23T14:58:30Z</updated>
<author>
<name>Chengchang Tang</name>
<email>tangchengchang@huawei.com</email>
</author>
<published>2024-12-20T05:52:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e3debdd48423d3d75b9d366399228d7225d902cd'/>
<id>urn:sha1:e3debdd48423d3d75b9d366399228d7225d902cd</id>
<content type='text'>
Flush CQE handler has not been called if QP state gets into errored
mode in DWQE path. So, the new added outstanding WQEs will never be
flushed.

It leads to a hung task timeout when using NFS over RDMA:
    __switch_to+0x7c/0xd0
    __schedule+0x350/0x750
    schedule+0x50/0xf0
    schedule_timeout+0x2c8/0x340
    wait_for_common+0xf4/0x2b0
    wait_for_completion+0x20/0x40
    __ib_drain_sq+0x140/0x1d0 [ib_core]
    ib_drain_sq+0x98/0xb0 [ib_core]
    rpcrdma_xprt_disconnect+0x68/0x270 [rpcrdma]
    xprt_rdma_close+0x20/0x60 [rpcrdma]
    xprt_autoclose+0x64/0x1cc [sunrpc]
    process_one_work+0x1d8/0x4e0
    worker_thread+0x154/0x420
    kthread+0x108/0x150
    ret_from_fork+0x10/0x18

Fixes: 01584a5edcc4 ("RDMA/hns: Add support of direct wqe")
Signed-off-by: Chengchang Tang &lt;tangchengchang@huawei.com&gt;
Signed-off-by: Junxian Huang &lt;huangjunxian6@hisilicon.com&gt;
Link: https://patch.msgid.link/20241220055249.146943-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky &lt;leon@kernel.org&gt;
</content>
</entry>
<entry>
<title>RDMA/hns: Fix warning storm caused by invalid input in IO path</title>
<updated>2024-12-23T14:58:30Z</updated>
<author>
<name>Chengchang Tang</name>
<email>tangchengchang@huawei.com</email>
</author>
<published>2024-12-20T05:52:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fa5c4ba8cdbfd2c2d6422e001311c8213283ebbf'/>
<id>urn:sha1:fa5c4ba8cdbfd2c2d6422e001311c8213283ebbf</id>
<content type='text'>
WARN_ON() is called in the IO path. And it could lead to a warning
storm. Use WARN_ON_ONCE() instead of WARN_ON().

Fixes: 12542f1de179 ("RDMA/hns: Refactor process about opcode in post_send()")
Signed-off-by: Chengchang Tang &lt;tangchengchang@huawei.com&gt;
Signed-off-by: Junxian Huang &lt;huangjunxian6@hisilicon.com&gt;
Link: https://patch.msgid.link/20241220055249.146943-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky &lt;leon@kernel.org&gt;
</content>
</entry>
<entry>
<title>RDMA/hns: Fix accessing invalid dip_ctx during destroying QP</title>
<updated>2024-12-23T14:58:30Z</updated>
<author>
<name>Chengchang Tang</name>
<email>tangchengchang@huawei.com</email>
</author>
<published>2024-12-20T05:52:47Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0572eccf239ce4bd89bd531767ec5ab20e249290'/>
<id>urn:sha1:0572eccf239ce4bd89bd531767ec5ab20e249290</id>
<content type='text'>
If it fails to modify QP to RTR, dip_ctx will not be attached. And
during detroying QP, the invalid dip_ctx pointer will be accessed.

Fixes: faa62440a577 ("RDMA/hns: Fix different dgids mapping to the same dip_idx")
Signed-off-by: Chengchang Tang &lt;tangchengchang@huawei.com&gt;
Signed-off-by: Junxian Huang &lt;huangjunxian6@hisilicon.com&gt;
Link: https://patch.msgid.link/20241220055249.146943-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky &lt;leon@kernel.org&gt;
</content>
</entry>
<entry>
<title>RDMA/hns: Fix mapping error of zero-hop WQE buffer</title>
<updated>2024-12-23T14:58:30Z</updated>
<author>
<name>wenglianfa</name>
<email>wenglianfa@huawei.com</email>
</author>
<published>2024-12-20T05:52:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8673a6c2d9e483dfeeef83a1f06f59e05636f4d1'/>
<id>urn:sha1:8673a6c2d9e483dfeeef83a1f06f59e05636f4d1</id>
<content type='text'>
Due to HW limitation, the three region of WQE buffer must be mapped
and set to HW in a fixed order: SQ buffer, SGE buffer, and RQ buffer.

Currently when one region is zero-hop while the other two are not,
the zero-hop region will not be mapped. This violate the limitation
above and leads to address error.

Fixes: 38389eaa4db1 ("RDMA/hns: Add mtr support for mixed multihop addressing")
Signed-off-by: wenglianfa &lt;wenglianfa@huawei.com&gt;
Signed-off-by: Junxian Huang &lt;huangjunxian6@hisilicon.com&gt;
Link: https://patch.msgid.link/20241220055249.146943-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky &lt;leon@kernel.org&gt;
</content>
</entry>
<entry>
<title>RDMA/bnxt_re: Fix the locking while accessing the QP table</title>
<updated>2024-12-19T11:57:19Z</updated>
<author>
<name>Selvin Xavier</name>
<email>selvin.xavier@broadcom.com</email>
</author>
<published>2024-12-17T10:26:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9272cba0ded71b5a2084da3004ec7806b8cb7fd2'/>
<id>urn:sha1:9272cba0ded71b5a2084da3004ec7806b8cb7fd2</id>
<content type='text'>
QP table handling is synchronized with destroy QP and Async
event from the HW. The same needs to be synchronized
during create_qp also. Use the same lock in create_qp also.

Fixes: 76d3ddff7153 ("RDMA/bnxt_re: synchronize the qp-handle table array")
Fixes: f218d67ef004 ("RDMA/bnxt_re: Allow posting when QPs are in error")
Fixes: 84cf229f4001 ("RDMA/bnxt_re: Fix the qp table indexing")
Signed-off-by: Selvin Xavier &lt;selvin.xavier@broadcom.com&gt;
Link: https://patch.msgid.link/20241217102649.1377704-6-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky &lt;leon@kernel.org&gt;
</content>
</entry>
</feed>
