linux/block, branch v5.14

Merge tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block

2021-08-27T23:08:29Z

Pull block fixes from Jens Axboe: - Revert the mq-deadline priority handling, it's causing serious performance regressions. While experimental patches exists to fix this up, it's too late to do so now. Revert it and re-do it properly for 5.15 instead. - Fix a NULL vs IS_ERR() regression in this release (Dan) - Fix a mq-deadline accounting regression in this release (Bart) - Mark cryptoloop as deprecated. It's broken and dm-crypt fully supports it, and it's actively intefering with loop. Plan on removal for 5.16 (Christoph) * tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block: cryptoloop: add a deprecation warning pd: fix a NULL vs IS_ERR() check Revert "block/mq-deadline: Prioritize high-priority requests" mq-deadline: Fix request accounting

Revert "block/mq-deadline: Prioritize high-priority requests"

2021-08-26T18:59:44Z

This reverts commit fb926032b3209300f9dc454a36b8299582ae545c. Zhen reports that this commit slows down mq-deadline on a 128 thread box, going from 258K IOPS to 170-180K. My testing shows that Optane gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box. Looking in detail at the code, the main culprit here is needing to sum percpu counters in the dispatch hot path, leading to very high CPU utilization there. To make matters worse, the code currently needs to sum 2 percpu counters, and it does so in the most naive way of iterating possible CPUs _twice_. Since we're close to release, revert this commit and we can re-do it with regular per-priority counters instead for the 5.15 kernel. Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/ Reported-by: Zhen Lei Signed-off-by: Jens Axboe

mq-deadline: Fix request accounting

2021-08-24T22:18:01Z

The block layer may call the I/O scheduler .finish_request() callback without having called the .insert_requests() callback. Make sure that the mq-deadline I/O statistics are correct if the block layer inserts an I/O request that bypasses the I/O scheduler. This patch prevents that lower priority I/O is delayed longer than necessary for mixed I/O priority workloads. Cc: Niklas Cassel Cc: Damien Le Moal Cc: Hannes Reinecke Reported-by: Niklas Cassel Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") Signed-off-by: Bart Van Assche Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org Reviewed-by: Niklas Cassel Tested-by: Niklas Cassel Signed-off-by: Jens Axboe

Merge tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block

2021-08-21T15:11:22Z

Pull block fixes from Jens Axboe: "Three fixes from Ming Lei that should go into 5.14: - Fix for a kernel panic when iterating over tags for some cases where a flush request is present, a regression in this cycle. - Request timeout fix - Fix flush request checking" * tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block: blk-mq: fix is_flush_rq blk-mq: fix kernel panic during iterating over flush request blk-mq: don't grab rq's refcount in blk_mq_check_expired()

blk-mq: fix is_flush_rq

2021-08-18T02:17:34Z

is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the following check: hctx->fq->flush_rq == req but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because: 1) memory re-order in blk_mq_rq_ctx_init(): rq->mq_hctx = data->hctx; ... refcount_set(&rq->ref, 1); OR 2) tag re-use and ->rqs[] isn't updated with new request. Fix the issue by re-writing is_flush_rq() as: return rq->end_io == flush_end_io; which turns out simpler to follow and immune to data race since we have ordered WRITE rq->end_io and refcount_set(&rq->ref, 1). Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Cc: "Blank-Burian, Markus, Dr." Cc: Yufen Yu Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.com Signed-off-by: Jens Axboe

blk-mq: fix kernel panic during iterating over flush request

2021-08-17T14:33:32Z

For fixing use-after-free during iterating over requests, we grabbed request's refcount before calling ->fn in commit 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter"). Turns out this way may cause kernel panic when iterating over one flush request: 1) old flush request's tag is just released, and this tag is reused by one new request, but ->rqs[] isn't updated yet 2) the flush request can be re-used for submitting one new flush command, so blk_rq_init() is called at the same time 3) meantime blk_mq_queue_tag_busy_iter() is called, and old flush request is retrieved from ->rqs[tag]; when blk_mq_put_rq_ref() is called, flush_rq->end_io may not be updated yet, so NULL pointer dereference is triggered in blk_mq_put_rq_ref(). Fix the issue by calling refcount_set(&flush_rq->ref, 1) after flush_rq->end_io is set. So far the only other caller of blk_rq_init() is scsi_ioctl_reset() in which the request doesn't enter block IO stack and the request reference count isn't used, so the change is safe. Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Reported-by: "Blank-Burian, Markus, Dr." Tested-by: "Blank-Burian, Markus, Dr." Signed-off-by: Ming Lei Reviewed-by: Christoph Hellwig Reviewed-by: John Garry Link: https://lore.kernel.org/r/20210811142624.618598-1-ming.lei@redhat.com Signed-off-by: Jens Axboe

blk-mq: don't grab rq's refcount in blk_mq_check_expired()

2021-08-17T14:32:45Z

Inside blk_mq_queue_tag_busy_iter() we already grabbed request's refcount before calling ->fn(), so needn't to grab it one more time in blk_mq_check_expired(). Meantime remove extra request expire check in blk_mq_check_expired(). Cc: Keith Busch Signed-off-by: Ming Lei Reviewed-by: Christoph Hellwig Reviewed-by: John Garry Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.com Signed-off-by: Jens Axboe

Merge tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block

2021-08-13T23:36:42Z

Pull block fixes from Jens Axboe: "A few fixes for block that should go into 5.14: - Revert the mq-deadline cgroup addition. More work is needed on this front, let's revert it for now and get it right before having it in a released kernel (Tejun) - blk-iocost lockdep fix (Ming) - nbd double completion fix (Xie) - Fix for non-idling when clearing the shared tag flag (Yu)" * tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block: nbd: Aovid double completion of a request blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED Revert "block/mq-deadline: Add cgroup support" blk-iocost: fix lockdep warning on blkcg->lock

blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED

2021-08-13T14:01:34Z

We run a test that delete and recover devcies frequently(two devices on the same host), and we found that 'active_queues' is super big after a period of time. If device a and device b share a tag set, and a is deleted, then blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there is only one queue that are using the tag set. However, if b is still active, the active_queues of b might never be cleared even if b is deleted. Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared. Signed-off-by: Yu Kuai Reviewed-by: Ming Lei Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com Signed-off-by: Jens Axboe

Revert "block/mq-deadline: Add cgroup support"

2021-08-11T19:47:26Z

This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a WARN_ON_ONCE() call"). The added cgroup support has the following issues: * It breaks cgroup interface file format rule by adding custom elements to a nested key-value file. * It registers mq-deadline as a cgroup-aware policy even though all it's doing is collecting per-cgroup stats. Even if we need these stats, this isn't the right way to add them. * It hasn't been reviewed from cgroup side. Cc: Bart Van Assche Cc: Jens Axboe Signed-off-by: Tejun Heo Signed-off-by: Jens Axboe