linux/kernel/workqueue.c, branch v7.1-rc2

Merge tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

2026-04-15T17:32:08Z

Pull workqueue updates from Tejun Heo: - New default WQ_AFFN_CACHE_SHARD affinity scope subdivides LLCs into smaller shards to improve scalability on machines with many CPUs per LLC - Misc: - system_dfl_long_wq for long unbound works - devm_alloc_workqueue() for device-managed allocation - sysfs exposure for ordered workqueues and the EFI workqueue - removal of HK_TYPE_WQ from wq_unbound_cpumask - various small fixes * tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (21 commits) workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id() workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value workqueue: avoid unguarded 64-bit division docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope workqueue: add test_workqueue benchmark module tools/workqueue: add CACHE_SHARD support to wq_dump.py workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope workqueue: add WQ_AFFN_CACHE_SHARD affinity scope workqueue: fix typo in WQ_AFFN_SMT comment workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path workqueue: Remove NULL wq WARN in __queue_delayed_work() workqueue: fix parse_affn_scope() prefix matching bug workqueue: devres: Add device-managed allocate workqueue workqueue: Add system_dfl_long_wq for long unbound works tools/workqueue/wq_dump.py: add NODE prefix to all node columns tools/workqueue/wq_dump.py: fix column alignment in node_nr/max_active section tools/workqueue/wq_dump.py: remove backslash separator from node_nr/max_active header efi: Allow to expose the workqueue via sysfs workqueue: Allow to expose ordered workqueues via sysfs ...

workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()

2026-04-13T16:15:26Z

On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so cpu_shard_id[] is a single-element array (int[1]). In llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an unsigned int that the compiler cannot prove is always 0, triggering a -Warray-bounds warning when the result is used to index cpu_shard_id[]: kernel/workqueue.c:8321:55: warning: array subscript 1 is above array bounds of 'int[1]' [-Warray-bounds] 8321 | cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)]; | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a false positive: sibling_cpus can never be empty here because 'c' itself is always set in it, so cpumask_first() will always return a valid CPU. However, the compiler cannot prove this statically, and the warning only manifests on UP configs where the array size is 1. Add a bounds check with WARN_ON_ONCE to silence the warning, and store the result in a local variable to make the code clearer and avoid calling cpumask_first() twice. Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/ Signed-off-by: Breno Leitao Signed-off-by: Tejun Heo

workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value

2026-04-07T18:13:19Z

use NR_STD_WORKER_POOLS for irq_work_fns[] array definition. NR_STD_WORKER_POOLS is also 2, but better to use MACRO. Initialization loop for_each_bh_worker_pool() also uses same MACRO. Signed-off-by: Maninder Singh Signed-off-by: Tejun Heo

workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope

2026-04-01T20:24:18Z

Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound workqueues. On systems where many CPUs share one LLC, the previous default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool, causing heavy spinlock contention on pool->lock. WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing a better balance between locality and contention. Users can revert to the previous behavior with workqueue.default_affinity_scope=cache. On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single shard covering the entire LLC, making it functionally identical to the previous CACHE default. The sharding only activates when an LLC has more than 8 cores. Signed-off-by: Breno Leitao Signed-off-by: Tejun Heo

workqueue: add WQ_AFFN_CACHE_SHARD affinity scope

2026-04-01T20:24:18Z

On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system. The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity. Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores. The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type(). Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system. Suggested-by: Tejun Heo Signed-off-by: Breno Leitao Signed-off-by: Tejun Heo

workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works

2026-04-01T20:18:22Z

In unplug_oldest_pwq(), the first inactive work item on the pool_workqueue is activated correctly. However, if multiple inactive works exist on the same pool_workqueue, subsequent works fail to activate because wq_node_nr_active.pending_pwqs is empty — the list insertion is skipped when the pool_workqueue is plugged. Fix this by checking for additional inactive works in unplug_oldest_pwq() and updating wq_node_nr_active.pending_pwqs accordingly. Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues") Cc: stable@vger.kernel.org Cc: Carlos Santa Cc: Ryan Neph Cc: Lai Jiangshan Cc: Waiman Long Cc: linux-kernel@vger.kernel.org Signed-off-by: Matthew Brost Signed-off-by: Tejun Heo Acked-by: Waiman Long

workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask

2026-03-31T21:43:25Z

For historical reason, wq_unbound_cpumask is initially set as intersection of HK_TYPE_DOMAIN, HK_TYPE_WQ and workqueue.unbound_cpus boot command line option. At run time, users can update the unbound cpumask via the /sys/devices/virtual/workqueue/cpumask sysfs file. Creation and modification of cpuset isolated partitions will also update wq_unbound_cpumask based on the latest HK_TYPE_DOMAIN cpumask. The HK_TYPE_WQ cpumask is out of the picture with these runtime updates. Complete the transition by taking HK_TYPE_WQ out from the workqueue code and make it depends on HK_TYPE_DOMAIN only from the housekeeping side. The final goal is to eliminate HK_TYPE_WQ as a housekeeping cpumask type. Signed-off-by: Waiman Long Acked-by: Frederic Weisbecker Signed-off-by: Tejun Heo

workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path

2026-03-25T15:52:01Z

When alloc_and_link_pwqs() fails partway through the per-cpu allocation loop, some pool_workqueues may have already been linked into wq->pwqs via link_pwq(). The error path frees these pwqs with kmem_cache_free() but never removes them from the wq->pwqs list, leaving dangling pointers in the list. Currently this is not exploitable because the workqueue was never added to the global workqueues list and the caller frees the wq immediately after. However, this makes sure that alloc_and_link_pwqs() doesn't leave any half-baked structure, which may have side effects if not properly cleaned up. Fix this by unlinking each pwq from wq->pwqs before freeing it. No locking is needed as the workqueue has not been published yet, thus no concurrency is possible. Signed-off-by: Breno Leitao Signed-off-by: Tejun Heo

workqueue: Better describe stall check

2026-03-25T15:51:02Z

Try to be more explicit why the workqueue watchdog does not take pool->lock by default. Spin locks are full memory barriers which delay anything. Obviously, they would primary delay operations on the related worker pools. Explain why it is enough to prevent the false positive by re-checking the timestamp under the pool->lock. Finally, make it clear what would be the alternative solution in __queue_work() which is a hotter path. Signed-off-by: Petr Mladek Acked-by: Song Liu Signed-off-by: Tejun Heo

workqueue: Fix false positive stall reports

2026-03-22T04:34:59Z

On weakly ordered architectures (e.g., arm64), the lockless check in wq_watchdog_timer_fn() can observe a reordering between the worklist insertion and the last_progress_ts update. Specifically, the watchdog can see a non-empty worklist (from a list_add) while reading a stale last_progress_ts value, causing a false positive stall report. This was confirmed by reading pool->last_progress_ts again after holding pool->lock in wq_watchdog_timer_fn(): workqueue watchdog: pool 7 false positive detected! lockless_ts=4784580465 locked_ts=4785033728 diff=453263ms worklist_empty=0 To avoid slowing down the hot path (queue_work, etc.), recheck last_progress_ts with pool->lock held. This will eliminate the false positive with minimal overhead. Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it. Fixes: 82607adcf9cd ("workqueue: implement lockup detector") Cc: stable@vger.kernel.org # v4.5+ Assisted-by: claude-code:claude-opus-4-6 Signed-off-by: Song Liu Signed-off-by: Tejun Heo