linux/kernel/cgroup, branch v6.13

cgroup/cpuset: remove kernfs active break

2025-01-09T01:54:39Z

A warning was found: WARNING: CPU: 10 PID: 3486953 at fs/kernfs/file.c:828 CPU: 10 PID: 3486953 Comm: rmdir Kdump: loaded Tainted: G RIP: 0010:kernfs_should_drain_open_files+0x1a1/0x1b0 RSP: 0018:ffff8881107ef9e0 EFLAGS: 00010202 RAX: 0000000080000002 RBX: ffff888154738c00 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000004 RDI: ffff888154738c04 RBP: ffff888154738c04 R08: ffffffffaf27fa15 R09: ffffed102a8e7180 R10: ffff888154738c07 R11: 0000000000000000 R12: ffff888154738c08 R13: ffff888750f8c000 R14: ffff888750f8c0e8 R15: ffff888154738ca0 FS: 00007f84cd0be740(0000) GS:ffff8887ddc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f9fbe00c8 CR3: 0000000153eec001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: kernfs_drain+0x15e/0x2f0 __kernfs_remove+0x165/0x300 kernfs_remove_by_name_ns+0x7b/0xc0 cgroup_rm_file+0x154/0x1c0 cgroup_addrm_files+0x1c2/0x1f0 css_clear_dir+0x77/0x110 kill_css+0x4c/0x1b0 cgroup_destroy_locked+0x194/0x380 cgroup_rmdir+0x2a/0x140 It can be explained by: rmdir echo 1 > cpuset.cpus kernfs_fop_write_iter // active=0 cgroup_rm_file kernfs_remove_by_name_ns kernfs_get_active // active=1 __kernfs_remove // active=0x80000002 kernfs_drain cpuset_write_resmask wait_event //waiting (active == 0x80000001) kernfs_break_active_protection // active = 0x80000001 // continue kernfs_unbreak_active_protection // active = 0x80000002 ... kernfs_should_drain_open_files // warning occurs kernfs_put_active This warning is caused by 'kernfs_break_active_protection' when it is writing to cpuset.cpus, and the cgroup is removed concurrently. The commit 3a5a6d0c2b03 ("cpuset: don't nest cgroup_mutex inside get_online_cpus()") made cpuset_hotplug_workfn asynchronous, This change involves calling flush_work(), which can create a multiple processes circular locking dependency that involve cgroup_mutex, potentially leading to a deadlock. To avoid deadlock. the commit 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") added 'kernfs_break_active_protection' in the cpuset_write_resmask. This could lead to this warning. After the commit 2125c0034c5d ("cgroup/cpuset: Make cpuset hotplug processing synchronous"), the cpuset_write_resmask no longer needs to wait the hotplug to finish, which means that concurrent hotplug and cpuset operations are no longer possible. Therefore, the deadlock doesn't exist anymore and it does not have to 'break active protection' now. To fix this warning, just remove kernfs_break_active_protection operation in the 'cpuset_write_resmask'. Fixes: bdb2fd7fc56e ("kernfs: Skip kernfs_drain_open_files() more aggressively") Fixes: 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") Reported-by: Ji Fa Signed-off-by: Chen Ridong Acked-by: Waiman Long Signed-off-by: Tejun Heo

cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains

2024-12-11T15:45:52Z

Isolated CPUs are not allowed to be used in a non-isolated partition. The only exception is the top cpuset which is allowed to contain boot time isolated CPUs. Commit ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem") introduces a simplified scheme of including only partition roots in sched domain generation. However, it does not properly account for this exception case. This can result in leakage of isolated CPUs into a sched domain. Fix it by making sure that isolated CPUs are excluded from the top cpuset before generating sched domains. Also update the way the boot time isolated CPUs are handled in test_cpuset_prs.sh to make sure that those isolated CPUs are really isolated instead of just skipping them in the tests. Fixes: ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem") Signed-off-by: Waiman Long Signed-off-by: Tejun Heo

cgroup/cpuset: Remove stale text

2024-12-11T06:38:41Z

Task's cpuset pointer was removed by commit 8793d854edbc ("Task Control Groups: make cpusets a client of cgroups") Paragraph "The task_lock() exception ...." was removed by commit 2df167a300d7 ("cgroups: update comments in cpuset.c") Remove stale text: We also require taking task_lock() when dereferencing a task's cpuset pointer. See "The task_lock() exception", at the end of this comment. Accessing a task's cpuset should be done in accordance with the guidelines for accessing subsystem state in kernel/cgroup.c and reformat. Co-developed-by: Michal Koutný Co-developed-by: Waiman Long Signed-off-by: Costa Shulyupin Acked-by: Waiman Long Signed-off-by: Tejun Heo

Merge tag 'cgroup-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2024-11-20T17:54:49Z

Pull cgroup updates from Tejun Heo: - cpu.stat now also shows niced CPU time - Freezer and cpuset optimizations - Other misc changes * tag 'cgroup-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Disable cpuset_cpumask_can_shrink() test if not load balancing cgroup/cpuset: Further optimize code if CONFIG_CPUSETS_V1 not set cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()" MAINTAINERS: remove Zefan Li cgroup/freezer: Add cgroup CGRP_FROZEN flag update helper cgroup/freezer: Reduce redundant traversal for cgroup_freeze cgroup/bpf: only cgroup v2 can be attached by bpf programs Revert "cgroup: Fix memory leak caused by missing cgroup_bpf_offline" selftests/cgroup: Fix compile error in test_cpu.c cgroup/rstat: Selftests for niced CPU statistics cgroup/rstat: Tracking cgroup-level niced CPU time cgroup/cpuset: Fix spelling errors in file kernel/cgroup/cpuset.c

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2024-11-18T20:24:06Z

Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...

cgroup/cpuset: Disable cpuset_cpumask_can_shrink() test if not load balancing

2024-11-14T18:44:03Z

With some recent proposed changes [1] in the deadline server code, it has caused a test failure in test_cpuset_prs.sh when a change is being made to an isolated partition. This is due to failing the cpuset_cpumask_can_shrink() check for SCHED_DEADLINE tasks at validate_change(). This is actually a false positive as the failed test case involves an isolated partition with load balancing disabled. The deadline check is not meaningful in this case and the users should know what they are doing. Fix this by doing the cpuset_cpumask_can_shrink() check only when loading balanced is enabled. Also change its arguments to use effective_cpus for the current cpuset and user_xcpus() as an approiximation for the target effective_cpus as the real effective_cpus hasn't been fully computed yet as this early stage. As the check isn't comprehensive, there may be false positives or negatives. We may have to revise the code to do a more thorough check in the future if this becomes a concern. [1] https://lore.kernel.org/lkml/82be06c1-6d6d-4651-86c9-bcc828cbcb80@redhat.com/T/#t Signed-off-by: Waiman Long Signed-off-by: Tejun Heo

cgroup/cpuset: Further optimize code if CONFIG_CPUSETS_V1 not set

2024-11-12T19:07:38Z

Currently the cpuset code uses group_subsys_on_dfl() to check if we are running with cgroup v2. If CONFIG_CPUSETS_V1 isn't set, there is really no need to do this check and we can optimize out some of the unneeded v1 specific code paths. Introduce a new cpuset_v2() and use it to replace the cgroup_subsys_on_dfl() check to further optimize the code. Signed-off-by: Waiman Long Signed-off-by: Tejun Heo

cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation

2024-11-12T19:07:09Z

Since commit ff0ce721ec21 ("cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug"), there is only one rebuild_sched_domains_locked() call per hotplug operation. However, writing to the various cpuset control files may still casue more than one rebuild_sched_domains_locked() call to happen in some cases. Juri had found that two rebuild_sched_domains_locked() calls in update_prstate(), one from update_cpumasks_hier() and another one from update_partition_sd_lb() could cause cpuset partition to be created with null total_bw for DL tasks. IOW, DL tasks may not be scheduled correctly in such a partition. A sample command sequence that can reproduce null total_bw is as follows. # echo Y >/sys/kernel/debug/sched/verbose # echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control # mkdir /sys/fs/cgroup/test # echo 0-7 > /sys/fs/cgroup/test/cpuset.cpus # echo 6-7 > /sys/fs/cgroup/test/cpuset.cpus.exclusive # echo root >/sys/fs/cgroup/test/cpuset.cpus.partition Fix this double rebuild_sched_domains_locked() calls problem by replacing existing calls with cpuset_force_rebuild() except the rebuild_sched_domains_cpuslocked() call at the end of cpuset_handle_hotplug(). Checking of the force_sd_rebuild flag is now done at the end of cpuset_write_resmask() and update_prstate() to determine if rebuild_sched_domains_locked() should be called or not. The cpuset v1 code can still call rebuild_sched_domains_locked() directly as double rebuild_sched_domains_locked() calls is not possible. Reported-by: Juri Lelli Closes: https://lore.kernel.org/lkml/ZyuUcJDPBln1BK1Y@jlelli-thinkpadt14gen4.remote.csb/ Signed-off-by: Waiman Long Tested-by: Juri Lelli Signed-off-by: Tejun Heo

cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()"

2024-11-12T19:07:01Z

Revert commit 3ae0b773211e ("cgroup/cpuset: Allow suppression of sched domain rebuild in update_cpumasks_hier()") to allow for an alternative way to suppress unnecessary rebuild_sched_domains_locked() calls in update_cpumasks_hier() and elsewhere in a following commit. Signed-off-by: Waiman Long Signed-off-by: Tejun Heo

css_set_fork(): switch to CLASS(fd_raw, ...)

2024-11-03T06:28:07Z

reference acquired there by fget_raw() is not stashed anywhere - we could as well borrow instead. Reviewed-by: Christian Brauner Signed-off-by: Al Viro