<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/cgroup, branch v5.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.2</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2019-06-19T15:09:08Z</updated>
<entry>
<title>treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 451</title>
<updated>2019-06-19T15:09:08Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2019-06-04T08:10:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f85d208658468b1a298f31daddb05a7684969cd4'/>
<id>urn:sha1:f85d208658468b1a298f31daddb05a7684969cd4</id>
<content type='text'>
Based on 1 normalized pattern(s):

  this file is subject to the terms and conditions of version 2 of the
  gnu general public license see the file copying in the main
  directory of the linux distribution for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 5 file(s).

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Enrico Weigelt &lt;info@metux.net&gt;
Reviewed-by: Allison Randal &lt;allison@lohutok.net&gt;
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081200.872755311@linutronix.de
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2019-06-15T03:46:14Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-06-15T03:46:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0011572c883082a95e02d47f45fc4a42dc0e8634'/>
<id>urn:sha1:0011572c883082a95e02d47f45fc4a42dc0e8634</id>
<content type='text'>
Pull cgroup fixes from Tejun Heo:
 "This has an unusually high density of tricky fixes:

   - task_get_css() could deadlock when it races against a dying cgroup.

   - cgroup.procs didn't list thread group leaders with live threads.

     This could mislead readers to think that a cgroup is empty when
     it's not. Fixed by making PROCS iterator include dead tasks. I made
     a couple mistakes making this change and this pull request contains
     a couple follow-up patches.

   - When cpusets run out of online cpus, it updates cpusmasks of member
     tasks in bizarre ways. Joel improved the behavior significantly"

* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: restore sanity to cpuset_cpus_allowed_fallback()
  cgroup: Fix css_task_iter_advance_css_set() cset skip condition
  cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
  cgroup: Include dying leaders with live threads in PROCS iterations
  cgroup: Implement css_task_iter_skip()
  cgroup: Call cgroup_release() before __exit_signal()
  docs cgroups: add another example size for hugetlb
  cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
</content>
</entry>
<entry>
<title>cpuset: restore sanity to cpuset_cpus_allowed_fallback()</title>
<updated>2019-06-12T18:00:08Z</updated>
<author>
<name>Joel Savitz</name>
<email>jsavitz@redhat.com</email>
</author>
<published>2019-06-12T15:50:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d477f8c202d1f0d4791ab1263ca7657bbe5cf79e'/>
<id>urn:sha1:d477f8c202d1f0d4791ab1263ca7657bbe5cf79e</id>
<content type='text'>
In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk-&gt;cpus_allowed to
the current value of task_cs(tsk)-&gt;effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,fffffff
	Cpus_allowed_list:	0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

	# taskset -p 4 $$
	pid 2272's current affinity mask: ffffffffffffffff
	pid 2272's new affinity mask: 4

	# echo off &gt; /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -3
	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
	[ 2195.872700] IRQ 114: no longer affine to CPU2
	[ 2195.879128] smpboot: CPU 2 is now offline

	# echo on &gt; /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -1
	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4

We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:   ffffffff,fffffffb
	Cpus_allowed_list:      0-1,3-63

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

	# taskset -p f000000000 $$
	# grep -i cpu /proc/$$/status
	Cpus_allowed:	000000f0,00000000
	Cpus_allowed_list:	36-39

A double toggle of smt results in the following behavior:

	# echo off &gt; /sys/devices/system/cpu/smt/control
	# echo on &gt; /sys/devices/system/cpu/smt/control
	# grep -i cpus /proc/$$/status
	Cpus_allowed:	ffffff00,ffffffff
	Cpus_allowed_list:	0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,ffffffff
	Cpus_allowed_list:	0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)-&gt;cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().

Suggested-by: Waiman Long &lt;longman@redhat.com&gt;
Suggested-by: Phil Auld &lt;pauld@redhat.com&gt;
Signed-off-by: Joel Savitz &lt;jsavitz@redhat.com&gt;
Acked-by: Phil Auld &lt;pauld@redhat.com&gt;
Acked-by: Waiman Long &lt;longman@redhat.com&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Fix css_task_iter_advance_css_set() cset skip condition</title>
<updated>2019-06-10T16:08:27Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-06-10T16:08:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c596687a008b579c503afb7a64fcacc7270fae9e'/>
<id>urn:sha1:c596687a008b579c503afb7a64fcacc7270fae9e</id>
<content type='text'>
While adding handling for dying task group leaders c03cd7738a83
("cgroup: Include dying leaders with live threads in PROCS
iterations") added an inverted cset skip condition to
css_task_iter_advance_css_set().  It should skip cset if it's
completely empty but was incorrectly testing for the inverse condition
for the dying_tasks list.  Fix it.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Reported-by: syzbot+d4bba5ccd4f9a2a68681@syzkaller.appspotmail.com
</content>
</entry>
<entry>
<title>cgroup/bfq: revert bfq.weight symlink change</title>
<updated>2019-06-10T09:35:41Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2019-06-10T09:35:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cf8929885de318c0bf73438c9e5dde59d6536f7c'/>
<id>urn:sha1:cf8929885de318c0bf73438c9e5dde59d6536f7c</id>
<content type='text'>
There's some discussion on how to do this the best, and Tejun prefers
that BFQ just create the file itself instead of having cgroups support
a symlink feature.

Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this
can be done properly for 5.3.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>cgroup: let a symlink too be created with a cftype file</title>
<updated>2019-06-07T07:29:39Z</updated>
<author>
<name>Angelo Ruocco</name>
<email>angeloruocco90@gmail.com</email>
</author>
<published>2019-05-21T08:01:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=54b7b868e826b294687c439b68ec55fe20cafe5b'/>
<id>urn:sha1:54b7b868e826b294687c439b68ec55fe20cafe5b</id>
<content type='text'>
This commit enables a cftype to have a symlink (of any name) that
points to the file associated with the cftype.

Signed-off-by: Angelo Ruocco &lt;angeloruocco90@gmail.com&gt;
Signed-off-by: Paolo Valente &lt;paolo.valente@linaro.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>cgroup: css_task_iter_skip()'d iterators must be advanced before accessed</title>
<updated>2019-06-05T16:54:34Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-06-05T16:54:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cee0c33c546a93957a52ae9ab6bebadbee765ec5'/>
<id>urn:sha1:cee0c33c546a93957a52ae9ab6bebadbee765ec5</id>
<content type='text'>
b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") introduced
css_task_iter_skip() which is used to fix task iterations skipping
dying threadgroup leaders with live threads.  Skipping is implemented
as a subportion of full advancing but css_task_iter_next() forgot to
fully advance a skipped iterator before determining the next task to
visit causing it to return invalid task pointers.

Fix it by making css_task_iter_next() fully advance the iterator if it
has been skipped since the previous iteration.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: syzbot
Link: http://lkml.kernel.org/r/00000000000097025d058a7fd785@google.com
Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()")
</content>
</entry>
<entry>
<title>mm, memcg: consider subtrees in memory.events</title>
<updated>2019-06-01T22:51:31Z</updated>
<author>
<name>Chris Down</name>
<email>chris@chrisdown.name</email>
</author>
<published>2019-06-01T05:30:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9852ae3fe5293264f01c49f2571ef7688f7823ce'/>
<id>urn:sha1:9852ae3fe5293264f01c49f2571ef7688f7823ce</id>
<content type='text'>
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.

The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files.  For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour.  The same confusion applies to other counters in this file when
debugging issues.

Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.

After this patch, events are propagated up the hierarchy:

    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 0
    oom 0
    oom_kill 0
    [root@ktst ~]# systemd-run -p MemoryMax=1 true
    Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 7
    oom 1
    oom_kill 1

As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set.  However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.

akpm: this is a behaviour change, so Cc:stable.  THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.

Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down &lt;chris@chrisdown.name&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Roman Gushchin &lt;guro@fb.com&gt;
Cc: Dennis Zhou &lt;dennis@kernel.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Include dying leaders with live threads in PROCS iterations</title>
<updated>2019-05-31T17:38:58Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-05-31T17:38:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c03cd7738a83b13739f00546166969342c8ff014'/>
<id>urn:sha1:c03cd7738a83b13739f00546166969342c8ff014</id>
<content type='text'>
CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.

Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-and-tested-by: Topi Miettinen &lt;toiwoton@gmail.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
</content>
</entry>
<entry>
<title>cgroup: Implement css_task_iter_skip()</title>
<updated>2019-05-31T17:38:58Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-05-31T17:38:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b636fd38dc40113f853337a7d2a6885ad23b8811'/>
<id>urn:sha1:b636fd38dc40113f853337a7d2a6885ad23b8811</id>
<content type='text'>
When a task is moved out of a cset, task iterators pointing to the
task are advanced using the normal css_task_iter_advance() call.  This
is fine but we'll be tracking dying tasks on csets and thus moving
tasks from cset-&gt;tasks to (to be added) cset-&gt;dying_tasks.  When we
remove a task from cset-&gt;tasks, if we advance the iterators, they may
move over to the next cset before we had the chance to add the task
back on the dying list, which can allow the task to escape iteration.

This patch separates out skipping from advancing.  Skipping only moves
the affected iterators to the next pointer rather than fully advancing
it and the following advancing will recognize that the cursor has
already been moved forward and do the rest of advancing.  This ensures
that when a task moves from one list to another in its cset, as long
as it moves in the right direction, it's always visible to iteration.

This doesn't cause any visible behavior changes.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
</content>
</entry>
</feed>
