<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/Documentation/cgroups, branch v4.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.2</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2015-06-27T02:50:04Z</updated>
<entry>
<title>Merge branch 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2015-06-27T02:50:04Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-06-27T02:50:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bbe179f88d39274630823a0dc07d2714fd19a103'/>
<id>urn:sha1:bbe179f88d39274630823a0dc07d2714fd19a103</id>
<content type='text'>
Pull cgroup updates from Tejun Heo:

 - threadgroup_lock got reorganized so that its users can pick the
   actual locking mechanism to use.  Its only user - cgroups - is
   updated to use a percpu_rwsem instead of per-process rwsem.

   This makes things a bit lighter on hot paths and allows cgroups to
   perform and fail multi-task (a process) migrations atomically.
   Multi-task migrations are used in several places including the
   unified hierarchy.

 - Delegation rule and documentation added to unified hierarchy.  This
   will likely be the last interface update from the cgroup core side
   for unified hierarchy before lifting the devel mask.

 - Some groundwork for the pids controller which is scheduled to be
   merged in the coming devel cycle.

* 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: add delegation section to unified hierarchy documentation
  cgroup: require write perm on common ancestor when moving processes on the default hierarchy
  cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
  kernfs: make kernfs_get_inode() public
  MAINTAINERS: add a cgroup core co-maintainer
  cgroup: fix uninitialised iterator in for_each_subsys_which
  cgroup: replace explicit ss_mask checking with for_each_subsys_which
  cgroup: use bitmask to filter for_each_subsys
  cgroup: add seq_file forward declaration for struct cftype
  cgroup: simplify threadgroup locking
  sched, cgroup: replace signal_struct-&gt;group_rwsem with a global percpu_rwsem
  sched, cgroup: reorganize threadgroup locking
  cgroup: switch to unsigned long for bitmasks
  cgroup: reorganize include/linux/cgroup.h
  cgroup: separate out include/linux/cgroup-defs.h
  cgroup: fix some comment typos
</content>
</entry>
<entry>
<title>cgroup: add delegation section to unified hierarchy documentation</title>
<updated>2015-06-18T20:54:28Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-06-18T20:54:28Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8a0792ef8e01f03cb43806c6a87738bde34df713'/>
<id>urn:sha1:8a0792ef8e01f03cb43806c6a87738bde34df713</id>
<content type='text'>
v2: Rearranged paragraphs as suggested by Johannes Weiner.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
</content>
</entry>
<entry>
<title>writeback, blkio: add documentation for cgroup writeback support</title>
<updated>2015-06-17T18:47:40Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-06-16T22:48:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3e1534cf4a2a8278e811e7c84a79da1a02347b8b'/>
<id>urn:sha1:3e1534cf4a2a8278e811e7c84a79da1a02347b8b</id>
<content type='text'>
Update Documentation/cgroups/blkio-controller.txt to reflect the
recently added cgroup writeback support.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Li Zefan &lt;lizefan@huawei.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Cc: cgroups@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>memcg: add per cgroup dirty page accounting</title>
<updated>2015-06-02T14:33:33Z</updated>
<author>
<name>Greg Thelen</name>
<email>gthelen@google.com</email>
</author>
<published>2015-05-22T21:13:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c4843a7593a9df3ff5b1806084cefdfa81dd7c79'/>
<id>urn:sha1:c4843a7593a9df3ff5b1806084cefdfa81dd7c79</id>
<content type='text'>
When modifying PG_Dirty on cached file pages, update the new
MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
per memcg memory.stat cgroupfs file.  The most recent past attempt at
this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

The new accounting supports future efforts to add per cgroup dirty
page throttling and writeback.  It also helps an administrator break
down a container's memory usage and provides evidence to understand
memcg oom kills (the new dirty count is included in memcg oom kill
messages).

The ability to move page accounting between memcg
(memory.move_charge_at_immigrate) makes this accounting more
complicated than the global counter.  The existing
mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
accounting with stat updates.
Typical update operation:
	memcg = mem_cgroup_begin_page_stat(page)
	if (TestSetPageDirty()) {
		[...]
		mem_cgroup_update_page_stat(memcg)
	}
	mem_cgroup_end_page_stat(memcg)

Summary of mem_cgroup_end_page_stat() overhead:
- Without CONFIG_MEMCG it's a no-op
- With CONFIG_MEMCG and no inter memcg task movement, it's just
  rcu_read_lock()
- With CONFIG_MEMCG and inter memcg  task movement, it's
  rcu_read_lock() + spin_lock_irqsave()

A memcg parameter is added to several routines because their callers
now grab mem_cgroup_begin_page_stat() which returns the memcg later
needed by for mem_cgroup_update_page_stat().

Because mem_cgroup_begin_page_stat() may disable interrupts, some
adjustments are needed:
- move __mark_inode_dirty() from __set_page_dirty() to its caller.
  __mark_inode_dirty() locking does not want interrupts disabled.
- use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
  __delete_from_page_cache(), replace_page_cache_page(),
  invalidate_complete_page2(), and __remove_mapping().

   text    data     bss      dec    hex filename
8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                            +192 text bytes
8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                            +773 text bytes

Performance tests run on v4.0-rc1-36-g4f671fe2f952.  Lower is better for
all metrics, they're all wall clock or cycle counts.  The read and write
fault benchmarks just measure fault time, they do not include I/O time.

* CONFIG_MEMCG not set:
                            baseline                              patched
  kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
  dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
  dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
  dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
  read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
  write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)

* CONFIG_MEMCG=y root_memcg:
                            baseline                              patched
  kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
  dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
  dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
  dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
  read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
  write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)

* CONFIG_MEMCG=y non-root_memcg:
                            baseline                              patched
  kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
  dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
  dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
  dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
  read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
  write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)

As expected anon page faults are not affected by this patch.

tj: Updated to apply on top of the recent cancel_dirty_page() changes.

Signed-off-by: Sha Zhengju &lt;handai.szj@gmail.com&gt;
Signed-off-by: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6</title>
<updated>2015-04-18T15:10:49Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-04-18T15:10:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d6a24d0640d609138a4e40a4ce9fd9fe7859e24c'/>
<id>urn:sha1:d6a24d0640d609138a4e40a4ce9fd9fe7859e24c</id>
<content type='text'>
Pull documentation updates from Jonathan Corbet:
 "Numerous fixes, the overdue removal of the i2o docs, some new Chinese
  translations, and, hopefully, the README fix that will end the flow of
  identical patches to that file"

* tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (34 commits)
  Documentation/memcg: update memcg/kmem status
  Documentation: blackfin: Makefile: Typo building issue
  Documentation/vm/pagemap.txt: correct location of page-types tool
  Documentation/memory-barriers.txt: typo fix
  doc: Add guest_nice column to example output of `cat /proc/stat'
  Documentation/kernel-parameters: Move "eagerfpu" to its right place
  Documentation: gpio: Update ACPI part of the document to mention _DSD
  docs/completion.txt: Various tweaks and corrections
  doc: completion: context, scope and language fixes
  Documentation:Update Documentation/zh_CN/arm64/memory.txt
  Documentation:Update Documentation/zh_CN/arm64/booting.txt
  Documentation: Chinese translation of arm64/legacy_instructions.txt
  DocBook media: fix broken EIA hyperlink
  Documentation: tweak the maintainers entry
  README: Change gzip/bzip2 to xz compression format
  README: Update version number reference
  doc:pci: Fix typo in Documentation/PCI
  Documentation: drm: Use '-&gt;' when describing access through pointers.
  Documentation: Remove mentioning of block barriers
  Documentation/email-clients.txt: Fix one grammar mistake, add extra info about TB
  ...
</content>
</entry>
<entry>
<title>Merge branch 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2015-04-13T23:47:11Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-04-13T23:47:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4fd48b45ffc4addd3c2963448b05417aa14abbf7'/>
<id>urn:sha1:4fd48b45ffc4addd3c2963448b05417aa14abbf7</id>
<content type='text'>
Pull cgroup updates from Tejun Heo:
 "Nothing too interesting.  Rik made cpuset cooperate better with
  isolcpus and there are several other cleanup patches"

* 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset, isolcpus: document relationship between cpusets &amp; isolcpus
  cpusets, isolcpus: exclude isolcpus from load balancing in cpusets
  sched, isolcpu: make cpu_isolated_map visible outside scheduler
  cpuset: initialize cpuset a bit early
  cgroup: Use kvfree in pidlist_free()
  cgroup: call cgroup_subsys-&gt;bind on cgroup subsys initialization
</content>
</entry>
<entry>
<title>Documentation/memcg: update memcg/kmem status</title>
<updated>2015-04-11T13:23:31Z</updated>
<author>
<name>Vladimir Davydov</name>
<email>vdavydov@parallels.com</email>
</author>
<published>2015-04-01T14:30:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=197175427a221fe3200f7727ea35e261727e7228'/>
<id>urn:sha1:197175427a221fe3200f7727ea35e261727e7228</id>
<content type='text'>
Memcg/kmem reclaim support has been finally merged. Reflect this in the
documentation.

Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Signed-off-by: Jonathan Corbet &lt;corbet@lwn.net&gt;
</content>
</entry>
<entry>
<title>cpuset, isolcpus: document relationship between cpusets &amp; isolcpus</title>
<updated>2015-03-19T18:28:20Z</updated>
<author>
<name>Rik van Riel</name>
<email>riel@redhat.com</email>
</author>
<published>2015-03-09T16:12:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=34ebe933417e16f46bc30ea77a66e7f30d0cf0f8'/>
<id>urn:sha1:34ebe933417e16f46bc30ea77a66e7f30d0cf0f8</id>
<content type='text'>
Document the subtly changed relationship between cpusets and isolcpus.
Turns out the old documentation did not match the code...

Signed-off-by: Rik van Riel &lt;riel@redhat.com&gt;
Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Acked-by: Zefan Li &lt;lizefan@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm: memcontrol: use "max" instead of "infinity" in control knobs</title>
<updated>2015-02-28T17:57:51Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2015-02-27T23:52:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d2973697b377e8b061e179c6057e94151759a73d'/>
<id>urn:sha1:d2973697b377e8b061e179c6057e94151759a73d</id>
<content type='text'>
The memcg control knobs indicate the highest possible value using the
symbolic name "infinity", which is long and awkward to type.

Switch to the string "max", which is just as descriptive but shorter and
sweeter.

This changes a user interface, so do it before the release and before
the development flag is dropped from the default hierarchy.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcontrol: default hierarchy interface for memory</title>
<updated>2015-02-12T01:06:02Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2015-02-11T23:26:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=241994ed8649f7300667be8b13a9e04ae04e05a1'/>
<id>urn:sha1:241994ed8649f7300667be8b13a9e04ae04e05a1</id>
<content type='text'>
Introduce the basic control files to account, partition, and limit
memory using cgroups in default hierarchy mode.

This interface versioning allows us to address fundamental design
issues in the existing memory cgroup interface, further explained
below.  The old interface will be maintained indefinitely, but a
clearer model and improved workload performance should encourage
existing users to switch over to the new one eventually.

The control files are thus:

  - memory.current shows the current consumption of the cgroup and its
    descendants, in bytes.

  - memory.low configures the lower end of the cgroup's expected
    memory consumption range.  The kernel considers memory below that
    boundary to be a reserve - the minimum that the workload needs in
    order to make forward progress - and generally avoids reclaiming
    it, unless there is an imminent risk of entering an OOM situation.

  - memory.high configures the upper end of the cgroup's expected
    memory consumption range.  A cgroup whose consumption grows beyond
    this threshold is forced into direct reclaim, to work off the
    excess and to throttle new allocations heavily, but is generally
    allowed to continue and the OOM killer is not invoked.

  - memory.max configures the hard maximum amount of memory that the
    cgroup is allowed to consume before the OOM killer is invoked.

  - memory.events shows event counters that indicate how often the
    cgroup was reclaimed while below memory.low, how often it was
    forced to reclaim excess beyond memory.high, how often it hit
    memory.max, and how often it entered OOM due to memory.max.  This
    allows users to identify configuration problems when observing a
    degradation in workload performance.  An overcommitted system will
    have an increased rate of low boundary breaches, whereas increased
    rates of high limit breaches, maximum hits, or even OOM situations
    will indicate internally overcommitted cgroups.

For existing users of memory cgroups, the following deviations from
the current interface are worth pointing out and explaining:

  - The original lower boundary, the soft limit, is defined as a limit
    that is per default unset.  As a result, the set of cgroups that
    global reclaim prefers is opt-in, rather than opt-out.  The costs
    for optimizing these mostly negative lookups are so high that the
    implementation, despite its enormous size, does not even provide
    the basic desirable behavior.  First off, the soft limit has no
    hierarchical meaning.  All configured groups are organized in a
    global rbtree and treated like equal peers, regardless where they
    are located in the hierarchy.  This makes subtree delegation
    impossible.  Second, the soft limit reclaim pass is so aggressive
    that it not just introduces high allocation latencies into the
    system, but also impacts system performance due to overreclaim, to
    the point where the feature becomes self-defeating.

    The memory.low boundary on the other hand is a top-down allocated
    reserve.  A cgroup enjoys reclaim protection when it and all its
    ancestors are below their low boundaries, which makes delegation
    of subtrees possible.  Secondly, new cgroups have no reserve per
    default and in the common case most cgroups are eligible for the
    preferred reclaim pass.  This allows the new low boundary to be
    efficiently implemented with just a minor addition to the generic
    reclaim code, without the need for out-of-band data structures and
    reclaim passes.  Because the generic reclaim code considers all
    cgroups except for the ones running low in the preferred first
    reclaim pass, overreclaim of individual groups is eliminated as
    well, resulting in much better overall workload performance.

  - The original high boundary, the hard limit, is defined as a strict
    limit that can not budge, even if the OOM killer has to be called.
    But this generally goes against the goal of making the most out of
    the available memory.  The memory consumption of workloads varies
    during runtime, and that requires users to overcommit.  But doing
    that with a strict upper limit requires either a fairly accurate
    prediction of the working set size or adding slack to the limit.
    Since working set size estimation is hard and error prone, and
    getting it wrong results in OOM kills, most users tend to err on
    the side of a looser limit and end up wasting precious resources.

    The memory.high boundary on the other hand can be set much more
    conservatively.  When hit, it throttles allocations by forcing
    them into direct reclaim to work off the excess, but it never
    invokes the OOM killer.  As a result, a high boundary that is
    chosen too aggressively will not terminate the processes, but
    instead it will lead to gradual performance degradation.  The user
    can monitor this and make corrections until the minimal memory
    footprint that still gives acceptable performance is found.

    In extreme cases, with many concurrent allocations and a complete
    breakdown of reclaim progress within the group, the high boundary
    can be exceeded.  But even then it's mostly better to satisfy the
    allocation from the slack available in other groups or the rest of
    the system than killing the group.  Otherwise, memory.max is there
    to limit this type of spillover and ultimately contain buggy or
    even malicious applications.

  - The original control file names are unwieldy and inconsistent in
    many different ways.  For example, the upper boundary hit count is
    exported in the memory.failcnt file, but an OOM event count has to
    be manually counted by listening to memory.oom_control events, and
    lower boundary / soft limit events have to be counted by first
    setting a threshold for that value and then counting those events.
    Also, usage and limit files encode their units in the filename.
    That makes the filenames very long, even though this is not
    information that a user needs to be reminded of every time they
    type out those names.

    To address these naming issues, as well as to signal clearly that
    the new interface carries a new configuration model, the naming
    conventions in it necessarily differ from the old interface.

  - The original limit files indicate the state of an unset limit with
    a very high number, and a configured limit can be unset by echoing
    -1 into those files.  But that very high number is implementation
    and architecture dependent and not very descriptive.  And while -1
    can be understood as an underflow into the highest possible value,
    -2 or -10M etc. do not work, so it's not inconsistent.

    memory.low, memory.high, and memory.max will use the string
    "infinity" to indicate and set the highest possible value.

[akpm@linux-foundation.org: use seq_puts() for basic strings]
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
