<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm, branch v3.0</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v3.0</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v3.0'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2011-07-20T05:09:31Z</updated>
<entry>
<title>vmscan: fix a livelock in kswapd</title>
<updated>2011-07-20T05:09:31Z</updated>
<author>
<name>Shaohua Li</name>
<email>shaohua.li@intel.com</email>
</author>
<published>2011-07-19T15:49:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4746efded84d7c5a9c8d64d4c6e814ff0cf9fb42'/>
<id>urn:sha1:4746efded84d7c5a9c8d64d4c6e814ff0cf9fb42</id>
<content type='text'>
I'm running a workload which triggers a lot of swap in a machine with 4
nodes.  After I kill the workload, I found a kswapd livelock.  Sometimes
kswapd3 or kswapd2 are keeping running and I can't access filesystem,
but most memory is free.

This looks like a regression since commit 08951e545918c159 ("mm: vmscan:
correct check for kswapd sleeping in sleeping_prematurely").

Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
for classzone_idx.  The reason is end_zone in balance_pgdat() is 0 by
default, if all zones have watermark ok, end_zone will keep 0.

Later sleeping_prematurely() always returns true.  Because this is an
order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
present_pages in pgdat_balanced() are 0.  We add a special case here.
If a zone has no page, we think it's balanced.  This fixes the livelock.

Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/nommu.c: fix remap_pfn_range()</title>
<updated>2011-07-09T04:14:44Z</updated>
<author>
<name>Bob Liu</name>
<email>lliubbo@gmail.com</email>
</author>
<published>2011-07-08T22:39:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8f3b1327aa454bc8283e96bca7669c3c88b83f79'/>
<id>urn:sha1:8f3b1327aa454bc8283e96bca7669c3c88b83f79</id>
<content type='text'>
remap_pfn_range() means map physical address pfn&lt;&lt;PAGE_SHIFT to user addr.

For nommu arch it's implemented by vma-&gt;vm_start = pfn &lt;&lt; PAGE_SHIFT which
is wrong acroding the original meaning of this function.  And some driver
developer using remap_pfn_range() with correct parameter will get
unexpected result because vm_start is changed.  It should be implementd
like addr = pfn &lt;&lt; PAGE_SHIFT but which is meanless on nommu arch, this
patch just make it simply return.

Parameter name and setting of vma-&gt;vm_flags also be fixed.

Signed-off-by: Bob Liu &lt;lliubbo@gmail.com&gt;
Cc: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Acked-by: Greg Ungerer &lt;gerg@uclinux.org&gt;
Cc: Mike Frysinger &lt;vapier@gentoo.org&gt;
Cc: Bob Liu &lt;lliubbo@gmail.com&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: fix numa scan information update to be triggered by memory event</title>
<updated>2011-07-09T04:14:44Z</updated>
<author>
<name>KAMEZAWA Hiroyuki</name>
<email>kamezawa.hiroyu@jp.fujitsu.com</email>
</author>
<published>2011-07-08T22:39:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=453a9bf347f1e22a5bb3605ced43b2366921221d'/>
<id>urn:sha1:453a9bf347f1e22a5bb3605ced43b2366921221d</id>
<content type='text'>
commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin
order") adds an numa node round-robin for memcg.  But the information is
updated once per 10sec.

This patch changes the update trigger from jiffies to memcg's event count.
 After this patch, numa scan information will be updated when we see 1024
events of pagein/pageout under a memcg.

[akpm@linux-foundation.org: attempt to repair code layout]
Signed-off-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Ying Han &lt;yinghan@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Daisuke Nishimura &lt;nishimura@mxp.nes.nec.co.jp&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: fix reclaimable lru check in memcg</title>
<updated>2011-07-09T04:14:43Z</updated>
<author>
<name>KAMEZAWA Hiroyuki</name>
<email>kamezawa.hiroyu@jp.fujitsu.com</email>
</author>
<published>2011-07-08T22:39:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4d0c066d29f030d47d19678f8008933e67dd3b72'/>
<id>urn:sha1:4d0c066d29f030d47d19678f8008933e67dd3b72</id>
<content type='text'>
Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is
used for checking whether the memcg contains reclaimable pages or not.  If
no pages in it, the routine skips it.

But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle
"noswap" condition correctly.  This doesn't work on a swapless system.

This patch adds test_mem_cgroup_reclaimable() and replaces
mem_cgroup_local_usage().  test_mem_cgroup_reclaimable() see LRU counter
and returns correct answer to the caller.  And this new function has
"noswap" argument and can see only FILE LRU if necessary.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix kerneldoc layout]
Signed-off-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Ying Han &lt;yinghan@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Daisuke Nishimura &lt;nishimura@mxp.nes.nec.co.jp&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: __tlb_remove_page() check the correct batch</title>
<updated>2011-07-09T04:14:43Z</updated>
<author>
<name>Shaohua Li</name>
<email>shaohua.li@intel.com</email>
</author>
<published>2011-07-08T22:39:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0b43c3aab0137595335b08b340a3f3e5af9818a6'/>
<id>urn:sha1:0b43c3aab0137595335b08b340a3f3e5af9818a6</id>
<content type='text'>
__tlb_remove_page() switches to a new batch page, but still checks space
in the old batch.  This check always fails, and causes a forced tlb flush.

Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully</title>
<updated>2011-07-09T04:14:43Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2011-07-08T22:39:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=215ddd6664ced067afca7eebd2d1eb83f064ff5a'/>
<id>urn:sha1:215ddd6664ced067afca7eebd2d1eb83f064ff5a</id>
<content type='text'>
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.

When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat-&gt;classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.

There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.

This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat-&gt;classzone_idx be initialised
each time to pgdat-&gt;nr_zones - 1 to avoid re-reads being interpreted as
wakeups.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Pádraig Brady &lt;P@draigBrady.com&gt;
Tested-by: Pádraig Brady &lt;P@draigBrady.com&gt;
Tested-by: Andrew Lutomirski &lt;luto@mit.edu&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: vmscan: evaluate the watermarks against the correct classzone</title>
<updated>2011-07-09T04:14:43Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2011-07-08T22:39:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=da175d06b437093f93109ba9e5efbe44dfdf9409'/>
<id>urn:sha1:da175d06b437093f93109ba9e5efbe44dfdf9409</id>
<content type='text'>
When deciding if kswapd is sleeping prematurely, the classzone is taken
into account but this is different to what balance_pgdat() and the
allocator are doing.  Specifically, the DMA zone will be checked based on
the classzone used when waking kswapd which could be for a GFP_KERNEL or
GFP_HIGHMEM request.  The lowmem reserve limit kicks in, the watermark is
not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in
error.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Pádraig Brady &lt;P@draigBrady.com&gt;
Tested-by: Pádraig Brady &lt;P@draigBrady.com&gt;
Tested-by: Andrew Lutomirski &lt;luto@mit.edu&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: vmscan: do not apply pressure to slab if we are not applying pressure to zone</title>
<updated>2011-07-09T04:14:43Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2011-07-08T22:39:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d7868dae893c83c50c7824bc2bc75f93d114669f'/>
<id>urn:sha1:d7868dae893c83c50c7824bc2bc75f93d114669f</id>
<content type='text'>
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.

When kswapd applies pressure to zones during node balancing, it checks if
the zone is above a high+balance_gap threshold.  If it is, it does not
apply pressure but it unconditionally shrinks slab on a global basis which
is excessive.  In the event kswapd is being kept awake due to a high small
unreclaimable zone, it skips zone shrinking but still calls shrink_slab().

Once pressure has been applied, the check for zone being unreclaimable is
being made before the check is made if all_unreclaimable should be set.
This miss of unreclaimable can cause has_under_min_watermark_zone to be
set due to an unreclaimable zone preventing kswapd backing off on
congestion_wait().

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Pádraig Brady &lt;P@draigBrady.com&gt;
Tested-by: Pádraig Brady &lt;P@draigBrady.com&gt;
Tested-by: Andrew Lutomirski &lt;luto@mit.edu&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: vmscan: correct check for kswapd sleeping in sleeping_prematurely</title>
<updated>2011-07-09T04:14:42Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2011-07-08T22:39:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=08951e545918c1594434d000d88a7793e2452a9b'/>
<id>urn:sha1:08951e545918c1594434d000d88a7793e2452a9b</id>
<content type='text'>
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.

This seems to happen most with recent sandybridge laptops but it's
probably a co-incidence as some of these laptops just happen to have a
small Normal zone.  The reproduction case is almost always during copying
large files that kswapd pegs at 100% CPU until the file is deleted or
cache is dropped.

The problem is mostly down to sleeping_prematurely() keeping kswapd awake
when the highest zone is small and unreclaimable and compounded by the
fact we shrink slabs even when not shrinking zones causing a lot of time
to be spent in shrinkers and a lot of memory to be reclaimed.

Patch 1 corrects sleeping_prematurely to check the zones matching
	the classzone_idx instead of all zones.

Patch 2 avoids shrinking slab when we are not shrinking a zone.

Patch 3 notes that sleeping_prematurely is checking lower zones against
	a high classzone which is not what allocators or balance_pgdat()
	is doing leading to an artifical belief that kswapd should be
	still awake.

Patch 4 notes that when balance_pgdat() gives up on a high zone that the
	decision is not communicated to sleeping_prematurely()

This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39
and 3.0-rc4 as well.  If accepted, they need to go to -stable to be picked
up by distros and this series is against 3.0-rc4.  I've cc'd people that
reported similar problems recently to see if they still suffer from the
problem and if this fixes it.

This patch: correct the check for kswapd sleeping in sleeping_prematurely()

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.

A problem occurs if the highest zone is small.  balance_pgdat() only
considers unreclaimable zones when priority is DEF_PRIORITY but
sleeping_prematurely considers all zones.  It's possible for this sequence
to occur

  1. kswapd wakes up and enters balance_pgdat()
  2. At DEF_PRIORITY, marks highest zone unreclaimable
  3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
  4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
        highest zone, clearing all_unreclaimable. Highest zone
        is still unbalanced
  5. kswapd returns and calls sleeping_prematurely
  6. sleeping_prematurely looks at *all* zones, not just the ones
     being considered by balance_pgdat. The highest small zone
     has all_unreclaimable cleared but the zone is not
     balanced. all_zones_ok is false so kswapd stays awake

This patch corrects the behaviour of sleeping_prematurely to check the
zones balance_pgdat() checked.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Pádraig Brady &lt;P@draigBrady.com&gt;
Tested-by: Pádraig Brady &lt;P@draigBrady.com&gt;
Tested-by: Andrew Lutomirski &lt;luto@mit.edu&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: fix direct softlimit reclaim to be called in limit path</title>
<updated>2011-06-28T01:00:13Z</updated>
<author>
<name>KAMEZAWA Hiroyuki</name>
<email>kamezawa.hiroyu@jp.fujitsu.com</email>
</author>
<published>2011-06-27T23:18:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ac34a1a3c39da0a1b9188d12a9ce85506364ed2a'/>
<id>urn:sha1:ac34a1a3c39da0a1b9188d12a9ce85506364ed2a</id>
<content type='text'>
Commit d149e3b25d7c ("memcg: add the soft_limit reclaim in global direct
reclaim") adds a softlimit hook to shrink_zones().  By this, soft limit
is called as

   try_to_free_pages()
       do_try_to_free_pages()
           shrink_zones()
               mem_cgroup_soft_limit_reclaim()

Then, direct reclaim is memcg softlimit hint aware, now.

But, the memory cgroup's "limit" path can call softlimit shrinker.

   try_to_free_mem_cgroup_pages()
       do_try_to_free_pages()
           shrink_zones()
               mem_cgroup_soft_limit_reclaim()

This will cause a global reclaim when a memcg hits limit.

This is bug. soft_limit_reclaim() should be called when
scanning_global_lru(sc) == true.

And the commit adds a variable "total_scanned" for counting softlimit
scanned pages....it's not "total".  This patch removes the variable and
update sc-&gt;nr_scanned instead of it.  This will affect shrink_slab()'s
scan condition but, global LRU is scanned by softlimit and I think this
change makes sense.

TODO: avoid too much scanning of a zone when softlimit did enough work.

Signed-off-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Daisuke Nishimura &lt;nishimura@mxp.nes.nec.co.jp&gt;
Cc: Ying Han &lt;yinghan@google.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
