<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm, branch v4.4</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.4</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.4'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2016-01-09T07:47:54Z</updated>
<entry>
<title>vmstat: allocate vmstat_wq before it is used</title>
<updated>2016-01-09T07:47:54Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2016-01-08T10:18:29Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=751e5f5c753e8d447bcf89f9e96b9616ac081628'/>
<id>urn:sha1:751e5f5c753e8d447bcf89f9e96b9616ac081628</id>
<content type='text'>
kernel test robot has reported the following crash:

  BUG: unable to handle kernel NULL pointer dereference at 00000100
  IP: [&lt;c1074df6&gt;] __queue_work+0x26/0x390
  *pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
  Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
  CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe #1
  Workqueue: events vmstat_shepherd
  task: cb684600 ti: cb7ba000 task.ti: cb7ba000
  EIP: 0060:[&lt;c1074df6&gt;] EFLAGS: 00010046 CPU: 0
  EIP is at __queue_work+0x26/0x390
  EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
  ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
   DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
  CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
  Stack:
  Call Trace:
    __queue_delayed_work+0xa1/0x160
    queue_delayed_work_on+0x36/0x60
    vmstat_shepherd+0xad/0xf0
    process_one_work+0x1aa/0x4c0
    worker_thread+0x41/0x440
    kthread+0xb0/0xd0
    ret_from_kernel_thread+0x21/0x40

The reason is that start_shepherd_timer schedules the shepherd work item
which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
that workqueue so if the further initialization takes more than HZ we
might end up scheduling on a NULL vmstat_wq.  This is really unlikely
but not impossible.

Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Reported-by: kernel test robot &lt;ying.huang@linux.intel.com&gt;
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Tested-by: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Cc: stable@vger.kernel.org
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/vmstat: fix overflow in mod_zone_page_state()</title>
<updated>2015-12-30T01:45:49Z</updated>
<author>
<name>Heiko Carstens</name>
<email>heiko.carstens@de.ibm.com</email>
</author>
<published>2015-12-29T22:54:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6cdb18ad98a49f7e9b95d538a0614cde827404b8'/>
<id>urn:sha1:6cdb18ad98a49f7e9b95d538a0614cde827404b8</id>
<content type='text'>
mod_zone_page_state() takes a "delta" integer argument.  delta contains
the number of pages that should be added or subtracted from a struct
zone's vm_stat field.

If a zone is larger than 8TB this will cause overflows.  E.g.  for a
zone with a size slightly larger than 8TB the line

    mod_zone_page_state(zone, NR_ALLOC_BATCH, zone-&gt;managed_pages);

in mm/page_alloc.c:free_area_init_core() will result in a negative
result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
contain 0x8xxxxxxx pages which will be sign extended to a negative
value.

Fix this by changing the delta argument to long type.

This could fix an early boot problem seen on s390, where we have a 9TB
system with only one node.  ZONE_DMA contains 2GB and ZONE_NORMAL the
rest.  The system is trying to allocate a GFP_DMA page but ZONE_DMA is
completely empty, so it tries to reclaim pages in an endless loop.

This was seen on a heavily patched 3.10 kernel.  One possible
explaination seem to be the overflows caused by mod_zone_page_state().
Unfortunately I did not have the chance to verify that this patch
actually fixes the problem, since I don't have access to the system
right now.  However the overflow problem does exist anyway.

Given the description that a system with slightly less than 8TB does
work, this seems to be a candidate for the observed problem.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()</title>
<updated>2015-12-30T01:45:49Z</updated>
<author>
<name>Andrew Banman</name>
<email>abanman@sgi.com</email>
</author>
<published>2015-12-29T22:54:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5f0f2887f4de9508dcf438deab28f1de8070c271'/>
<id>urn:sha1:5f0f2887f4de9508dcf438deab28f1de8070c271</id>
<content type='text'>
test_pages_in_a_zone() does not account for the possibility of missing
sections in the given pfn range.  pfn_valid_within always returns 1 when
CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
sections to pass the test, leading to a kernel oops.

Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
for missing sections before proceeding into the zone-check code.

This also prevents a crash from offlining memory devices with missing
sections.  Despite this, it may be a good idea to keep the related patch
'[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
missing sections' because missing sections in a memory block may lead to
other problems not covered by the scope of this fix.

Signed-off-by: Andrew Banman &lt;abanman@sgi.com&gt;
Acked-by: Alex Thorlton &lt;athorlton@sgi.com&gt;
Cc: Russ Anderson &lt;rja@sgi.com&gt;
Cc: Alex Thorlton &lt;athorlton@sgi.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Seth Jennings &lt;sjennings@variantweb.net&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcontrol: fix possible memcg leak due to interrupted reclaim</title>
<updated>2015-12-30T01:45:49Z</updated>
<author>
<name>Vladimir Davydov</name>
<email>vdavydov@virtuozzo.com</email>
</author>
<published>2015-12-29T22:54:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6df38689e0e9a07ff4f42c06b302e203b33667e9'/>
<id>urn:sha1:6df38689e0e9a07ff4f42c06b302e203b33667e9</id>
<content type='text'>
Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
once enough pages have been reclaimed, in which case, in contrast to a
full round-trip over a cgroup sub-tree, the current position stored in
mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
and so is left holding the reference to the last scanned cgroup.  If the
target cgroup does not get scanned again (we might have just reclaimed
the last page or all processes might exit and free their memory
voluntary), we will leak it, because there is nobody to put the
reference held by the iterator.

The problem is easy to reproduce by running the following command
sequence in a loop:

    mkdir /sys/fs/cgroup/memory/test
    echo 100M &gt; /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    echo $$ &gt; /sys/fs/cgroup/memory/test/cgroup.procs
    memhog 150M
    echo $$ &gt; /sys/fs/cgroup/memory/cgroup.procs
    rmdir test

The cgroups generated by it will never get freed.

This patch fixes this issue by making mem_cgroup_iter avoid taking
reference to the current position.  In order not to hit use-after-free
bug while running reclaim in parallel with cgroup deletion, we make use
of -&gt;css_released cgroup callback to clear references to the dying
cgroup in all reclaim iterators that might refer to it.  This callback
is called right before scheduling rcu work which will free css, so if we
access iter-&gt;position from rcu read section, we might be sure it won't
go away under us.

[hannes@cmpxchg.org: clean up css ref handling]
Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
Signed-off-by: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@kernel.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[3.19+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/zswap: change incorrect strncmp use to strcmp</title>
<updated>2015-12-18T22:25:40Z</updated>
<author>
<name>Dan Streetman</name>
<email>ddstreet@ieee.org</email>
</author>
<published>2015-12-18T22:22:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8bc8b228d076ae93398316f81eab35f3d12c0c4f'/>
<id>urn:sha1:8bc8b228d076ae93398316f81eab35f3d12c0c4f</id>
<content type='text'>
Change the use of strncmp in zswap_pool_find_get() to strcmp.

The use of strncmp is no longer correct, now that zswap_zpool_type is
not an array; sizeof() will return the size of a pointer, which isn't
the right length to compare.  We don't need to use strncmp anyway,
because the existing params and the passed in params are all guaranteed
to be null terminated, so strcmp should be used.

Signed-off-by: Dan Streetman &lt;ddstreet@ieee.org&gt;
Reported-by: Weijie Yang &lt;weijie.yang@samsung.com&gt;
Cc: Seth Jennings &lt;sjennings@variantweb.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/oom_kill.c: avoid attempting to kill init sharing same memory</title>
<updated>2015-12-12T18:15:34Z</updated>
<author>
<name>Chen Jie</name>
<email>chenjie6@huawei.com</email>
</author>
<published>2015-12-11T21:41:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a2b829d95958da2025ef844c0f53ac15ad720fac'/>
<id>urn:sha1:a2b829d95958da2025ef844c0f53ac15ad720fac</id>
<content type='text'>
It's possible that an oom killed victim shares an -&gt;mm with the init
process and thus oom_kill_process() would end up trying to kill init as
well.

This has been shown in practice:

	Out of memory: Kill process 9134 (init) score 3 or sacrifice child
	Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB
	Kill process 1 (init) sharing same memory
	...
	Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

And this will result in a kernel panic.

If a process is forked by init and selected for oom kill while still
sharing init_mm, then it's likely this system is in a recoverable state.
However, it's better not to try to kill init and allow the machine to
panic due to unkillable processes.

[rientjes@google.com: rewrote changelog]
[akpm@linux-foundation.org: fix inverted test, per Ben]
Signed-off-by: Chen Jie &lt;chenjie6@huawei.com&gt;
Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Hillf Danton &lt;hillf.zj@alibaba-inc.com&gt;
Cc: Ben Hutchings &lt;ben@decadent.org.uk&gt;
Cc: Li Zefan &lt;lizefan@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>tmpfs: fix shmem_evict_inode() warnings on i_blocks</title>
<updated>2015-12-12T18:15:34Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2015-12-11T21:40:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=267a4c76bbdb950688d3aeb020976c2918064584'/>
<id>urn:sha1:267a4c76bbdb950688d3aeb020976c2918064584</id>
<content type='text'>
Dmitry Vyukov provides a little program, autogenerated by syzkaller,
which races a fault on a mapping of a sparse memfd object, against
truncation of that object below the fault address: run repeatedly for a
few minutes, it reliably generates shmem_evict_inode()'s
WARN_ON(inode-&gt;i_blocks).

(But there's nothing specific to memfd here, nor to the fstat which it
happened to use to generate the fault: though that looked suspicious,
since a shmem_recalc_inode() had been added there recently.  The same
problem can be reproduced with open+unlink in place of memfd_create, and
with fstatfs in place of fstat.)

v3.7 commit 0f3c42f522dc ("tmpfs: change final i_blocks BUG to WARNING")
explains one cause of such a warning (a race with shmem_writepage to
swap), and possible solutions; but we never took it further, and this
syzkaller incident turns out to have a different cause.

shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
then found to be beyond eof, looks plausible - decrementing the alloced
count that was just before incremented - but in fact can go wrong, if a
racing thread (the truncator, for example) gets its shmem_recalc_inode()
in just after our delete_from_page_cache().  delete_from_page_cache()
decrements nrpages, that shmem_recalc_inode() will balance the books by
decrementing alloced itself, then our decrement of alloced take it one
too low: leading to the WARNING when the object is finally evicted.

Once the new page has been exposed in the page cache,
shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
the accounting right in all cases (and not fall through from "trunc:" to
"decused:").  Adjust that error recovery block; and the reinitialization
of info and sbinfo can be removed too.

While we're here, fix shmem_writepage() to avoid the original issue: it
will be safe against a racing shmem_recalc_inode(), if it merely
increments swapped before the shmem_delete_from_page_cache() which
decrements nrpages (but it must then do its own shmem_recalc_inode()
before that, while still in balance, instead of after).  (Aside: why do
we shmem_recalc_inode() here in the swap path? Because its raison d'etre
is to cope with clean sparse shmem pages being reclaimed behind our
back: so here when swapping is a good place to look for that case.) But
I've not now managed to reproduce this bug, even without the patch.

I don't see why I didn't do that earlier: perhaps inhibited by the
preference to eliminate shmem_recalc_inode() altogether.  Driven by this
incident, I do now have a patch to do so at last; but still want to sit
on it for a bit, there's a couple of questions yet to be resolved.

Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/hugetlb.c: fix resv map memory leak for placeholder entries</title>
<updated>2015-12-12T18:15:34Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2015-12-11T21:40:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dbe409e4f5e5075bd9ff7f8dd5c627abf3ee38c1'/>
<id>urn:sha1:dbe409e4f5e5075bd9ff7f8dd5c627abf3ee38c1</id>
<content type='text'>
Dmitry Vyukov reported the following memory leak

unreferenced object 0xffff88002eaafd88 (size 32):
  comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
  hex dump (first 32 bytes):
    28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff  (.Nc....(.Nc....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
     kmalloc include/linux/slab.h:458
     region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
     __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
     vma_needs_reservation mm/hugetlb.c:1813
     alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
     hugetlb_no_page mm/hugetlb.c:3543
     hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
     follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
     __get_user_pages+0x542/0xf30 mm/gup.c:497
     populate_vma_page_range+0xde/0x110 mm/gup.c:919
     __mm_populate+0x1c7/0x310 mm/gup.c:969
     do_mlock+0x291/0x360 mm/mlock.c:637
     SYSC_mlock2 mm/mlock.c:658
     SyS_mlock2+0x4b/0x70 mm/mlock.c:648

Dmitry identified a potential memory leak in the routine region_chg,
where a region descriptor is not free'ed on an error path.

However, the root cause for the above memory leak resides in region_del.
In this specific case, a "placeholder" entry is created in region_chg.
The associated page allocation fails, and the placeholder entry is left
in the reserve map.  This is "by design" as the entry should be deleted
when the map is released.  The bug is in the region_del routine which is
used to delete entries within a specific range (and when the map is
released).  region_del did not handle the case where a placeholder entry
exactly matched the start of the range range to be deleted.  In this
case, the entry would not be deleted and leaked.  The fix is to take
these special placeholder entries into account in region_del.

The region_chg error path leak is also fixed.

Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries")
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Acked-by: Hillf Danton &lt;hillf.zj@alibaba-inc.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.3+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: hugetlb: call huge_pte_alloc() only if ptep is null</title>
<updated>2015-12-12T18:15:34Z</updated>
<author>
<name>Naoya Horiguchi</name>
<email>n-horiguchi@ah.jp.nec.com</email>
</author>
<published>2015-12-11T21:40:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0d777df5d8953293be090d9ab5a355db893e8357'/>
<id>urn:sha1:0d777df5d8953293be090d9ab5a355db893e8357</id>
<content type='text'>
Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
and check whether the obtained *ptep is a migration/hwpoison entry or
not.  And if not, then we get to call huge_pte_alloc().  This is racy
because the *ptep could turn into migration/hwpoison entry after the
huge_pte_offset() check.  This race results in BUG_ON in
huge_pte_alloc().

We don't have to call huge_pte_alloc() when the huge_pte_offset()
returns non-NULL, so let's fix this bug with moving the code into else
block.

Note that the *ptep could turn into a migration/hwpoison entry after
this block, but that's not a problem because we have another
!pte_present check later (we never go into hugetlb_no_page() in that
case.)

Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Acked-by: Hillf Danton &lt;hillf.zj@alibaba-inc.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[2.6.36+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: fix kerneldoc on mem_cgroup_replace_page</title>
<updated>2015-12-12T18:15:34Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2015-12-11T21:40:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=25be6a659598791d511c09ed71dca24e4844d128'/>
<id>urn:sha1:25be6a659598791d511c09ed71dca24e4844d128</id>
<content type='text'>
Whoops, I missed removing the kerneldoc comment of the lrucare arg
removed from mem_cgroup_replace_page; but it's a good comment, keep it.

Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
