<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm/page_alloc.c, branch v5.0</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.0</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.0'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2019-02-21T17:01:00Z</updated>
<entry>
<title>mm, page_alloc: fix a division by zero error when boosting watermarks v2</title>
<updated>2019-02-21T17:01:00Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2019-02-21T06:19:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=94b3334cbebea34d56a7e6321c6fe9d89b309a49'/>
<id>urn:sha1:94b3334cbebea34d56a7e6321c6fe9d89b309a49</id>
<content type='text'>
Yury Norov reported that an arm64 KVM instance could not boot since
after v5.0-rc1 and could addressed by reverting the patches

  1c30844d2dfe272d58c ("mm: reclaim small amounts of memory when an external
  73444bc4d8f92e46a20 ("mm, page_alloc: do not wake kswapd with zone lock held")

The problem is that a division by zero error is possible if boosting
occurs very early in boot if the system has very little memory.  This
patch avoids the division by zero error.

Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Reported-by: Yury Norov &lt;yury.norov@gmail.com&gt;
Tested-by: Yury Norov &lt;yury.norov@gmail.com&gt;
Tested-by: Will Deacon &lt;will.deacon@arm.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: Use fixed constant in page_frag_alloc instead of size + 1</title>
<updated>2019-02-17T23:48:43Z</updated>
<author>
<name>Alexander Duyck</name>
<email>alexander.h.duyck@linux.intel.com</email>
</author>
<published>2019-02-15T22:44:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8644772637deb121f7ac2df690cbf83fa63d3b70'/>
<id>urn:sha1:8644772637deb121f7ac2df690cbf83fa63d3b70</id>
<content type='text'>
This patch replaces the size + 1 value introduced with the recent fix for 1
byte allocs with a constant value.

The idea here is to reduce code overhead as the previous logic would have
to read size into a register, then increment it, and write it back to
whatever field was being used. By using a constant we can avoid those
memory reads and arithmetic operations in favor of just encoding the
maximum value into the operation itself.

Fixes: 2c2ade81741c ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs")
Signed-off-by: Alexander Duyck &lt;alexander.h.duyck@linux.intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs</title>
<updated>2019-02-14T17:12:17Z</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2019-02-13T21:45:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2c2ade81741c66082f8211f0b96cf509cc4c0218'/>
<id>urn:sha1:2c2ade81741c66082f8211f0b96cf509cc4c0218</id>
<content type='text'>
The basic idea behind -&gt;pagecnt_bias is: If we pre-allocate the maximum
number of references that we might need to create in the fastpath later,
the bump-allocation fastpath only has to modify the non-atomic bias value
that tracks the number of extra references we hold instead of the atomic
refcount. The maximum number of allocations we can serve (under the
assumption that no allocation is made with size 0) is nc-&gt;size, so that's
the bias used.

However, even when all memory in the allocation has been given away, a
reference to the page is still held; and in the `offset &lt; 0` slowpath, the
page may be reused if everyone else has dropped their references.
This means that the necessary number of references is actually
`nc-&gt;size+1`.

Luckily, from a quick grep, it looks like the only path that can call
page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
requires CAP_NET_ADMIN in the init namespace and is only intended to be
used for kernel testing and fuzzing.

To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
`offset &lt; 0` path, below the virt_to_page() call, and then repeatedly call
writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
with a vector consisting of 15 elements containing 1 byte each.

Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Revert "mm, memory_hotplug: initialize struct pages for the full memory section"</title>
<updated>2019-01-28T18:35:22Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2019-01-25T18:08:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a'/>
<id>urn:sha1:4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a</id>
<content type='text'>
This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.

The underlying assumption that one sparse section belongs into a single
numa node doesn't hold really. Robert Shteynfeld has reported a boot
failure. The boot log was not captured but his memory layout is as
follows:

  Early memory node ranges
    node   1: [mem 0x0000000000001000-0x0000000000090fff]
    node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
    node   1: [mem 0x0000000100000000-0x0000001423ffffff]
    node   0: [mem 0x0000001424000000-0x0000002023ffffff]

This means that node0 starts in the middle of a memory section which is
also in node1.  memmap_init_zone tries to initialize padding of a
section even when it is outside of the given pfn range because there are
code paths (e.g.  memory hotplug) which assume that the full worth of
memory section is always initialized.

In this particular case, though, such a range is already intialized and
most likely already managed by the page allocator.  Scribbling over
those pages corrupts the internal state and likely blows up when any of
those pages gets used.

Reported-by: Robert Shteynfeld &lt;robert.shteynfeld@gmail.com&gt;
Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
Cc: stable@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, page_alloc: do not wake kswapd with zone lock held</title>
<updated>2019-01-09T01:15:11Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2019-01-08T23:23:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=73444bc4d8f92e46a20cb6bd3342fc2ea75c6787'/>
<id>urn:sha1:73444bc4d8f92e46a20cb6bd3342fc2ea75c6787</id>
<content type='text'>
syzbot reported the following regression in the latest merge window and
it was confirmed by Qian Cai that a similar bug was visible from a
different context.

  ======================================================
  WARNING: possible circular locking dependency detected
  4.20.0+ #297 Not tainted
  ------------------------------------------------------
  syz-executor0/8529 is trying to acquire lock:
  000000005e7fb829 (&amp;pgdat-&gt;kswapd_wait){....}, at:
  __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120

  but task is already holding lock:
  000000009bb7bae0 (&amp;(&amp;zone-&gt;lock)-&gt;rlock){-.-.}, at: spin_lock
  include/linux/spinlock.h:329 [inline]
  000000009bb7bae0 (&amp;(&amp;zone-&gt;lock)-&gt;rlock){-.-.}, at: rmqueue_bulk
  mm/page_alloc.c:2548 [inline]
  000000009bb7bae0 (&amp;(&amp;zone-&gt;lock)-&gt;rlock){-.-.}, at: __rmqueue_pcplist
  mm/page_alloc.c:3021 [inline]
  000000009bb7bae0 (&amp;(&amp;zone-&gt;lock)-&gt;rlock){-.-.}, at: rmqueue_pcplist
  mm/page_alloc.c:3050 [inline]
  000000009bb7bae0 (&amp;(&amp;zone-&gt;lock)-&gt;rlock){-.-.}, at: rmqueue
  mm/page_alloc.c:3072 [inline]
  000000009bb7bae0 (&amp;(&amp;zone-&gt;lock)-&gt;rlock){-.-.}, at:
  get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491

It appears to be a false positive in that the only way the lock ordering
should be inverted is if kswapd is waking itself and the wakeup
allocates debugging objects which should already be allocated if it's
kswapd doing the waking.  Nevertheless, the possibility exists and so
it's best to avoid the problem.

This patch flags a zone as needing a kswapd using the, surprisingly,
unused zone flag field.  The flag is read without the lock held to do
the wakeup.  It's possible that the flag setting context is not the same
as the flag clearing context or for small races to occur.  However, each
race possibility is harmless and there is no visible degredation in
fragmentation treatment.

While zone-&gt;flag could have continued to be unused, there is potential
for moving some existing fields into the flags field instead.
Particularly read-mostly ones like zone-&gt;initialized and
zone-&gt;contiguous.

Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Tested-by: Qian Cai &lt;cai@lca.pw&gt;
Cc: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/page_alloc.c: allow error injection</title>
<updated>2018-12-28T20:11:51Z</updated>
<author>
<name>Benjamin Poirier</name>
<email>bpoirier@suse.com</email>
</author>
<published>2018-12-28T08:39:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=af3b854492f351d1ff3b4744a83bf5ff7eed4920'/>
<id>urn:sha1:af3b854492f351d1ff3b4744a83bf5ff7eed4920</id>
<content type='text'>
Model call chain after should_failslab().  Likewise, we can now use a
kprobe to override the return value of should_fail_alloc_page() and inject
allocation failures into alloc_page*().

This will allow injecting allocation failures using the BCC tools even
without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
fail_page_alloc= parameter, which incurs some overhead even when failures
are not being injected.  On the other hand, this patch adds an
unconditional call to should_fail_alloc_page() from page allocation
hotpath.  That overhead should be rather negligible with
CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.

[vbabka@suse.cz: changelog addition]
Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com
Signed-off-by: Benjamin Poirier &lt;bpoirier@suse.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Pavel Tatashin &lt;pavel.tatashin@microsoft.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Alexander Duyck &lt;alexander.h.duyck@linux.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, page_alloc: enable pcpu_drain with zone capability</title>
<updated>2018-12-28T20:11:51Z</updated>
<author>
<name>Wei Yang</name>
<email>richard.weiyang@gmail.com</email>
</author>
<published>2018-12-28T08:38:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d9367bd06faa1beb2951899272f925bdf877d28b'/>
<id>urn:sha1:d9367bd06faa1beb2951899272f925bdf877d28b</id>
<content type='text'>
drain_all_pages is documented to drain per-cpu pages for a given zone (if
non-NULL).  The current implementation doesn't match the description
though.  It will drain all pcp pages for all zones that happen to have
cached pages on the same cpu as the given zone.  This will lead to
premature pcp cache draining for zones that are not of any interest to the
caller - e.g.  compaction, hwpoison or memory offline.

This forces the page allocator to take locks and potential lock contention
as a result.

There is no real reason for this sub-optimal implementation.  Replace
per-cpu work item with a dedicated structure which contains a pointer to
the zone and pass it over to the worker.  This will get the zone
information all the way down to the worker function and do the right job.

[akpm@linux-foundation.org: avoid 80-col tricks]
[mhocko@suse.com: refactor the whole changelog]
Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init</title>
<updated>2018-12-28T20:11:51Z</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2018-12-28T08:38:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3c0c12cc8f00ca5f81acb010023b8eb13e9a7004'/>
<id>urn:sha1:3c0c12cc8f00ca5f81acb010023b8eb13e9a7004</id>
<content type='text'>
When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
pages initialization can take a long time.  Below were the reported init
times on a 8-socket 96-core 4TB IvyBridge system.

  1) Non-debug kernel without CONFIG_KASAN
     [    8.764222] node 1 initialised, 132086516 pages in 7027ms

  2) Debug kernel with CONFIG_KASAN
     [  146.288115] node 1 initialised, 132075466 pages in 143052ms

So the page init time in a debug kernel was 20X of the non-debug kernel.
The long init time can be problematic as the page initialization is done
with interrupt disabled.  In this particular case, it caused the
appearance of following warning messages as well as NMI backtraces of all
the cores that were doing the initialization.

[   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
[   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
[   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
[   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
[   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
[   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
[   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
[   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)

The long init time was mainly caused by the call to kasan_free_pages() to
poison the newly initialized pages.  On a 4TB system, we are talking about
almost 500GB of memory probably on the same node.

In reality, we may not need to poison the newly initialized pages before
they are ever allocated.  So KASAN poisoning of freed pages before the
completion of deferred memory initialization is now disabled.  Those pages
will be properly poisoned when they are allocated or freed after deferred
pages initialization is done.

With this change, the new page initialization time became:

[   21.948010] node 1 initialised, 132075466 pages in 18702ms

This was still about double the non-debug kernel time, but was much
better than before.

Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Alexander Potapenko &lt;glider@google.com&gt;
Cc: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Pasha Tatashin &lt;Pavel.Tatashin@microsoft.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES</title>
<updated>2018-12-28T20:11:51Z</updated>
<author>
<name>Pingfan Liu</name>
<email>kernelfans@gmail.com</email>
</author>
<published>2018-12-28T08:38:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=125b860b251ad226b1384b6db06be37485127f69'/>
<id>urn:sha1:125b860b251ad226b1384b6db06be37485127f69</id>
<content type='text'>
Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
If someone adds extra migrate type, then he may forget to enlarge the
NR_PAGEBLOCK_BITS.  Hence it requires some way to fix.

NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
two different .h file with reverse dependency, it is a little hard to
refer to MIGRATE_TYPES in pageblock-flag.h.  This patch tries to remind
such relation in compiling-time.

Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu &lt;kernelfans@gmail.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Pavel Tatashin &lt;pavel.tatashin@microsoft.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Alexander Duyck &lt;alexander.h.duyck@linux.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/page_alloc.c: drop uneeded __meminit and __meminitdata</title>
<updated>2018-12-28T20:11:49Z</updated>
<author>
<name>Oscar Salvador</name>
<email>osalvador@suse.de</email>
</author>
<published>2018-12-28T08:37:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bbe5d9939e81277bfbfb8fdba68374024b16d5fa'/>
<id>urn:sha1:bbe5d9939e81277bfbfb8fdba68374024b16d5fa</id>
<content type='text'>
Since commit 03e85f9d5f1 ("mm/page_alloc: Introduce
free_area_init_core_hotplug"), some functions changed to only be called
during system initialization.  Concretly, free_area_init_node() and the
functions that hang from it.

Also, some variables are no longer used after the system has gone
through initialization.  So this could be considered as a late clean-up
for that patch.

This patch changes the functions from __meminit to __init, and the
variables from __meminitdata to __initdata.

In return, we get some KBs back:

Before:
  Freeing unused kernel image memory: 2472K

After:
  Freeing unused kernel image memory: 2480K

Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
Signed-off-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Reviewed-by: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Pavel Tatashin &lt;pavel.tatashin@microsoft.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Alexander Duyck &lt;alexander.h.duyck@linux.intel.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
