<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm/page_alloc.c, branch v3.6</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v3.6</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v3.6'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2012-09-17T22:00:38Z</updated>
<entry>
<title>mm/page_alloc: fix the page address of higher page's buddy calculation</title>
<updated>2012-09-17T22:00:38Z</updated>
<author>
<name>Li Haifeng</name>
<email>omycle@gmail.com</email>
</author>
<published>2012-09-17T21:09:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0ba8f2d59304dfe69b59c034de723ad80f7ab9ac'/>
<id>urn:sha1:0ba8f2d59304dfe69b59c034de723ad80f7ab9ac</id>
<content type='text'>
The heuristic method for buddy has been introduced since commit
43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index
of adjacent buddy lists").  But the page address of higher page's buddy
was wrongly calculated, which will lead page_is_buddy to fail for ever.
IOW, the heuristic method would be disabled with the wrong page address
of higher page's buddy.

Calculating the page address of higher page's buddy should be based
higher_page with the offset between index of higher page and index of
higher page's buddy.

Signed-off-by: Haifeng Li &lt;omycle@gmail.com&gt;
Signed-off-by: Gavin Shan &lt;shangw@linux.vnet.ibm.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: KyongHo Cho &lt;pullip.cho@samsung.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[2.6.38+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: compaction: Abort async compaction if locks are contended or taking too long</title>
<updated>2012-08-21T23:45:03Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-08-21T23:16:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c67fe3752abe6ab47639e2f9b836900c3dc3da84'/>
<id>urn:sha1:c67fe3752abe6ab47639e2f9b836900c3dc3da84</id>
<content type='text'>
Jim Schutt reported a problem that pointed at compaction contending
heavily on locks.  The workload is straight-forward and in his own words;

	The systems in question have 24 SAS drives spread across 3 HBAs,
	running 24 Ceph OSD instances, one per drive.  FWIW these servers
	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
	Ceph Linux clients doing dd simultaneously to a Ceph file system
	backed by 12 of these servers.

Early in the test everything looks fine

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
  27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
  28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
   6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
  22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0

and then it goes to pot

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
  207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
  123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
  123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
  622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
  223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

  perf record -g -a sleep 10
  perf report --sort symbol --call-graph fractal,5
    34.63%  [k] _raw_spin_lock_irqsave
            |
            |--97.30%-- isolate_freepages
            |          compaction_alloc
            |          unmap_and_move
            |          migrate_pages
            |          compact_zone
            |          compact_zone_order
            |          try_to_compact_pages
            |          __alloc_pages_direct_compact
            |          __alloc_pages_slowpath
            |          __alloc_pages_nodemask
            |          alloc_pages_vma
            |          do_huge_pmd_anonymous_page
            |          handle_mm_fault
            |          do_page_fault
            |          page_fault
            |          |
            |          |--87.39%-- skb_copy_datagram_iovec
            |          |          tcp_recvmsg
            |          |          inet_recvmsg
            |          |          sock_recvmsg
            |          |          sys_recvfrom
            |          |          system_call
            |          |          __recv
            |          |          |
            |          |           --100.00%-- (nil)
            |          |
            |           --12.61%-- memcpy
             --2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone-&gt;lock and zone-&gt;lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt &lt;jaschut@sandia.gov&gt;
Tested-by: Jim Schutt &lt;jaschut@sandia.gov&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: correct page-&gt;pfmemalloc to fix deactivate_slab regression</title>
<updated>2012-08-21T23:45:03Z</updated>
<author>
<name>Alex Shi</name>
<email>alex.shi@intel.com</email>
</author>
<published>2012-08-21T23:16:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b121186ab1b12e2a96a945d88eae0735b4542158'/>
<id>urn:sha1:b121186ab1b12e2a96a945d88eae0735b4542158</id>
<content type='text'>
Commit cfd19c5a9ecf ("mm: only set page-&gt;pfmemalloc when
ALLOC_NO_WATERMARKS was used") tried to narrow down page-&gt;pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

    64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |---0.34%-- deactivate_slab
                        |          __slab_alloc
                        |          kmem_cache_alloc
                        |          |

That causes our fio sync write performance to have a 40% regression.

Move the checking in get_page_from_freelist() which resolves this issue.

Signed-off-by: Alex Shi &lt;alex.shi@intel.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Tested-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Tested-by: Sage Weil &lt;sage@inktank.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: remove node_start_pfn checking in new WARN_ON for now</title>
<updated>2012-08-02T17:37:03Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2012-08-02T17:37:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8783b6e2b2cb726f2734cf208d101f73ac1ba616'/>
<id>urn:sha1:8783b6e2b2cb726f2734cf208d101f73ac1ba616</id>
<content type='text'>
Borislav Petkov reports that the new warning added in commit
88fdf75d1bb5 ("mm: warn if pg_data_t isn't initialized with zero")
triggers for him, and it is the node_start_pfn field that has already
been initialized once.

The call trace looks like this:

  x86_64_start_kernel -&gt;
    x86_64_start_reservations -&gt;
    start_kernel -&gt;
    setup_arch -&gt;
    paging_init -&gt;
    zone_sizes_init -&gt;
    free_area_init_nodes -&gt;
    free_area_init_node

and (with the warning replaced by debug output), Borislav sees

  On node 0 totalpages: 4193848
    DMA zone: 64 pages used for memmap
    DMA zone: 6 pages reserved
    DMA zone: 3890 pages, LIFO batch:0
    DMA32 zone: 16320 pages used for memmap
    DMA32 zone: 798464 pages, LIFO batch:31
    Normal zone: 52736 pages used for memmap
    Normal zone: 3322368 pages, LIFO batch:31
  free_area_init_node: pgdat-&gt;node_start_pfn: 4423680      &lt;----
  On node 1 totalpages: 4194304
    Normal zone: 65536 pages used for memmap
    Normal zone: 4128768 pages, LIFO batch:31
  free_area_init_node: pgdat-&gt;node_start_pfn: 8617984      &lt;----
  On node 2 totalpages: 4194304
    Normal zone: 65536 pages used for memmap
    Normal zone: 4128768 pages, LIFO batch:31
  free_area_init_node: pgdat-&gt;node_start_pfn: 12812288     &lt;----
  On node 3 totalpages: 4194304
    Normal zone: 65536 pages used for memmap
    Normal zone: 4128768 pages, LIFO batch:31

so remove the bogus warning for now to avoid annoying people.  Minchan
Kim is looking at it.

Reported-by: Borislav Petkov &lt;bp@amd64.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: remove redundant initialization</title>
<updated>2012-08-01T01:42:50Z</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2012-07-31T23:46:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6527af5d1bea219d64095a5e30c1b1e0868aae16'/>
<id>urn:sha1:6527af5d1bea219d64095a5e30c1b1e0868aae16</id>
<content type='text'>
pg_data_t is zeroed before reaching free_area_init_core(), so remove the
now unnecessary initializations.

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: warn if pg_data_t isn't initialized with zero</title>
<updated>2012-08-01T01:42:50Z</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2012-07-31T23:46:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=88fdf75d1bb51d85ba00c466391770056d44bc03'/>
<id>urn:sha1:88fdf75d1bb51d85ba00c466391770056d44bc03</id>
<content type='text'>
Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
when it is allocated.  Arch code and memory hotplug already initiailize
pg_data_t.  So this warning should never happen.  I select fields randomly
near the beginning, middle and end of pg_data_t for checking.

This patch isn't for performance but for removing initialization code
which is necessary to add whenever we adds new field to pg_data_t or zone.

Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
Tejun doesn't like it because in the future, some archs can initialize
some fields in arch code and pass them into general MM part so blindly
clearing it out in mm core part would be very annoying.

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage</title>
<updated>2012-08-01T01:42:46Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-07-31T23:44:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5515061d22f0f9976ae7815864bfd22042d36848'/>
<id>urn:sha1:5515061d22f0f9976ae7815864bfd22042d36848</id>
<content type='text'>
If swap is backed by network storage such as NBD, there is a risk that a
large number of reclaimers can hang the system by consuming all
PF_MEMALLOC reserves.  To avoid these hangs, the administrator must tune
min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use.  If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother but
the system will keep running.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Cc: Eric B Munson &lt;emunson@mgebm.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Sebastian Andrzej Siewior &lt;sebastian@breakpoint.cc&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: ignore mempolicies when using ALLOC_NO_WATERMARK</title>
<updated>2012-08-01T01:42:45Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-07-31T23:44:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=183f6371aac2a5496a8ef2b0b0a68562652c3cdb'/>
<id>urn:sha1:183f6371aac2a5496a8ef2b0b0a68562652c3cdb</id>
<content type='text'>
The reserve is proportionally distributed over all !highmem zones in the
system.  So we need to allow an emergency allocation access to all zones.
In order to do that we need to break out of any mempolicy boundaries we
might have.

In my opinion that does not break mempolicies as those are user oriented
and not system oriented.  That is, system allocations are not guaranteed
to be within mempolicy boundaries.  For instance IRQs do not even have a
mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency allocations,
which are always system allocations (as opposed to user) is ok.

Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Cc: Eric B Munson &lt;emunson@mgebm.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Sebastian Andrzej Siewior &lt;sebastian@breakpoint.cc&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: only set page-&gt;pfmemalloc when ALLOC_NO_WATERMARKS was used</title>
<updated>2012-08-01T01:42:45Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-07-31T23:44:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cfd19c5a9ecf8e5e38de2603077c4330af21316e'/>
<id>urn:sha1:cfd19c5a9ecf8e5e38de2603077c4330af21316e</id>
<content type='text'>
__alloc_pages_slowpath() is called when the number of free pages is below
the low watermark.  If the caller is entitled to use ALLOC_NO_WATERMARKS
then the page will be marked page-&gt;pfmemalloc.  This protects more pages
than are strictly necessary as we only need to protect pages allocated
below the min watermark (the pfmemalloc reserves).

This patch only sets page-&gt;pfmemalloc when ALLOC_NO_WATERMARKS was
required to allocate the page.

[rientjes@google.com: David noticed the problem during review]
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Cc: Eric B Munson &lt;emunson@mgebm.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Sebastian Andrzej Siewior &lt;sebastian@breakpoint.cc&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: allow PF_MEMALLOC from softirq context</title>
<updated>2012-08-01T01:42:45Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-07-31T23:44:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=907aed48f65efeecf91575397e3d79335d93a466'/>
<id>urn:sha1:907aed48f65efeecf91575397e3d79335d93a466</id>
<content type='text'>
This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle with
- thus the gfp to alloc flag mapping ignores the task flags when in
interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery.  This patch borrows the task flags from whatever process happens
to be preempted by the softirq.  It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
cannot leak back into the preempted process.  This should be safe due to
the following reasons

Softirqs can run on multiple CPUs sure but the same task should not be
	executing the same softirq code. Neither should the softirq
	handler be preempted by any other softirq handler so the flags
	should not leak to an unrelated softirq.

Softirqs re-enable hardware interrupts in __do_softirq() so can be
	preempted by hardware interrupts so PF_MEMALLOC is inherited
	by the hard IRQ. However, this is similar to a process in
	reclaim being preempted by a hardirq. While PF_MEMALLOC is
	set, gfp_to_alloc_flags() distinguishes between hard and
	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
	flag.

If the softirq is deferred to ksoftirq then its flags may be used
        instead of a normal tasks but as the softirq cannot be preempted,
        the PF_MEMALLOC flag does not leak to other code by accident.

[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Cc: Eric B Munson &lt;emunson@mgebm.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Sebastian Andrzej Siewior &lt;sebastian@breakpoint.cc&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
