<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm, branch v4.0</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.0</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.0'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2015-04-06T21:50:59Z</updated>
<entry>
<title>fix mremap() vs. ioctx_kill() race</title>
<updated>2015-04-06T21:50:59Z</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2015-04-06T21:48:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b2edffdd912b4205899a8efa0974dfbbc3216109'/>
<id>urn:sha1:b2edffdd912b4205899a8efa0974dfbbc3216109</id>
<content type='text'>
teach -&gt;mremap() method to return an error and have it fail for
aio mappings in process of being killed

Note that in case of -&gt;mremap() failure we need to undo move_page_tables()
we'd already done; we could call -&gt;mremap() first, but then the failure of
move_page_tables() would require undoing whatever _successful_ -&gt;mremap()
has done, which would be a lot more headache in general.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>Merge branch 'akpm' (patches from Andrew)</title>
<updated>2015-03-25T23:21:17Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-03-25T23:21:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9c8e30d12d46461c283608d3123debc4e75d4f8b'/>
<id>urn:sha1:9c8e30d12d46461c283608d3123debc4e75d4f8b</id>
<content type='text'>
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton &lt;akpm@linux-foundation.org&gt;:
  mm: numa: mark huge PTEs young when clearing NUMA hinting faults
  mm: numa: slow PTE scan rate if migration failures occur
  mm: numa: preserve PTE write permissions across a NUMA hinting fault
  mm: numa: group related processes based on VMA flags instead of page table flags
  hfsplus: fix B-tree corruption after insertion at position 0
  MAINTAINERS: add Jan as DMI/SMBIOS support maintainer
  fs/affs/file.c: unlock/release page on error
  mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate
  mm/slub: fix lockups on PREEMPT &amp;&amp; !SMP kernels
  mm/memory hotplug: postpone the reset of obsolete pgdat
  MAINTAINERS: correct rtc armada38x pattern entry
  mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers
  mm: fix anon_vma-&gt;degree underflow in anon_vma endless growing prevention
  drivers/rtc/rtc-mrst: fix suspend/resume
  aoe: update aoe maintainer information
</content>
</entry>
<entry>
<title>mm: numa: mark huge PTEs young when clearing NUMA hinting faults</title>
<updated>2015-03-25T23:20:31Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2015-03-25T22:55:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b7b04004ecd9e58cdc6c6ff92f251d5ac5c0adb2'/>
<id>urn:sha1:b7b04004ecd9e58cdc6c6ff92f251d5ac5c0adb2</id>
<content type='text'>
Base PTEs are marked young when the NUMA hinting information is cleared
but the same does not happen for huge pages which this patch addresses.

Note that migrated pages are not marked young as the base page migration
code does not assume that migrated pages have been referenced.  This
could be addressed but beyond the scope of this series which is aimed at
Dave Chinners shrink workload that is unlikely to be affected by this
issue.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Aneesh Kumar &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: numa: slow PTE scan rate if migration failures occur</title>
<updated>2015-03-25T23:20:31Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2015-03-25T22:55:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=074c238177a75f5e79af3b2cb6a84e54823ef950'/>
<id>urn:sha1:074c238177a75f5e79af3b2cb6a84e54823ef950</id>
<content type='text'>
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Dave Chinner &lt;david@fromorbit.com&gt;
Tested-by: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Aneesh Kumar &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: numa: preserve PTE write permissions across a NUMA hinting fault</title>
<updated>2015-03-25T23:20:31Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2015-03-25T22:55:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b191f9b106ea1a24a711dbebb2925d3313da5852'/>
<id>urn:sha1:b191f9b106ea1a24a711dbebb2925d3313da5852</id>
<content type='text'>
Protecting a PTE to trap a NUMA hinting fault clears the writable bit
and further faults are needed after trapping a NUMA hinting fault to set
the writable bit again.  This patch preserves the writable bit when
trapping NUMA hinting faults.  The impact is obvious from the number of
minor faults trapped during the basis balancing benchmark and the system
CPU usage;

  autonumabench
                                             4.0.0-rc4             4.0.0-rc4
                                              baseline              preserve
  Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
  Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
  Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
  Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
  Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
  Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
  Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
  Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)

               4.0.0-rc4   4.0.0-rc4
                baseline    preserve
  User          44383.95    43971.89
  System          252.61      201.24
  Elapsed         998.68     1000.94

  Minor Faults   2597249     1981230
  Major Faults       365         364

There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
workload

                                      4.0.0-rc4             4.0.0-rc4
                                       baseline              preserve
  Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
  Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)

The patch looks hacky but the alternatives looked worse.  The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse.  It's not generally
safe to just mark the page writable during the fault if it's a write
fault as it may have been read-only for COW so that approach was
discarded.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Dave Chinner &lt;david@fromorbit.com&gt;
Tested-by: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Aneesh Kumar &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: numa: group related processes based on VMA flags instead of page table flags</title>
<updated>2015-03-25T23:20:31Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2015-03-25T22:55:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bea66fbd11af1ca98ae26855eea41eda8582923e'/>
<id>urn:sha1:bea66fbd11af1ca98ae26855eea41eda8582923e</id>
<content type='text'>
These are three follow-on patches based on the xfsrepair workload Dave
Chinner reported was problematic in 4.0-rc1 due to changes in page table
management -- https://lkml.org/lkml/2015/3/1/226.

Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
Return the correct value for change_huge_pmd").  It was known that the
performance in 3.19 was still better even if is far less safe.  This
series aims to restore the performance without compromising on safety.

For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
three patches applied on top

  autonumabench
                                                3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                               vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
  Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
  Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
  Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
  Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
  Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
  Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
  Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
  Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)

                3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
               vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  User        46334.60    46391.94    44383.95    43971.89    44372.12
  System        252.84      284.66      252.61      201.24      249.00
  Elapsed      1062.14     1050.96      998.68     1000.94     1026.78

Overall the system CPU usage is comparable and the test is naturally a
bit variable.  The slowing of the scanner hurts numa01 but on this
machine it is an adverse workload and patches that dramatically help it
often hurt absolutely everything else.

Due to patch 2, the fault activity is interesting

                                  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                 vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Minor Faults                   2097811     2656646     2597249     1981230     1636841
  Major Faults                       362         450         365         364         365

Note the impact preserving the write bit across protection updates and
fault reduces faults.

  NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
  NUMA alloc miss                      0           0           0           0           0
  NUMA interleave hit                  0           0           0           0           0
  NUMA alloc local               1228514     1216317     1190871     1177448     1199021
  NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
  NUMA huge PMD updates           479530      468448      464868      477573      224487
  NUMA page range updates      491225557   479886983   476207932   489222218   229950144
  NUMA hint faults                659753      656503      641678      656926      294842
  NUMA hint local faults          381604      373963      360478      337585      186249
  NUMA hint local percent             57          56          56          51          63
  NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
  AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%

Here the impact of slowing the PTE scanner on migratrion failures is
obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
massively reduced even though the headline performance is very similar.

As xfsrepair was the reported workload here is the impact of the series
on it.

  xfsrepair
                                         3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                        vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
  Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
  Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
  Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
  Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
  Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
  Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
  Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
  Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
  Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
  Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
  Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
  Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
  CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
  CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
  CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
  CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
  Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
  Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
  Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
  Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)

The really relevant lines as syst-xfsrepair which is the system CPU
usage when running xfsrepair.  Note that on my machine the overhead was
45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
preserve the write bit across faults, it's only 2.51% higher on average.
With the full series applied, system CPU usage is 24.6% lower on
average.

Again, the impact of preserving the write bit on minor faults is obvious
and the impact of slowing scanning after migration failures is obvious
on the PTE updates.  Note also that the number of pages migrated is much
reduced even though the headline performance is comparable.

                                  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                 vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Minor Faults                 153466827   254507978   249163829   153501373   105737890
  Major Faults                       610         702         690         649         724
  NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
  NUMA huge PMD updates           129294       85044      106921      127246       79887
  NUMA pages migrated           21938995    29705270    28594162    22687324    16258075

                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
  Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
  Mean sdb-await         25.92        5.09        5.33        5.02        5.22
  Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
  Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
  Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
  Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
  Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
  Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
  Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
  Max  sdb-await        168.24       39.28       49.22       44.64       65.62
  Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
  Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
  Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
  Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
  Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15

FWIW, I also checked SPECjbb in different configurations but it's
similar observations -- minor faults lower, PTE update activity lower
and performance is roughly comparable against 3.19.

This patch (of 3):

Threads that share writable data within pages are grouped together as
related tasks.  This decision is based on whether the PTE is marked
dirty which is subject to timing races between the PTE scanner update
and when the application writes the page.  If the page is file-backed,
then background flushes and sync also affect placement.  This is
unpredictable behaviour which is impossible to reason about so this
patch makes grouping decisions based on the VMA flags.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Dave Chinner &lt;david@fromorbit.com&gt;
Tested-by: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Aneesh Kumar &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate</title>
<updated>2015-03-25T23:20:30Z</updated>
<author>
<name>Laura Abbott</name>
<email>lauraa@codeaurora.org</email>
</author>
<published>2015-03-25T22:55:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cfa869438282be84ad4110bba5027ef1fbbe71e4'/>
<id>urn:sha1:cfa869438282be84ad4110bba5027ef1fbbe71e4</id>
<content type='text'>
Commit 3c605096d315 ("mm/page_alloc: restrict max order of merging on
isolated pageblock") changed the logic of unset_migratetype_isolate to
check the buddy allocator and explicitly call __free_pages to merge.

The page that is being freed in this path never had prep_new_page called
so set_page_refcounted is called explicitly but there is no call to
kernel_map_pages.  With the default kernel_map_pages this is mostly
harmless but if kernel_map_pages does any manipulation of the page
tables (unmapping or setting pages to read only) this may trigger a
fault:

    alloc_contig_range test_pages_isolated(ceb00, ced00) failed
    Unable to handle kernel paging request at virtual address ffffffc0cec00000
    pgd = ffffffc045fc4000
    [ffffffc0cec00000] *pgd=0000000000000000
    Internal error: Oops: 9600004f [#1] PREEMPT SMP
    Modules linked in: exfatfs
    CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1
    task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000
    PC is at memset+0xc8/0x1c0
    LR is at kernel_map_pages+0x1ec/0x244

Fix this by calling kernel_map_pages to ensure the page is set in the
page table properly

Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Signed-off-by: Laura Abbott &lt;lauraa@codeaurora.org&gt;
Cc: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Zhang Yanfei &lt;zhangyanfei@cn.fujitsu.com&gt;
Cc: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Cc: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Acked-by: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Gioh Kim &lt;gioh.kim@lge.com&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Marek Szyprowski &lt;m.szyprowski@samsung.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/slub: fix lockups on PREEMPT &amp;&amp; !SMP kernels</title>
<updated>2015-03-25T23:20:30Z</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2015-03-25T22:55:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=859b7a0e89120505c304d7afbbe90325abaa0a6b'/>
<id>urn:sha1:859b7a0e89120505c304d7afbbe90325abaa0a6b</id>
<content type='text'>
Commit 9aabf810a67c ("mm/slub: optimize alloc/free fastpath by removing
preemption on/off") introduced an occasional hang for kernels built with
CONFIG_PREEMPT &amp;&amp; !CONFIG_SMP.

The problem is the following loop the patch introduced to
slab_alloc_node and slab_free:

    do {
        tid = this_cpu_read(s-&gt;cpu_slab-&gt;tid);
        c = raw_cpu_ptr(s-&gt;cpu_slab);
    } while (IS_ENABLED(CONFIG_PREEMPT) &amp;&amp; unlikely(tid != c-&gt;tid));

GCC 4.9 has been observed to hoist the load of c and c-&gt;tid above the
loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
constant and does not force a reload).  On arm64 the generated assembly
looks like:

         ldr     x4, [x0,#8]
  loop:
         ldr     x1, [x0,#8]
         cmp     x1, x4
         b.ne    loop

If the thread is preempted between the load of c-&gt;tid (into x1) and tid
(into x4), and an allocation or free occurs in another thread (bumping
the cpu_slab's tid), the thread will be stuck in the loop until
s-&gt;cpu_slab-&gt;tid wraps, which may be forever in the absence of
allocations/frees on the same CPU.

This patch changes the loop condition to access c-&gt;tid with READ_ONCE.
This ensures that the value is reloaded even when the compiler would
otherwise assume it could cache the value, and also ensures that the
load will not be torn.

Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Jesper Dangaard Brouer &lt;brouer@redhat.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Steve Capper &lt;steve.capper@linaro.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memory hotplug: postpone the reset of obsolete pgdat</title>
<updated>2015-03-25T23:20:30Z</updated>
<author>
<name>Gu Zheng</name>
<email>guz.fnst@cn.fujitsu.com</email>
</author>
<published>2015-03-25T22:55:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b0dc3a342af36f95a68fe229b8f0f73552c5ca08'/>
<id>urn:sha1:b0dc3a342af36f95a68fe229b8f0f73552c5ca08</id>
<content type='text'>
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:

  BUG: unable to handle kernel paging request at 0000000000025f60
  IP: next_online_pgdat+0x1/0x50
  PGD 0
  Oops: 0000 [#1] SMP
  ACPI: Device does not support D3cold
  Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
  CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
  Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
  Workqueue: events vmstat_update
  task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
  RIP: 0010: next_online_pgdat+0x1/0x50
  RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
  RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
  RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
  RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
  R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
  R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
  FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    refresh_cpu_vm_stats+0xd0/0x140
    vmstat_update+0x11/0x50
    process_one_work+0x194/0x3d0
    worker_thread+0x12b/0x410
    kthread+0xc6/0xd0
    ret_from_fork+0x7c/0xb0

The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.

process A:				offline node XX:

vmstat_updat()
   refresh_cpu_vm_stats()
     for_each_populated_zone()
       find online node XX
     cond_resched()
					offline cpu and memory, then try_offline_node()
					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
       zone = next_zone(zone)
         pg_data_t *pgdat = zone-&gt;zone_pgdat;  // here pgdat is NULL now
           next_online_pgdat(pgdat)
             next_online_node(pgdat-&gt;node_id);  // NULL pointer access

So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat-&gt;nr_zones and pgdat-&gt;classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.

Signed-off-by: Gu Zheng &lt;guz.fnst@cn.fujitsu.com&gt;
Reported-by: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Suggested-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Cc: Taku Izumi &lt;izumi.taku@jp.fujitsu.com&gt;
Cc: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Cc: Xie XiuQi &lt;xiexiuqi@huawei.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers</title>
<updated>2015-03-25T23:20:30Z</updated>
<author>
<name>Naoya Horiguchi</name>
<email>n-horiguchi@ah.jp.nec.com</email>
</author>
<published>2015-03-25T22:55:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f683739539e819e9b821a197d80e52258510837b'/>
<id>urn:sha1:f683739539e819e9b821a197d80e52258510837b</id>
<content type='text'>
walk_page_test() is purely pagewalk's internal stuff, and its positive
return values are not intended to be passed to the callers of pagewalk.

However, in the current code if the last vma in the do-while loop in
walk_page_range() happens to return a positive value, it leaks outside
walk_page_range().  So the user visible effect is invalid/unexpected
return value (according to the reporter, mbind() causes it.)

This patch fixes it simply by reinitializing the return value after
checked.

Another exposed interface, walk_page_vma(), already returns 0 for such
cases so no problem.

Fixes: fafaa4264eba ("pagewalk: improve vma handling")
Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Signed-off-by: Kazutomo Yoshii &lt;kazutomo.yoshii@gmail.com&gt;
Reported-by: Kazutomo Yoshii &lt;kazutomo.yoshii@gmail.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
