<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm, branch v5.10</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.10</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.10'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-12-11T22:02:14Z</updated>
<entry>
<title>mm/hugetlb: clear compound_nr before freeing gigantic pages</title>
<updated>2020-12-11T22:02:14Z</updated>
<author>
<name>Gerald Schaefer</name>
<email>gerald.schaefer@linux.ibm.com</email>
</author>
<published>2020-12-11T21:36:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ba9c1201beaa86a773e83be5654602a0667e4a4d'/>
<id>urn:sha1:ba9c1201beaa86a773e83be5654602a0667e4a4d</id>
<content type='text'>
Commit 1378a5ee451a ("mm: store compound_nr as well as compound_order")
added compound_nr counter to first tail struct page, overlaying with
page-&gt;mapping.  The overlay itself is fine, but while freeing gigantic
hugepages via free_contig_range(), a "bad page" check will trigger for
non-NULL page-&gt;mapping on the first tail page:

  BUG: Bad page state in process bash  pfn:380001
  page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
  aops:0x0
  flags: 0x3ffff00000000000()
  raw: 3ffff00000000000 0000000000000100 0000000000000122 0000000100000000
  raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
  page dumped because: non-NULL mapping
  Modules linked in:
  CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
  Hardware name: IBM 3906 M03 703 (LPAR)
  Call Trace:
    show_stack+0x6e/0xe8
    dump_stack+0x90/0xc8
    bad_page+0xd6/0x130
    free_pcppages_bulk+0x26a/0x800
    free_unref_page+0x6e/0x90
    free_contig_range+0x94/0xe8
    update_and_free_page+0x1c4/0x2c8
    free_pool_huge_page+0x11e/0x138
    set_max_huge_pages+0x228/0x300
    nr_hugepages_store_common+0xb8/0x130
    kernfs_fop_write+0xd2/0x218
    vfs_write+0xb0/0x2b8
    ksys_write+0xac/0xe0
    system_call+0xe6/0x288
  Disabling lock debugging due to kernel taint

This is because only the compound_order is cleared in
destroy_compound_gigantic_page(), and compound_nr is set to
1U &lt;&lt; order == 1 for order 0 in set_compound_order(page, 0).

Fix this by explicitly clearing compound_nr for first tail page after
calling set_compound_order(page, 0).

Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.com
Fixes: 1378a5ee451a ("mm: store compound_nr as well as compound_order")
Signed-off-by: Gerald Schaefer &lt;gerald.schaefer@linux.ibm.com&gt;
Reviewed-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Christian Borntraeger &lt;borntraeger@de.ibm.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[5.9+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>kasan: fix object remaining in offline per-cpu quarantine</title>
<updated>2020-12-11T22:02:14Z</updated>
<author>
<name>Kuan-Ying Lee</name>
<email>Kuan-Ying.Lee@mediatek.com</email>
</author>
<published>2020-12-11T21:36:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6c82d45c7f0348b44e00bd7dcccfc47dec7577d1'/>
<id>urn:sha1:6c82d45c7f0348b44e00bd7dcccfc47dec7577d1</id>
<content type='text'>
We hit this issue in our internal test.  When enabling generic kasan, a
kfree()'d object is put into per-cpu quarantine first.  If the cpu goes
offline, object still remains in the per-cpu quarantine.  If we call
kmem_cache_destroy() now, slub will report "Objects remaining" error.

  =============================================================================
  BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  Disabling lock debugging due to kernel taint
  INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200
  CPU: 3 PID: 176 Comm: cat Tainted: G    B             5.10.0-rc1-00007-g4525c8781ec0-dirty #10
  Hardware name: linux,dummy-virt (DT)
  Call trace:
     dump_backtrace+0x0/0x2b0
     show_stack+0x18/0x68
     dump_stack+0xfc/0x168
     slab_err+0xac/0xd4
     __kmem_cache_shutdown+0x1e4/0x3c8
     kmem_cache_destroy+0x68/0x130
     test_version_show+0x84/0xf0
     module_attr_show+0x40/0x60
     sysfs_kf_seq_show+0x128/0x1c0
     kernfs_seq_show+0xa0/0xb8
     seq_read+0x1f0/0x7e8
     kernfs_fop_read+0x70/0x338
     vfs_read+0xe4/0x250
     ksys_read+0xc8/0x180
     __arm64_sys_read+0x44/0x58
     el0_svc_common.constprop.0+0xac/0x228
     do_el0_svc+0x38/0xa0
     el0_sync_handler+0x170/0x178
     el0_sync+0x174/0x180
  INFO: Object 0x(____ptrval____) @offset=15848
  INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172
     stack_trace_save+0x9c/0xd0
     set_track+0x64/0xf0
     alloc_debug_processing+0x104/0x1a0
     ___slab_alloc+0x628/0x648
     __slab_alloc.isra.0+0x2c/0x58
     kmem_cache_alloc+0x560/0x588
     test_version_show+0x98/0xf0
     module_attr_show+0x40/0x60
     sysfs_kf_seq_show+0x128/0x1c0
     kernfs_seq_show+0xa0/0xb8
     seq_read+0x1f0/0x7e8
     kernfs_fop_read+0x70/0x338
     vfs_read+0xe4/0x250
     ksys_read+0xc8/0x180
     __arm64_sys_read+0x44/0x58
     el0_svc_common.constprop.0+0xac/0x228
  kmem_cache_destroy test_module_slab: Slab cache still has objects

Register a cpu hotplug function to remove all objects in the offline
per-cpu quarantine when cpu is going offline.  Set a per-cpu variable to
indicate this cpu is offline.

[qiang.zhang@windriver.com: fix slab double free when cpu-hotplug]
  Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com

Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com
Signed-off-by: Kuan-Ying Lee &lt;Kuan-Ying.Lee@mediatek.com&gt;
Signed-off-by: Zqiang &lt;qiang.zhang@windriver.com&gt;
Suggested-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Reported-by: Guangye Yang &lt;guangye.yang@mediatek.com&gt;
Reviewed-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Alexander Potapenko &lt;glider@google.com&gt;
Cc: Matthias Brugger &lt;matthias.bgg@gmail.com&gt;
Cc: Nicholas Tang &lt;nicholas.tang@mediatek.com&gt;
Cc: Miles Chen &lt;miles.chen@mediatek.com&gt;
Cc: Qian Cai &lt;qcai@redhat.com&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>revert "mm/filemap: add static for function __add_to_page_cache_locked"</title>
<updated>2020-12-11T22:02:14Z</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2020-12-11T21:36:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=16c0cc0ce3059e315a0aab6538061d95a6612589'/>
<id>urn:sha1:16c0cc0ce3059e315a0aab6538061d95a6612589</id>
<content type='text'>
Revert commit 3351b16af494 ("mm/filemap: add static for function
__add_to_page_cache_locked") due to incompatibility with
ALLOW_ERROR_INJECTION which result in build errors.

Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com
Tested-by: Justin Forbes &lt;jmforbes@linuxtx.org&gt;
Tested-by: Greg Thelen &lt;gthelen@google.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Michal Kubecek &lt;mkubecek@suse.cz&gt;
Cc: Alex Shi &lt;alex.shi@linux.alibaba.com&gt;
Cc: Souptick Joarder &lt;jrdr.linux@gmail.com&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Josef Bacik &lt;josef@toxicpanda.com&gt;
Cc: Tony Luck &lt;tony.luck@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/madvise: remove racy mm ownership check</title>
<updated>2020-12-09T04:57:18Z</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2020-12-09T04:57:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a68a0262abdaa251e12c53715f48e698a18ef402'/>
<id>urn:sha1:a68a0262abdaa251e12c53715f48e698a18ef402</id>
<content type='text'>
Jann spotted the security hole due to race of mm ownership check.

If the task is sharing the mm_struct but goes through execve() before
mm_access(), it could skip process_madvise_behavior_valid check.  That
makes *any advice hint* to reach into the remote process.

This patch removes the mm ownership check.  With it, it will lose the
ability that local process could give *any* advice hint with vector
interface for some reason (e.g., performance).  Since there is no
concrete example in upstream yet, it would be better to remove the
abiliity at this moment and need to review when such new advice comes
up.

Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Suggested-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mmap.c: fix mmap return value when vma is merged after call_mmap()</title>
<updated>2020-12-06T18:19:07Z</updated>
<author>
<name>Liu Zixian</name>
<email>liuzixian4@huawei.com</email>
</author>
<published>2020-12-06T06:15:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=309d08d9b3a3659ab3f239d27d4e38b670b08fc9'/>
<id>urn:sha1:309d08d9b3a3659ab3f239d27d4e38b670b08fc9</id>
<content type='text'>
On success, mmap should return the begin address of newly mapped area,
but patch "mm: mmap: merge vma after call_mmap() if possible" set
vm_start of newly merged vma to return value addr.  Users of mmap will
get wrong address if vma is merged after call_mmap().  We fix this by
moving the assignment to addr before merging vma.

We have a driver which changes vm_flags, and this bug is found by our
testcases.

Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
Signed-off-by: Liu Zixian &lt;liuzixian4@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Hongxiang Lou &lt;louhongxiang@huawei.com&gt;
Cc: Hu Shiyuan &lt;hushiyuan@huawei.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>hugetlb_cgroup: fix offline of hugetlb cgroup with reservations</title>
<updated>2020-12-06T18:19:07Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2020-12-06T06:15:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7a5bde37983d37783161681ff7c6122dfd081791'/>
<id>urn:sha1:7a5bde37983d37783161681ff7c6122dfd081791</id>
<content type='text'>
Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs.  In this environment the issue is reproduced by:

 - Start a simple pod that uses the recently added HugePages medium
   feature (pod yaml attached)

 - Start a DPDK app. It doesn't need to run successfully (as in transfer
   packets) nor interact with real hardware. It seems just initializing
   the EAL layer (which handles hugepage reservation and locking) is
   enough to trigger the issue

 - Delete the Pod (or let it "Complete").

This would result in a kworker thread going into a tight loop (top output):

   1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45 kworker/28:7+cgroup_destroy

'perf top -g' reports:

  -   63.28%     0.01%  [kernel]                    [k] worker_thread
     - 49.97% worker_thread
        - 52.64% process_one_work
           - 62.08% css_killed_work_fn
              - hugetlb_cgroup_css_offline
                   41.52% _raw_spin_lock
                 - 2.82% _cond_resched
                      rcu_all_qs
                   2.66% PageHuge
        - 0.57% schedule
           - 0.57% __schedule

We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning.  Little else can be done on the system as the
cgroup_mutex can not be acquired.

Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.

The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup.  This is done for each
hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
The routine moving counts (hugetlb_cgroup_move_parent) is only moving
'usage' counts.  The routine hugetlb_cgroup_have_usage is checking for
both 'usage' and 'reservation' counts.  Discussion about what to do with
reservation counts when reparenting was discussed here:

https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

The decision was made to leave a zombie cgroup for with reservation
counts.  Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.

To fix the issue, simply remove the check for reservation counts.  While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was
noticed.  The hstate index is not reinitialized each time through the
do-while loop.  Fix this as well.

Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Reported-by: Adrian Moreno &lt;amorenoz@redhat.com&gt;
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Tested-by: Adrian Moreno &lt;amorenoz@redhat.com&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Sandipan Das &lt;sandipan@linux.ibm.com&gt;
Cc: Shuah Khan &lt;shuah@kernel.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/filemap: add static for function __add_to_page_cache_locked</title>
<updated>2020-12-06T18:19:07Z</updated>
<author>
<name>Alex Shi</name>
<email>alex.shi@linux.alibaba.com</email>
</author>
<published>2020-12-06T06:15:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3351b16af4946fff0d46481d155fb91adb28b1f9'/>
<id>urn:sha1:3351b16af4946fff0d46481d155fb91adb28b1f9</id>
<content type='text'>
  mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]

Signed-off-by: Alex Shi &lt;alex.shi@linux.alibaba.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Souptick Joarder &lt;jrdr.linux@gmail.com&gt;
Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/swapfile: do not sleep with a spin lock held</title>
<updated>2020-12-06T18:19:07Z</updated>
<author>
<name>Qian Cai</name>
<email>qcai@redhat.com</email>
</author>
<published>2020-12-06T06:14:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b11a76b37a5aa7b07c3e3eeeaae20b25475bddd3'/>
<id>urn:sha1:b11a76b37a5aa7b07c3e3eeeaae20b25475bddd3</id>
<content type='text'>
We can't call kvfree() with a spin lock held, so defer it.  Fixes a
might_sleep() runtime warning.

Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
Signed-off-by: Qian Cai &lt;qcai@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING</title>
<updated>2020-12-06T18:19:07Z</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2020-12-06T06:14:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e91d8d78237de8d7120c320b3645b7100848f24d'/>
<id>urn:sha1:e91d8d78237de8d7120c320b3645b7100848f24d</id>
<content type='text'>
While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted.  With investigation, I found
below commit calls cond_resched unconditionally so it could make a
problem in atomic context if the task is reschedule.

  BUG: sleeping function called from invalid context at mm/vmalloc.c:108
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
  3 locks held by memhog/946:
   #0: ffff9d01d4b193e8 (&amp;mm-&gt;mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
   #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
   #2: ffff9d01d56b8110 (&amp;zspage-&gt;lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
  CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
  Call Trace:
    unmap_kernel_range_noflush+0x2eb/0x350
    unmap_kernel_range+0x14/0x30
    zs_unmap_object+0xd5/0xe0
    zram_bvec_rw.isra.0+0x38c/0x8e0
    zram_rw_page+0x90/0x101
    bdev_write_page+0x92/0xe0
    __swap_writepage+0x94/0x4a0
    pageout+0xe3/0x3a0
    shrink_page_list+0xb94/0xd60
    shrink_inactive_list+0x158/0x460

We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.

Even though this option showed some amount improvement(e.g., 30%) in
some arm32 platforms, it has been headache to maintain since it have
abused APIs[1](e.g., unmap_kernel_range in atomic context).

Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly it
has been not default option in zsmalloc, it's time to drop the option
for better maintenance.

[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Sergey Senozhatsky &lt;sergey.senozhatsky@gmail.com&gt;
Cc: Tony Lindgren &lt;tony@atomide.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Harish Sriram &lt;harish@linux.ibm.com&gt;
Cc: Uladzislau Rezki &lt;urezki@gmail.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: list_lru: set shrinker map bit when child nr_items is not zero</title>
<updated>2020-12-06T18:19:07Z</updated>
<author>
<name>Yang Shi</name>
<email>shy828301@gmail.com</email>
</author>
<published>2020-12-06T06:14:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8199be001a470209f5c938570cc199abb012fe53'/>
<id>urn:sha1:8199be001a470209f5c938570cc199abb012fe53</id>
<content type='text'>
When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache.  The vmcore shows there are over 14M negative dentry objects on
lru, but tracing result shows they were even not scanned at all.

Further investigation shows the memcg's vfs shrinker_map bit is not set.
So the reclaimer or dropping cache just skip calling vfs shrinker.  So
we have to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and
the problem can't be reproduced after rebooting.  But it seems there is
race between shrinker map bit clear and reparenting by code inspection.
The hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:

                root
               /    \
          system   user

The main workloads are running under user slice's children, and it
creates and removes memcg frequently.  So reparenting happens very often
under user slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

       CPU A                            CPU B
reparent
    dst-&gt;nr_items == 0
                                 shrinker:
                                     total_objects == 0
    add src-&gt;nr_items to dst
    set_bit
                                     return SHRINK_EMPTY
                                     clear_bit
child memcg offline
    replace child's kmemcg_id with
    parent's (in memcg_offline_kmem())
                                  list_lru_del() between shrinker runs
                                     see parent's kmemcg_id
                                     dec dst-&gt;nr_items
reparent again
    dst-&gt;nr_items may go negative
    due to concurrent list_lru_del()

                                 The second run of shrinker:
                                     read nr_items without any
                                     synchronization, so it may
                                     see intermediate negative
                                     nr_items then total_objects
                                     may return 0 coincidently

                                     keep the bit cleared
    dst-&gt;nr_items != 0
    skip set_bit
    add scr-&gt;nr_item to dst

After this point dst-&gt;nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore.  And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either.  That bit is kept cleared forever.

How does list_lru_del() race with reparenting? It is because reparenting
replaces children's kmemcg_id to parent's without protecting from
nlru-&gt;lock, so list_lru_del() may see parent's kmemcg_id but actually
deleting items from child's lru, but dec'ing parent's nr_items, so the
parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
reparent list_lrus and free kmemcg_id on css offline") says.

Since it is impossible that dst-&gt;nr_items goes negative and
src-&gt;nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src-&gt;nr_items != 0.  We could synchronize
list_lru_count_one() and reparenting with nlru-&gt;lock, but it seems
checking src-&gt;nr_items in reparenting is the simplest and avoids lock
contention.

Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Suggested-by: Roman Gushchin &lt;guro@fb.com&gt;
Signed-off-by: Yang Shi &lt;shy828301@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Roman Gushchin &lt;guro@fb.com&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Kirill Tkhai &lt;ktkhai@virtuozzo.com&gt;
Cc: Vladimir Davydov &lt;vdavydov.dev@gmail.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.19]
Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
