| Age | Commit message (Collapse) | Author | Files | Lines |
|
KCSAN reports a data race on mm_cluster.hiwater_rss, which can be accessed
concurrently from various paths like page migration and memory unmapping
without synchronization.
Since hiwater_rss is a statistical field for accounting purposes, this
data race is benign. Annotate both the read and write accesses with
data_race() to make KCSAN happy.
Link: https://lkml.kernel.org/r/20250926092426.43312-1-lance.yang@linux.dev
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Reported-by: syzbot+60192c8877d0bc92a92b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/68d6364e.050a0220.3390a8.000d.GAE@google.com
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Marco Elver <elver@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We added that "select MEMORY_ISOLATION" in commit ee6f509c3274 ("mm:
factor out memory isolate functions"). However, in commit add05cecef80
("mm: soft-offline: don't free target page in successful page migration")
we remove the need for it, where we removed the calls to
set_migratetype_isolate() etc.
What CONFIG_MEMORY_FAILURE soft-offline support wants is migrate_pages()
support. But that comes with CONFIG_MIGRATION. And
isolate_folio_to_list() has nothing to do with CONFIG_MEMORY_ISOLATION.
Therefore, we can remove "select MEMORY_ISOLATION" of MEMORY_FAILURE.
Link: https://lkml.kernel.org/r/20250922143618.48640-1-xieyuanbin1@huawei.com
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Current code is not correct to get struct khugepaged_mm_slot by
mm_slot_entry() without checking mm_slot is !NULL. There is no problem
reported since slot is the first element of struct khugepaged_mm_slot.
While struct khugepaged_mm_slot is just a wrapper of struct mm_slot, there
is no need to define it.
Remove the definition of struct khugepaged_mm_slot, so there is not chance
to miss use mm_slot_entry().
[richard.weiyang@gmail.com: fix use-after-free crash]
Link: https://lkml.kernel.org/r/20250922002834.vz6ntj36e75ehkyp@master
Link: https://lkml.kernel.org/r/20250919071244.17020-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm_slot: fix the usage of mm_slot_entry", v2.
When using mm_slot in ksm, there is code like:
slot = mm_slot_lookup(mm_slots_hash, mm);
mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot);
if (mm_slot && ..) {
}
The mm_slot_entry() won't return a valid value if slot is NULL generally.
But currently it works since slot is the first element of struct
ksm_mm_slot.
To reduce the ambiguity and make it robust, access mm_slot_entry() when
slot is !NULL.
Link: https://lkml.kernel.org/r/20250919071244.17020-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250919071244.17020-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 79359d6d24df ("hugetlb: perform vmemmap optimization on a list of
pages") batches the submission of HugeTLB vmemmap optimization (HVO)
during hugepage reservation. With HVO enabled, hugepages obtained from
the buddy allocator are not submitted for optimization and their
struct-page memory is therefore not released—until the entire
reservation request has been satisfied. As a result, any struct-page
memory freed in the course of the allocation cannot be reused for the
ongoing reservation, artificially limiting the number of huge pages that
can ultimately be provided.
As commit b1222550fbf7 ("mm/hugetlb: do pre-HVO for bootmem allocated
pages") already applies early HVO to bootmem-allocated huge pages, this
patch extends the same benefit to non-bootmem pages by incrementally
submitting them for HVO as they are allocated, thereby returning
struct-page memory to the buddy allocator in real time. The change raises
the maximum 2 MiB hugepage reservation from just under 376 GB to more than
381 GB on a 384 GB x86 VM.
Link: https://lkml.kernel.org/r/20250919092353.41671-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a new selftest to verify whether the `ksm_merging_pages` counter in
`mm_struct` is not inherited by a child process after fork. This helps
ensure correctness of KSM accounting across process creation.
Link: https://lkml.kernel.org/r/e7bb17d374133bd31a3e423aa9e46e1122e74971.1758648700.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/ksm: Fix incorrect accounting of KSM counters during
fork", v3.
The first patch in this series fixes the incorrect accounting of KSM
counters such as ksm_merging_pages, ksm_rmap_items, and the global
ksm_zero_pages during fork.
The following patch add a selftest to verify the ksm_merging_pages counter
was updated correctly during fork.
Test Results
============
Without the first patch
-----------------------
# [RUN] test_fork_ksm_merging_page_count
not ok 10 ksm_merging_page in child: 32
With the first patch
--------------------
# [RUN] test_fork_ksm_merging_page_count
ok 10 ksm_merging_pages is not inherited after fork
This patch (of 2):
Currently, the KSM-related counters in `mm_struct`, such as
`ksm_merging_pages`, `ksm_rmap_items`, and `ksm_zero_pages`, are inherited
by the child process during fork. This results in inconsistent
accounting.
When a process uses KSM, identical pages are merged and an rmap item is
created for each merged page. The `ksm_merging_pages` and
`ksm_rmap_items` counters are updated accordingly. However, after a fork,
these counters are copied to the child while the corresponding rmap items
are not. As a result, when the child later triggers an unmerge, there are
no rmap items present in the child, so the counters remain stale, leading
to incorrect accounting.
A similar issue exists with `ksm_zero_pages`, which maintains both a
global counter and a per-process counter. During fork, the per-process
counter is inherited by the child, but the global counter is not
incremented. Since the child also references zero pages, the global
counter should be updated as well. Otherwise, during zero-page unmerge,
both the global and per-process counters are decremented, causing the
global counter to become inconsistent.
To fix this, ksm_merging_pages and ksm_rmap_items are reset to 0 during
fork, and the global ksm_zero_pages counter is updated with the
per-process ksm_zero_pages value inherited by the child. This ensures
that KSM statistics remain accurate and reflect the activity of each
process correctly.
Link: https://lkml.kernel.org/r/cover.1758648700.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/7b9870eb67ccc0d79593940d9dbd4a0b39b5d396.1758648700.git.donettom@linux.ibm.com
Fixes: 7609385337a4 ("ksm: count ksm merging pages for each process")
Fixes: cb4df4cae4f2 ("ksm: count allocated ksm rmap_items for each process")
Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM")
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: <stable@vger.kernel.org> [6.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When device_register() fails in register_node(), it calls
put_device(&node->dev). This triggers node_device_release(), which calls
kfree(to_node(dev)), thereby freeing the entire node structure.
As a result, when register_node() returns an error, the node memory has
already been freed. Calling kfree(node) again in register_one_node()
leads to a double free.
This patch removes the redundant kfree(node) from register_one_node() to
prevent the double free.
Link: https://lkml.kernel.org/r/20250918054144.58980-1-donettom@linux.ibm.com
Fixes: 786eb990cfb7 ("drivers/base/node: handle error properly in register_one_node()")
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Chris Mason <clm@meta.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When using vmalloc with VM_ALLOW_HUGE_VMAP flag, it will set the alignment
to PMD_SIZE internally, if it deems huge mappings to be eligible.
Therefore, setting the alignment in execmem_vmalloc is redundant. Apart
from this, it also reduces the probability of allocation in case vmalloc
fails to allocate hugepages - in the fallback case, vmalloc tries to use
the original alignment and allocate basepages, which unfortunately will
again be PMD_SIZE passed over from execmem_vmalloc, thus constraining the
search for a free space in vmalloc region.
Therefore, remove this constraint.
Link: https://lkml.kernel.org/r/20250918093453.75676-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Link: https://lkml.kernel.org/r/20250918174528.90879-1-manish1588@gmail.com
Signed-off-by: Manish Kumar <manish1588@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel currently does not mlock large folios when adding them to rmap,
stating that it is difficult to confirm that the folio is fully mapped and
safe to mlock it.
This leads to a significant undercount of Mlocked in /proc/meminfo,
causing problems in production where the stat was used to estimate system
utilization and determine if load shedding is required.
However, nowadays the caller passes a number of pages of the folio that
are getting mapped, making it easy to check if the entire folio is mapped
to the VMA.
mlock the folio on rmap if it is fully mapped to the VMA.
Mlocked in /proc/meminfo can still undercount, but the value is closer the
truth and is useful for userspace.
Link: https://lkml.kernel.org/r/20250923110711.690639-7-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, kernel only maps part of large folio that fits into
start_pgoff/end_pgoff range.
Map entire folio where possible. It will match finish_fault() behaviour
that user hits on cold page cache.
Mapping large folios at once will allow the rmap code to mlock it on add,
as it will recognize that it is fully mapped and mlocking is safe.
Link: https://lkml.kernel.org/r/20250923110711.690639-6-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
finish_fault() uses per-page fault for file folios. This only occurs for
file folios smaller than PMD_SIZE.
The comment suggests that this approach prevents RSS inflation. However,
it only prevents RSS accounting. The folio is still mapped to the
process, and the fact that it is mapped by a single PTE does not affect
memory pressure. Additionally, the kernel's ability to map large folios
as PMD if they are large enough does not support this argument.
When possible, map large folios in one shot. This reduces the number of
minor page faults and allows for TLB coalescing.
Mapping large folios at once will allow the rmap code to mlock it on add,
as it will recognize that it is fully mapped and mlocking is safe.
Link: https://lkml.kernel.org/r/20250923110711.690639-5-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, try_to_unmap_once() only tries to mlock small folios.
Use logic similar to folio_referenced_one() to mlock large folios: only do
this for fully mapped folios and under page table lock that protects all
page table entries.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-4-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The mlock_vma_folio() function requires the page table lock to be held in
order to safely mlock the folio. However, folio_referenced_one() mlocks a
large folios outside of the page_vma_mapped_walk() loop where the page
table lock has already been dropped.
Rework the mlock logic to use the same code path inside the loop for both
large and small folios.
Use PVMW_PGTABLE_CROSSED to detect when the folio is mapped across a page
table boundary.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-3-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: Improve mlock tracking for large folios", v3.
The patchset includes several fixes and improvements related to mlock
tracking of large folios.
The main objective is to reduce the undercount of Mlocked memory in
/proc/meminfo and improve the accuracy of the statistics.
Patches 1-2:
These patches address a minor race condition in folio_referenced_one()
related to mlock_vma_folio().
Currently, mlock_vma_folio() is called on large folio without the page
table lock, which can result in a race condition with unmap (i.e.
MADV_DONTNEED). This can lead to partially mapped folios on the
unevictable LRU list.
While not a significant issue, I do not believe backporting is necessary.
Patch 3:
This patch adds mlocking logic similar to folio_referenced_one() to
try_to_unmap_one(), allowing for mlocking of large folios where possible.
Patch 4-5:
These patches modifies finish_fault() and faultaround to map in the entire
folio when possible, enabling efficient mlocking upon addition to the
rmap.
Patch 6:
This patch makes rmap mlock large folios if they are fully mapped,
addressing the primary source of mlock undercount for large folios.
This patch (of 6):
Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if
the page is mapped across page table boundary. Unlike other PVMW_* flags,
this one is result of page_vma_mapped_walk() and not set by the caller.
folio_referenced_one() will use it to detect if it safe to mlock the
folio.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-1-kirill@shutemov.name
Link: https://lkml.kernel.org/r/20250923110711.690639-2-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 56ae0bb349b4 ("mm: compaction: convert to use a folio in
isolate_migratepages_block()") converts api from page to folio. But the
low_pfn advance for hugetlb page seems wrong when low_pfn doesn't point to
head page.
Originally, if page is a hugetlb tail page, compound_nr() return 1, which
means low_pfn only advance one in next iteration. After the change,
low_pfn would advance more than the hugetlb range, since folio_nr_pages()
always return total number of the large page. This results in skipping
some range to isolate and then to migrate.
The worst case for alloc_contig is it does all the isolation and
migration, but finally find some range is still not isolated. And then
undo all the work and try a new range.
Advance low_pfn to the end of hugetlb.
Link: https://lkml.kernel.org/r/20250910092240.3981-1-richard.weiyang@gmail.com
Fixes: 56ae0bb349b4 ("mm: compaction: convert to use a folio in isolate_migratepages_block()")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When restoring a page, no sanity checks are done to make sure the page
actually came from a kexec handover. The caller is trusted to pass in the
right address. If the caller has a bug and passes in a wrong address, an
in-use page might be "restored" and returned, causing all sorts of memory
corruption.
Harden the page restore logic by stashing in a magic number in
page->private along with the order. If the magic number does not match,
the page won't be touched. page->private is an unsigned long. The union
kho_page_info splits it into two parts, with one holding the order and the
other holding the magic number.
Link: https://lkml.kernel.org/r/20250917125725.665-2-pratyush@kernel.org
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While KHO exposes folio as the primitive externally, internally its
restoration machinery operates on pages. This can be seen with
kho_restore_folio() for example. It performs some sanity checks and hands
it over to kho_restore_page() to do the heavy lifting of page restoration.
After the work done by kho_restore_page(), kho_restore_folio() only
converts the head page to folio and returns it. Similarly,
deserialize_bitmap() operates on the head page directly to store the
order.
Move the sanity checks for valid phys and order from the public-facing
kho_restore_folio() to the private-facing kho_restore_page(). This makes
the boundary between page and folio clearer from KHO's perspective.
While at it, drop the comment above kho_restore_page(). The comment is
misleading now. The function stopped looking like free_reserved_page()
since 12b9a2c05d1b4 ("kho: initialize tail pages for higher order folios
properly"), and now looks even more different.
Link: https://lkml.kernel.org/r/20250917125725.665-1-pratyush@kernel.org
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The madv_populate and soft-dirty kselftests currently fail on systems
where CONFIG_MEM_SOFT_DIRTY is disabled.
Introduce a new helper softdirty_supported() into vm_util.c/h to ensure
tests are properly skipped when the feature is not enabled.
Link: https://lkml.kernel.org/r/20250917133137.62802-1-lance.yang@linux.dev
Fixes: 9f3265db6ae8 ("selftests: vm: add test for Soft-Dirty PTE bit")
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_ctx->addr_unit is respected only for physical address space
monitoring use case. Meanwhile, damon_ctx->min_sz_region is used by the
core layer for aligning regions, regardless of whether it is set for
physical address space monitoring or virtual address spaces monitoring.
And it is set as 'DAMON_MIN_REGION / damon_ctx->addr_unit'. Hence, if
user sets ->addr_unit on virtual address spaces monitoring mode, regions
can be unexpectedly aligned in <PAGE_SIZE granularity. It shouldn't cause
crash-like issues but make monitoring and DAMOS behavior difficult to
understand.
Fix the unexpected behavior by setting ->min_sz_region only when it is
configured for physical address space monitoring.
The issue was found from a result of Chris' experiments that thankfully
shared with me off-list.
Link: https://lkml.kernel.org/r/20250917160041.53187-1-sj@kernel.org
Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently vm_area_alloc_pages() contains two cond_resched() points.
However, the page allocator already has its own in slow path so an extra
resched is not optimal because it delays the loops.
The place where CPU time can be consumed is in the VA-space search in
alloc_vmap_area(), especially if the space is really fragmented using
synthetic stress tests, after a fast path falls back to a slow one.
Move a single cond_resched() there, after dropping free_vmap_area_lock in
a slow path. This keeps fairness where it matters while removing
redundant yields from the page-allocation path.
[akpm@linux-foundation.org: tweak comment grammar]
Link: https://lkml.kernel.org/r/20250917185906.1595454-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This removes the last call to page_stable_node(), so delete the wrapper.
It also removes a call to trylock_page() and saves a call to
compound_head(), as well as removing a reference to folio->page.
Link: https://lkml.kernel.org/r/20250916181219.2400258-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Longlong Xia <xialonglong@kylinos.cn>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On NUMA systems without bindings, allocations check all nodes for free
space, then wake up the kswapds on all nodes and retry. This ensures
all available space is evenly used before reclaim begins. However,
when one process or certain allocations have node restrictions, they
can cause kswapds on only a subset of nodes to be woken up.
Since kswapd hysteresis targets watermarks that are *higher* than
needed for allocation, even *unrestricted* allocations can now get
suckered onto such nodes that are already pressured. This ends up
concentrating all allocations on them, even when there are idle nodes
available for the unrestricted requests.
This was observed with two numa nodes, where node0 is normal and node1
is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
kswapd on node0 only (since node1 is not eligible); once kswapd0 is
active, the watermarks hover between low and high, and then even the
movable allocations end up on node0, only to be kicked out again;
meanwhile node1 is empty and idle.
Similar behavior is possible when a process with NUMA bindings is
causing selective kswapd wakeups.
To fix this, on NUMA systems augment the (misleading) watermark test
with a check for whether kswapd is already active during the first
iteration through the zonelist. If this fails to place the request,
kswapd must be running everywhere already, and the watermark test is
good enough to decide placement.
With this patch, unrestricted requests successfully make use of node1,
even while kswapd is reclaiming node0 for restricted allocations.
[gourry@gourry.net: don't retry if no kswapds were active]
Link: https://lkml.kernel.org/r/20250919162134.1098208-1-hannes@cmpxchg.org
Signed-off-by: Gregory Price <gourry@gourry.net>
Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fix an incorrect logic conversion in process_mrelease().
Link: https://lkml.kernel.org/r/3b7f0faf-4dbc-4d67-8a71-752fbcdf0906@lucifer.local
Fixes: 12e423ba4eae ("mm: convert core mm to mm_flags_*() accessors")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Chris Mason <clm@meta.com>
Closes: https://lkml.kernel.org/r/c2e28e27-d84b-4671-8784-de5fe0d14f41@lucifer.local
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
MADV_COLLAPSE on a file mapping behaves inconsistently depending on if PMD
page table is installed or not.
Consider following example:
p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
err = madvise(p, 2UL << 20, MADV_COLLAPSE);
fd is a populated tmpfs file.
The result depends on the address that the kernel returns on mmap(). If
it is located in an existing PMD table, the madvise() will succeed.
However, if the table does not exist, it will fail with -EINVAL.
This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when a
page table is missing, which causes collapse_pte_mapped_thp() to fail.
SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
tables as needed.
Link: https://lkml.kernel.org/r/v5ivpub6z2n2uyemlnxgbilzs52ep4lrary7lm7o6axxoneb75@yfacfl5rkzeh
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for
nested file systems") we introduced the ability for stacked drivers and
file systems to correctly invoke the f_op->mmap_prepare() handler from an
f_op->mmap() handler via a compatibility layer implemented in
compat_vma_mmap_prepare().
This populates vm_area_desc fields according to those found in the (not
yet fully initialised) VMA passed to f_op->mmap().
However this function implicitly assumes that the struct file which we are
operating upon is equal to vma->vm_file. This is not a safe assumption in
all cases.
The only really sane situation in which this matters would be something
like e.g. i915_gem_dmabuf_mmap() which invokes vfs_mmap() against
obj->base.filp:
ret = vfs_mmap(obj->base.filp, vma);
if (ret)
return ret;
And then sets the VMA's file to this, should the mmap operation succeed:
vma_set_file(vma, obj->base.filp);
That is - it is the file that is intended to back the VMA mapping.
This is not an issue currently, as so far we have only implemented
f_op->mmap_prepare() handlers for some file systems and internal mm uses,
and the only stacked f_op->mmap() operations that can be performed upon
these are those in backing_file_mmap() and coda_file_mmap(), both of which
use vma->vm_file.
However, moving forward, as we convert drivers to using
f_op->mmap_prepare(), this will become a problem.
Resolve this issue by explicitly setting desc->file to the provided file
parameter and update callers accordingly.
Callers are expected to read desc->file and update desc->vm_file - the
former will be the file provided by the caller (if stacked, this may
differ from vma->vm_file).
If the caller needs to differentiate between the two they therefore now
can.
While we are here, also provide a variant of compat_vma_mmap_prepare()
that operates against a pointer to any file_operations struct and does not
assume that the file_operations struct we are interested in is file->f_op.
This function is __compat_vma_mmap_prepare() and we invoke it from
compat_vma_mmap_prepare() so that we share code between the two functions.
This is important, because some drivers provide hooks in a separate
struct, for instance struct drm_device provides an fops field for this
purpose.
Also update the VMA selftests accordingly.
Link: https://lkml.kernel.org/r/dd0c72df8a33e8ffaa243eeb9b01010b670610e9.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()", v2.
As part of the efforts to eliminate the problematic f_op->mmap callback, a
new callback - f_op->mmap_prepare was provided.
While we are converting these callbacks, we must deal with 'stacked'
filesystems and drivers - those which in their own f_op->mmap callback
invoke an inner f_op->mmap callback.
To accomodate for this, a compatibility layer is provided that, via
vfs_mmap(), detects if f_op->mmap_prepare is provided and if so, generates
a vm_area_desc containing the VMA's metadata and invokes the call.
So far, we have provided desc->file equal to vma->vm_file. However this
is not necessarily valid, especially in the case of stacked drivers which
wish to assign a new file after the inner hook is invoked.
To account for this, we adjust vm_area_desc to have both file and vm_file
fields. The .vm_file field is strictly set to vma->vm_file (or in the
case of a new mapping, what will become vma->vm_file).
However, .file is set to whichever file vfs_mmap() is invoked with when
using the compatibilty layer.
Therefore, if the VMA's file needs to be updated in .mmap_prepare,
desc->vm_file should be assigned, whilst desc->file should be read.
No current f_op->mmap_prepare users assign desc->file so this is safe to
do.
This makes the .mmap_prepare callback in the context of a stacked
filesystem or driver completely consistent with the existing .mmap
implementations.
While we're here, we do a few small cleanups, and ensure that we const-ify
things correctly in the vm_area_desc struct to avoid hooks accidentally
trying to assign fields they should not.
This patch (of 2):
Stacked filesystems and drivers may invoke mmap hooks with a struct file
pointer that differs from the overlying file. We will make this
functionality possible in a subsequent patch.
In order to prepare for this, let's update vm_area_struct to separately
provide desc->file and desc->vm_file parameters.
The desc->file parameter is the file that the hook is expected to operate
upon, and is not assignable (though the hok may wish to e.g. update the
file's accessed time for instance).
The desc->vm_file defaults to what will become vma->vm_file and is what
the hook must reassign should it wish to change the VMA"s vma->vm_file.
For now we keep desc->file, vm_file the same to remain consistent.
No f_op->mmap_prepare() callback sets a new vma->vm_file currently, so
this is safe to change.
While we're here, make the mm_struct desc->mm pointers at immutable as
well as the desc->mm field itself.
As part of this change, also update the single hook which this would
otherwise break - mlock_future_ok(), invoked by secretmem_mmap_prepare()).
We additionally update set_vma_from_desc() to compare fields in a more
logical fashion, checking the (possibly) user-modified fields as the first
operand against the existing value as the second one.
Additionally, update VMA tests to accommodate changes.
Link: https://lkml.kernel.org/r/cover.1756920635.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/3fa15a861bb7419f033d22970598aa61850ea267.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that all actionable outcomes from checking pte_write() are gone, drop
the related references.
Link: https://lkml.kernel.org/r/20250908075028.38431-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Expand scope of khugepaged anonymous collapse", v2.
Currently khugepaged does not collapse an anonymous region which does not
have a single writable pte. This is wasteful since a region mapped with
non-writable ptes, for example, non-writable VMAs mapped by the
application, won't benefit from THP collapse.
An additional consequence of this constraint is that MADV_COLLAPSE does
not perform a collapse on a non-writable VMA, and this restriction is
nowhere to be found on the manpage - the restriction itself sounds wrong
to me since the user knows the protection of the memory it has mapped, so
collapsing read-only memory via madvise() should be a choice of the user
which shouldn't be overridden by the kernel.
Therefore, remove this constraint.
On an arm64 bare metal machine, comparing with vanilla 6.17-rc2, an
average of 5% improvement is seen on some mmtests benchmarks, particularly
hackbench, with a maximum improvement of 12%. In the following table, (I)
denotes statistically significant improvement, (R) denotes statistically
significant regression.
+-------------------------+--------------------------------+---------------+
| mmtests/hackbench | process-pipes-1 (seconds) | -0.06% |
| | process-pipes-4 (seconds) | -0.27% |
| | process-pipes-7 (seconds) | (I) -12.13% |
| | process-pipes-12 (seconds) | (I) -5.32% |
| | process-pipes-21 (seconds) | (I) -2.87% |
| | process-pipes-30 (seconds) | (I) -3.39% |
| | process-pipes-48 (seconds) | (I) -5.65% |
| | process-pipes-79 (seconds) | (I) -6.74% |
| | process-pipes-110 (seconds) | (I) -6.26% |
| | process-pipes-141 (seconds) | (I) -4.99% |
| | process-pipes-172 (seconds) | (I) -4.45% |
| | process-pipes-203 (seconds) | (I) -3.65% |
| | process-pipes-234 (seconds) | (I) -3.45% |
| | process-pipes-256 (seconds) | (I) -3.47% |
| | process-sockets-1 (seconds) | 2.13% |
| | process-sockets-4 (seconds) | 1.02% |
| | process-sockets-7 (seconds) | -0.26% |
| | process-sockets-12 (seconds) | -1.24% |
| | process-sockets-21 (seconds) | 0.01% |
| | process-sockets-30 (seconds) | -0.15% |
| | process-sockets-48 (seconds) | 0.15% |
| | process-sockets-79 (seconds) | 1.45% |
| | process-sockets-110 (seconds) | -1.64% |
| | process-sockets-141 (seconds) | (I) -4.27% |
| | process-sockets-172 (seconds) | 0.30% |
| | process-sockets-203 (seconds) | -1.71% |
| | process-sockets-234 (seconds) | -1.94% |
| | process-sockets-256 (seconds) | -0.71% |
| | thread-pipes-1 (seconds) | 0.66% |
| | thread-pipes-4 (seconds) | 1.66% |
| | thread-pipes-7 (seconds) | -0.17% |
| | thread-pipes-12 (seconds) | (I) -4.12% |
| | thread-pipes-21 (seconds) | (I) -2.13% |
| | thread-pipes-30 (seconds) | (I) -3.78% |
| | thread-pipes-48 (seconds) | (I) -5.77% |
| | thread-pipes-79 (seconds) | (I) -5.31% |
| | thread-pipes-110 (seconds) | (I) -6.12% |
| | thread-pipes-141 (seconds) | (I) -4.00% |
| | thread-pipes-172 (seconds) | (I) -3.01% |
| | thread-pipes-203 (seconds) | (I) -2.62% |
| | thread-pipes-234 (seconds) | (I) -2.00% |
| | thread-pipes-256 (seconds) | (I) -2.30% |
| | thread-sockets-1 (seconds) | (R) 2.39% |
+-------------------------+--------------------------------+---------------+
+-------------------------+------------------------------------------------+
| mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | -0.02% |
| | sysbenchmutex-4 (usec) | -0.02% |
| | sysbenchmutex-7 (usec) | 0.00% |
| | sysbenchmutex-12 (usec) | 0.12% |
| | sysbenchmutex-21 (usec) | -0.40% |
| | sysbenchmutex-30 (usec) | 0.08% |
| | sysbenchmutex-48 (usec) | 2.59% |
| | sysbenchmutex-79 (usec) | -0.80% |
| | sysbenchmutex-110 (usec) | -3.87% |
| | sysbenchmutex-128 (usec) | (I) -4.46% |
+-------------------------+--------------------------------+---------------+
This patch (of 2):
Currently khugepaged does not collapse an anonymous region which does not
have a single writable pte. This is wasteful since a region mapped with
non-writable ptes, for example, non-writable VMAs mapped by the
application, won't benefit from THP collapse.
An additional consequence of this constraint is that MADV_COLLAPSE does
not perform a collapse on a non-writable VMA, and this restriction is
nowhere to be found on the manpage - the restriction itself sounds wrong
to me since the user knows the protection of the memory it has mapped, so
collapsing read-only memory via madvise() should be a choice of the user
which shouldn't be overridden by the kernel.
Therefore, remove this restriction by not honouring SCAN_PAGE_RO.
Link: https://lkml.kernel.org/r/20250908075028.38431-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250908075028.38431-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON_STAT calculates the idle time of a region using the region's age if
the region's nr_accesses is zero. If the nr_accesses value is non-zero
(positive), the idle time of the region becomes zero.
This means the users cannot know how warm and hot data is distributed,
using DAMON_STAT's memory_idle_ms_percentiles output. The other stat,
namely estimated_memory_bandwidth, can help understanding how the overall
access temperature of the system is, but it is still very rough
information. On production systems, actually, a significant portion of
the system memory is observed with zero idle time, and we cannot break it
down based on its internal hotness distribution.
Define the idle time of the region using its age, similar to those having
zero nr_accesses, but multiples '-1' to distinguish it. And expose that
using the same parameter interface, memory_idle_ms_percentiles.
Link: https://lkml.kernel.org/r/20250916183127.65708-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/stat: expose auto-tuned intervals and non-idle
ages".
DAMON_STAT is intentionally providing limited information for easy
consumption of the information. From production fleet level usages, below
limitations are found, though.
The aggregation interval of DAMON_STAT represents the granularity of the
memory_idle_ms_percentiles. But the interval is auto-tuned and not
exposed to users, so users cannot know the granularity.
All memory regions of non-zero (positive) nr_accesses are treated as
having zero idle time. A significant portion of production systems have
such zero idle time. Hence breakdown of warm and hot data is nearly
impossible.
Make following changes to overcome the limitations. Expose the auto-tuned
aggregation interval with a new parameter named aggr_interval_us. Expose
the age of non-zero nr_accesses (how long >0 access frequency the region
retained) regions as a negative idle time.
This patch (of 2):
DAMON_STAT calculates the idle time for a region as the region's age
multiplied by the aggregation interval. That is, the aggregation interval
is the granularity of the idle time. Since the aggregation interval is
auto-tuned and not exposed to users, however, users cannot easily know in
what granularity the stat is made. Expose the tuned aggregation interval
in microseconds via a new parameter, aggr_interval_us.
Link: https://lkml.kernel.org/r/20250916183127.65708-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250916183127.65708-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sample_mtier is assuming DAMON is ready to use in module_init time,
and uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
Link: https://lkml.kernel.org/r/20250916033511.116366-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sample_prcl is assuming DAMON is ready to use in module_init time,
and uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
Link: https://lkml.kernel.org/r/20250916033511.116366-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sample_wsse is assuming DAMON is ready to use in module_init time,
and uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
Link: https://lkml.kernel.org/r/20250916033511.116366-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON_LRU_SORT is assuming DAMON is ready to use in module_init time, and
uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
Link: https://lkml.kernel.org/r/20250916033511.116366-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON_RECLAIM is assuming DAMON is ready to use in module_init time, and
uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
Link: https://lkml.kernel.org/r/20250916033511.116366-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON_STAT is assuming DAMON is ready to use in module_init time, and uses
its own hack to see if it is the time. Use damon_initialized(), which is
a way for seeing if DAMON is ready to be used that is more reliable and
better to maintain instead of the hack.
Link: https://lkml.kernel.org/r/20250916033511.116366-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: define and use DAMON initialization check
function".
DAMON is initialized in subsystem initialization time, by damon_init().
If DAMON API functions are called before the initialization, the
system could crash. Actually such issues happened and were fixed [1]
in the past. For the fix, DAMON API callers have updated to check if
DAMON is initialized or not, using their own hacks. The hacks are
unnecessarily duplicated on every DAMON API callers and therefore it
would be difficult to reliably maintain in the long term.
Make it reliable and easy to maintain. For this, implement a new DAMON
core layer API function that returns if DAMON is successfully
initialized. If it returns true, it means DAMON API functions are safe
to be used. After the introduction of the new API, update DAMON API
callers to use the new function instead of their own hacks.
This patch (of 7):
If DAMON is tried to be used when it is not yet successfully initialized,
the caller could be crashed. DAMON core layer is not providing a reliable
way to see if it is successfully initialized and therefore ready to be
used, though. As a result, DAMON API callers are implementing their own
hacks to see it. The hacks simply assume DAMON should be ready on module
init time. It is not reliable as DAMON initialization can indeed fail if
KMEM_CACHE() fails, and difficult to maintain as those are duplicates.
Implement a core layer API function for better reliability and
maintainability to replace the hacks with followup commits.
Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org
Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org
Link: https://lore.kernel.org/20250909022238.2989-1-sj@kernel.org [1]
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON section name is 'DATA ACCESS MONITOR', which implies it is only for
data access monitoring. But DAMON is now evolved for not only access
monitoring but also access-aware system operations (DAMOS). Rename the
section to simply DAMON. It might make it difficult to understand what it
does at a glance, but at least not spreading more confusion. Readers can
further refer to the documentation to better understand what really DAMON
does.
Link: https://lkml.kernel.org/r/20250916032339.115817-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The example command doesn't work [1] on the latest DAMON user-space tool,
since --damos_action option is updated to receive multiple arguments, and
hence cannot know if the final argument is for deductible monitoring
target or an argument for --damos_action option. Add --target_pid option
to let damo understand it is for target pid.
Link: https://lkml.kernel.org/r/20250916032339.115817-5-sj@kernel.org
Link: https://github.com/damonitor/damo/pull/32 [2]
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
requirements
DAMON community meetup was having two different kinds of meetups:
reservation required ones and unrequired ones. Now the reservation
unrequested one is gone, but the documentation on the maintainer-profile
is not updated. Update.
Link: https://lkml.kernel.org/r/20250916032339.115817-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The effective quota of a scheme is initialized zero, which means there is
no quota. It is set based on user-specified time/quota/quota goals. But
the later value set is done only from the second charge window. As a
result, a scheme having a user-specified quota can work as not having the
quota (unexpectedly fast) for the first charge window. In practical and
common use cases the quota interval is not too long, and the scheme's
target access pattern is restrictive. Hence the issue should be modest.
That said, it is apparently an unintended misbehavior. Fix the problem by
setting esz on the first charge window.
Link: https://lkml.kernel.org/r/20250916032339.115817-3-sj@kernel.org
Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota") # 5.16.x
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: misc fixups and improvements for 6.18", v2.
Misc fixes and improvements for DAMON that are not critical and therefore
aims to be merged into Linux 6.18-rc1.
The first patch improves DAMON's age counting for nr_accesses zero to/from
non-zero changes.
The second patch fixes an initial DAMOS apply interval delay issue that is
not realistic but still could happen on an odd setup.
The third and the fourth patches update DAMON community meetup description
and DAMON user-space tool example command for DAMOS usage, respectively.
Finally, the fifth patch updates MAINTAINERS section name for DAMON to
just DAMON.
This patch (of 5):
DAMON resets the age of a region if its nr_accesses value has
significantly changed. Specifically, the threshold is calculated as 20%
of largest nr_accesses of the current snapshot. This means that regions
changing the nr_accesses from zero to small non-zero value or from a small
non-zero value to zero will keep the age. Since many users treat zero
nr_accesses regions special, this can be confusing. Kernel code including
DAMOS' regions priority calculation and DAMON_STAT's idle time calculation
also treat zero nr_accesses regions special. Make it unconfusing by
resetting the age when the nr_accesses changes between zero and a non-zero
value.
Link: https://lkml.kernel.org/r/20250916032339.115817-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250916032339.115817-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While rare, memory allocation profiling can contain inaccurate counters if
slab object extension vector allocation fails. That allocation might
succeed later but prior to that, slab allocations that would have used
that object extension vector will not be accounted for. To indicate
incorrect counters, "accurate:no" marker is appended to the call site line
in the /proc/allocinfo output. Bump up /proc/allocinfo version to reflect
the change in the file format and update documentation.
Example output with invalid counters:
allocinfo - version: 2.0
0 0 arch/x86/kernel/kdebugfs.c:105 func:create_setup_data_nodes
0 0 arch/x86/kernel/alternative.c:2090 func:alternatives_smp_module_add
0 0 arch/x86/kernel/alternative.c:127 func:__its_alloc accurate:no
0 0 arch/x86/kernel/fpu/regset.c:160 func:xstateregs_set
0 0 arch/x86/kernel/fpu/xstate.c:1590 func:fpstate_realloc
0 0 arch/x86/kernel/cpu/aperfmperf.c:379 func:arch_enable_hybrid_capacity_scale
0 0 arch/x86/kernel/cpu/amd_cache_disable.c:258 func:init_amd_l3_attrs
49152 48 arch/x86/kernel/cpu/mce/core.c:2709 func:mce_device_create accurate:no
32768 1 arch/x86/kernel/cpu/mce/genpool.c:132 func:mce_gen_pool_create
0 0 arch/x86/kernel/cpu/mce/amd.c:1341 func:mce_threshold_create_device
[surenb@google.com: document new "accurate:no" marker]
Fixes: 39d117e04d15 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output")
[akpm@linux-foundation.org: simplification per Usama, reflow text]
[akpm@linux-foundation.org: add newline to prevent docs warning, per Randy]
Link: https://lkml.kernel.org/r/20250915230224.4115531-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Although the oom_reaper is delayed and it gives the oom victim chance to
clean up its address space this might take a while especially for
processes with a large address space footprint. In those cases oom_reaper
might start racing with the dying task and compete for shared resources -
e.g. page table lock contention has been observed.
Reduce those races by reaping the oom victim from the other end of the
address space.
It is also a significant improvement for process_mrelease(). When a
process is killed, process_mrelease is used to reap the killed process and
often runs concurrently with the dying task. The test data shows that
after applying the patch, lock contention is greatly reduced during the
procedure of reaping the killed process.
The test is conducted on arm64. The following basic perf numbers show
that applying this patch significantly reduces pte spin lock contention.
Without the patch:
|--99.57%-- oom_reaper
| |--73.58%-- unmap_page_range
| | |--8.67%-- [hit in function]
| | |--41.59%-- __pte_offset_map_lock
| | |--29.47%-- folio_remove_rmap_ptes
| | |--16.11%-- tlb_flush_mmu
| |--19.94%-- tlb_finish_mmu
| |--3.21%-- folio_remove_rmap_ptes
With the patch:
|--99.53%-- oom_reaper
| |--55.77%-- unmap_page_range
| | |--20.49%-- [hit in function]
| | |--58.30%-- folio_remove_rmap_ptes
| | |--11.48%-- tlb_flush_mmu
| | |--3.33%-- folio_mark_accessed
| |--32.21%-- tlb_finish_mmu
| |--6.93%-- folio_remove_rmap_ptes
| |--0.69%-- __pte_offset_map_lock
Detailed breakdowns for both scenarios are provided below. The cumulative
time for oom_reaper plus exit_mmap(victim) in both cases is also
summarized, making the performance improvements clear.
+----------------------------------------------------------------+
| Category | Applying patch | Without patch |
+-------------------------------+----------------+---------------+
| Total running time | 132.6 | 167.1 |
| (exit_mmap + reaper work) | 72.4 + 60.2 | 90.7 + 76.4 |
+-------------------------------+----------------+---------------+
| Time waiting for pte spinlock | 1.0 | 33.1 |
| (exit_mmap + reaper work) | 0.4 + 0.6 | 10.0 + 23.1 |
+-------------------------------+----------------+---------------+
| folio_remove_rmap_ptes time | 42.0 | 41.3 |
| (exit_mmap + reaper work) | 18.4 + 23.6 | 22.4 + 18.9 |
+----------------------------------------------------------------+
From this report, we can see that:
1. The reduction in total time comes mainly from the decrease in time
spent on pte spinlock and other locks.
2. oom_reaper performs more work in some areas, but at the same time,
exit_mmap also handles certain tasks more efficiently, such as
folio_remove_rmap_ptes.
Here is a more detailed perf report. [1]
Link: https://lkml.kernel.org/r/20250915162946.5515-3-zhongjinji@honor.com
Link: https://lore.kernel.org/all/20250915162619.5133-1-zhongjinji@honor.com/ [1]
Signed-off-by: zhongjinji <zhongjinji@honor.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Improvements to Victim Process Thawing and OOM Reaper
Traversal Order", v10.
This patch series focuses on optimizing victim process thawing and
refining the traversal order of the OOM reaper. Since __thaw_task() is
used to thaw a single thread of the victim, thawing only one thread cannot
guarantee the exit of the OOM victim when it is frozen. Patch 1 thaw the
entire process of the OOM victim to ensure that OOM victims are able to
terminate themselves. Even if the oom_reaper is delayed, patch 2 is still
beneficial for reaping processes with a large address space footprint, and
it also greatly improves process_mrelease.
This patch (of 10):
OOM killer is a mechanism that selects and kills processes when the system
runs out of memory to reclaim resources and keep the system stable. But
the oom victim cannot terminate on its own when it is frozen, even if the
OOM victim task is thawed through __thaw_task(). This is because
__thaw_task() can only thaw a single OOM victim thread, and cannot thaw
the entire OOM victim process.
In addition, freezing_slow_path() determines whether a task is an OOM
victim by checking the task's TIF_MEMDIE flag. When a task is identified
as an OOM victim, the freezer bypasses both PM freezing and cgroup
freezing states to thaw it.
Historically, TIF_MEMDIE was a "this is the oom victim & it has access to
memory reserves" flag in the past. It has that thread vs. process
problems and tsk_is_oom_victim was introduced later to get rid of them and
other issues as well as the guarantee that we can identify the oom
victim's mm reliably for other oom_reaper.
Therefore, thaw_process() is introduced to unfreeze all threads within the
OOM victim process, ensuring that every thread is properly thawed. The
freezer now uses tsk_is_oom_victim() to determine OOM victim status,
allowing all victim threads to be unfrozen as necessary.
With this change, the entire OOM victim process will be thawed when an OOM
event occurs, ensuring that the victim can terminate on its own.
Link: https://lkml.kernel.org/r/20250915162946.5515-1-zhongjinji@honor.com
Link: https://lkml.kernel.org/r/20250915162946.5515-2-zhongjinji@honor.com
Signed-off-by: zhongjinji <zhongjinji@honor.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
static inlines
For all the usual reasons, plus a new one. Calling
(void)arch_enter_lazy_mmu_mode();
deservedly blows up.
Cc: Balbir Singh <balbirs@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_lru_sort_apply_parameters() allocates a new DAMON context, stages
user-specified DAMON parameters on it, and commits to running DAMON
context at once, using damon_commit_ctx(). The code is, however, directly
updating the monitoring attributes of the running context. And the
attributes are over-written by later damon_commit_ctx() call. This means
that the monitoring attributes parameters are not really working. Fix the
wrong use of the parameter context.
Link: https://lkml.kernel.org/r/20250916031549.115326-1-sj@kernel.org
Fixes: a30969436428 ("mm/damon/lru_sort: use damon_commit_ctx()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: <stable@vger.kernel.org> [6.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The while loop doesn't execute and following warning gets generated:
protection_keys.c:561:15: warning: code will never be executed
[-Wunreachable-code]
int rpkey = alloc_random_pkey();
Let's enable the while loop such that it gets executed nr_iterations
times. Simplify the code a bit as well.
Link: https://lkml.kernel.org/r/20250912123025.1271051-3-usama.anjum@collabora.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|