summaryrefslogtreecommitdiffstats
path: root/mm/Makefile
AgeCommit message (Collapse)AuthorLines
2026-02-12Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
2026-02-12Merge tag 'mm-stable-2026-02-11-19-22' of ↵Linus Torvalds-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...
2026-02-06mm: move pte table reclaim code to memory.cDavid Hildenbrand (Red Hat)-1/+0
Some cleanups for PT table reclaim code, triggered by a false-positive warning we might start to see soon after we unlocked pt-reclaim on architectures besides x86-64. This patch (of 2): The pte-table reclaim code is only called from memory.c, while zapping pages, and it better also stays that way in the long run. If we ever have to call it from other files, we should expose proper high-level helpers for zapping if the existing helpers are not good enough. So, let's move the code over (it's not a lot) and slightly clean it up a bit by: - Renaming the functions. - Dropping the "Check if it is empty PTE page" comment, which is now self-explaining given the function name. - Making zap_pte_table_if_empty() return whether zapping worked so the caller can free it. - Adding a comment in pte_table_reclaim_possible(). - Inlining free_pte() in the last remaining user. - In zap_empty_pte_table(), switch from pmdp_get_lcokless() to pmd_clear(), we are holding the PMD PT lock. By moving the code over, compilers can also easily figure out when zap_empty_pte_table() does not initialize the pmdval variable, avoiding false-positive warnings about the variable possibly not being initialized. Link: https://lkml.kernel.org/r/20260119220708.3438514-1-david@kernel.org Link: https://lkml.kernel.org/r/20260119220708.3438514-2-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm: rename CONFIG_MEMORY_BALLOON -> CONFIG_BALLOONDavid Hildenbrand (Red Hat)-1/+1
Let's make it consistent with the naming of the files but also with the naming of CONFIG_BALLOON_MIGRATION. While at it, add a "/* CONFIG_BALLOON */". Link: https://lkml.kernel.org/r/20260119230133.3551867-24-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm: rename balloon_compaction.(c|h) to balloon.(c|h)David Hildenbrand (Red Hat)-1/+1
Even without CONFIG_BALLOON_COMPACTION this infrastructure implements basic list and page management for a memory balloon. Link: https://lkml.kernel.org/r/20260119230133.3551867-21-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26liveupdate: separate memfd support into LIVEUPDATE_MEMFDPasha Tatashin-1/+1
Decouple memfd preservation support from the core Live Update Orchestrator configuration. Previously, enabling CONFIG_LIVEUPDATE forced a dependency on CONFIG_SHMEM and unconditionally compiled memfd_luo.o. However, Live Update may be used for purposes that do not require memfd-backed memory preservation. Introduce CONFIG_LIVEUPDATE_MEMFD to gate memfd_luo.o. This moves the SHMEM and MEMFD_CREATE dependencies to the specific feature that needs them, allowing the base LIVEUPDATE option to be selected independently of shared memory support. Link: https://lkml.kernel.org/r/20251230161402.1542099-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: add basic tests for lazy_mmuKevin Brodsky-0/+1
Add basic KUnit tests for the generic aspects of the lazy MMU mode: ensure that it appears active when it should, depending on how enable/disable and pause/resume pairs are nested. [akpm@linux-foundation.org: export ppc64_tlb_batch and __flush_tlb_pending to modules] [ritesh.list@gmail.com: use EXPORT_SYMBOL_IF_KUNIT()] Link: https://lkml.kernel.org/r/87a4zhkt6h.ritesh.list@gmail.com [kevin.brodsky@arm.com: move MODULE_IMPORT_NS(), add comment] Link: https://lkml.kernel.org/r/20251217163812.2633648-2-kevin.brodsky@arm.com Link: https://lkml.kernel.org/r/20251215150323.2218608-15-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-22mm: introduce BPF kfuncs to deal with memcg pointersRoman Gushchin-0/+3
To effectively operate with memory cgroups in BPF there is a need to convert css pointers to memcg pointers. A simple container_of cast which is used in the kernel code can't be used in BPF because from the verifier's point of view that's a out-of-bounds memory access. Introduce helper get/put kfuncs which can be used to get a refcounted memcg pointer from the css pointer: - bpf_get_mem_cgroup, - bpf_put_mem_cgroup. bpf_get_mem_cgroup() can take both memcg's css and the corresponding cgroup's "self" css. It allows it to be used with the existing cgroup iterator which iterates over cgroup tree, not memcg tree. Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20251223044156.208250-3-roman.gushchin@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-06Merge tag 'mm-nonmm-stable-2025-12-06-11-14' of ↵Linus Torvalds-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko) fixes a build issue and does some cleanup in ib/sys_info.c - "Implement mul_u64_u64_div_u64_roundup()" (David Laight) enhances the 64-bit math code on behalf of a PWM driver and beefs up the test module for these library functions - "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich) makes BPF symbol names, sizes, and line numbers available to the GDB debugger - "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang) adds a sysctl which can be used to cause additional info dumping when the hung-task and lockup detectors fire - "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu) adds a general base64 encoder/decoder to lib/ and migrates several users away from their private implementations - "rbree: inline rb_first() and rb_last()" (Eric Dumazet) makes TCP a little faster - "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin) reworks the KEXEC Handover interfaces in preparation for Live Update Orchestrator (LUO), and possibly for other future clients - "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin) increases the flexibility of KEXEC Handover. Also preparation for LUO - "Live Update Orchestrator" (Pasha Tatashin) is a major new feature targeted at cloud environments. Quoting the cover letter: This series introduces the Live Update Orchestrator, a kernel subsystem designed to facilitate live kernel updates using a kexec-based reboot. This capability is critical for cloud environments, allowing hypervisors to be updated with minimal downtime for running virtual machines. LUO achieves this by preserving the state of selected resources, such as memory, devices and their dependencies, across the kernel transition. As a key feature, this series includes support for preserving memfd file descriptors, which allows critical in-memory data, such as guest RAM or any other large memory region, to be maintained in RAM across the kexec reboot. Mike Rappaport merits a mention here, for his extensive review and testing work. - "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain) moves the kexec and kdump sysfs entries from /sys/kernel/ to /sys/kernel/kexec/ and adds back-compatibility symlinks which can hopefully be removed one day - "kho: fixes for vmalloc restoration" (Mike Rapoport) fixes a BUG which was being hit during KHO restoration of vmalloc() regions * tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits) calibrate: update header inclusion Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()" vmcoreinfo: track and log recoverable hardware errors kho: fix restoring of contiguous ranges of order-0 pages kho: kho_restore_vmalloc: fix initialization of pages array MAINTAINERS: TPM DEVICE DRIVER: update the W-tag init: replace simple_strtoul with kstrtoul to improve lpj_setup KHO: fix boot failure due to kmemleak access to non-PRESENT pages Documentation/ABI: new kexec and kdump sysfs interface Documentation/ABI: mark old kexec sysfs deprecated kexec: move sysfs entries to /sys/kernel/kexec test_kho: always print restore status kho: free chunks using free_page() instead of kfree() selftests/liveupdate: add kexec test for multiple and empty sessions selftests/liveupdate: add simple kexec-based selftest for LUO selftests/liveupdate: add userspace API selftests docs: add documentation for memfd preservation via LUO mm: memfd_luo: allow preserving memfd liveupdate: luo_file: add private argument to store runtime state mm: shmem: export some functions to internal.h ...
2025-11-27mm: memfd_luo: allow preserving memfdPratyush Yadav-0/+1
The ability to preserve a memfd allows userspace to use KHO and LUO to transfer its memory contents to the next kernel. This is useful in many ways. For one, it can be used with IOMMUFD as the backing store for IOMMU page tables. Preserving IOMMUFD is essential for performing a hypervisor live update with passthrough devices. memfd support provides the first building block for making that possible. For another, applications with a large amount of memory that takes time to reconstruct, reboots to consume kernel upgrades can be very expensive. memfd with LUO gives those applications reboot-persistent memory that they can use to quickly save and reconstruct that state. While memfd is backed by either hugetlbfs or shmem, currently only support on shmem is added. To be more precise, support for anonymous shmem files is added. The handover to the next kernel is not transparent. All the properties of the file are not preserved; only its memory contents, position, and size. The recreated file gets the UID and GID of the task doing the restore, and the task's cgroup gets charged with the memory. Once preserved, the file cannot grow or shrink, and all its pages are pinned to avoid migrations and swapping. The file can still be read from or written to. Use vmalloc to get the buffer to hold the folios, and preserve it using kho_preserve_vmalloc(). This doesn't have the size limit. Link: https://lkml.kernel.org/r/20251125165850.3389713-15-pasha.tatashin@soleen.com Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/hugetlb: extract sysctl into hugetlb_sysctl.cHui Zhu-1/+1
Following the extraction of sysfs code, this patch moves the sysctl interface implementation into a dedicated file to further improve code organization and maintainability of the hugetlb subsystem. The following components are moved to mm/hugetlb_sysctl.c: - proc_hugetlb_doulongvec_minmax() - hugetlb_sysctl_handler_common() - hugetlb_sysctl_handler() - hugetlb_mempolicy_sysctl_handler() (CONFIG_NUMA) - hugetlb_overcommit_handler() - hugetlb_table[] sysctl table definition - hugetlb_sysctl_init() The hugetlb_internal.h header file is updated to declare the sysctl initialization function with proper #ifdef guards for configurations without CONFIG_SYSCTL support. The Makefile is updated to compile hugetlb_sysctl.o when CONFIG_HUGETLBFS is enabled. This refactoring reduces the size of hugetlb.c and logically separates the sysctl interface from core hugetlb management code. MAINTAINERS is updated to add new file hugetlb_sysctl.c. No functional changes are introduced; all code is moved as-is from hugetlb.c with consistent formatting. Link: https://lkml.kernel.org/r/5bbee7ab5be71d0bb1aebec38642d7e83526bb7a.1762398359.git.zhuhui@kylinos.cn Signed-off-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/hugetlb: extract sysfs into hugetlb_sysfs.cHui Zhu-1/+1
Patch series "mm/hugetlb: refactor sysfs/sysctl interfaces", v5. hugetlb.c has grown significantly and become difficult to maintain. This patch series extracts the sysfs and sysctl interface code into separate dedicated files to improve code organization. The refactoring includes: - Patch 1: Extract sysfs interface into mm/hugetlb_sysfs.c - Patch 2: Extract sysctl interface into mm/hugetlb_sysctl.c No functional changes are introduced in this series. The code is moved as-is, with only minor formatting adjustments for code style consistency. This should make future maintenance and enhancements to the hugetlb subsystem easier. Testing: The patch series has been compile-tested and maintains the same functionality as the original code. This patch (of 2): Currently, hugetlb.c contains both core management logic and sysfs interface implementations, making it difficult to maintain. This patch extracts the sysfs-related code into a dedicated file to improve code organization. The following components are moved to mm/hugetlb_sysfs.c: - sysfs attribute definitions and handlers - sysfs kobject management functions - NUMA per-node hstate attribute registration Several inline helper functions and macros are moved to mm/hugetlb_internal.h: - hstate_is_gigantic_no_runtime() - next_node_allowed() - get_valid_node_allowed() - hstate_next_node_to_alloc() - hstate_next_node_to_free() - for_each_node_mask_to_alloc/to_free macros To support code sharing, these functions are changed from static to exported symbols: - remove_hugetlb_folio() - add_hugetlb_folio() - init_new_hugetlb_folio() - prep_and_add_allocated_folios() - demote_pool_huge_page() - __nr_hugepages_store_common() The Makefile is updated to compile hugetlb_sysfs.o when CONFIG_HUGETLBFS is enabled. This maintains all existing functionality while improving maintainability by separating concerns. MAINTAINERS is updated to add new file hugetlb_sysfs.c. Link: https://lkml.kernel.org/r/cover.1762398359.git.zhuhui@kylinos.cn Link: https://lkml.kernel.org/r/656a03dff7e2bb20e24e841ede81fdca01d21410.1762398359.git.zhuhui@kylinos.cn Signed-off-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: remove unused zpool layerJohannes Weiner-1/+0
With zswap using zsmalloc directly, there are no more in-tree users of this code. Remove it. With zpool gone, zsmalloc is now always a simple dependency and no longer something the user needs to configure. Hide CONFIG_ZSMALLOC from the user and have zswap and zram pull it in as needed. Link: https://lkml.kernel.org/r/20250829162212.208258-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.se> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm: remove mm/io-mapping.cLorenzo Stoakes-1/+0
This is dead code, which was used from commit b739f125e4eb ("i915: use io_mapping_map_user") but reverted a month later by commit 0e4fe0c9f2f9 ("Revert "i915: use io_mapping_map_user"") back in 2021. Since then nobody has used it, so remove it. [akpm@linux-foundation.org: update Documentation/core-api/mm-api.rst, per Vlastimil] Link: https://lkml.kernel.org/r/20250725142901.81502-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: perform VMA allocation, freeing, duplication in mmLorenzo Stoakes-1/+1
Right now these are performed in kernel/fork.c which is odd and a violation of separation of concerns, as well as preventing us from integrating this and related logic into userland VMA testing going forward. There is a fly in the ointment - nommu - mmap.c is not compiled if CONFIG_MMU not set, and neither is vma.c. To square the circle, let's add a new file - vma_init.c. This will be compiled for both CONFIG_MMU and nommu builds, and will also form part of the VMA userland testing. This allows us to de-duplicate code, while maintaining separation of concerns and the ability for us to userland test this logic. Update the VMA userland tests accordingly, additionally adding a detach_free_vma() helper function to correctly detach VMAs before freeing them in test code, as this change was triggering the assert for this. [akpm@linux-foundation.org: remove stray newline, per Liam] Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: establish mm/vma_exec.c for shared exec/mm VMA functionalityLorenzo Stoakes-1/+1
Patch series "move all VMA allocation, freeing and duplication logic to mm", v3. Currently VMA allocation, freeing and duplication exist in kernel/fork.c, which is a violation of separation of concerns, and leaves these functions exposed to the rest of the kernel when they are in fact internal implementation details. Resolve this by moving this logic to mm, and making it internal to vma.c, vma.h. This also allows us, in future, to provide userland testing around this functionality. We additionally abstract dup_mmap() to mm, being careful to ensure kernel/fork.c acceses this via the mm internal header so it is not exposed elsewhere in the kernel. As part of this change, also abstract initial stack allocation performed in __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as this code uses vm_area_alloc() and vm_area_free(). In order to do so sensibly, we introduce a new mm/vma_exec.c file, which contains the code that is shared by mm and exec. This file is added to both memory mapping and exec sections in MAINTAINERS so both sets of maintainers can maintain oversight. As part of this change, we also move relocate_vma_down() to mm/vma_exec.c so all shared mm/exec functionality is kept in one place. We add code shared between nommu and mmu-enabled configurations in order to share VMA allocation, freeing and duplication code correctly while also keeping these functions available in userland VMA testing. This is achieved by adding a mm/vma_init.c file which is also compiled by the userland tests. This patch (of 4): There is functionality that overlaps the exec and memory mapping subsystems. While it properly belongs in mm, it is important that exec maintainers maintain oversight of this functionality correctly. We can establish both goals by adding a new mm/vma_exec.c file which contains these 'glue' functions, and have fs/exec.c import them. As a part of this change, to ensure that proper oversight is achieved, add the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections. scripts/get_maintainer.pl can correctly handle files in multiple entries and this neatly handles the cross-over. [akpm@linux-foundation.org: fix comment typo] Link: https://lkml.kernel.org/r/80f0d0c6-0b68-47f9-ab78-0ab7f74677fc@lucifer.local Link: https://lkml.kernel.org/r/cover.1745853549.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/91f2cee8f17d65214a9d83abb7011aa15f1ea690.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: rename GENERIC_PTDUMP and PTDUMP_COREAnshuman Khandual-1/+1
Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP. But generic ptdump gets enabled via PTDUMP_CORE. These configs combination is confusing as they sound very similar and does not differentiate between platform's feature subscription and feature enablement for ptdump. Rename the configs as ARCH_HAS_PTDUMP and PTDUMP making it more clear and improve readability. Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm, swap: remove swap slot cacheKairui Song-1/+1
Slot cache is no longer needed now, removing it and all related code. - vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`, 12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs), 16 test runs for each case, measuring the total throughput: Before (KB/s) (stdev) After (KB/s) (stdev) Random (4K): 424907.60 (24410.78) 414745.92 (34554.78) Random (64K): 163308.82 (11635.72) 167314.50 (18434.99) Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16) The performance changes are below noise level. - Build linux kernel with make -j96, using 4K folio with 1.5G memory cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs, 12 test runs, measuring the system time: Before (s) (stdev) After (s) (stdev) make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46) make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55) Similar to above, 64k mTHP case showed a slight improvement. Link: https://lkml.kernel.org/r/20250313165935.63303-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: move hugetlb CMA code in to its own fileFrank van der Linden-0/+3
hugetlb.c contained a number of CONFIG_CMA ifdefs, and the code inside them was large enough to merit being in its own file, so move it, cleaning up things a bit. Hide some direct variable access behind functions to accommodate the move. No functional change intended. Link: https://lkml.kernel.org/r/20250228182928.2645936-28-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: zbud: remove zbudYosry Ahmed-1/+0
The zbud compressed pages allocator is rarely used, most users use zsmalloc. zbud consumes much more memory (only stores 1 or 2 compressed pages per physical page). The only advantage of zbud is a marginal performance improvement that by no means justify the memory overhead. Historically, zsmalloc had significantly worse latency than zbud and z3fold but offered better memory savings. This is no longer the case as shown by a simple recent analysis [1]. In a kernel build test on tmpfs in a limited cgroup, zbud 2-3% less time than zsmalloc, but at the cost of using ~32% more memory (1.5G vs 1.13G). The tradeoff does not make sense for zbud in any practical scenario. The only alleged advantage of zbud is not having the dependency on CONFIG_MMU, but CONFIG_SWAP already depends on CONFIG_MMU anyway, and zbud is only used by zswap. Remove zbud after z3fold's removal, leaving zsmalloc as the one and only zpool allocator. Leave the removal of the zpool API (and its associated config options) to a followup cleanup after no more allocators show up. Deprecating zbud for a few cycles before removing it was initially proposed [2], like z3fold was marked as deprecated for 2 cycles [3]. However, Johannes rightfully pointed out that the 2 cycles is too short for most downstream consumers, and z3fold was deprecated first only as a courtesy anyway. [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/ [2]https://lore.kernel.org/lkml/Z5gdnSX5Lv-nfjQL@google.com/ [3]https://lore.kernel.org/lkml/20240904233343.933462-1-yosryahmed@google.com/ Link: https://lkml.kernel.org/r/20250129180633.3501650-3-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: z3fold: remove z3foldYosry Ahmed-1/+0
Patch series "mm: zswap: remove z3fold and zbud", v2. After 2 cycles of deprecating z3fold, remove it as well as zbud (rationale in specific patches). This patch (of 2): Z3fold has been marked as deprecated for 2 cycles and no one complained, as expected. As there are no known users, remove the code now. Link: https://lkml.kernel.org/r/20250129180633.3501650-1-yosry.ahmed@linux.dev Link: https://lkml.kernel.org/r/20250129180633.3501650-2-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)Qi Zheng-0/+1
Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. As a first step, this commit aims to synchronously free the empty PTE pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than madvise(MADV_DONTNEED). Once an empty PTE is detected, we first try to hold the pmd lock within the pte lock. If successful, we clear the pmd entry directly (fast path). Otherwise, we wait until the pte lock is released, then re-hold the pmd and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect whether the PTE page is empty and free it (slow path). For other cases such as madvise(MADV_FREE), consider scanning and freeing empty PTE pages asynchronously in the future. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 KB 240 KB [zhengqi.arch@bytedance.com: fix uninitialized symbol 'ptl'] Link: https://lkml.kernel.org/r/20241206112348.51570-1-zhengqi.arch@bytedance.com Link: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/ Link: https://lkml.kernel.org/r/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: move the page fragment allocator from page_alloc into its own fileYunsheng Lin-0/+1
Inspired by [1], move the page fragment allocator from page_alloc into its own c file and header file, as we are about to make more change for it to replace another page_frag implementation in sock.c As this patchset is going to replace 'struct page_frag' with 'struct page_frag_cache' in sched.h, including page_frag_cache.h in sched.h has a compiler error caused by interdependence between mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler error by moving 'struct page_frag_cache' to mm_types_task.h as suggested by Alexander, see [3]. 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowells@redhat.com/ 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/ 3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt_HOA@mail.gmail.com/ CC: David Howells <dhowells@redhat.com> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-3-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-03mm: introduce numa_emulationMike Rapoport (Microsoft)-0/+1
Move numa_emulation code from arch/x86 to mm/numa_emulation.c This code will be later reused by arch_numa. No functional changes. Link: https://lkml.kernel.org/r/20240807064110.1003856-20-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: introduce numa_memblksMike Rapoport (Microsoft)-0/+1
Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig options to let x86 select it in its Kconfig. This code will be later reused by arch_numa. No functional changes. Link: https://lkml.kernel.org/r/20240807064110.1003856-18-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: move kernel/numa.c to mm/Mike Rapoport (Microsoft)-0/+1
Patch series "mm: introduce numa_memblks", v4. Following the discussion about handling of CXL fixed memory windows on arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to the generic code so they will be available on arm64/riscv and maybe on loongarch sometime later. While it could be possible to use memblock to describe CXL memory windows, it currently lacks notion of unpopulated memory ranges and numa_memblks does implement this. Another reason to make numa_memblks generic is that both arch_numa (arm64 and riscv) and loongarch use trimmed copy of x86 code although there is no fundamental reason why the same code cannot be used on all these platforms. Having numa_memblks in mm/ will make it's interaction with ACPI and FDT more consistent and I believe will reduce maintenance burden. And with generic numa_memblks it is (almost) straightforward to enable NUMA emulation on arm64 and riscv. The first 9 commits in this series are cleanups that are not strictly related to numa_memblks. Commits 10-16 slightly reorder code in x86 to allow extracting numa_memblks and NUMA emulation to the generic code. Commits 17-19 actually move the code from arch/x86/ to mm/ and commits 20-22 does some aftermath cleanups. Commit 23 updates of_numa_init() to return error of no NUMA nodes were found in the device tree. Commit 24 switches arch_numa to numa_memblks. Commit 25 enables usage of phys_to_target_node() and memory_add_physaddr_to_nid() with numa_memblks. Commit 26 moves the description for numa=fake from x86 to admin-guide. [1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/ This patch (of 26): The stub functions in kernel/numa.c belong to mm/ rather than to kernel/ Link: https://lkml.kernel.org/r/20240807064110.1003856-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20240807064110.1003856-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: move internal core VMA manipulation functions to own fileLorenzo Stoakes-1/+1
This patch introduces vma.c and moves internal core VMA manipulation functions to this file from mmap.c. This allows us to isolate VMA functionality in a single place such that we can create userspace testing code that invokes this functionality in an environment where we can implement simple unit tests of core functionality. This patch ensures that core VMA functionality is explicitly marked as such by its presence in mm/vma.h. It also places the header includes required by vma.c in vma_internal.h, which is simply imported by vma.c. This makes the VMA functionality testable, as userland testing code can simply stub out functionality as required. Link: https://lkml.kernel.org/r/c77a6aafb4c42aaadb8e7271a853658cbdca2e22.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01shmem_quota: build the object file conditionally to the config optionCarlos Maiolino-1/+2
Initially I added shmem-quota to obj-y, move it to the correct place and remove the unneeded full file #ifdef Link: https://lkml.kernel.org/r/20240717063737.910840-1-cem@kernel.org Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Suggested-by: Aristeu Rozanski <aris@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: put cgroup v1-specific code under a config optionRoman Gushchin-1/+2
Put legacy cgroup v1 memory controller code under a new CONFIG_MEMCG_V1 config option. The option is turned off by default. Nobody except those who are still using cgroup v1 should turn it on. If the option is not set, memory controller can still be mounted under cgroup v1, but none of memcg-specific control files are present. Please note, that not all cgroup v1's memory controller code is guarded yet (but most of it), it's a subject for some follow-up work. Thanks to Michal Hocko for providing a better Kconfig option description. [roman.gushchin@linux.dev: better config option description provided by Michal] Link: https://lkml.kernel.org/r/ZnxXNtvqllc9CDoo@google.com Link: https://lkml.kernel.org/r/20240625005906.106920-14-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: introduce memcontrol-v1.cRoman Gushchin-1/+2
Patch series "mm: memcg: separate legacy cgroup v1 code and put under config option", v2. Cgroups v2 have been around for a while and many users have fully adopted them, so they never use cgroups v1 features and functionality. Yet they have to "pay" for the cgroup v1 support anyway: 1) the kernel binary contains an unused cgroup v1 code, 2) some code paths have additional checks which are not needed, 3) some common structures like task_struct and mem_cgroup contain unused cgroup v1-specific members. Cgroup v1's memory controller has a number of features that are not supported by cgroup v2 and their implementation is pretty much self contained. Most notably, these features are: soft limit reclaim, oom handling in userspace, complicated event notification system, charge migration. Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's intervened with generic and cgroup v2-specific code. It's a burden on developers and maintainers. This patchset aims to solve these problems by: 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the mm/memcontrol-v1.h header, 3) introducing the CONFIG_MEMCG_V1 config option, turned off by default, 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set. If CONFIG_MEMCG_V1 is not set, cgroup v1 memory controller is still available for mounting, however no memory-specific control knobs are present. This patch (of 14): This patch introduces the mm/memcontrol-v1.c source file which will be used for all legacy (cgroup v1) memory cgroup code. It also introduces mm/memcontrol-v1.h to keep declarations shared between mm/memcontrol.c and mm/memcontrol-v1.c. As of now, let's compile it if CONFIG_MEMCG is set, similar to mm/memcontrol.c. Later on it can be switched to use a separate config option, so that the legacy code won't be compiled if not required. Link: https://lkml.kernel.org/r/20240625005906.106920-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20240625005906.106920-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-23mseal: add mseal syscallJeff Xu-0/+4
The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. Following input during RFC are incooperated into this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Linus Torvalds: assisting in defining system call signature and scope. Liam R. Howlett: perf optimization. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. Finally, the idea that inspired this patch comes from Stephen Röttger's work in Chrome V8 CFI. [jeffxu@chromium.org: add branch prediction hint, per Pedro] Link: https://lkml.kernel.org/r/20240423192825.1273679-2-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-3-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-14mm: introduce execmem_alloc() and execmem_free()Mike Rapoport (IBM)-0/+1
module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystems that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. Start splitting code allocation from modules by introducing execmem_alloc() and execmem_free() APIs. Initially, execmem_alloc() is a wrapper for module_alloc() and execmem_free() is a replacement of module_memfree() to allow updating all call sites to use the new APIs. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem_alloc() takes a type argument, that will be used to identify the calling subsystem and to allow architectures define parameters for ranges suitable for that subsystem. No functional changes. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-04-25mm/kmemleak: disable KASAN instrumentation in kmemleakWaiman Long-0/+1
Kmemleak ia a memory leak checker. KASAN is also a memory checker but it focuses more on finding out-of-bounds and use-after-free bugs. Since kmemleak is inherently slow especially on systems with large number of CPUs, adding KASAN instrumentation will make it slower even more. As kmemleak is not for production use, the utility of enabling KASAN there is questionable. This patch disables KASAN instrumentation for configurations that enable both of them to slightly reduce performance overhead. Link: https://lkml.kernel.org/r/20240307190548.963626-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-31kbuild: make -Woverride-init warnings more consistentArnd Bergmann-2/+1
The -Woverride-init warn about code that may be intentional or not, but the inintentional ones tend to be real bugs, so there is a bit of disagreement on whether this warning option should be enabled by default and we have multiple settings in scripts/Makefile.extrawarn as well as individual subsystems. Older versions of clang only supported -Wno-initializer-overrides with the same meaning as gcc's -Woverride-init, though all supported versions now work with both. Because of this difference, an earlier cleanup of mine accidentally turned the clang warning off for W=1 builds and only left it on for W=2, while it's still enabled for gcc with W=1. There is also one driver that only turns the warning off for newer versions of gcc but not other compilers, and some but not all the Makefiles still use a cc-disable-warning conditional that is no longer needed with supported compilers here. Address all of the above by removing the special cases for clang and always turning the warning off unconditionally where it got in the way, using the syntax that is supported by both compilers. Fixes: 2cd3271b7a31 ("kbuild: avoid duplicate warning options") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Hamza Mahfooz <hamza.mahfooz@amd.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-12-05mm/slab: remove CONFIG_SLAB from all Kconfig and MakefileVlastimil Babka-5/+1
Remove CONFIG_SLAB, CONFIG_DEBUG_SLAB, CONFIG_SLAB_DEPRECATED and everything in Kconfig files and mm/Makefile that depends on those. Since SLUB is the only remaining allocator, remove the allocator choice, make CONFIG_SLUB a "def_bool y" for now and remove all explicit dependencies on SLUB or SLAB as it's now always enabled. Make every option's verbose name and description refer to "the slab allocator" without refering to the specific implementation. Do not rename the CONFIG_ option names yet. Everything under #ifdef CONFIG_SLAB, and mm/slab.c is now dead code, all code under #ifdef CONFIG_SLUB is now always compiled. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-10-04mm: vmscan: move shrinker-related code into a separate fileQi Zheng-2/+2
The mm/vmscan.c file is too large, so separate the shrinker-related code from it into a separate file. No functional changes. Link: https://lkml.kernel.org/r/20230911092517.64141-3-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian König <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Price <steven.price@arm.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Sean Paul <sean@poorly.run> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-29Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-21mm: kill frontswapJohannes Weiner-1/+0
The only user of frontswap is zswap, and has been for a long time. Have swap call into zswap directly and remove the indirection. [hannes@cmpxchg.org: remove obsolete comment, per Yosry] Link: https://lkml.kernel.org/r/20230719142832.GA932528@cmpxchg.org [fengwei.yin@intel.com: don't warn if none swapcache folio is passed to zswap_load] Link: https://lkml.kernel.org/r/20230810095652.3905184-1-fengwei.yin@intel.com Link: https://lkml.kernel.org/r/20230717160227.GA867137@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-09shmem: prepare shmem quota infrastructureCarlos Maiolino-1/+1
Add new shmem quota format, its quota_format_ops together with dquot_operations Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230725144510.253763-5-cem@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-09mm: page_alloc: split out DEBUG_PAGEALLOCKefeng Wang-0/+1
Move DEBUG_PAGEALLOC related functions into a single file to reduce a bit of page_alloc.c. Link: https://lkml.kernel.org/r/20230516063821.121844-9-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09mm: page_alloc: split out FAIL_PAGE_ALLOCKefeng Wang-0/+1
... to a single file to reduce a bit of page_alloc.c. Link: https://lkml.kernel.org/r/20230516063821.121844-8-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09mm: page_alloc: collect mem statistic into show_mem.cKefeng Wang-1/+1
Let's move show_mem.c from lib to mm, as it belongs memory subsystem, also split some memory statistic related functions from page_alloc.c to show_mem.c, and we cleanup some unneeded include. There is no functional change. Link: https://lkml.kernel.org/r/20230516063821.121844-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-05dmapool: add alloc/free performance testKeith Busch-0/+1
Patch series "dmapool enhancements", v4. Time spent in dma_pool alloc/free increases linearly with the number of pages backing the pool. We can reduce this to constant time with minor changes to how free pages are tracked. This patch (of 12): Provide a module that allocates and frees many blocks of various sizes and report how long it takes. This is intended to provide a consistent way to measure how changes to the dma_pool_alloc/free routines affect timing. Link: https://lkml.kernel.org/r/20230126215125.4069751-1-kbusch@meta.com Link: https://lkml.kernel.org/r/20230126215125.4069751-2-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-29mm/slob: remove CONFIG_SLOBVlastimil Babka-2/+0
Remove SLOB from Kconfig and Makefile. Everything under #ifdef CONFIG_SLOB, and mm/slob.c is now dead code. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
2022-10-03mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbolJohannes Weiner-1/+3
Since 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control"), CONFIG_MEMCG_SWAP hasn't been a user-visible config option anymore, it just means CONFIG_MEMCG && CONFIG_SWAP. Update the sites accordingly and drop the symbol. [ While touching the docs, remove two references to CONFIG_MEMCG_KMEM, which hasn't been a user-visible symbol for over half a decade. ] Link: https://lkml.kernel.org/r/20220926135704.400818-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03kmsan: add KMSAN runtime coreAlexander Potapenko-0/+1
For each memory location KernelMemorySanitizer maintains two types of metadata: 1. The so-called shadow of that location - а byte:byte mapping describing whether or not individual bits of memory are initialized (shadow is 0) or not (shadow is 1). 2. The origins of that location - а 4-byte:4-byte mapping containing 4-byte IDs of the stack traces where uninitialized values were created. Each struct page now contains pointers to two struct pages holding KMSAN metadata (shadow and origins) for the original struct page. Utility routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata creation, addressing, copying and checking. mm/kmsan/report.c performs error reporting in the cases an uninitialized value is used in a way that leads to undefined behavior. KMSAN compiler instrumentation is responsible for tracking the metadata along with the kernel memory. mm/kmsan/instrumentation.c provides the implementation for instrumentation hooks that are called from files compiled with -fsanitize=kernel-memory. To aid parameter passing (also done at instrumentation level), each task_struct now contains a struct kmsan_task_state used to track the metadata of function parameters and return values for that task. Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The KMSAN_SANITIZE:=n Makefile directive can be used to completely disable KMSAN instrumentation for certain files. Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly created stack memory initialized. Users can also use functions from include/linux/kmsan-checks.h to mark certain memory regions as uninitialized or initialized (this is called "poisoning" and "unpoisoning") or check that a particular region is initialized. Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: remove vmacacheLiam R. Howlett-1/+1
By using the maple tree and the maple tree state, the vmacache is no longer beneficial and is complicating the VMA code. Remove the vmacache to reduce the work in keeping it up to date and code complexity. Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm/demotion: add support for explicit memory tiersAneesh Kumar K.V-0/+1
Patch series "mm/demotion: Memory tiers and demotion", v15. The current kernel has the basic memory tiering support: Inactive pages on a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to make room for new allocations on the higher tier NUMA node. Frequently accessed pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA node to improve the performance. In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel implementation needs to be improved for several important use cases: * The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) and that should be put into a higher tier. * The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. * Also because the current tier hierarchy always puts CPU nodes into the top tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets changed, even though no memory node is added or removed. This can make the tier hierarchy unstable and make it difficult to support tier-based memory accounting. * A higher tier node can only be demoted to nodes with shortest distance on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), and has resulted in the feature request for an interface to override the system-wide, per-node demotion order from the userspace. This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. This patch series make the creation of memory tiers explicit under the control of device driver. Memory Tier Initialization ========================== Linux kernel presents memory devices as NUMA nodes and each memory device is of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This allows for classifying memory devices with a specific performance range into a memory tier. By default, all memory nodes are assigned to the default tier with abstract distance 512. A device driver can move its memory nodes from the default tier. For example, PMEM can move its memory nodes below the default tier, whereas GPU can move its memory nodes above the default tier. The kernel initialization code makes the decision on which exact tier a memory node should be assigned to based on the requests from the device drivers as well as the memory device hardware information provided by the firmware. Hot-adding/removing CPUs doesn't affect memory tier hierarchy. This patch (of 10): In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and builds the tier hierarchy by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel implementation needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to nodes with shortest distance on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device is of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This allows for classifying memory devices with a specific performance range into a memory tier. This patch configures the range/chunk size to be 128. The default DRAM abstract distance is 512. We can have 4 memory tiers below the default DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511. Faster memory devices can be placed in these faster(higher) memory tiers. Slower memory devices like persistent memory will have abstract distance higher than the default DRAM level. [akpm@linux-foundation.org: fix comment, per Aneesh] Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>