aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2025-09-21sunrpc: Change ret code of xdr_stream_decode_opaque_fixedSergey Bashirov2-3/+3
Since the opaque is fixed in size, the caller already knows how many bytes were decoded, on success. Thus, xdr_stream_decode_opaque_fixed() doesn't need to return that value. And, xdr_stream_decode_u32 and _u64 both return zero on success. This patch simplifies the caller's error checking to avoid potential integer promotion issues. Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21nfsd: discard nfsd_file_get_local()NeilBrown4-24/+0
This interface was deprecated by commit e6f7e1487ab5 ("nfs_localio: simplify interface to nfsd for getting nfsd_file") and is now unused. So let's remove it. Signed-off-by: NeilBrown <neil@brown.name> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21sunrpc: delay pc_release callback until after the reply is sentJeff Layton1-3/+14
The server-side sunrpc code currently calls pc_release before sending the reply. Change svc_process and svc_process_bc to call pc_release after sending the reply instead. Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21NFSD: Move the fh_getattr() helperChuck Lever3-13/+24
Clean up: The fh_getattr() function is part of NFSD's file handle API, so relocate it. I've made it an un-inlined function so that trace points and new functionality can easily be introduced. That increases the size of nfsd.ko by about a page on my x86_64 system (out of 26MB; compiled with -O2). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21NFSD: Relocate the fh_want_write() and fh_drop_write() helpersChuck Lever2-20/+37
Clean up: these helpers are part of the NFSD file handle API. Relocate them to fs/nfsd/nfsfh.h. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-21sunrpc: fix null pointer dereference on zero-length checksumLei Lu1-1/+1
In xdr_stream_decode_opaque_auth(), zero-length checksum.len causes checksum.data to be set to NULL. This triggers a NPD when accessing checksum.data in gss_krb5_verify_mic_v2(). This patch ensures that the value of checksum.len is not less than XDR_UNIT. Fixes: 0653028e8f1c ("SUNRPC: Convert gss_verify_header() to use xdr_stream") Cc: stable@kernel.org Signed-off-by: Lei Lu <llfamsec@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-22Merge tag 'amd-drm-next-6.18-2025-09-19' of ↵Dave Airlie205-5194/+2093
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-09-19: amdgpu: - Fence drv clean up fix - DPC fixes - Misc display fixes - Support the MMIO remap page as a ttm pool - JPEG parser updates - UserQ updates - VCN ctx handling fixes - Documentation updates - Misc cleanups - SMU 13.0.x updates - SI DPM updates - GC 11.x cleaner shader updates - DMCUB updates - DML fixes - Improve fallback handling for pixel encoding - VCN reset improvements - DCE6 DC updates - DSC fixes - Use devm for i2c buses - GPUVM locking updates - GPUVM documentation improvements - Drop non-DC DCE11 code - S0ix fixes - Backlight fix - SR-IOV fixes amdkfd: - SVM updates Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250919193354.2989255-1-alexander.deucher@amd.com
2025-09-21docs: remove cdomain.pyMauro Carvalho Chehab1-247/+0
This is not used anymore, as it was a logic we had to support pre Sphinx 3.x, as shown at: afde706afde2 ("Make the docs build "work" with Sphinx 3.x") Remove it. Fixes: b26717852db7 ("docs: conf.py: drop backward support for old Sphinx versions") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <3b86d236c64af17924e4cfedbbfb8bc60059802f.1758381727.git.mchehab+huawei@kernel.org>
2025-09-21Documentation/process: submitting-patches: fix typo in "were do"Yash Suthar1-1/+1
Fixes a typo in submitting-patches.rst: "were do" -> "where do" Signed-off-by: Yash Suthar <yashsuthar983@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <20250920190856.7394-1-yashsuthar983@gmail.com>
2025-09-21Merge tag 'v6.18-rockchip-clk1' of ↵Stephen Boyd2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into clk-rockchip Pull Rockchip clk driver updates from Heiko Stuebner: Export the dsi-24MHz clock on the RK3368, which seems to get some attention to enable DSI support there. * tag 'v6.18-rockchip-clk1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: clk: rockchip: rk3368: use clock ids for SCLK_MIPIDSI_24M dt-bindings: clock: rk3368: Add SCLK_MIPIDSI_24M
2025-09-21docs: dev-tools/lkmm: Fix typo of missing file extensionAkira Yokosawa1-1/+1
Commit 1e9ddbb2cd34 ("docs: Pull LKMM documentation into dev-tools book") failed to add a file extension in lkmm/docs/herd-representation.rst for referencing its plane-text counterpart. Fix it. Fixes: 1e9ddbb2cd34 ("docs: Pull LKMM documentation into dev-tools book") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509192138.fx3H6NzG-lkp@intel.com/ Signed-off-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <c3b9de17-7cd8-4968-9872-cbe2607a7143@gmail.com>
2025-09-22Merge tag 'drm-xe-next-2025-09-19' of ↵Dave Airlie133-2330/+6859
https://gitlab.freedesktop.org/drm/xe/kernel into drm-next UAPI Changes: - Drop L3 bank mask reporting from the media GT on Xe3 and later. Only do that for the primary GT. No userspace needs or uses it for media and some platforms may report bogus values. - Add SLPC power_profile sysfs interface with support for base and power_saving modes (Vinay Belgaumkar, Rodrigo Vivi) - Add configfs attributes to add post/mid context-switch commands (Lucas De Marchi) Cross-subsystem Changes: - Fix hmm_pfn_to_map_order() usage in gpusvm and refactor APIs to align with pieces previous handled by xe_hmm (Matthew Auld) Core Changes: - Add MEI driver for Late Binding Firmware Update/Upload (Alexander Usyskin) Driver Changes: - Fix GuC CT teardown wrt TLB invalidation (Satyanarayana) - Fix CCS save/restore on VF (Satyanarayana) - Increase default GuC crash buffer size (Zhanjun) - Allow to clear GT stats in debugfs to aid debugging (Matthew Brost) - Add more SVM GT stats to debugfs (Matthew Brost) - Fix error handling in VMA attr query (Himal) - Move sa_info in debugfs to be per tile (Michal Wajdeczko) - Limit number of retries upon receiving NO_RESPONSE_RETRY from GuC to avoid endless loop (Michal Wajdeczko) - Fix configfs handling for survivability_mode undoing user choice when unbinding the module (Michal Wajdeczko) - Refactor configfs attribute visibility to future-proof it and stop exposing survivability_mode if not applicable (Michal Wajdeczko) - Constify some functions (Harish Chegondi, Michal Wajdeczko) - Add/extend more HW workarounds for Xe2 and Xe3 (Harish Chegondi, Tangudu Tilak Tirumalesh) - Replace xe_hmm with gpusvm (Matthew Auld) - Improve fake pci and WA kunit handling for testing new platforms (Michal Wajdeczko) - Reduce unnecessary PTE writes when migrating (Sanjay Yadav) - Cleanup GuC interface definitions and log message (John Harrison) - Small improvements around VF CCS (Michal Wajdeczko) - Enable bus mastering for the I2C controller (Raag Jadav) - Prefer devm_mutex of hand rolling it (Christophe JAILLET) - Drop sysfs and debugfs attributes not available for VF (Michal Wajdeczko) - GuC CT devm actions improvements (Michal Wajdeczko) - Recommend new GuC versions for PTL and BMG (Julia Filipchuk) - Improveme driver handling for exhaustive eviction using new xe_validation wrapper around drm_exec (Thomas Hellström) - Add and use printk wrappers for tile and device (Michal Wajdeczko) - Better document workaround handling in Xe (Lucas De Marchi) - Improvements on ARRAY_SIZE and ERR_CAST usage (Lucas De Marchi, Fushuai Wang) - Align CSS firmware headers with the GuC APIs (John Harrison) - Test GuC to GuC (G2G) communication to aid debug in pre-production firmware (John Harrison) - Bail out driver probing if GuC fails to load (John Harrison) - Allow error injection in xe_pxp_exec_queue_add() (Daniele Ceraolo Spurio) - Minor refactors in xe_svm (Shuicheng Lin) - Fix madvise ioctl error handling (Shuicheng Lin) - Use attribute groups to simplify sysfs registration (Michal Wajdeczko) - Add Late Binding Firmware implementation in Xe to work together with the MEI component (Badal Nilawar, Daniele Ceraolo Spurio, Rodrigo Vivi) - Fix build with CONFIG_MODULES=n (Lucas De Marchi) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/c2et6dnkst2apsgt46dklej4nprqdukjosb55grpaknf3pvcxy@t7gtn3hqtp6n
2025-09-21Linux 6.17-rc7v6.17-rc7Linus Torvalds1-1/+1
2025-09-21MAINTAINERS, mailmap: Update address for Peter HilberPeter Hilber2-1/+2
Going forward, I will use another Qualcomm address, peter.hilber@oss.qualcomm.com. Map past contributions on behalf of Qualcomm to the new address as well. Signed-off-by: Peter Hilber <peter.hilber@oss.qualcomm.com> Message-Id: <20250826130015.6218-1-peter.hilber@oss.qualcomm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-09-21virtio_config: clarify output parametersAlyssa Ross1-5/+6
This was ambiguous enough for a broken patch (206cc44588f7 ("virtio: reject shm region if length is zero")) to make it into the kernel, so make it clearer. Link: https://lore.kernel.org/r/20250816071600-mutt-send-email-mst@kernel.org/ Signed-off-by: Alyssa Ross <hi@alyssa.is> Message-Id: <20250829150944.233505-1-hi@alyssa.is> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-09-21uapi: vduse: fix typo in commentAshwini Sahu1-1/+1
Fix a spelling mistake in vduse.h: "regsion" → "region" in the documentation for struct vduse_iova_info. No functional change. Signed-off-by: Ashwini Sahu <ashwini@wisig.com> Message-Id: <20250908095645.610336-1-ashwini@wisig.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-09-21vhost: Take a reference on the task in struct vhost_task.Sebastian Andrzej Siewior1-1/+2
vhost_task_create() creates a task and keeps a reference to its task_struct. That task may exit early via a signal and its task_struct will be released. A pending vhost_task_wake() will then attempt to wake the task and access a task_struct which is no longer there. Acquire a reference on the task_struct while creating the thread and release the reference while the struct vhost_task itself is removed. If the task exits early due to a signal, then the vhost_task_wake() will still access a valid task_struct. The wake is safe and will be skipped in this case. Fixes: f9010dbdce911 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") Reported-by: Sean Christopherson <seanjc@google.com> Closes: https://lore.kernel.org/all/aKkLEtoDXKxAAWju@google.com/ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Message-Id: <20250918181144.Ygo8BZ-R@linutronix.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Sean Christopherson <seanjc@google.com>
2025-09-21mm: drop all references of writable and SCAN_PAGE_RODev Jain2-24/+9
Now that all actionable outcomes from checking pte_write() are gone, drop the related references. Link: https://lkml.kernel.org/r/20250908075028.38431-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: enable khugepaged anonymous collapse on non-writable regionsDev Jain1-7/+2
Patch series "Expand scope of khugepaged anonymous collapse", v2. Currently khugepaged does not collapse an anonymous region which does not have a single writable pte. This is wasteful since a region mapped with non-writable ptes, for example, non-writable VMAs mapped by the application, won't benefit from THP collapse. An additional consequence of this constraint is that MADV_COLLAPSE does not perform a collapse on a non-writable VMA, and this restriction is nowhere to be found on the manpage - the restriction itself sounds wrong to me since the user knows the protection of the memory it has mapped, so collapsing read-only memory via madvise() should be a choice of the user which shouldn't be overridden by the kernel. Therefore, remove this constraint. On an arm64 bare metal machine, comparing with vanilla 6.17-rc2, an average of 5% improvement is seen on some mmtests benchmarks, particularly hackbench, with a maximum improvement of 12%. In the following table, (I) denotes statistically significant improvement, (R) denotes statistically significant regression. +-------------------------+--------------------------------+---------------+ | mmtests/hackbench | process-pipes-1 (seconds) | -0.06% | | | process-pipes-4 (seconds) | -0.27% | | | process-pipes-7 (seconds) | (I) -12.13% | | | process-pipes-12 (seconds) | (I) -5.32% | | | process-pipes-21 (seconds) | (I) -2.87% | | | process-pipes-30 (seconds) | (I) -3.39% | | | process-pipes-48 (seconds) | (I) -5.65% | | | process-pipes-79 (seconds) | (I) -6.74% | | | process-pipes-110 (seconds) | (I) -6.26% | | | process-pipes-141 (seconds) | (I) -4.99% | | | process-pipes-172 (seconds) | (I) -4.45% | | | process-pipes-203 (seconds) | (I) -3.65% | | | process-pipes-234 (seconds) | (I) -3.45% | | | process-pipes-256 (seconds) | (I) -3.47% | | | process-sockets-1 (seconds) | 2.13% | | | process-sockets-4 (seconds) | 1.02% | | | process-sockets-7 (seconds) | -0.26% | | | process-sockets-12 (seconds) | -1.24% | | | process-sockets-21 (seconds) | 0.01% | | | process-sockets-30 (seconds) | -0.15% | | | process-sockets-48 (seconds) | 0.15% | | | process-sockets-79 (seconds) | 1.45% | | | process-sockets-110 (seconds) | -1.64% | | | process-sockets-141 (seconds) | (I) -4.27% | | | process-sockets-172 (seconds) | 0.30% | | | process-sockets-203 (seconds) | -1.71% | | | process-sockets-234 (seconds) | -1.94% | | | process-sockets-256 (seconds) | -0.71% | | | thread-pipes-1 (seconds) | 0.66% | | | thread-pipes-4 (seconds) | 1.66% | | | thread-pipes-7 (seconds) | -0.17% | | | thread-pipes-12 (seconds) | (I) -4.12% | | | thread-pipes-21 (seconds) | (I) -2.13% | | | thread-pipes-30 (seconds) | (I) -3.78% | | | thread-pipes-48 (seconds) | (I) -5.77% | | | thread-pipes-79 (seconds) | (I) -5.31% | | | thread-pipes-110 (seconds) | (I) -6.12% | | | thread-pipes-141 (seconds) | (I) -4.00% | | | thread-pipes-172 (seconds) | (I) -3.01% | | | thread-pipes-203 (seconds) | (I) -2.62% | | | thread-pipes-234 (seconds) | (I) -2.00% | | | thread-pipes-256 (seconds) | (I) -2.30% | | | thread-sockets-1 (seconds) | (R) 2.39% | +-------------------------+--------------------------------+---------------+ +-------------------------+------------------------------------------------+ | mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | -0.02% | | | sysbenchmutex-4 (usec) | -0.02% | | | sysbenchmutex-7 (usec) | 0.00% | | | sysbenchmutex-12 (usec) | 0.12% | | | sysbenchmutex-21 (usec) | -0.40% | | | sysbenchmutex-30 (usec) | 0.08% | | | sysbenchmutex-48 (usec) | 2.59% | | | sysbenchmutex-79 (usec) | -0.80% | | | sysbenchmutex-110 (usec) | -3.87% | | | sysbenchmutex-128 (usec) | (I) -4.46% | +-------------------------+--------------------------------+---------------+ This patch (of 2): Currently khugepaged does not collapse an anonymous region which does not have a single writable pte. This is wasteful since a region mapped with non-writable ptes, for example, non-writable VMAs mapped by the application, won't benefit from THP collapse. An additional consequence of this constraint is that MADV_COLLAPSE does not perform a collapse on a non-writable VMA, and this restriction is nowhere to be found on the manpage - the restriction itself sounds wrong to me since the user knows the protection of the memory it has mapped, so collapsing read-only memory via madvise() should be a choice of the user which shouldn't be overridden by the kernel. Therefore, remove this restriction by not honouring SCAN_PAGE_RO. Link: https://lkml.kernel.org/r/20250908075028.38431-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250908075028.38431-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/stat: expose negative idle timeSeongJae Park1-5/+5
DAMON_STAT calculates the idle time of a region using the region's age if the region's nr_accesses is zero. If the nr_accesses value is non-zero (positive), the idle time of the region becomes zero. This means the users cannot know how warm and hot data is distributed, using DAMON_STAT's memory_idle_ms_percentiles output. The other stat, namely estimated_memory_bandwidth, can help understanding how the overall access temperature of the system is, but it is still very rough information. On production systems, actually, a significant portion of the system memory is observed with zero idle time, and we cannot break it down based on its internal hotness distribution. Define the idle time of the region using its age, similar to those having zero nr_accesses, but multiples '-1' to distinguish it. And expose that using the same parameter interface, memory_idle_ms_percentiles. Link: https://lkml.kernel.org/r/20250916183127.65708-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/stat: expose the current tuned aggregation intervalSeongJae Park1-0/+6
Patch series "mm/damon/stat: expose auto-tuned intervals and non-idle ages". DAMON_STAT is intentionally providing limited information for easy consumption of the information. From production fleet level usages, below limitations are found, though. The aggregation interval of DAMON_STAT represents the granularity of the memory_idle_ms_percentiles. But the interval is auto-tuned and not exposed to users, so users cannot know the granularity. All memory regions of non-zero (positive) nr_accesses are treated as having zero idle time. A significant portion of production systems have such zero idle time. Hence breakdown of warm and hot data is nearly impossible. Make following changes to overcome the limitations. Expose the auto-tuned aggregation interval with a new parameter named aggr_interval_us. Expose the age of non-zero nr_accesses (how long >0 access frequency the region retained) regions as a negative idle time. This patch (of 2): DAMON_STAT calculates the idle time for a region as the region's age multiplied by the aggregation interval. That is, the aggregation interval is the granularity of the idle time. Since the aggregation interval is auto-tuned and not exposed to users, however, users cannot easily know in what granularity the stat is made. Expose the tuned aggregation interval in microseconds via a new parameter, aggr_interval_us. Link: https://lkml.kernel.org/r/20250916183127.65708-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250916183127.65708-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21samples/damon/mtier: use damon_initialized()SeongJae Park1-4/+7
damon_sample_mtier is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21samples/damon/prcl: use damon_initialized()SeongJae Park1-4/+7
damon_sample_prcl is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21samples/damon/wsse: use damon_initialized()SeongJae Park1-6/+9
damon_sample_wsse is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/lru_sort: use damon_initialized()SeongJae Park1-2/+7
DAMON_LRU_SORT is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/reclaim: use damon_initialized()SeongJae Park1-2/+7
DAMON_RECLAIM is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/stat: use damon_initialized()SeongJae Park1-4/+6
DAMON_STAT is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/core: implement damon_initialized() functionSeongJae Park2-0/+11
Patch series "mm/damon: define and use DAMON initialization check function". DAMON is initialized in subsystem initialization time, by damon_init(). If DAMON API functions are called before the initialization, the system could crash. Actually such issues happened and were fixed [1] in the past. For the fix, DAMON API callers have updated to check if DAMON is initialized or not, using their own hacks. The hacks are unnecessarily duplicated on every DAMON API callers and therefore it would be difficult to reliably maintain in the long term. Make it reliable and easy to maintain. For this, implement a new DAMON core layer API function that returns if DAMON is successfully initialized. If it returns true, it means DAMON API functions are safe to be used. After the introduction of the new API, update DAMON API callers to use the new function instead of their own hacks. This patch (of 7): If DAMON is tried to be used when it is not yet successfully initialized, the caller could be crashed. DAMON core layer is not providing a reliable way to see if it is successfully initialized and therefore ready to be used, though. As a result, DAMON API callers are implementing their own hacks to see it. The hacks simply assume DAMON should be ready on module init time. It is not reliable as DAMON initialization can indeed fail if KMEM_CACHE() fails, and difficult to maintain as those are duplicates. Implement a core layer API function for better reliability and maintainability to replace the hacks with followup commits. Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org Link: https://lore.kernel.org/20250909022238.2989-1-sj@kernel.org [1] Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21MAINTAINERS: rename DAMON sectionSeongJae Park1-1/+1
DAMON section name is 'DATA ACCESS MONITOR', which implies it is only for data access monitoring. But DAMON is now evolved for not only access monitoring but also access-aware system operations (DAMOS). Rename the section to simply DAMON. It might make it difficult to understand what it does at a glance, but at least not spreading more confusion. Readers can further refer to the documentation to better understand what really DAMON does. Link: https://lkml.kernel.org/r/20250916032339.115817-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21Docs/admin-guide/mm/damon/start: add --target_pid to DAMOS example commandSeongJae Park1-1/+1
The example command doesn't work [1] on the latest DAMON user-space tool, since --damos_action option is updated to receive multiple arguments, and hence cannot know if the final argument is for deductible monitoring target or an argument for --damos_action option. Add --target_pid option to let damo understand it is for target pid. Link: https://lkml.kernel.org/r/20250916032339.115817-5-sj@kernel.org Link: https://github.com/damonitor/damo/pull/32 [2] Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21Docs/mm/damon/maintainer-profile: update community meetup for reservation ↵SeongJae Park1-11/+6
requirements DAMON community meetup was having two different kinds of meetups: reservation required ones and unrequired ones. Now the reservation unrequested one is gone, but the documentation on the maintainer-profile is not updated. Update. Link: https://lkml.kernel.org/r/20250916032339.115817-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/core: set effective quota on first charge windowSeongJae Park1-1/+3
The effective quota of a scheme is initialized zero, which means there is no quota. It is set based on user-specified time/quota/quota goals. But the later value set is done only from the second charge window. As a result, a scheme having a user-specified quota can work as not having the quota (unexpectedly fast) for the first charge window. In practical and common use cases the quota interval is not too long, and the scheme's target access pattern is restrictive. Hence the issue should be modest. That said, it is apparently an unintended misbehavior. Fix the problem by setting esz on the first charge window. Link: https://lkml.kernel.org/r/20250916032339.115817-3-sj@kernel.org Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota") # 5.16.x Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/core: reset age if nr_accesses changes between non-zero and zeroSeongJae Park1-0/+2
Patch series "mm/damon: misc fixups and improvements for 6.18", v2. Misc fixes and improvements for DAMON that are not critical and therefore aims to be merged into Linux 6.18-rc1. The first patch improves DAMON's age counting for nr_accesses zero to/from non-zero changes. The second patch fixes an initial DAMOS apply interval delay issue that is not realistic but still could happen on an odd setup. The third and the fourth patches update DAMON community meetup description and DAMON user-space tool example command for DAMOS usage, respectively. Finally, the fifth patch updates MAINTAINERS section name for DAMON to just DAMON. This patch (of 5): DAMON resets the age of a region if its nr_accesses value has significantly changed. Specifically, the threshold is calculated as 20% of largest nr_accesses of the current snapshot. This means that regions changing the nr_accesses from zero to small non-zero value or from a small non-zero value to zero will keep the age. Since many users treat zero nr_accesses regions special, this can be confusing. Kernel code including DAMOS' regions priority calculation and DAMON_STAT's idle time calculation also treat zero nr_accesses regions special. Make it unconfusing by resetting the age when the nr_accesses changes between zero and a non-zero value. Link: https://lkml.kernel.org/r/20250916032339.115817-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250916032339.115817-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21alloc_tag: mark inaccurate allocation counters in /proc/allocinfo outputSuren Baghdasaryan5-2/+34
While rare, memory allocation profiling can contain inaccurate counters if slab object extension vector allocation fails. That allocation might succeed later but prior to that, slab allocations that would have used that object extension vector will not be accounted for. To indicate incorrect counters, "accurate:no" marker is appended to the call site line in the /proc/allocinfo output. Bump up /proc/allocinfo version to reflect the change in the file format and update documentation. Example output with invalid counters: allocinfo - version: 2.0 0 0 arch/x86/kernel/kdebugfs.c:105 func:create_setup_data_nodes 0 0 arch/x86/kernel/alternative.c:2090 func:alternatives_smp_module_add 0 0 arch/x86/kernel/alternative.c:127 func:__its_alloc accurate:no 0 0 arch/x86/kernel/fpu/regset.c:160 func:xstateregs_set 0 0 arch/x86/kernel/fpu/xstate.c:1590 func:fpstate_realloc 0 0 arch/x86/kernel/cpu/aperfmperf.c:379 func:arch_enable_hybrid_capacity_scale 0 0 arch/x86/kernel/cpu/amd_cache_disable.c:258 func:init_amd_l3_attrs 49152 48 arch/x86/kernel/cpu/mce/core.c:2709 func:mce_device_create accurate:no 32768 1 arch/x86/kernel/cpu/mce/genpool.c:132 func:mce_gen_pool_create 0 0 arch/x86/kernel/cpu/mce/amd.c:1341 func:mce_threshold_create_device [surenb@google.com: document new "accurate:no" marker] Fixes: 39d117e04d15 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output") [akpm@linux-foundation.org: simplification per Usama, reflow text] [akpm@linux-foundation.org: add newline to prevent docs warning, per Randy] Link: https://lkml.kernel.org/r/20250915230224.4115531-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/oom_kill: the OOM reaper traverses the VMA maple tree in reverse orderzhongjinji1-2/+8
Although the oom_reaper is delayed and it gives the oom victim chance to clean up its address space this might take a while especially for processes with a large address space footprint. In those cases oom_reaper might start racing with the dying task and compete for shared resources - e.g. page table lock contention has been observed. Reduce those races by reaping the oom victim from the other end of the address space. It is also a significant improvement for process_mrelease(). When a process is killed, process_mrelease is used to reap the killed process and often runs concurrently with the dying task. The test data shows that after applying the patch, lock contention is greatly reduced during the procedure of reaping the killed process. The test is conducted on arm64. The following basic perf numbers show that applying this patch significantly reduces pte spin lock contention. Without the patch: |--99.57%-- oom_reaper | |--73.58%-- unmap_page_range | | |--8.67%-- [hit in function] | | |--41.59%-- __pte_offset_map_lock | | |--29.47%-- folio_remove_rmap_ptes | | |--16.11%-- tlb_flush_mmu | |--19.94%-- tlb_finish_mmu | |--3.21%-- folio_remove_rmap_ptes With the patch: |--99.53%-- oom_reaper | |--55.77%-- unmap_page_range | | |--20.49%-- [hit in function] | | |--58.30%-- folio_remove_rmap_ptes | | |--11.48%-- tlb_flush_mmu | | |--3.33%-- folio_mark_accessed | |--32.21%-- tlb_finish_mmu | |--6.93%-- folio_remove_rmap_ptes | |--0.69%-- __pte_offset_map_lock Detailed breakdowns for both scenarios are provided below. The cumulative time for oom_reaper plus exit_mmap(victim) in both cases is also summarized, making the performance improvements clear. +----------------------------------------------------------------+ | Category | Applying patch | Without patch | +-------------------------------+----------------+---------------+ | Total running time | 132.6 | 167.1 | | (exit_mmap + reaper work) | 72.4 + 60.2 | 90.7 + 76.4 | +-------------------------------+----------------+---------------+ | Time waiting for pte spinlock | 1.0 | 33.1 | | (exit_mmap + reaper work) | 0.4 + 0.6 | 10.0 + 23.1 | +-------------------------------+----------------+---------------+ | folio_remove_rmap_ptes time | 42.0 | 41.3 | | (exit_mmap + reaper work) | 18.4 + 23.6 | 22.4 + 18.9 | +----------------------------------------------------------------+ From this report, we can see that: 1. The reduction in total time comes mainly from the decrease in time spent on pte spinlock and other locks. 2. oom_reaper performs more work in some areas, but at the same time, exit_mmap also handles certain tasks more efficiently, such as folio_remove_rmap_ptes. Here is a more detailed perf report. [1] Link: https://lkml.kernel.org/r/20250915162946.5515-3-zhongjinji@honor.com Link: https://lore.kernel.org/all/20250915162619.5133-1-zhongjinji@honor.com/ [1] Signed-off-by: zhongjinji <zhongjinji@honor.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/oom_kill: thaw the entire OOM victim processzhongjinji3-6/+26
Patch series "Improvements to Victim Process Thawing and OOM Reaper Traversal Order", v10. This patch series focuses on optimizing victim process thawing and refining the traversal order of the OOM reaper. Since __thaw_task() is used to thaw a single thread of the victim, thawing only one thread cannot guarantee the exit of the OOM victim when it is frozen. Patch 1 thaw the entire process of the OOM victim to ensure that OOM victims are able to terminate themselves. Even if the oom_reaper is delayed, patch 2 is still beneficial for reaping processes with a large address space footprint, and it also greatly improves process_mrelease. This patch (of 10): OOM killer is a mechanism that selects and kills processes when the system runs out of memory to reclaim resources and keep the system stable. But the oom victim cannot terminate on its own when it is frozen, even if the OOM victim task is thawed through __thaw_task(). This is because __thaw_task() can only thaw a single OOM victim thread, and cannot thaw the entire OOM victim process. In addition, freezing_slow_path() determines whether a task is an OOM victim by checking the task's TIF_MEMDIE flag. When a task is identified as an OOM victim, the freezer bypasses both PM freezing and cgroup freezing states to thaw it. Historically, TIF_MEMDIE was a "this is the oom victim & it has access to memory reserves" flag in the past. It has that thread vs. process problems and tsk_is_oom_victim was introduced later to get rid of them and other issues as well as the guarantee that we can identify the oom victim's mm reliably for other oom_reaper. Therefore, thaw_process() is introduced to unfreeze all threads within the OOM victim process, ensuring that every thread is properly thawed. The freezer now uses tsk_is_oom_victim() to determine OOM victim status, allowing all victim threads to be unfrozen as necessary. With this change, the entire OOM victim process will be thawed when an OOM event occurs, ensuring that the victim can terminate on its own. Link: https://lkml.kernel.org/r/20250915162946.5515-1-zhongjinji@honor.com Link: https://lkml.kernel.org/r/20250915162946.5515-2-zhongjinji@honor.com Signed-off-by: zhongjinji <zhongjinji@honor.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21include/linux/pgtable.h: convert arch_enter_lazy_mmu_mode() and friends to ↵Andrew Morton1-3/+3
static inlines For all the usual reasons, plus a new one. Calling (void)arch_enter_lazy_mmu_mode(); deservedly blows up. Cc: Balbir Singh <balbirs@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/lru_sort: use param_ctx for damon_attrs stagingSeongJae Park1-1/+1
damon_lru_sort_apply_parameters() allocates a new DAMON context, stages user-specified DAMON parameters on it, and commits to running DAMON context at once, using damon_commit_ctx(). The code is, however, directly updating the monitoring attributes of the running context. And the attributes are over-written by later damon_commit_ctx() call. This means that the monitoring attributes parameters are not really working. Fix the wrong use of the parameter context. Link: https://lkml.kernel.org/r/20250916031549.115326-1-sj@kernel.org Fixes: a30969436428 ("mm/damon/lru_sort: use damon_commit_ctx()") Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: <stable@vger.kernel.org> [6.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21selftests/mm: protection_keys: fix dead codeMuhammad Usama Anjum1-3/+1
The while loop doesn't execute and following warning gets generated: protection_keys.c:561:15: warning: code will never be executed [-Wunreachable-code] int rpkey = alloc_random_pkey(); Let's enable the while loop such that it gets executed nr_iterations times. Simplify the code a bit as well. Link: https://lkml.kernel.org/r/20250912123025.1271051-3-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21selftests/mm: add -Wunreachable-code and fix warningsMuhammad Usama Anjum4-5/+5
Patch series "selftests/mm: Add -Wunreachable-code and fix warnings". Add -Wunreachable-code to selftests and remove dead code from generated warnings. This patch (of 2): Enable -Wunreachable-code flag to catch dead code and fix them. 1. Remove the dead code and write a comment instead: hmm-tests.c:2033:3: warning: code will never be executed [-Wunreachable-code] perror("Should not reach this\n"); ^~~~~~ 2. ksft_exit_fail_msg() calls exit(). So cleanup isn't done. Replace it with ksft_print_msg(). split_huge_page_test.c:301:3: warning: code will never be executed [-Wunreachable-code] goto cleanup; ^~~~~~~~~~~~ 3. Remove duplicate inline. pkey_sighandler_tests.c:44:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier] static inline __always_inline Link: https://lkml.kernel.org/r/20250912123025.1271051-1-usama.anjum@collabora.com Link: https://lkml.kernel.org/r/20250912123025.1271051-2-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21resource: improve child resource handling in release_mem_region_adjustable()Sumanth Korikkar1-5/+45
When memory block is removed via try_remove_memory(), it eventually reaches release_mem_region_adjustable(). The current implementation assumes that when a busy memory resource is split into two, all child resources remain in the lower address range. This simplification causes problems when child resources actually belong to the upper split. For example: * Initial memory layout: lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000002ffffffff 12G online yes 0-95 * /proc/iomem 00000000-2dfefffff : System RAM 158834000-1597b3fff : Kernel code 1597b4000-159f50fff : Kernel data 15a13c000-15a218fff : Kernel bss 2dff00000-2ffefffff : Crash kernel 2fff00000-2ffffffff : System RAM * After offlining and removing range 0x150000000-0x157ffffff lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED (output according to upcoming lsmem changes with the configured column: s390) RANGE SIZE STATE BLOCK CONFIGURED 0x0000000000000000-0x000000014fffffff 5.3G online 0-41 yes 0x0000000150000000-0x0000000157ffffff 128M offline 42 no 0x0000000158000000-0x00000002ffffffff 6.6G online 43-95 yes The iomem resource gets split into two entries, but kernel code, kernel data, and kernel bss remain attached to the lower resource [0–5376M] instead of the correct upper resource [5504M–12288M]. As a result, WARN_ON() triggers in release_mem_region_adjustable() ("Usecase: split into two entries - we need a new resource") ------------[ cut here ]------------ WARNING: CPU: 5 PID: 858 at kernel/resource.c:1486 release_mem_region_adjustable+0x210/0x280 Modules linked in: CPU: 5 UID: 0 PID: 858 Comm: chmem Not tainted 6.17.0-rc2-11707-g2c36aaf3ba4e Hardware name: IBM 3906 M04 704 (z/VM 7.3.0) Krnl PSW : 0704d00180000000 0000024ec0dae0e4 (release_mem_region_adjustable+0x214/0x280) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 Krnl GPRS: 0000000000000000 00000002ffffafc0 fffffffffffffff0 0000000000000000 000000014fffffff 0000024ec2257608 0000000000000000 0000024ec2301758 0000024ec22680d0 00000000902c9140 0000000150000000 00000002ffffafc0 000003ffa61d8d18 0000024ec21fb478 0000024ec0dae014 000001cec194fbb0 Krnl Code: 0000024ec0dae0d8: af000000 mc 0,0 0000024ec0dae0dc: a7f4ffc1 brc 15,0000024ec0dae05e #0000024ec0dae0e0: af000000 mc 0,0 >0000024ec0dae0e4: a5defffd llilh %r13,65533 0000024ec0dae0e8: c04000c6064c larl %r4,0000024ec266ed80 0000024ec0dae0ee: eb1d400000f8 laa %r1,%r13,0(%r4) 0000024ec0dae0f4: 07e0 bcr 14,%r0 0000024ec0dae0f6: a7f4ffc0 brc 15,0000024ec0dae076 [<0000024ec0dae0e4>] release_mem_region_adjustable+0x214/0x280 ([<0000024ec0dadf3c>] release_mem_region_adjustable+0x6c/0x280) [<0000024ec10a2130>] try_remove_memory+0x100/0x140 [<0000024ec10a4052>] __remove_memory+0x22/0x40 [<0000024ec18890f6>] config_mblock_store+0x326/0x3e0 [<0000024ec11f7056>] kernfs_fop_write_iter+0x136/0x210 [<0000024ec1121e86>] vfs_write+0x236/0x3c0 [<0000024ec11221b8>] ksys_write+0x78/0x110 [<0000024ec1b6bfbe>] __do_syscall+0x12e/0x350 [<0000024ec1b782ce>] system_call+0x6e/0x90 Last Breaking-Event-Address: [<0000024ec0dae014>] release_mem_region_adjustable+0x144/0x280 ---[ end trace 0000000000000000 ]--- Also, resource adjustment doesn't happen and stale resources still cover [0-12288M]. Later, memory re-add fails in register_memory_resource() with -EBUSY. i.e: /proc/iomem is still: 00000000-2dfefffff : System RAM 158834000-1597b3fff : Kernel code 1597b4000-159f50fff : Kernel data 15a13c000-15a218fff : Kernel bss 2dff00000-2ffefffff : Crash kernel 2fff00000-2ffffffff : System RAM Enhance release_mem_region_adjustable() to reassign child resources to the correct parent after a split. Children are now assigned based on their actual range: If they fall within the lower split, keep them in the lower parent. If they fall within the upper split, move them to the upper parent. Kernel code/data/bss regions are not offlined, so they will always reside entirely within one parent and never span across both. Output after the enhancement: * Initial state /proc/iomem (before removal of memory block): 00000000-2dfefffff : System RAM 1f94f8000-1fa477fff : Kernel code 1fa478000-1fac14fff : Kernel data 1fae00000-1faedcfff : Kernel bss 2dff00000-2ffefffff : Crash kernel 2fff00000-2ffffffff : System RAM * Offline and remove 0x1e8000000-0x1efffffff memory range * /proc/iomem 00000000-1e7ffffff : System RAM 1f0000000-2dfefffff : System RAM 1f94f8000-1fa477fff : Kernel code 1fa478000-1fac14fff : Kernel data 1fae00000-1faedcfff : Kernel bss 2dff00000-2ffefffff : Crash kernel 2fff00000-2ffffffff : System RAM Link: https://lkml.kernel.org/r/20250912123021.3219980-1-sumanthk@linux.ibm.com Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21selftests/mm: centralize the __always_unused macroMuhammad Usama Anjum3-2/+7
This macro gets used in different tests. Add it to kselftest.h which is central location and tests use this header. Then use this new macro. Link: https://lkml.kernel.org/r/20250912125102.1309796-1-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Antonio Quartulli <antonio@openvpn.net> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: "Sabrina Dubroca" <sd@queasysnail.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/reclaim: support addr_unit for DAMON_RECLAIMQuanmin Yan1-0/+40
Implement a sysfs file to expose addr_unit for DAMON_RECLAIM users. During parameter application, use the configured addr_unit parameter to perform the necessary initialization. Similar to the core layer, prevent setting addr_unit to zero. It is worth noting that when monitor_region_start and monitor_region_end are unset (i.e., 0), their values will later be set to biggest_system_ram. At that point, addr_unit may not be the default value 1. Although we could divide the biggest_system_ram value by addr_unit, changing addr_unit without setting monitor_region_start/end should be considered a user misoperation. And biggest_system_ram is only within the 0~ULONG_MAX range, system can clearly work correctly with addr_unit=1. Therefore, if monitor_region_start/end are unset, always silently reset addr_unit to 1. Link: https://lkml.kernel.org/r/20250910113221.1065764-3-yanquanmin1@huawei.com Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/lru_sort: support addr_unit for DAMON_LRU_SORTQuanmin Yan1-0/+40
Patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM". In DAMON_LRU_SORT and DAMON_RECLAIM, damon_ctx is independent of the core. Add addr_unit to these modules to support systems like ARM32 with LPAE. This patch (of 2): Implement a sysfs file to expose addr_unit for DAMON_LRU_SORT users. During parameter application, use the configured addr_unit parameter to perform the necessary initialization. Similar to the core layer, prevent setting addr_unit to zero. It is worth noting that when monitor_region_start and monitor_region_end are unset (i.e., 0), their values will later be set to biggest_system_ram. At that point, addr_unit may not be the default value 1. Although we could divide the biggest_system_ram value by addr_unit, changing addr_unit without setting monitor_region_start/end should be considered a user misoperation. And biggest_system_ram is only within the 0~ULONG_MAX range, system can clearly work correctly with addr_unit=1. Therefore, if monitor_region_start/end are unset, always silently reset addr_unit to 1. Link: https://lkml.kernel.org/r/20250910113221.1065764-1-yanquanmin1@huawei.com Link: https://lkml.kernel.org/r/20250910113221.1065764-2-yanquanmin1@huawei.com Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21selftests/mm: gup_tests: option to GUP all pages in a single callDavid Hildenbrand2-3/+7
We recently missed detecting an issue during early testing because the default (!all) tests would not trigger it and even when running "all" tests it only would happen sometimes because of races. So let's allow for an easy way to specify "GUP all pages in a single call", extend the test matrix and extend our default (!all) tests. By GUP'ing all pages in a single call, with the default size of 128MiB we'll cover multiple leaf page tables / PMDs on architectures with sane THP sizes. Link: https://lkml.kernel.org/r/20250910093051.1693097-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: remove page->orderMatthew Wilcox (Oracle)2-7/+5
We already use page->private for storing the order of a page while it's in the buddy allocator system; extend that to also storing the order while it's in the pcp_llist. Link: https://lkml.kernel.org/r/20250910142923.2465470-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: remove redundant test in validate_page_before_insert()Matthew Wilcox (Oracle)1-2/+1
The page_has_type() call would have included slab since commit 46df8e73a4a3 and now we don't even get that far because slab pages have a zero refcount since commit 9aec2fb0fd5e. Link: https://lkml.kernel.org/r/20250910142923.2465470-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: constify compound_order() and page_size()Matthew Wilcox (Oracle)1-3/+3
Patch series "Small cleanups". These small cleanups can be applied now to reduce conflicts during the next merge window. They're all from various efforts to split struct page from other memdescs. Thanks to Vlastimil for the suggestion. This patch (of 3): These functions do not modify their arguments. Telling the compiler this may improve code generation, and allows us to pass const arguments from other functions. Link: https://lkml.kernel.org/r/20250910142923.2465470-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250910142923.2465470-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: lru_add_drain_all() do local lru_add_drain() firstHugh Dickins1-0/+3
No numbers to back this up, but it seemed obvious to me, that if there are competing lru_add_drain_all()ers, the work will be minimized if each flushes its own local queues before locking and doing cross-CPU drains. Link: https://lkml.kernel.org/r/33389bf8-f79d-d4dd-b7a4-680c4aa21b23@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: make folio page count functions return unsignedAristeu Rozanski2-6/+6
As raised by Andrew [1], a folio/compound page never spans a negative number of pages. Consequently, let's use "unsigned long" instead of "long" consistently for folio_nr_pages(), folio_large_nr_pages() and compound_nr(). Using "unsigned long" as return value is fine, because even "(long)-folio_nr_pages()" will keep on working as expected. Using "unsigned int" instead would actually break these use cases. This patch takes the first step changing these to return unsigned long (and making drm_gem_get_pages() use the new types instead of replacing min()). In the future, we might want to make more callers of these functions to consistently use "unsigned long". Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/ Link: https://lkml.kernel.org/r/20250826153721.GA23292@cathedrallabs.org Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/ [1] Signed-off-by: Aristeu Rozanski <aris@ruivo.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>