| Age | Commit message (Collapse) | Author | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)
Address the longstanding "dying memcg problem". A situation wherein a
no-longer-used memory control group will hang around for an extended
period pointlessly consuming memory
- "fix unexpected type conversions and potential overflows" (Qi Zheng)
Fix a couple of potential 32-bit/64-bit issues which were identified
during review of the "Eliminate Dying Memory Cgroup" series
- "kho: history: track previous kernel version and kexec boot count"
(Breno Leitao)
Use Kexec Handover (KHO) to pass the previous kernel's version string
and the number of kexec reboots since the last cold boot to the next
kernel, and print it at boot time
- "liveupdate: prevent double preservation" (Pasha Tatashin)
Teach LUO to avoid managing the same file across different active
sessions
- "liveupdate: Fix module unloading and unregister API" (Pasha
Tatashin)
Address an issue with how LUO handles module reference counting and
unregistration during module unloading
- "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)
Simplify and clean up the zswap crypto compression handling and
improve the lifecycle management of zswap pool's per-CPU acomp_ctx
resources
- "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
(SeongJae Park)
Address unlikely but possible leaks and deadlocks in damon_call() and
damon_walk()
- "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park)
Fix a couple of root-only wild pointer dereferences
- "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
(SeongJae Park)
Update the DAMON documentation to warn operators about potential
races which can occur if the commit_inputs parameter is altered at
the wrong time
- "Minor hmm_test fixes and cleanups" (Alistair Popple)
Bugfixes and a cleanup for the HMM kernel selftests
- "Modify memfd_luo code" (Chenghao Duan)
Cleanups, simplifications and speedups to the memfd_lou code
- "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)
Support for userfaultfd in guest_memfd
- "selftests/mm: skip several tests when thp is not available" (Chunyu
Hu)
Fix several issues in the selftests code which were causing breakage
when the tests were run on CONFIG_THP=n kernels
- "mm/mprotect: micro-optimization work" (Pedro Falcato)
A couple of nice speedups for mprotect()
- "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)
Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
kexec, crash, kdump and probably other kexec-based things - they are
being moved out of mm.git and into a new git tree
* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
MAINTAINERS: add page cache reviewer
mm/vmscan: avoid false-positive -Wuninitialized warning
MAINTAINERS: update Dave's kdump reviewer email address
MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
MAINTAINERS: drop include/linux/kho/abi/ from KHO
MAINTAINERS: update KHO and LIVE UPDATE maintainers
MAINTAINERS: update kexec/kdump maintainers entries
mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
selftests: mm: skip charge_reserved_hugetlb without killall
userfaultfd: allow registration of ranges below mmap_min_addr
mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
mm/hugetlb: fix early boot crash on parameters without '=' separator
zram: reject unrecognized type= values in recompress_store()
docs: proc: document ProtectionKey in smaps
mm/mprotect: special-case small folios when applying permissions
mm/mprotect: move softleaf code out of the main function
mm: remove '!root_reclaim' checking in should_abort_scan()
mm/sparse: fix comment for section map alignment
mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
selftests/mm: transhuge_stress: skip the test when thp not available
...
|
|
damos_adjust_quota() uses time_after_eq() to show if it is time to start a
new quota charge window, comparing the current jiffies and the scheduled
next charge window start time. If it is, the next charge window start
time is updated and the new charge window starts.
The time check and next window start time update is skipped while the
scheme is deactivated by the watermarks. Let's suppose the deactivation
is kept more than LONG_MAX jiffies (assuming CONFIG_HZ of 250, more than
99 days in 32 bit systems and more than one billion years in 64 bit
systems), resulting in having the jiffies larger than the next charge
window start time + LONG_MAX. Then, the time_after_eq() call can return
false until another LONG_MAX jiffies are passed.
This means the scheme can continue working after being reactivated by the
watermarks. But, soon, the quota will be exceeded and the scheme will
again effectively stop working until the next charge window starts.
Because the current charge window is extended to up to LONG_MAX jiffies,
however, it will look like it stopped unexpectedly and indefinitely, from
the user's perspective.
Fix this by using !time_in_range_open() instead.
The issue was discovered [1] by sashiko.
Link: https://lore.kernel.org/20260329152306.45796-1-sj@kernel.org
Link: https://lore.kernel.org/20260324040722.57944-1-sj@kernel.org [1]
Fixes: ee801b7dd782 ("mm/damon/schemes: activate schemes based on a watermarks mechanism")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 5.16.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Users can set damos_quota_goal->nid with arbitrary value for
node_memcg_{used,free}_bp. But DAMON core is using those for NODE-DATA()
without a validation of the value. This can result in out of bounds
memory access. The issue can actually triggered using DAMON user-space
tool (damo), like below.
$ sudo mkdir /sys/fs/cgroup/foo
$ sudo ./damo start --damos_action stat --damos_quota_interval 1s \
--damos_quota_goal node_memcg_used_bp 50% -1 /foo
$ sudo dmseg
[...]
[ 524.181426] Unable to handle kernel paging request at virtual address 0000000000002c00
Fix this issue by adding the validation of the given node id. If an
invalid node id is given, it returns 0% for used memory ratio, and 100%
for free memory ratio.
Link: https://lore.kernel.org/20260329043902.46163-3-sj@kernel.org
Fixes: b74a120bcf50 ("mm/damon/core: implement DAMOS_QUOTA_NODE_MEMCG_USED_BP")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.19.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/core: validate damos_quota_goal->nid".
node_mem[cg]_{used,free}_bp DAMOS quota goals receive the node id. The
node id is used for si_meminfo_node() and NODE_DATA() without proper
validation. As a result, privileged users can trigger an out of bounds
memory access using DAMON_SYSFS. Fix the issues.
The issue was originally reported [1] with a fix by another author. The
original author announced [2] that they will stop working including the
fix that was still in the review stage. Hence I'm restarting this.
This patch (of 2):
Users can set damos_quota_goal->nid with arbitrary value for
node_mem_{used,free}_bp. But DAMON core is using those for
si_meminfo_node() without the validation of the value. This can result in
out of bounds memory access. The issue can actually triggered using DAMON
user-space tool (damo), like below.
$ sudo ./damo start --damos_action stat \
--damos_quota_goal node_mem_used_bp 50% -1 \
--damos_quota_interval 1s
$ sudo dmesg
[...]
[ 65.565986] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000098
Fix this issue by adding the validation of the given node. If an invalid
node id is given, it returns 0% for used memory ratio, and 100% for free
memory ratio.
Link: https://lore.kernel.org/20260329043902.46163-2-sj@kernel.org
Link: https://lore.kernel.org/20260325073034.140353-1-objecting@objecting.org [1]
Link: https://lore.kernel.org/20260327040924.68553-1-sj@kernel.org [2]
Fixes: 0e1c773b501f ("mm/damon/core: introduce damos quota goal metrics for memory node utilization")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.16.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Destroy the DAMON context and reset the global pointer when damon_start()
fails. Otherwise, the context allocated by damon_stat_build_ctx() is
leaked, and the stale damon_stat_context pointer will be overwritten on
the next enable attempt, making the old allocation permanently
unreachable.
Link: https://lore.kernel.org/20260331101553.88422-1-liu.yun@linux.dev
Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When kdamond_fn() main loop is finished, the function cancels remaining
damos_walk() request and unset the damon_ctx->kdamond so that API callers
and API functions themselves can show the context is terminated.
damos_walk() adds the caller's request to the queue first. After that, it
shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond
is set). Only if the kdamond is running, damos_walk() starts waiting for
the kdamond's handling of the newly added request.
The damos_walk() requests registration and damon_ctx->kdamond unset are
protected by different mutexes, though. Hence, damos_walk() could race
with damon_ctx->kdamond unset, and result in deadlocks.
For example, let's suppose kdamond successfully finished the damow_walk()
request cancelling. Right after that, damos_walk() is called for the
context. It registers the new request, and shows the context is still
running, because damon_ctx->kdamond unset is not yet done. Hence the
damos_walk() caller starts waiting for the handling of the request.
However, the kdamond is already on the termination steps, so it never
handles the new request. As a result, the damos_walk() caller thread
infinitely waits.
Fix this by introducing another damon_ctx field, namely
walk_control_obsolete. It is protected by the
damon_ctx->walk_control_lock, which protects damos_walk() request
registration. Initialize (unset) it in kdamond_fn() before letting
damon_start() returns and set it just before the cancelling of the
remaining damos_walk() request is executed. damos_walk() reads the
obsolete field under the lock and avoids adding a new request.
After this change, only requests that are guaranteed to be handled or
cancelled are registered. Hence the after-registration DAMON context
termination check is no longer needed. Remove it together.
The issue is found by sashiko [1].
Link: https://lore.kernel.org/20260327233319.3528-3-sj@kernel.org
Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1]
Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.14.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit
race".
damon_call() and damos_walk() can leak memory and/or deadlock when they
race with kdamond terminations. Fix those.
This patch (of 2);
When kdamond_fn() main loop is finished, the function cancels all
remaining damon_call() requests and unset the damon_ctx->kdamond so that
API callers and API functions themselves can know the context is
terminated. damon_call() adds the caller's request to the queue first.
After that, it shows if the kdamond of the damon_ctx is still running
(damon_ctx->kdamond is set). Only if the kdamond is running, damon_call()
starts waiting for the kdamond's handling of the newly added request.
The damon_call() requests registration and damon_ctx->kdamond unset are
protected by different mutexes, though. Hence, damon_call() could race
with damon_ctx->kdamond unset, and result in deadlocks.
For example, let's suppose kdamond successfully finished the damon_call()
requests cancelling. Right after that, damon_call() is called for the
context. It registers the new request, and shows the context is still
running, because damon_ctx->kdamond unset is not yet done. Hence the
damon_call() caller starts waiting for the handling of the request.
However, the kdamond is already on the termination steps, so it never
handles the new request. As a result, the damon_call() caller threads
infinitely waits.
Fix this by introducing another damon_ctx field, namely
call_controls_obsolete. It is protected by the
damon_ctx->call_controls_lock, which protects damon_call() requests
registration. Initialize (unset) it in kdamond_fn() before letting
damon_start() returns and set it just before the cancelling of remaining
damon_call() requests is executed. damon_call() reads the obsolete field
under the lock and avoids adding a new request.
After this change, only requests that are guaranteed to be handled or
cancelled are registered. Hence the after-registration DAMON context
termination check is no longer needed. Remove it together.
Note that the deadlock will not happen when damon_call() is called for
repeat mode request. In tis case, damon_call() returns instead of waiting
for the handling when the request registration succeeds and it shows the
kdamond is running. However, if the request also has dealloc_on_cancel,
the request memory would be leaked.
The issue is found by sashiko [1].
Link: https://lore.kernel.org/20260327233319.3528-1-sj@kernel.org
Link: https://lore.kernel.org/20260327233319.3528-2-sj@kernel.org
Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1]
Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.14.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Fix printf format warning for bprintf
sunrpc uses a trace_printk() that triggers a printf warning during
the compile. Move the __printf() attribute around for when debugging
is not enabled the warning will go away
- Remove redundant check for EVENT_FILE_FL_FREED in
event_filter_write()
The FREED flag is checked in the call to event_file_file() and then
checked again right afterward, which is unneeded
- Clean up event_file_file() and event_file_data() helpers
These helper functions played a different role in the past, but now
with eventfs, the READ_ONCE() isn't needed. Simplify the code a bit
and also add a warning to event_file_data() if the file or its data
is not present
- Remove updating file->private_data in tracing open
All access to the file private data is handled by the helper
functions, which do not use file->private_data. Stop updating it on
open
- Show ENUM names in function arguments via BTF in function tracing
When showing the function arguments when func-args option is set for
function tracing, if one of the arguments is found to be an enum,
show the name of the enum instead of its number
- Add new trace_call__##name() API for tracepoints
Tracepoints are enabled via static_branch() blocks, where when not
enabled, there's only a nop that is in the code where the execution
will just skip over it. When tracing is enabled, the nop is converted
to a direct jump to the tracepoint code. Sometimes more calculations
are required to be performed to update the parameters of the
tracepoint. In this case, trace_##name##_enabled() is called which is
a static_branch() that gets enabled only when the tracepoint is
enabled. This allows the extra calculations to also be skipped by the
nop:
if (trace_foo_enabled()) {
x = bar();
trace_foo(x);
}
Where the x=bar() is only performed when foo is enabled. The problem
with this approach is that there's now two static_branch() calls. One
for checking if the tracepoint is enabled, and then again to know if
the tracepoint should be called. The second one is redundant
Introduce trace_call__foo() that will call the foo() tracepoint
directly without doing a static_branch():
if (trace_foo_enabled()) {
x = bar();
trace_call__foo();
}
- Update various locations to use the new trace_call__##name() API
- Move snapshot code out of trace.c
Cleaning up trace.c to not be a "dump all", move the snapshot code
out of it and into a new trace_snapshot.c file
- Clean up some "%*.s" to "%*s"
- Allow boot kernel command line options to be called multiple times
Have options like:
ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo
Equal to:
ftrace_filter=foo,bar,zoo
- Fix ipi_raise event CPU field to be a CPU field
The ipi_raise target_cpus field is defined as a __bitmask(). There is
now a __cpumask() field definition. Update the field to use that
- Have hist_field_name() use a snprintf() and not a series of strcat()
It's safer to use snprintf() that a series of strcat()
- Fix tracepoint regfunc balancing
A tracepoint can define a "reg" and "unreg" function that gets called
before the tracepoint is enabled, and after it is disabled
respectively. But on error, after the "reg" func is called and the
tracepoint is not enabled, the "unreg" function is not called to tear
down what the "reg" function performed
- Fix output that shows what histograms are enabled
Event variables are displayed incorrectly in the histogram output
Instead of "sched.sched_wakeup.$var", it is showing
"$sched.sched_wakeup.var" where the '$' is in the incorrect location
- Some other simple cleanups
* tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (24 commits)
selftests/ftrace: Add test case for fully-qualified variable references
tracing: Fix fully-qualified variable reference printing in histograms
tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()
tracing: Rebuild full_name on each hist_field_name() call
tracing: Report ipi_raise target CPUs as cpumask
tracing: Remove duplicate latency_fsnotify() stub
tracing: Preserve repeated trace_trigger boot parameters
tracing: Append repeated boot-time tracing parameters
tracing: Remove spurious default precision from show_event_trigger/filter formats
cpufreq: Use trace_call__##name() at guarded tracepoint call sites
tracing: Remove tracing_alloc_snapshot() when snapshot isn't defined
tracing: Move snapshot code out of trace.c and into trace_snapshot.c
mm: damon: Use trace_call__##name() at guarded tracepoint call sites
btrfs: Use trace_call__##name() at guarded tracepoint call sites
spi: Use trace_call__##name() at guarded tracepoint call sites
i2c: Use trace_call__##name() at guarded tracepoint call sites
kernel: Use trace_call__##name() at guarded tracepoint call sites
tracepoint: Add trace_call__##name() API
tracing: trace_mmap.h: fix a kernel-doc warning
tracing: Pretty-print enum parameters in function arguments
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "maple_tree: Replace big node with maple copy" (Liam Howlett)
Mainly prepararatory work for ongoing development but it does reduce
stack usage and is an improvement.
- "mm, swap: swap table phase III: remove swap_map" (Kairui Song)
Offers memory savings by removing the static swap_map. It also yields
some CPU savings and implements several cleanups.
- "mm: memfd_luo: preserve file seals" (Pratyush Yadav)
File seal preservation to LUO's memfd code
- "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
Chen)
Additional userspace stats reportng to zswap
- "arch, mm: consolidate empty_zero_page" (Mike Rapoport)
Some cleanups for our handling of ZERO_PAGE() and zero_pfn
- "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
Han)
A robustness improvement and some cleanups in the kmemleak code
- "Improve khugepaged scan logic" (Vernon Yang)
Improve khugepaged scan logic and reduce CPU consumption by
prioritizing scanning tasks that access memory frequently
- "Make KHO Stateless" (Jason Miu)
Simplify Kexec Handover by transitioning KHO from an xarray-based
metadata tracking system with serialization to a radix tree data
structure that can be passed directly to the next kernel
- "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
Ballasi and Steven Rostedt)
Enhance vmscan's tracepointing
- "mm: arch/shstk: Common shadow stack mapping helper and
VM_NOHUGEPAGE" (Catalin Marinas)
Cleanup for the shadow stack code: remove per-arch code in favour of
a generic implementation
- "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)
Fix a WARN() which can be emitted the KHO restores a vmalloc area
- "mm: Remove stray references to pagevec" (Tal Zussman)
Several cleanups, mainly udpating references to "struct pagevec",
which became folio_batch three years ago
- "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
Shutsemau)
Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
pages encode their relationship to the head page
- "mm/damon/core: improve DAMOS quota efficiency for core layer
filters" (SeongJae Park)
Improve two problematic behaviors of DAMOS that makes it less
efficient when core layer filters are used
- "mm/damon: strictly respect min_nr_regions" (SeongJae Park)
Improve DAMON usability by extending the treatment of the
min_nr_regions user-settable parameter
- "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)
The proper fix for a previously hotfixed SMP=n issue. Code
simplifications and cleanups ensued
- "mm: cleanups around unmapping / zapping" (David Hildenbrand)
A bunch of cleanups around unmapping and zapping. Mostly
simplifications, code movements, documentation and renaming of
zapping functions
- "support batched checking of the young flag for MGLRU" (Baolin Wang)
Batched checking of the young flag for MGLRU. It's part cleanups; one
benchmark shows large performance benefits for arm64
- "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)
memcg cleanup and robustness improvements
- "Allow order zero pages in page reporting" (Yuvraj Sakshith)
Enhance free page reporting - it is presently and undesirably order-0
pages when reporting free memory.
- "mm: vma flag tweaks" (Lorenzo Stoakes)
Cleanup work following from the recent conversion of the VMA flags to
a bitmap
- "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
Park)
Add some more developer-facing debug checks into DAMON core
- "mm/damon: test and document power-of-2 min_region_sz requirement"
(SeongJae Park)
An additional DAMON kunit test and makes some adjustments to the
addr_unit parameter handling
- "mm/damon/core: make passed_sample_intervals comparisons
overflow-safe" (SeongJae Park)
Fix a hard-to-hit time overflow issue in DAMON core
- "mm/damon: improve/fixup/update ratio calculation, test and
documentation" (SeongJae Park)
A batch of misc/minor improvements and fixups for DAMON
- "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
Hildenbrand)
Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
movement was required.
- "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)
A somewhat random mix of fixups, recompression cleanups and
improvements in the zram code
- "mm/damon: support multiple goal-based quota tuning algorithms"
(SeongJae Park)
Extend DAMOS quotas goal auto-tuning to support multiple tuning
algorithms that users can select
- "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)
Fix the khugpaged sysfs handling so we no longer spam the logs with
reams of junk when starting/stopping khugepaged
- "mm: improve map count checks" (Lorenzo Stoakes)
Provide some cleanups and slight fixes in the mremap, mmap and vma
code
- "mm/damon: support addr_unit on default monitoring targets for
modules" (SeongJae Park)
Extend the use of DAMON core's addr_unit tunable
- "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)
Cleanups to khugepaged and is a base for Nico's planned khugepaged
mTHP support
- "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)
Code movement and cleanups in the memhotplug and sparsemem code
- "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
CONFIG_MIGRATION" (David Hildenbrand)
Rationalize some memhotplug Kconfig support
- "change young flag check functions to return bool" (Baolin Wang)
Cleanups to change all young flag check functions to return bool
- "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
Law and SeongJae Park)
Fix a few potential DAMON bugs
- "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
Stoakes)
Convert a lot of the existing use of the legacy vm_flags_t data type
to the new vma_flags_t type which replaces it. Mainly in the vma
code.
- "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)
Expand the mmap_prepare functionality, which is intended to replace
the deprecated f_op->mmap hook which has been the source of bugs and
security issues for some time. Cleanups, documentation, extension of
mmap_prepare into filesystem drivers
- "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)
Simplify and clean up zap_huge_pmd(). Additional cleanups around
vm_normal_folio_pmd() and the softleaf functionality are performed.
* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm: fix deferred split queue races during migration
mm/khugepaged: fix issue with tracking lock
mm/huge_memory: add and use has_deposited_pgtable()
mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
mm/huge_memory: separate out the folio part of zap_huge_pmd()
mm/huge_memory: use mm instead of tlb->mm
mm/huge_memory: remove unnecessary sanity checks
mm/huge_memory: deduplicate zap deposited table call
mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
mm/huge_memory: add a common exit path to zap_huge_pmd()
mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
mm/huge: avoid big else branch in zap_huge_pmd()
mm/huge_memory: simplify vma_is_specal_huge()
mm: on remap assert that input range within the proposed VMA
mm: add mmap_action_map_kernel_pages[_full]()
uio: replace deprecated mmap hook with mmap_prepare in uio_info
drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
mm: allow handling of stacked mmap_prepare hooks in more drivers
...
|
|
damon_stat_start() always allocates the module's damon_ctx object
(damon_stat_context). Meanwhile, if damon_call() in the function fails,
the damon_ctx object is not deallocated. Hence, if the damon_call() is
failed, and the user writes Y to “enabled” again, the previously
allocated damon_ctx object is leaked.
This cannot simply be fixed by deallocating the damon_ctx object when
damon_call() fails. That's because damon_call() failure doesn't guarantee
the kdamond main function, which accesses the damon_ctx object, is
completely finished. In other words, if damon_stat_start() deallocates
the damon_ctx object after damon_call() failure, the not-yet-terminated
kdamond could access the freed memory (use-after-free).
Fix the leak while avoiding the use-after-free by keeping returning
damon_stat_start() without deallocating the damon_ctx object after
damon_call() failure, but deallocating it when the function is invoked
again and the kdamond is completely terminated. If the kdamond is not yet
terminated, simply return -EAGAIN, as the kdamond will soon be terminated.
The issue was discovered [1] by sashiko.
Link: https://lkml.kernel.org/r/20260402134418.74121-1-sj@kernel.org
Link: https://lore.kernel.org/20260401012428.86694-1-sj@kernel.org [1]
Fixes: 405f61996d9d ("mm/damon/stat: use damon_call() repeat mode instead of damon_callback")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_call() for repeat_call_control of DAMON_SYSFS could fail if somehow
the kdamond is stopped before the damon_call(). It could happen, for
example, when te damon context was made for monitroing of a virtual
address processes, and the process is terminated immediately, before the
damon_call() invocation. In the case, the dyanmically allocated
repeat_call_control is not deallocated and leaked.
Fix the leak by deallocating the repeat_call_control under the
damon_call() failure.
This issue is discovered by sashiko [1].
Link: https://lkml.kernel.org/r/20260327003224.55752-1-sj@kernel.org
Link: https://lore.kernel.org/20260320020630.962-1-sj@kernel.org [1]
Fixes: 04a06b139ec0 ("mm/damon/sysfs: use dynamically allocated repeat mode damon_call_control")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sysfs_repeat_call_fn() calls damon_sysfs_upd_tuned_intervals(),
damon_sysfs_upd_schemes_stats(), and
damon_sysfs_upd_schemes_effective_quotas() without checking contexts->nr.
If nr_contexts is set to 0 via sysfs while DAMON is running, these
functions dereference contexts_arr[0] and cause a NULL pointer
dereference. Add the missing check.
For example, the issue can be reproduced using DAMON sysfs interface and
DAMON user-space tool (damo) [1] like below.
$ sudo damo start --refresh_interval 1s
$ echo 0 | sudo tee \
/sys/kernel/mm/damon/admin/kdamonds/0/contexts/nr_contexts
Link: https://patch.msgid.link/20260320163559.178101-3-objecting@objecting.org
Link: https://lkml.kernel.org/r/20260321175427.86000-4-sj@kernel.org
Link: https://github.com/damonitor/damo [1]
Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work")
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Multiple sysfs command paths dereference contexts_arr[0] without first
verifying that kdamond->contexts->nr == 1. A user can set nr_contexts to
0 via sysfs while DAMON is running, causing NULL pointer dereferences.
In more detail, the issue can be triggered by privileged users like
below.
First, start DAMON and make contexts directory empty
(kdamond->contexts->nr == 0).
# damo start
# cd /sys/kernel/mm/damon/admin/kdamonds/0
# echo 0 > contexts/nr_contexts
Then, each of below commands will cause the NULL pointer dereference.
# echo update_schemes_stats > state
# echo update_schemes_tried_regions > state
# echo update_schemes_tried_bytes > state
# echo update_schemes_effective_quotas > state
# echo update_tuned_intervals > state
Guard all commands (except OFF) at the entry point of
damon_sysfs_handle_cmd().
Link: https://lkml.kernel.org/r/20260321175427.86000-3-sj@kernel.org
Fixes: 0ac32b8affb5 ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [5.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/sysfs: fix memory leak and NULL dereference
issues", v4.
DAMON_SYSFS can leak memory under allocation failure, and do NULL pointer
dereference when a privileged user make wrong sequences of control. Fix
those.
This patch (of 3):
When damon_sysfs_new_test_ctx() fails in damon_sysfs_commit_input(),
param_ctx is leaked because the early return skips the cleanup at the out
label. Destroy param_ctx before returning.
Link: https://lkml.kernel.org/r/20260321175427.86000-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260321175427.86000-2-sj@kernel.org
Fixes: f0c5118ebb0e ("mm/damon/sysfs: catch commit test ctx alloc failure")
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a kernel-doc-like comment to damos_commit_dests() documenting its
allocation failure contract: on -ENOMEM, the destination structure is left
in a partially torn-down state that is safe to deallocate via
damon_destroy_scheme(), but must not be reused for further commits.
This was unclear from the code alone and led to a separate patch [1]
attempting to reset nr_dests on failure. Make the intended usage explicit
so future readers do not repeat the confusion.
Link: https://lkml.kernel.org/r/20260320143648.91673-1-sj@kernel.org
Link: https://lore.kernel.org/20260318214939.36100-1-objecting@objecting.org [1]
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In the past, damon_set_region_biggest_system_ram_default(), which is the
core function for setting the default monitoring target region of
DAMON_LRU_SORT, didn't support addr_unit. Hence DAMON_LRU_SORT was
silently ignoring the user input for addr_unit when the user doesn't
explicitly set the monitoring target regions, and therefore the default
target region is being used. No real problem from that ignorance was
reported so far. But, the implicit rule is only making things confusing.
Also, the default target region setup function is updated to support
addr_unit. Hence there is no reason to keep ignoring it. Respect the
user-passed addr_unit for the default target monitoring region use case.
Link: https://lkml.kernel.org/r/20260311052927.93921-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In the past, damon_set_region_biggest_system_ram_default(), which is the
core function for setting the default monitoring target region of
DAMON_RECLAIM, didn't support addr_unit. Hence DAMON_RECLAIM was silently
ignoring the user input for addr_unit when the user doesn't explicitly set
the monitoring target regions, and therefore the default target region is
being used. No real problem from that ignorance was reported so far.
But, the implicit rule is only making things confusing. Also, the default
target region setup function is updated to support addr_unit. Hence there
is no reason to keep ignoring it. Respect the user-passed addr_unit for
the default target monitoring region use case.
Link: https://lkml.kernel.org/r/20260311052927.93921-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The third argument is the length of the second parameter. But addr_unit
is wrongly being passed. Fix it.
Link: https://lkml.kernel.org/r/20260314001854.79623-1-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_set_region_biggest_system_ram_default()
damon_find_biggest_system_ram() was not supporting addr_unit in the past.
Hence, its caller, damon_set_region_biggest_system_ram_default(), was also
not supporting addr_unit. The previous commit has updated the inner
function to support addr_unit. There is no more reason to not support
addr_unit on damon_set_region_biggest_system_ram_default(). Rather, it
makes unnecessary inconsistency on support of addr_unit. Update it to
receive addr_unit and handle it inside.
Link: https://lkml.kernel.org/r/20260311052927.93921-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_find_biggest_system_ram() sets an 'unsigned long' variable with
'resource_size_t' value. This is fundamentally wrong. On environments
such as ARM 32 bit machines having LPAE (Large Physical Address
Extensions), which DAMON supports, the size of 'unsigned long' may be
smaller than that of 'resource_size_t'. It is safe, though, since we
restrict the walk to be done only up to ULONG_MAX.
DAMON supports the address size gap using 'addr_unit'. We didn't add the
support to the function, just to make the initial support change small.
Now the support is reasonably settled. This kind of gap is only making
the code inconsistent and easy to be confused. Add the support of
'addr_unit' to the function, by letting callers pass the 'addr_unit' and
handling it in the function. All callers are passing 'addr_unit' 1,
though, to keep the old behavior.
[sj@kernel.org: verify found biggest system ram]
Link: https://lkml.kernel.org/r/20260317144725.88524-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260311052927.93921-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: support addr_unit on default monitoring targets
for modules".
DAMON_RECLAIM and DAMON_LRU_SORT support 'addr_unit' parameters only when
the monitoring target address range is explicitly set. This was
intentional for making the initial 'addr_unit' support change small. Now
'addr_unit' support is being quite stabilized. Having the corner case of
the support is only making the code inconsistent with implicit rules. The
inconsistency makes it easy to confuse [1] readers. After all, there is
no real reason to keep 'addr_unit' support incomplete. Add the support
for the case to improve the readability and more completely support
'addr_unit'.
This series is constructed with five patches. The first one (patch 1)
fixes a small bug that mistakenly assigns inclusive end address to open
end address, which was found from this work. The second and third ones
(patches 2 and 3) extend the default monitoring target setting functions
in the core layer one by one, to support the 'addr_unit' while making no
visible changes. The final two patches (patches 4 and 5) update
DAMON_RECLAIM and DAMON_LRU_SORT to support 'addr_unit' for the default
monitoring target address ranges, by passing the user input to the core
functions.
This patch (of 5):
'struct damon_addr_range' and 'struct resource' represent different types
of address ranges. 'damon_addr_range' is for end-open ranges ([start,
end)). 'resource' is for fully-closed ranges ([start, end]). But
walk_system_ram() is assigning resource->end to damon_addr_range->end
without the inclusiveness adjustment. As a result, the function returns
an address range that is missing the last one byte.
The function is being used to find and set the biggest system ram as the
default monitoring target for DAMON_RECLAIM and DAMON_LRU_SORT. Missing
the last byte of the big range shouldn't be a real problem for the real
use cases. That said, the loss is definitely an unintended behavior. Do
the correct adjustment.
Link: https://lkml.kernel.org/r/20260311052927.93921-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260311052927.93921-2-sj@kernel.org
Link: https://lore.kernel.org/20260131015643.79158-1-sj@kernel.org [1]
Fixes: 43b0536cb471 ("mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yang yingliang <yangyingliang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Extend damos_commit_quota() kunit test for the newly added goal_tuner
parameter.
Link: https://lkml.kernel.org/r/20260310010529.91162-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a new DAMON sysfs interface file, namely 'goal_tuner' under the DAMOS
quotas directory. It is connected to the damos_quota->goal_tuner field.
Users can therefore select their favorite goal-based quotas tuning
algorithm by writing the name of the tuner to the file. Reading the file
returns the name of the currently selected tuner.
Link: https://lkml.kernel.org/r/20260310010529.91162-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce a new goal-based DAMOS quota auto-tuning algorithm, namely
DAMOS_QUOTA_GOAL_TUNER_TEMPORAL (temporal in short). The algorithm aims
to trigger the DAMOS action only for a temporal time, to achieve the goal
as soon as possible. For the temporal period, it uses as much quota as
allowed. Once the goal is achieved, it sets the quota zero, so
effectively makes the scheme be deactivated.
Link: https://lkml.kernel.org/r/20260310010529.91162-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
User-explicit quotas (size and time quotas) having zero value means the
quotas are unset. And, effective size quota is set as the minimum value
of the explicit quotas. When quota goals are set, the goal-based quota
tuner can make it lower. But the existing only single tuner never sets
the effective size quota zero. Because of the fact, DAMON core assumes
zero effective quota means the user has set no quota.
Multiple tuners are now allowed, though. In the future, some tuners might
want to set a zero effective size quota. There is no reason to restrict
that. Meanwhile, because of the current implementation, it will only
deactivate all quotas and make the scheme work at its full speed.
Introduce a dedicated function for checking if no quota is set. The
function checks the fact by showing if the user-set explicit quotas are
zero and no goal is installed. It is decoupled from zero effective quota,
and hence allows future tuners set zero effective quota for intentionally
deactivating the scheme by a purpose.
Link: https://lkml.kernel.org/r/20260310010529.91162-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: support multiple goal-based quota tuning
algorithms".
Aim-oriented DAMOS quota auto-tuning uses a single tuning algorithm. The
algorithm is designed to find a quota value that should be consistently
kept for achieving the aimed goal for long term. It is useful and
reliable at automatically operating systems that have dynamic environments
in the long term.
As always, however, no single algorithm fits all. When the environment
has static characteristics or there are control towers in not only the
kernel space but also the user space, the algorithm shows some
limitations. In such environments, users want kernel work in a more short
term deterministic way. Actually there were at least two reports [1,2] of
such cases.
Extend DAMOS quotas goal to support multiple quota tuning algorithms that
users can select. Keep the current algorithm as the default one, to not
break the old users. Also give it a name, "consist", as it is designed to
"consistently" apply the DAMOS action. And introduce a new tuning
algorithm, namely "temporal". It is designed to apply the DAMOS action
only temporally, in a deterministic way. In more detail, as long as the
goal is under-achieved, it uses the maximum quota available. Once the
goal is over-achieved, it sets the quota zero.
Tests
=====
I confirmed the feature is working as expected using the latest version of
DAMON user-space tool, like below.
$ # start DAMOS for reclaiming memory aiming 30% free memory
$ sudo ./damo/damo start --damos_action pageout \
--damos_quota_goal_tuner temporal \
--damos_quota_goal node_mem_free_bp 30% 0 \
--damos_quota_interval 1s \
--damos_quota_space 100M
Note that >=3.1.8 version of DAMON user-space tool supports this feature
(--damos_quota_goal_tuner). As expected, DAMOS stops reclaiming memory as
soon as the goal amount of free memory is made. When 'consist' tuner is
used, the reclamation was continued even after the goal amount of free
memory is made, resulting in more than goal amount of free memory, as
expected.
Patch Sequence
==============
First four patches implement the features. Patch 1 extends core API to
allow multiple tuners and make the current tuner as the default and only
available tuner, namely 'consist'. Patch 2 allows future tuners setting
zero effective quota. Patch 3 introduces the second tuner, namely
'temporal'. Patch 4 further extends DAMON sysfs API to let users use
that.
Three following patches (patches 5-7) update design, usage, and ABI
documents, respectively.
Final four patches (patches 8-11) are for adding tests. The eighth patch
(patch 8) extends the kunit test for online parameters commit for
validating the goal_tuner. The ninth and the tenth patches (patches 9-10)
extend the testing-purpose DAMON sysfs control helper and DAMON status
dumping tool to support the newly added feature. The final eleventh one
(patch 11) extends the existing online commit selftest to cover the new
feature.
This patch (of 11):
DAMOS quota goal feature utilizes a single feedback loop based algorithm
for automatic tuning of the effective quota. It is useful in dynamic
environments that operate systems with only kernels in the long term.
But, no one fits all. It is not very easy to control in environments
having more controlled characteristics and user-space control towers. We
actually got multiple reports [1,2] of use cases that the algorithm is not
optimal.
Introduce a new field of 'struct damos_quotas', namely 'goal_tuner'. It
specifies what tuning algorithm the given scheme should use, and allows
DAMON API callers to set it as they want. Nonetheless, this commit
introduces no new tuning algorithm but only the interface. This commit
hence makes no behavioral change. A new algorithm will be added by the
following commit.
Link: https://lkml.kernel.org/r/20260310010529.91162-2-sj@kernel.org
Link: https://lore.kernel.org/CALa+Y17__d=ZsM1yX+MXx0ozVdsXnFqF4p0g+kATEitrWyZFfg@mail.gmail.com [1]
Link: https://lore.kernel.org/20260204022537.814-1-yunjeong.mun@sk.com [2]
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_set_attrs() is called for multiple purposes from multiple places.
Calling it in an unsafe context can make DAMON internal state polluted and
results in unexpected behaviors. Clarify when it is safe, and where it is
being called.
Link: https://lkml.kernel.org/r/20260307195356.203753-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Acked-by: wang lian <lianux.mm@gmail.com>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There was a bug [1] in damon_is_last_region(). Add a kunit test to not
reintroduce the bug.
Link: https://lkml.kernel.org/r/20260307195356.203753-3-sj@kernel.org
Link: https://lore.kernel.org/20260114152049.99727-1-sj@kernel.org/ [1]
Signed-off-by: SeongJae Park <sj@kernel.org>
Tested-by: wang lian <lianux.mm@gmail.com>
Reviewed-by: wang lian <lianux.mm@gmail.com>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: improve/fixup/update ratio calculation, test and
documentation".
Yet another batch of misc/minor improvements and fixups. Use mult_frac()
instead of the worse open-coding for rate calculations (patch 1). Add a
test for a previously found and fixed bug (patch 2). Improve and update
comments and documentations for easier code review and up-to-date
information (patches 3-6). Finally, fix an obvious typo (patch 7).
This patch (of 7):
There are multiple places in core code that do open-code rate
calculations. Use mult_frac(), which is developed for doing that in a way
more safe from overflow and precision loss.
Link: https://lkml.kernel.org/r/20260307195356.203753-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260307195356.203753-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Acked-by: wang lian <lianux.mm@gmail.com>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_ctx->passed_sample_intervals and damon_ctx->next_*_sis are unsigned
long. Those are compared in kdamond_fn() using normal comparison
operators. It is unsafe from overflow. Use time_after_eq(), which is
safe from overflows when correctly used, instead.
Link: https://lkml.kernel.org/r/20260307194915.203169-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_ctx->passed_sample_intervals and damos->next_apply_sis are unsigned
long, and compared via normal comparison operators. It is unsafe from
overflow. Use time_before(), which is safe from overflow when correctly
used, instead.
Link: https://lkml.kernel.org/r/20260307194915.203169-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/core: make passed_sample_intervals comparisons
overflow-safe".
DAMON accounts time using its own jiffies-like time counter, namely
damon_ctx->passed_sample_intervals. The counter is incremented on each
iteration of kdamond_fn() main loop, which sleeps at least one sample
interval. Hence the name is like that.
DAMON has time-periodic operations including monitoring results
aggregation and DAMOS action application. DAMON sets the next time to do
each of such operations in the passed_sample_intervals unit. And it does
the operation when the counter becomes the same to or larger than the
pre-set values, and update the next time for the operation. Note that the
operation is done not only when the values exactly match but also when the
time is passed, because the values can be updated for online-committed
DAMON parameters.
The counter is 'unsigned long' type, and the comparison is done using
normal comparison operators. It is not safe from overflows. This can
cause rare and limited but odd situations.
Let's suppose there is an operation that should be executed every 20
sampling intervals, and the passed_sample_intervals value for next
execution of the operation is ULONG_MAX - 3. Once the
passed_sample_intervals reaches ULONG_MAX - 3, the operation will be
executed, and the next time value for doing the operation becomes 17
(ULONG_MAX - 3 + 20), since overflow happens. In the next iteration of
the kdamond_fn() main loop, passed_sample_intervals is larger than the
next operation time value, so the operation will be executed again. It
will continue executing the operation for each iteration, until the
passed_sample_intervals also overflows.
Note that this will not be common and problematic in the real world. The
sampling interval, which takes for each passed_sample_intervals increment,
is 5 ms by default. And it is usually [auto-]tuned for hundreds of
milliseconds. That means it takes about 248 days or 4,971 days to have
the overflow on 32 bit machines when the sampling interval is 5 ms and 100
ms, respectively (1<<32 * sampling_interval_in_seconds / 3600 / 24). On
64 bit machines, the numbers become 2924712086.77536 and 58494241735.5072
years. So the real user impact is negligible. But still this is better
to be fixed as long as the fix is simple and efficient.
Fix this by simply replacing the overflow-unsafe native comparison
operators with the existing overflow-safe time comparison helpers.
The first patch only cleans up the next DAMOS action application time
setup for consistency and reduced code. The second and the third patches
update DAMOS action application time setup and rest, respectively.
This patch (of 3):
There is a function for damos->next_apply_sis setup. But some places are
open-coding it. Consistently use the helper.
Link: https://lkml.kernel.org/r/20260307194915.203169-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: test and document power-of-2 min_region_sz
requirement".
Since commit c80f46ac228b ("mm/damon/core: disallow non-power of two
min_region_sz"), min_region_sz is always restricted to be a power of two.
Add a kunit test to confirm the functionality. Also, the change adds a
restriction to addr_unit parameter. Clarify it on the document.
This patch (of 2):
Add a kunit test for confirming the change that is made on commit
c80f46ac228b ("mm/damon/core: disallow non-power of two min_region_sz")
functions as expected.
Link: https://lkml.kernel.org/r/20260307194222.202075-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
CONFIG_DAMON_DEBUG_SANITY is recommended for DAMON development and test
setups. Enable it on the default configurations for DAMON kunit test run.
Link: https://lkml.kernel.org/r/20260306152914.86303-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
At time of damon_reset_aggregated(), aggregation of the interval should be
completed, and hence nr_accesses and nr_accesses_bp should match. I found
a few bugs caused it to be broken in the past, from online parameters
update and complicated nr_accesses handling changes. Add a sanity check
for that under CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_split_region_at() should be called with the correct address to split
on. Add a sanity check for that under CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_merge_regions_of() should be called only after aggregation is
finished and therefore each region's nr_accesses and nr_accesses_bp match.
There were bugs that broke the assumption, during development of online
DAMON parameter updates and monitoring results handling changes. Add a
sanity check for that under CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A data corruption could cause damon_merge_two_regions() creating zero
length DAMON regions. Add a sanity check for that under
CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_target->nr_regions is introduced to get the number quickly without
having to iterate regions always. Add a sanity check for that under
CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_del_region() should be called for targets that have one or more
regions. Add a sanity check for that under CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_new_region() is supposed to be called with only valid address range
arguments. Do the check under DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: add optional debugging-purpose sanity checks".
DAMON code has a few assumptions that can be critical if violated.
Validating the assumptions in code can be useful at finding such critical
bugs. I was actually adding some such additional sanity checks in my
personal tree, and those were useful at finding bugs that I made during
the development of new patches. We also found [1] sometimes the
assumptions are misunderstood. The validation can work as good
documentation for such cases.
Add some of such debugging purpose sanity checks. Because those
additional checks can impose more overhead, make those only optional via
new config, CONFIG_DAMON_DEBUG_SANITY, that is recommended for only
development and test setups. And as recommended, enable it for DAMON
kunit tests and selftests.
Note that the verification only WARN_ON() for each of the insanity. The
developer or tester may better to set panic_on_oops together, like
damon-tests/corr did [2].
This patch (of 10):
Add a new build config that will enable additional DAMON sanity checks.
It is recommended to be enabled on only development and test setups, since
it can impose additional overhead.
Link: https://lkml.kernel.org/r/20260306152914.86303-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260306152914.86303-2-sj@kernel.org
Link: https://lore.kernel.org/20251231070029.79682-1-sj@kernel.org [1]
Link: https://github.com/damonitor/damon-tests/commit/a80fbee55e272f151b4e5809ee85898aea33e6ff [2]
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a kunit test for the functionality of damon_apply_min_nr_regions().
Link: https://lkml.kernel.org/r/20260228222831.7232-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The previous commit made DAMON core split regions at the beginning for
min_nr_regions. The virtual address space operation set (vaddr) does
similar work on its own, for a case user delegates entire initial
monitoring regions setup to vaddr. It is unnecessary now, as DAMON core
will do similar work for any case. Remove the duplicated work in vaddr.
Also, remove a helper function that was being used only for the work, and
the test code of the helper function.
Link: https://lkml.kernel.org/r/20260228222831.7232-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: strictly respect min_nr_regions".
DAMON core respects min_nr_regions only at merge operation. DAMON API
callers are therefore responsible to respect or ignore that. Only vaddr
ops is respecting that, but only for initial start time. DAMON sysfs
interface allows users to setup the initial regions that DAMON core also
respects. But, again, it works for only the initial time. Users setting
the regions for min_nr_regions can be difficult and inefficient, when the
min_nr_regions value is high. There was actually a report [1] from a
user. The use case was page granular access monitoring with a large
aggregation interval.
Make the following three changes for resolving the issue. First (patch
1), make DAMON core split regions at the beginning and every aggregation
interval, to respect the min_nr_regions. Second (patch 2), drop the
vaddr's split operations and related code that are no more needed. Third
(patch 3), add a kunit test for the newly introduced function.
This patch (of 3):
DAMON core layer respects the min_nr_regions parameter by setting the
maximum size of each region as total monitoring region size divided by the
parameter value. And the limit is applied by preventing merge of regions
that result in a region larger than the maximum size. The limit is
updated per ops update interval, because vaddr updates the monitoring
regions on the ops update callback.
It does nothing for the beginning state. That's because the users can set
the initial monitoring regions as they want. That is, if the users really
care about the min_nr_regions, they are supposed to set the initial
monitoring regions to have more than min_nr_regions regions. The virtual
address space operation set, vaddr, has an exceptional case. Users can
ask the ops set to configure the initial regions on its own. For the
case, vaddr sets up the initial regions to meet the min_nr_regions. So,
vaddr has exceptional support, but basically users are required to set the
regions on their own if they want min_nr_regions to be respected.
When 'min_nr_regions' is high, such initial setup is difficult. If DAMON
sysfs interface is used for that, the memory for saving the initial setup
is also a waste.
Even if the user forgives the setup, DAMON will eventually make more than
min_nr_regions regions by splitting operations. But it will take time.
If the aggregation interval is long, the delay could be problematic.
There was actually a report [1] of the case. The reporter wanted to do
page granular monitoring with a large aggregation interval.
Also, DAMON is doing nothing for online changes on monitoring regions and
min_nr_regions. For example, the user can remove a monitoring region or
increase min_nr_regions while DAMON is running.
Split regions larger than the size at the beginning of the kdamond main
loop, to fix the initial setup issue. Also do the split every aggregation
interval, for online changes. This means the behavior is slightly
changed. It is difficult to imagine a use case that actually depends on
the old behavior, though. So this change is arguably fine.
Note that the size limit is aligned by damon_ctx->min_region_sz and cannot
be zero. That is, if min_nr_region is larger than the total size of
monitoring regions divided by ->min_region_sz, that cannot be respected.
Link: https://lkml.kernel.org/r/20260228222831.7232-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260228222831.7232-2-sj@kernel.org
Link: https://lore.kernel.org/CAC5umyjmJE9SBqjbetZZecpY54bHpn2AvCGNv3aF6J=1cfoPXQ@mail.gmail.com [1]
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kdamond_apply_schemes() is using damon_for_each_region_safe(), which is
safe for deallocation of the region inside the loop. However, the loop
internal logic does not deallocate regions. Hence it is only wasting the
next pointer. Also, it causes a problem.
When an address filter is applied, and there is a region that intersects
with the filter, the filter splits the region on the filter boundary. The
intention is to let DAMOS apply action to only filtered-in address ranges.
However, it is using damon_for_each_region_safe(), which sets the next
region before the execution of the iteration. Hence, the region that
split and now will be next to the previous region, is simply ignored. As
a result, DAMOS applies the action to target regions bit slower than
expected, when the address filter is used. Shouldn't be a big problem but
definitely better to be fixed. damos_skip_charged_region() was working
around the issue using a double pointer hack.
Use damon_for_each_region(), which is safe for this use case. And drop
the work around in damos_skip_charged_region().
Link: https://lkml.kernel.org/r/20260227170623.95384-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/core: improve DAMOS quota efficiency for core layer
filters".
Improve two below problematic behaviors of DAMOS that makes it less
efficient when core layer filters are used.
DAMOS generates the under-quota regions prioritization-purpose access
temperature histogram [1] with only the scheme target access pattern. The
DAMOS filters are ignored on the histogram, and this can result in the
scheme not applied to eligible regions. For working around this, users
had to use separate DAMON contexts. The memory tiering approaches are
such examples.
DAMOS splits regions that intersect with address filters, so that only
filtered-out part of the region is skipped. But, the implementation is
skipping the other part of the region that is not filtered out, too. As a
result, DAMOS can work slower than expected.
Improve the two inefficient behaviors with two patches, respectively.
Read the patches for more details about the problem and how those are
fixed.
This patch (of 2):
The histogram for under-quota region prioritization [1] is made for all
regions that are eligible for the DAMOS target access pattern. When there
are DAMOS filters, the prioritization-threshold access temperature that
generated from the histogram could be inaccurate.
For example, suppose there are three regions. Each region is 1 GiB. The
access temperature of the regions are 100, 50, and 0. And a DAMOS scheme
that targets _any_ access temperature with quota 2 GiB is being used. The
histogram will look like below:
temperature size of regions having >=temperature temperature
0 3 GiB
50 2 GiB
100 1 GiB
Based on the histogram and the quota (2 GiB), DAMOS applies the action to
only the regions having >=50 temperature. This is all good.
Let's suppose the region of temperature 50 is excluded by a DAMOS filter.
Regardless of the filter, DAMOS will try to apply the action on only
regions having >=50 temperature. Because the region of temperature 50 is
filtered out, the action is applied to only the region of temperature 100.
Worse yet, suppose the filter is excluding regions of temperature 50 and
100. Then no action is really applied to any region, while the region of
temperature 0 is there.
People used to work around this by utilizing multiple contexts, instead of
the core layer DAMOS filters. For example, DAMON-based memory tiering
approaches including the quota auto-tuning based one [2] are using a DAMON
context per NUMA node. If the above explained issue is effectively
alleviated, those can be configured again to run with single context and
DAMOS filters for applying the promotion and demotion to only specific
NUMA nodes.
Alleviate the problem by checking core DAMOS filters when generating the
histogram. The reason to check only core filters is the overhead. While
core filters are usually for coarse-grained filtering (e.g.,
target/address filters for process, NUMA, zone level filtering), operation
layer filters are usually for fine-grained filtering (e.g., for anon
page). Doing this for operation layer filters would cause significant
overhead. There is no known use case that is affected by the operation
layer filters-distorted histogram problem, though. Do this for only core
filters for now. We will revisit this for operation layer filters in
future. We might be able to apply a sort of sampling based operation
layer filtering.
After this fix is applied, for the first case that there is a DAMOS filter
excluding the region of temperature 50, the histogram will be like below:
temperature size of regions having >=temperature temperature
0 2 GiB
100 1 GiB
And DAMOS will set the temperature threshold as 0, allowing both regions
of temperatures 0 and 100 be applied.
For the second case that there is a DAMOS filter excluding the regions of
temperature 50 and 100, the histogram will be like below:
temperature size of regions having >=temperature temperature
0 1 GiB
And DAMOS will set the temperature threshold as 0, allowing the region of
temperature 0 be applied.
[1] 'Prioritization' section of Documentation/mm/damon/design.rst
[2] commit 0e1c773b501f ("mm/damon/core: introduce damos quota goal
metrics for memory node utilization")
Link: https://lkml.kernel.org/r/20260227170623.95384-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260227170623.95384-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_target is not used by get_scheme_score operations, nor with virtual
neither with physical addresses.
Link: https://lkml.kernel.org/r/20260213145032.1740407-1-gutierrez.asier@huawei-partners.com
Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Quanmin Yan <yanquanmin1@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sysfs_repeat_call_fn() calls damon_sysfs_upd_tuned_intervals(),
damon_sysfs_upd_schemes_stats(), and
damon_sysfs_upd_schemes_effective_quotas() without checking contexts->nr.
If nr_contexts is set to 0 via sysfs while DAMON is running, these
functions dereference contexts_arr[0] and cause a NULL pointer
dereference. Add the missing check.
For example, the issue can be reproduced using DAMON sysfs interface and
DAMON user-space tool (damo) [1] like below.
$ sudo damo start --refresh_interval 1s
$ echo 0 | sudo tee \
/sys/kernel/mm/damon/admin/kdamonds/0/contexts/nr_contexts
Link: https://patch.msgid.link/20260320163559.178101-3-objecting@objecting.org
Link: https://lkml.kernel.org/r/20260321175427.86000-4-sj@kernel.org
Link: https://github.com/damonitor/damo [1]
Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work")
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Multiple sysfs command paths dereference contexts_arr[0] without first
verifying that kdamond->contexts->nr == 1. A user can set nr_contexts to
0 via sysfs while DAMON is running, causing NULL pointer dereferences.
In more detail, the issue can be triggered by privileged users like
below.
First, start DAMON and make contexts directory empty
(kdamond->contexts->nr == 0).
# damo start
# cd /sys/kernel/mm/damon/admin/kdamonds/0
# echo 0 > contexts/nr_contexts
Then, each of below commands will cause the NULL pointer dereference.
# echo update_schemes_stats > state
# echo update_schemes_tried_regions > state
# echo update_schemes_tried_bytes > state
# echo update_schemes_effective_quotas > state
# echo update_tuned_intervals > state
Guard all commands (except OFF) at the entry point of
damon_sysfs_handle_cmd().
Link: https://lkml.kernel.org/r/20260321175427.86000-3-sj@kernel.org
Fixes: 0ac32b8affb5 ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [5.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|