<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/tools/testing/vma, branch v6.16</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.16</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.16'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-06-13T04:39:02Z</updated>
<entry>
<title>mm: add mmap_prepare() compatibility layer for nested file systems</title>
<updated>2025-06-13T04:39:02Z</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2025-06-09T16:57:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bb666b7c27073b986b75699e51a7102910f58060'/>
<id>urn:sha1:bb666b7c27073b986b75699e51a7102910f58060</id>
<content type='text'>
Nested file systems, that is those which invoke call_mmap() within their
own f_op-&gt;mmap() handlers, may encounter underlying file systems which
provide the f_op-&gt;mmap_prepare() hook introduced by commit c84bf6dd2b83
("mm: introduce new .mmap_prepare() file callback").

We have a chicken-and-egg scenario here - until all file systems are
converted to using .mmap_prepare(), we cannot convert these nested
handlers, as we can't call f_op-&gt;mmap from an .mmap_prepare() hook.

So we have to do it the other way round - invoke the .mmap_prepare() hook
from an .mmap() one.

in order to do so, we need to convert VMA state into a struct vm_area_desc
descriptor, invoking the underlying file system's f_op-&gt;mmap_prepare()
callback passing a pointer to this, and then setting VMA state accordingly
and safely.

This patch achieves this via the compat_vma_mmap_prepare() function, which
we invoke from call_mmap() if f_op-&gt;mmap_prepare() is specified in the
passed in file pointer.

We place the fundamental logic into mm/vma.h where VMA manipulation
belongs.  We also update the VMA userland tests to accommodate the
changes.

The compat_vma_mmap_prepare() function and its associated machinery is
temporary, and will be removed once the conversion of file systems is
complete.

We carefully place this code so it can be used with CONFIG_MMU and also
with cutting edge nommu silicon.

[akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build]
[lorenzo.stoakes@oracle.com: remove unused declarations]
  Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local
Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com
Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback").
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/
Reviewed-by: Pedro Falcato &lt;pfalcato@suse.de&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/hugetlb: unshare page tables during VMA split, not before</title>
<updated>2025-06-06T05:02:24Z</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2025-05-27T21:23:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=081056dc00a27bccb55ccc3c6f230a3d5fd3f7e0'/>
<id>urn:sha1:081056dc00a27bccb55ccc3c6f230a3d5fd3f7e0</id>
<content type='text'>
Currently, __split_vma() triggers hugetlb page table unsharing through
vm_ops-&gt;may_split().  This happens before the VMA lock and rmap locks are
taken - which is too early, it allows racing VMA-locked page faults in our
process and racing rmap walks from other processes to cause page tables to
be shared again before we actually perform the split.

Fix it by explicitly calling into the hugetlb unshare logic from
__split_vma() in the same place where THP splitting also happens.  At that
point, both the VMA and the rmap(s) are write-locked.

An annoying detail is that we can now call into the helper
hugetlb_unshare_pmds() from two different locking contexts:

1. from hugetlb_split(), holding:
    - mmap lock (exclusively)
    - VMA lock
    - file rmap lock (exclusively)
2. hugetlb_unshare_all_pmds(), which I think is designed to be able to
   call us with only the mmap lock held (in shared mode), but currently
   only runs while holding mmap lock (exclusively) and VMA lock

Backporting note:
This commit fixes a racy protection that was introduced in commit
b30c14cd6102 ("hugetlb: unshare some PMDs when splitting VMAs"); that
commit claimed to fix an issue introduced in 5.13, but it should actually
also go all the way back.

[jannh@google.com: v2]
  Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1a@google.com
Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1a@google.com
Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58a@google.com
Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page")
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Reviewed-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[b30c14cd6102: hugetlb: unshare some PMDs when splitting VMAs]
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>tools/testing/vma: add missing function stub</title>
<updated>2025-06-01T05:46:14Z</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2025-05-28T14:15:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=918850c13608c7b138512c2ecbfd3436b7a51797'/>
<id>urn:sha1:918850c13608c7b138512c2ecbfd3436b7a51797</id>
<content type='text'>
The hugetlb fix introduced in commit ee40c9920ac2 ("mm: fix copy_vma()
error handling for hugetlb mappings") mistakenly did not provide a stub
for the VMA userland testing, which results in a compile error when trying
to build this.

Provide this stub to resolve the issue.

Link: https://lkml.kernel.org/r/20250528-fix-vma-test-v1-1-c8a5f533b38f@oracle.com
Fixes: ee40c9920ac2 ("mm: fix copy_vma() error handling for hugetlb mappings")
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Reviewed-by:  Liam R. Howlett &lt;Liam.Howlett@oracle.com&gt;
Reviewed-by: Pedro Falcato &lt;pfalcato@suse.de&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: introduce new .mmap_prepare() file callback</title>
<updated>2025-05-13T23:28:07Z</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2025-05-09T12:13:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c84bf6dd2b836b49bb2662668ff1692350d28236'/>
<id>urn:sha1:c84bf6dd2b836b49bb2662668ff1692350d28236</id>
<content type='text'>
Patch series "eliminate mmap() retry merge, add .mmap_prepare hook", v2.

During the mmap() of a file-backed mapping, we invoke the underlying
driver file's mmap() callback in order to perform driver/file system
initialisation of the underlying VMA.

This has been a source of issues in the past, including a significant
security concern relating to unwinding of error state discovered by Jann
Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
error path behaviour") which performed the recent, significant, rework of
mmap() as a whole.

However, we have had a fly in the ointment remain - drivers have a great
deal of freedom in the .mmap() hook to manipulate VMA state (as well as
page table state).

This can be problematic, as we can no longer reason sensibly about VMA
state once the call is complete (the ability to do - anything - here does
rather interfere with that).

In addition, callers may choose to do odd or unusual things which might
interfere with subsequent steps in the mmap() process, and it may do so
and then raise an error, requiring very careful unwinding of state about
which we can make no assumptions.

Rather than providing such an open-ended interface, this series provides
an alternative, far more restrictive one - we expose a whitelist of fields
which can be adjusted by the driver, along with immutable state upon which
the driver can make such decisions:

struct vm_area_desc {
	/* Immutable state. */
	struct mm_struct *mm;
	unsigned long start;
	unsigned long end;

	/* Mutable fields. Populated with initial state. */
	pgoff_t pgoff;
	struct file *file;
	vm_flags_t vm_flags;
	pgprot_t page_prot;

	/* Write-only fields. */
	const struct vm_operations_struct *vm_ops;
	void *private_data;
};

The mmap logic then updates the state used to either merge with a VMA or
establish a new VMA based upon this logic.

This is achieved via new file hook .mmap_prepare(), which is, importantly,
invoked very early on in the mmap() process.

If an error arises, we can very simply abort the operation with very
little unwinding of state required.

The existing logic contains another, related, peccadillo - since the
.mmap() callback might do anything, it may also cause a previously
unmergeable VMA to become mergeable with adjacent VMAs.

Right now the logic will retry a merge like this only if the driver
changes VMA flags, and changes them in such a way that a merge might
succeed (that is, the flags are not 'special', that is do not contain any
of the flags specified in VM_SPECIAL).

This has also been the source of a great deal of pain - it's hard to
reason about an .mmap() callback that might do - anything - but it's also
hard to reason about setting up a VMA and writing to the maple tree, only
to do it again utilising a great deal of shared state.

Since .mmap_prepare() sets fields before the first merge is even
attempted, the use of this callback obviates the need for this retry merge
logic.

A driver may only specify .mmap_prepare() or the deprecated .mmap()
callback.  In future we may add futher callbacks beyond .mmap_prepare() to
faciliate all use cass as we convert drivers.

In researching this change, I examined every .mmap() callback, and
discovered only a very few that set VMA state in such a way that a.  the
VMA flags changed and b.  this would be mergeable.

In the majority of cases, it turns out that drivers are mapping kernel
memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other
unmergeable VM_SPECIAL flags.

Of those that remain I identified a number of cases which are only
applicable in DAX, setting the VM_HUGEPAGE flag:

* dax_mmap()
* erofs_file_mmap()
* ext4_file_mmap()
* xfs_file_mmap()

For this remerge to not occur and to impact users, each of these cases
would require a user to mmap() files using DAX, in parts, immediately
adjacent to one another.

This is a very unlikely usecase and so it does not appear to be worthwhile
to adjust this functionality accordingly.

We can, however, very quickly do so if needed by simply adding an
.mmap_prepare() callback to these as required.

There are two further non-DAX cases I idenitfied:

* orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with
  VM_SEQ_READ.
* usb_stream_hwdep_mmap() - Sets VM_DONTDUMP.

Both of these cases again seem very unlikely to be mmap()'d immediately
adjacent to one another in a fashion that would result in a merge.

Finally, we are left with a viable case:

* secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP.

This is viable enough that the mm selftests trigger the logic as a matter
of course.  Therefore, this series replace the .secretmem_mmap() hook with
.secret_mmap_prepare().


This patch (of 3):

Provide a means by which drivers can specify which fields of those
permitted to be changed should be altered to prior to mmap()'ing a range
(which may either result from a merge or from mapping an entirely new
VMA).

Doing so is substantially safer than the existing .mmap() calback which
provides unrestricted access to the part-constructed VMA and permits
drivers and file systems to do 'creative' things which makes it hard to
reason about the state of the VMA after the function returns.

The existing .mmap() callback's freedom has caused a great deal of issues,
especially in error handling, as unwinding the mmap() state has proven to
be non-trivial and caused significant issues in the past, for instance
those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
error path behaviour").

It also necessitates a second attempt at merge once the .mmap() callback
has completed, which has caused issues in the past, is awkward, adds
overhead and is difficult to reason about.

The .mmap_prepare() callback eliminates this requirement, as we can update
fields prior to even attempting the first merge.  It is safer, as we
heavily restrict what can actually be modified, and being invoked very
early in the mmap() process, error handling can be performed safely with
very little unwinding of state required.

The .mmap_prepare() and deprecated .mmap() callbacks are mutually
exclusive, so we permit only one to be invoked at a time.

Update vma userland test stubs to account for changes.

Link: https://lkml.kernel.org/r/cover.1746792520.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/adb36a7c4affd7393b2fc4b54cc5cfe211e41f71.1746792520.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mike Rapoport &lt;rppt@kernel.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: perform VMA allocation, freeing, duplication in mm</title>
<updated>2025-05-13T06:50:48Z</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2025-04-28T15:28:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3e43e260f1e44d21861815faa905a1829027600f'/>
<id>urn:sha1:3e43e260f1e44d21861815faa905a1829027600f</id>
<content type='text'>
Right now these are performed in kernel/fork.c which is odd and a
violation of separation of concerns, as well as preventing us from
integrating this and related logic into userland VMA testing going
forward.

There is a fly in the ointment - nommu - mmap.c is not compiled if
CONFIG_MMU not set, and neither is vma.c.

To square the circle, let's add a new file - vma_init.c.  This will be
compiled for both CONFIG_MMU and nommu builds, and will also form part of
the VMA userland testing.

This allows us to de-duplicate code, while maintaining separation of
concerns and the ability for us to userland test this logic.

Update the VMA userland tests accordingly, additionally adding a
detach_free_vma() helper function to correctly detach VMAs before freeing
them in test code, as this change was triggering the assert for this.

[akpm@linux-foundation.org: remove stray newline, per Liam]
Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Reviewed-by: Liam R. Howlett &lt;Liam.Howlett@oracle.com&gt;
Reviewed-by: Pedro Falcato &lt;pfalcato@suse.de&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: Kees Cook &lt;kees@kernel.org&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: abstract initial stack setup to mm subsystem</title>
<updated>2025-05-13T06:50:48Z</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2025-04-28T15:28:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dd7a6246f4fd6e8a6dcb08f1f51c899f3e0d3b83'/>
<id>urn:sha1:dd7a6246f4fd6e8a6dcb08f1f51c899f3e0d3b83</id>
<content type='text'>
There are peculiarities within the kernel where what is very clearly mm
code is performed elsewhere arbitrarily.

This violates separation of concerns and makes it harder to refactor code
to make changes to how fundamental initialisation and operation of mm
logic is performed.

One such case is the creation of the VMA containing the initial stack upon
execve()'ing a new process.  This is currently performed in
__bprm_mm_init() in fs/exec.c.

Abstract this operation to create_init_stack_vma().  This allows us to
limit use of vma allocation and free code to fork and mm only.

We previously did the same for the step at which we relocate the initial
stack VMA downwards via relocate_vma_down(), now we move the initial VMA
establishment too.

Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's
no longer needed anywhere outside of mm.

Link: https://lkml.kernel.org/r/118c950ef7a8dd19ab20a23a68c3603751acd30e.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Liam R. Howlett &lt;Liam.Howlett@oracle.com&gt;
Reviewed-by: Pedro Falcato &lt;pfalcato@suse.de&gt;
Reviewed-by: Kees Cook &lt;kees@kernel.org&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: establish mm/vma_exec.c for shared exec/mm VMA functionality</title>
<updated>2025-05-13T06:50:48Z</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2025-04-28T15:28:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6c36ac1e124f1be97cf0485a220865fce5a2020d'/>
<id>urn:sha1:6c36ac1e124f1be97cf0485a220865fce5a2020d</id>
<content type='text'>
Patch series "move all VMA allocation, freeing and duplication logic to
mm", v3.

Currently VMA allocation, freeing and duplication exist in kernel/fork.c,
which is a violation of separation of concerns, and leaves these functions
exposed to the rest of the kernel when they are in fact internal
implementation details.

Resolve this by moving this logic to mm, and making it internal to vma.c,
vma.h.

This also allows us, in future, to provide userland testing around this
functionality.

We additionally abstract dup_mmap() to mm, being careful to ensure
kernel/fork.c acceses this via the mm internal header so it is not exposed
elsewhere in the kernel.

As part of this change, also abstract initial stack allocation performed
in __bprm_mm_init() out of fs code into mm via the
create_init_stack_vma(), as this code uses vm_area_alloc() and
vm_area_free().

In order to do so sensibly, we introduce a new mm/vma_exec.c file, which
contains the code that is shared by mm and exec.  This file is added to
both memory mapping and exec sections in MAINTAINERS so both sets of
maintainers can maintain oversight.

As part of this change, we also move relocate_vma_down() to mm/vma_exec.c
so all shared mm/exec functionality is kept in one place.

We add code shared between nommu and mmu-enabled configurations in order
to share VMA allocation, freeing and duplication code correctly while also
keeping these functions available in userland VMA testing.

This is achieved by adding a mm/vma_init.c file which is also compiled by
the userland tests.


This patch (of 4):

There is functionality that overlaps the exec and memory mapping
subsystems.  While it properly belongs in mm, it is important that exec
maintainers maintain oversight of this functionality correctly.

We can establish both goals by adding a new mm/vma_exec.c file which
contains these 'glue' functions, and have fs/exec.c import them.

As a part of this change, to ensure that proper oversight is achieved, add
the file to both the MEMORY MAPPING and EXEC &amp; BINFMT API, ELF sections.

scripts/get_maintainer.pl can correctly handle files in multiple entries
and this neatly handles the cross-over.

[akpm@linux-foundation.org: fix comment typo]
  Link: https://lkml.kernel.org/r/80f0d0c6-0b68-47f9-ab78-0ab7f74677fc@lucifer.local
Link: https://lkml.kernel.org/r/cover.1745853549.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/91f2cee8f17d65214a9d83abb7011aa15f1ea690.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Reviewed-by: Liam R. Howlett &lt;Liam.Howlett@oracle.com&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Pedro Falcato &lt;pfalcato@suse.de&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: Kees Cook &lt;kees@kernel.org&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/vma: fix incorrectly disallowed anonymous VMA merges</title>
<updated>2025-05-12T00:48:26Z</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2025-04-08T09:29:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=879bca0a2c4f40b08d09a95a2a0c3c6513060b5c'/>
<id>urn:sha1:879bca0a2c4f40b08d09a95a2a0c3c6513060b5c</id>
<content type='text'>
Patch series "fix incorrectly disallowed anonymous VMA merges", v2.

It appears that we have been incorrectly rejecting merge cases for 15
years, apparently by mistake.

Imagine a range of anonymous mapped momemory divided into two VMAs like
this, with incompatible protection bits:

              RW         RWX
	  unfaulted    faulted
	|-----------|-----------|
	|    prev   |    vma    |
	|-----------|-----------|
	             mprotect(RW)

Now imagine mprotect()'ing vma so it is RW. This appears as if it should
merge, it does not.

Neither does this case, again mprotect()'ing vma RW:

              RWX        RW
	   faulted    unfaulted
	|-----------|-----------|
	|    vma    |   next    |
	|-----------|-----------|
	 mprotect(RW)

Nor:

              RW         RWX          RW
	  unfaulted    faulted    unfaulted
	|-----------|-----------|-----------|
	|    prev   |    vma    |    next   |
	|-----------|-----------|-----------|
	             mprotect(RW)

What's going on here?

In commit 5beb49305251 ("mm: change anon_vma linking to fix multi-process
server scalability issue"), from 2010, Rik von Riel took careful care to
account for these cases - commenting that '[this is] easily overlooked:
when mprotect shifts the boundary, make sure the expanding vma has
anon_vma set if the shrinking vma had, to cover any anon pages imported.'

However, commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs")
introduced a little over a year later, appears to have accidentally
disallowed this.

By adjusting the is_mergeable_anon_vma() function to avoid lock contention
across large trees of forked anon_vma's, this commit wrongly assumed the
VMA being checked (the ostensible merge 'target') should be faulted, that
is, have an anon_vma, and thus an anon_vma_chain list established, but
only of length 1.

This appears to have been unintentional, as disallowing empty target VMAs
like this across the board makes no sense.

We already have logic that accounts for this case, the same logic Rik
introduced in 2010, now via dup_anon_vma() (and ultimately
anon_vma_clone()), so there is no problem permitting this.

This series fixes this mistake and also ensures that scalability concerns
remain addressed by explicitly checking that whatever VMA is being merged
has not been forked.

A full set of self tests which reproduce the issue are provided, as well
as updating userland VMA tests to assert this behaviour.

The self tests additionally assert scalability concerns are addressed.


This patch (of 3):

anon_vma_chain's were introduced by Rik von Riel in commit 5beb49305251
("mm: change anon_vma linking to fix multi-process server scalability
issue").

This patch was introduced in March 2010.  As part of this change, careful
attention was made to the instance of mprotect() causing a VMA merge, with
one faulted (i.e.  having anon_vma set) and another not:

		/*
		 * Easily overlooked: when mprotect shifts the boundary,
		 * make sure the expanding vma has anon_vma set if the
		 * shrinking vma had, to cover any anon pages imported.
		 */

In the modern VMA code, this is handled in dup_anon_vma() (and ultimately
anon_vma_clone()).

This case is one of the three configurations of adjacent VMA anon_vma
state that we might encounter on merge (where dst is the VMA which will be
merged into and src the one being merged into dst):

1.  dst-&gt;anon_vma,  src-&gt;anon_vma - These must be equal, no-op.
2.  dst-&gt;anon_vma, !src-&gt;anon_vma - We simply use dst-&gt;anon_vma, no-op.
3. !dst-&gt;anon_vma,  src-&gt;anon_vma - The case in question here.

In case 3, the instance addressed here - we duplicate the AVC connections
from src and place into dst.

However, in practice, we very often do NOT do this.

This appears to be due to an inadvertent consequence of the change
introduced by commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs"),
introduced in May 2011.

This implies that this merge case was functional only for a little over a
year, and has since been broken for ~15 years.

Here, lock scalability concerns lead to us restricting anonymous merges
only to those VMAs with 1 entry in their vma-&gt;anon_vma_chain, that is, a
VMA that is not connected to any parent process's anon_vma.

The mergeability test looks like this:

static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1,
		 struct anon_vma *anon_vma2, struct vm_area_struct *vma)
{
	if ((!anon_vma1 || !anon_vma2) &amp;&amp; (!vma ||
		!vma-&gt;anon_vma || list_is_singular(&amp;vma-&gt;anon_vma_chain)))
		return true;
	return anon_vma1 == anon_vma2;
}

However, we have a problem here - typically the vma passed here is the
destination VMA.

For instance in vma_merge_existing_range() we invoke:

can_vma_merge_left()
-&gt; [ check that there is an immediately adjacent prior VMA ]
-&gt; can_vma_merge_after()
  -&gt; is_mergeable_vma() for general attribute check
-&gt; is_mergeable_anon_vma([ proposed anon_vma ], prev-&gt;anon_vma, prev)

So if we were considering a target unfaulted 'prev':

	  unfaulted    faulted
	|-----------|-----------|
	|    prev   |    vma    |
	|-----------|-----------|

This would call is_mergeable_anon_vma(NULL, vma-&gt;anon_vma, prev).

The list_is_singular() check for vma-&gt;anon_vma_chain, an empty list on
fault, would cause this merge to _fail_ even though all else indicates a
merge.

Equally a simple merge into a next VMA would hit the same problem:

	   faulted    unfaulted
	|-----------|-----------|
	|    vma    |    next   |
	|-----------|-----------|

can_vma_merge_right()
-&gt; [ check that there is an immediately adjacent succeeding VMA ]
-&gt; can_vma_merge_before()
  -&gt; is_mergeable_vma() for general attribute check
-&gt; is_mergeable_anon_vma([ proposed anon_vma ], next-&gt;anon_vma, next)

For a 3-way merge, we'd also hit the same problem if it was configured like
this for instance:

	  unfaulted    faulted    unfaulted
	|-----------|-----------|-----------|
	|    prev   |    vma    |    next   |
	|-----------|-----------|-----------|

As we'd call can_vma_merge_left() for prev, and can_vma_merge_right() for
next, both of which would fail.

vma_merge_new_range() (and relatedly, vma_expand()) are not impacted, as
the new VMA would never already be faulted (it is a proposed new range).

Because we already handle each of the aforementioned merge cases, and can
absolutely therefore deal with an existing VMA merge with !dst-&gt;anon_vma,
src-&gt;anon_vma, there is absolutely no reason to disallow this kind of
merge.

It seems that the intention of this patch is to ensure that, in the
instance of merging unfaulted VMAs with faulted ones, we never wish to do
so with those with multiple AVCs due to the fact that anon_vma lock's are
held across both parent and child anon_vma's (actually, the 'root' parent
anon_vma's lock is used).

In fact, the original commit alludes to this - "find_mergeable_anon_vma()
already considers this case".

In find_mergeable_anon_vma() however, we check the anon_vma which will be
merged from, if it is set, then we check
list_is_singular(vma-&gt;anon_vma_chain).

So to match this logic, update is_mergeable_anon_vma() to perform this
scalability check on the VMA whose anon_vma we ultimately merge into.

This matches existing behaviour with forked VMAs, only we no longer
wrongly disallow ALL empty target merges.

So we both allow merge cases and ensure the scalability check is correctly
applied.

We may wish to revisit these lock scalability concerns at a later date and
ensure they are still valid.

Additionally, correct userland VMA tests which were mistakenly not
asserting these cases correctly previously to now correctly assert this,
and to ensure vmg-&gt;anon_vma state is always consistent to account for
newly introduced asserts.

Link: https://lkml.kernel.org/r/cover.1744104124.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/18c756fc9eaf7ad082a710c91133b8346f8cd9a8.1744104124.git.lorenzo.stoakes@oracle.com
Fixes: 965f55dea0e3 ("mmap: avoid merging cloned VMAs")
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Reviewed-by: Yeoreum Yun &lt;yeoreum.yun@arm.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: make vma cache SLAB_TYPESAFE_BY_RCU</title>
<updated>2025-03-17T05:06:21Z</updated>
<author>
<name>Suren Baghdasaryan</name>
<email>surenb@google.com</email>
</author>
<published>2025-02-13T22:46:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3104138517fc66aad21f4a2487bb572e9fc2e3ec'/>
<id>urn:sha1:3104138517fc66aad21f4a2487bb572e9fc2e3ec</id>
<content type='text'>
To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu().

Current checks are sufficient as long as vma is detached before it is
freed.  The only place this is not currently happening is in exit_mmap(). 
Add the missing vma_mark_detached() in exit_mmap().

Another issue which might trick lock_vma_under_rcu() during vma reuse is
vm_area_dup(), which copies the entire content of the vma into a new one,
overriding new vma's vm_refcnt and temporarily making it appear as
attached.  This might trick a racing lock_vma_under_rcu() to operate on a
reused vma if it found the vma before it got reused.  To prevent this
situation, we should ensure that vm_refcnt stays at detached state (0)
when it is copied and advances to attached state only after it is added
into the vma tree.  Introduce vm_area_init_from() which preserves new
vma's vm_refcnt and use it in vm_area_dup().  Since all vmas are in
detached state with no current readers when they are freed,

lock_vma_under_rcu() will not be able to take vm_refcnt after vma got
detached even if vma is reused. vma_mark_attached() in modified to
include a release fence to ensure all stores to the vma happen before
vm_refcnt gets initialized.

Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
vm_area_struct reuse and will minimize the number of call_rcu() calls.

[surenb@google.com: remove atomic_set_release() usage in tools/]
  Link: https://lkml.kernel.org/r/20250217054351.2973666-1-surenb@google.com
Link: https://lkml.kernel.org/r/20250213224655.1680278-18-surenb@google.com
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Tested-by: Shivank Garg &lt;shivankg@amd.com&gt;
  Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Klara Modin &lt;klarasmodin@gmail.com&gt;
Cc: Liam R. Howlett &lt;Liam.Howlett@Oracle.com&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@google.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: "Paul E . McKenney" &lt;paulmck@kernel.org&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Sourav Panda &lt;souravpanda@google.com&gt;
Cc: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: move lesser used vma_area_struct members into the last cacheline</title>
<updated>2025-03-17T05:06:20Z</updated>
<author>
<name>Suren Baghdasaryan</name>
<email>surenb@google.com</email>
</author>
<published>2025-02-13T22:46:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6bef4c2f97221f3b595d08c8656eb5845ef80fe9'/>
<id>urn:sha1:6bef4c2f97221f3b595d08c8656eb5845ef80fe9</id>
<content type='text'>
Move several vma_area_struct members which are rarely or never used during
page fault handling into the last cacheline to better pack vm_area_struct.
As a result vm_area_struct will fit into 3 as opposed to 4 cachelines. 
New typical vm_area_struct layout:

struct vm_area_struct {
    union {
        struct {
            long unsigned int vm_start;              /*     0     8 */
            long unsigned int vm_end;                /*     8     8 */
        };                                           /*     0    16 */
        freeptr_t          vm_freeptr;               /*     0     8 */
    };                                               /*     0    16 */
    struct mm_struct *         vm_mm;                /*    16     8 */
    pgprot_t                   vm_page_prot;         /*    24     8 */
    union {
        const vm_flags_t   vm_flags;                 /*    32     8 */
        vm_flags_t         __vm_flags;               /*    32     8 */
    };                                               /*    32     8 */
    unsigned int               vm_lock_seq;          /*    40     4 */

    /* XXX 4 bytes hole, try to pack */

    struct list_head           anon_vma_chain;       /*    48    16 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct anon_vma *          anon_vma;             /*    64     8 */
    const struct vm_operations_struct  * vm_ops;     /*    72     8 */
    long unsigned int          vm_pgoff;             /*    80     8 */
    struct file *              vm_file;              /*    88     8 */
    void *                     vm_private_data;      /*    96     8 */
    atomic_long_t              swap_readahead_info;  /*   104     8 */
    struct mempolicy *         vm_policy;            /*   112     8 */
    struct vma_numab_state *   numab_state;          /*   120     8 */
    /* --- cacheline 2 boundary (128 bytes) --- */
    refcount_t          vm_refcnt (__aligned__(64)); /*   128     4 */

    /* XXX 4 bytes hole, try to pack */

    struct {
        struct rb_node     rb (__aligned__(8));      /*   136    24 */
        long unsigned int  rb_subtree_last;          /*   160     8 */
    } __attribute__((__aligned__(8))) shared;        /*   136    32 */
    struct anon_vma_name *     anon_name;            /*   168     8 */
    struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */

    /* size: 192, cachelines: 3, members: 18 */
    /* sum members: 176, holes: 2, sum holes: 8 */
    /* padding: 8 */
    /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
} __attribute__((__aligned__(64)));

Memory consumption per 1000 VMAs becomes 48 pages:

    slabinfo after vm_area_struct changes:
     &lt;name&gt;           ... &lt;objsize&gt; &lt;objperslab&gt; &lt;pagesperslab&gt; : ...
     vm_area_struct   ...    192   42    2 : ...

Link: https://lkml.kernel.org/r/20250213224655.1680278-14-surenb@google.com
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Tested-by: Shivank Garg &lt;shivankg@amd.com&gt;
  Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Klara Modin &lt;klarasmodin@gmail.com&gt;
Cc: Liam R. Howlett &lt;Liam.Howlett@Oracle.com&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@google.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: "Paul E . McKenney" &lt;paulmck@kernel.org&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Sourav Panda &lt;souravpanda@google.com&gt;
Cc: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
