summaryrefslogtreecommitdiffstats
path: root/fs/xfs
AgeCommit message (Collapse)AuthorLines
2026-04-13Merge tag 'xfs-merge-7.1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds-534/+739
Pull xfs updates from Carlos Maiolino: "There aren't any new features. The whole series is just a collection of bug fixes and code refactoring. There is some new information added a couple new tracepoints, new data added to mountstats, but no big changes" * tag 'xfs-merge-7.1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits) xfs: fix number of GC bvecs xfs: untangle the open zones reporting in mountinfo xfs: expose the number of open zones in sysfs xfs: reduce special casing for the open GC zone xfs: streamline GC zone selection xfs: refactor GC zone selection helpers xfs: rename xfs_zone_gc_iter_next to xfs_zone_gc_iter_irec xfs: put the open zone later xfs_open_zone_put xfs: add a separate tracepoint for stealing an open zone for GC xfs: delay initial open of the GC zone xfs: fix a resource leak in xfs_alloc_buftarg() xfs: handle too many open zones when mounting xfs: refactor xfs_mount_zones xfs: fix integer overflow in busy extent sort comparator xfs: fix integer overflow in deferred intent sort comparators xfs: fold xfs_setattr_size into xfs_vn_setattr_size xfs: remove a duplicate assert in xfs_setattr_size xfs: return default quota limits for IDs without a dquot xfs: start gc on zonegc_low_space attribute updates xfs: don't decrement the buffer LRU count for in-use buffers ...
2026-04-13Merge tag 'for-7.1/block-20260411' of ↵Linus Torvalds-23/+15
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Add shared memory zero-copy I/O support for ublk, bypassing per-I/O copies between kernel and userspace by matching registered buffer PFNs at I/O time. Includes selftests. - Refactor bio integrity to support filesystem initiated integrity operations and arbitrary buffer alignment. - Clean up bio allocation, splitting bio_alloc_bioset() into clear fast and slow paths. Add bio_await() and bio_submit_or_kill() helpers, unify synchronous bi_end_io callbacks. - Fix zone write plug refcount handling and plug removal races. Add support for serializing zone writes at QD=1 for rotational zoned devices, yielding significant throughput improvements. - Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET command. - Add io_uring passthrough (uring_cmd) support to the BSG layer. - Replace pp_buf in partition scanning with struct seq_buf. - zloop improvements and cleanups. - drbd genl cleanup, switching to pre_doit/post_doit. - NVMe pull request via Keith: - Fabrics authentication updates - Enhanced block queue limits support - Workqueue usage updates - A new write zeroes device quirk - Tagset cleanup fix for loop device - MD pull requests via Yu Kuai: - Fix raid5 soft lockup in retry_aligned_read() - Fix raid10 deadlock with check operation and nowait requests - Fix raid1 overlapping writes on writemostly disks - Fix sysfs deadlock on array_state=clear - Proactive RAID-5 parity building with llbitmap, with write_zeroes_unmap optimization for initial sync - Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops version mismatch fallback - Fix bcache use-after-free and uninitialized closure - Validate raid5 journal metadata payload size - Various cleanups - Various other fixes, improvements, and cleanups * tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits) ublk: fix tautological comparison warning in ublk_ctrl_reg_buf scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd() block: refactor blkdev_zone_mgmt_ioctl MAINTAINERS: update ublk driver maintainer email Documentation: ublk: address review comments for SHMEM_ZC docs ublk: allow buffer registration before device is started ublk: replace xarray with IDA for shmem buffer index allocation ublk: simplify PFN range loop in __ublk_ctrl_reg_buf ublk: verify all pages in multi-page bvec fall within registered range ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support xfs: use bio_await in xfs_zone_gc_reset_sync block: add a bio_submit_or_kill helper block: factor out a bio_await helper block: unify the synchronous bi_end_io callbacks xfs: fix number of GC bvecs selftests/ublk: add read-only buffer registration test selftests/ublk: add filesystem fio verify test for shmem_zc selftests/ublk: add hugetlbfs shmem_zc test for loop target selftests/ublk: add shared memory zero-copy test selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target ...
2026-04-13Merge tag 'vfs-7.1-rc1.integrity' of ↵Linus Torvalds-7/+49
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs integrity updates from Christian Brauner: "This adds support to generate and verify integrity information (aka T10 PI) in the file system, instead of the automatic below the covers support that is currently used. The implementation is based on refactoring the existing block layer PI code to be reusable for this use case, and then adding relatively small wrappers for the file system use case. These are then used in iomap to implement the semantics, and wired up in XFS with a small amount of glue code. Compared to the baseline this does not change performance for writes, but increases read performance up to 15% for 4k I/O, with the benefit decreasing with larger I/O sizes as even the baseline maxes out the device quickly on my older enterprise SSD" * tag 'vfs-7.1-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: support T10 protection information iomap: support T10 protection information iomap: support ioends for buffered reads iomap: add a bioset pointer to iomap_read_folio_ops ntfs3: remove copy and pasted iomap code iomap: allow file systems to hook into buffered read bio submission iomap: only call into ->submit_read when there is a read_ctx iomap: pass the iomap_iter to ->submit_read iomap: refactor iomap_bio_read_folio_range block: pass a maxlen argument to bio_iov_iter_bounce block: add fs_bio_integrity helpers block: make max_integrity_io_size public block: prepare generation / verification helpers for fs usage block: add a bdev_has_integrity_csum helper block: factor out a bio_integrity_setup_default helper block: factor out a bio_integrity_action helper
2026-04-07xfs: use bio_await in xfs_zone_gc_reset_syncChristoph Hellwig-14/+5
Replace the open-coded bio wait logic with the new bio_await helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260407140538.633364-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07xfs: fix number of GC bvecsChristoph Hellwig-9/+10
GC scratch allocations can wrap around and use the same buffer twice, and the current code fails to account for that. So far this worked due to rounding in the block layer, but changes to the bio allocator drop the over-provisioning and generic/256 or generic/361 will now usually fail when running against the current block tree. Simplify the allocation to always pass the maximum value that is easier to verify, as a saving of up to one bvec per allocation isn't worth the effort to verify a complicated calculated value. Fixes: 102f444b57b3 ("xfs: rework zone GC buffer management") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Link: https://patch.msgid.link/20260407140538.633364-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-07xfs: fix number of GC bvecsChristoph Hellwig-9/+10
GC scratch allocations can wrap around and use the same buffer twice, and the current code fails to account for that. So far this worked due to rounding in the block layer, but changes to the bio allocator drop the over-provisioning and generic/256 or generic/361 will now usually fail when running against the current block tree. Simplify the allocation to always pass the maximum value that is easier to verify, as a saving of up to one bvec per allocation isn't worth the effort to verify a complicated calculated value. Fixes: 102f444b57b3 ("xfs: rework zone GC buffer management") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: untangle the open zones reporting in mountinfoChristoph Hellwig-2/+6
Keeping a value per line makes parsing much easier, so move the maximum number of open zones into a separate line, and also add a new line for the number of open open GC zones. While that has to be either 0 or 1 currently having a value future-proofs the interface for adding more open GC zones if needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: expose the number of open zones in sysfsChristoph Hellwig-0/+13
Add a sysfs attribute for the current number of open zones so that it can be trivially read from userspace in monitoring or testing software. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: reduce special casing for the open GC zoneChristoph Hellwig-59/+58
Currently the open zone used for garbage collection is a special snow flake, and it has been a bit annoying for some further zoned XFS work I've been doing. Remove the zi_open_gc_field and instead track the open GC zone in the zi_open_zones list together with the normal open zones, and keep an extra pointer and a reference of in the GC thread's data structure. This means anything iterating over open zones just has to look at zi_open_zones, and the life time rules are consistent. It also helps to add support for multiple open GC zones if we ever need them, and removes a bit of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: streamline GC zone selectionChristoph Hellwig-55/+40
Currently picking of the GC target zone is a bit odd as it is done both in the main "can we start new GC cycles" routine and in the low-level block allocator for GC. This was mostly done to work around the rules for when code in a waitqueue wait loop can sleep. But with a trick to check if the process state has been set to running to discover if the wait loop has to be retried, all this becomes much simpler. We can select a GC zone just before writing, and bail out of starting new work if we can't find a usable zone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: refactor GC zone selection helpersChristoph Hellwig-23/+22
Merge xfs_zone_gc_ensure_target into xfs_zone_gc_select_target to keep all zone selection code together. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: rename xfs_zone_gc_iter_next to xfs_zone_gc_iter_irecChristoph Hellwig-2/+2
This function returns the current iterator position, which makes the _next postfix a bit misleading. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: put the open zone later xfs_open_zone_putChristoph Hellwig-1/+1
The open zone is what holds the rtg reference for us. This doesn't matter until we support shrinking, and even then is rather theoretical because we can't shrink away a just filled zone in a tiny race window, but let's play safe here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: add a separate tracepoint for stealing an open zone for GCChristoph Hellwig-1/+2
The case where we have to reuse an already open zone warrants a different trace point vs the normal opening of a GC zone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: delay initial open of the GC zoneChristoph Hellwig-25/+20
The code currently used to select the new GC target zone when the previous one is full also handles the case where there is no current GC target zone at all. Make use of that to simplify the logic in xfs_zone_gc_mount. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: fix a resource leak in xfs_alloc_buftarg()Haoxiang Li-0/+1
In the error path, call fs_put_dax() to drop the DAX device reference. Fixes: 6f643c57d57c ("xfs: implement ->notify_failure() for XFS") Cc: stable@vger.kernel.org Signed-off-by: Haoxiang Li <lihaoxiang@isrc.iscas.ac.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: handle too many open zones when mountingChristoph Hellwig-0/+76
When running on conventional zones or devices, the zoned allocator does not have a real write pointer, but instead fakes it up at mount time based on the last block recorded in the rmap. This can create spurious "open" zones when the last written blocks in a conventional zone are invalidated. Add a loop to the mount code to find the conventional zone with the highest used block in the rmap tree and "finish" it until we are below the open zones limit. While we're at it, also error out if there are too many open sequential zones, which can only happen when the user overrode the max open zones limit (or with really buggy hardware reducing the limit, but not much we can do about that). Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: refactor xfs_mount_zonesChristoph Hellwig-20/+34
xfs_mount_zones has grown a bit too big and unorganized. Split the zone reporting loop into a separate helper, hiding the rtg variable there. Print the mount message last, and also keep the VFS writeback chunk size last instead of in the middle of the logic to calculate the free/available blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: fix integer overflow in busy extent sort comparatorYuto Ohnuki-2/+2
xfs_extent_busy_ag_cmp() subtracts two uint32_t values (group numbers and block numbers) and returns the result as s32. When the difference exceeds INT_MAX, the result overflows and the sort order is corrupted. Use cmp_int() instead, as was done in commit 362c49098086 ("xfs: fix integer overflow in bmap intent sort comparator"). Fixes: 4a137e09151e ("xfs: keep a reference to the pag for busy extents") Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: fix integer overflow in deferred intent sort comparatorsYuto Ohnuki-3/+3
xfs_extent_free_diff_items(), xfs_refcount_update_diff_items(), and xfs_rmap_update_diff_items() subtract two uint32_t group numbers and return the result as int, which can overflow when the difference exceeds INT_MAX. Use cmp_int() instead, as was done in commit 362c49098086 ("xfs: fix integer overflow in bmap intent sort comparator"). Fixes: c13418e8eb37 ("xfs: give xfs_rmap_intent its own perag reference") Fixes: f6b384631e1e ("xfs: give xfs_extfree_intent its own perag reference") Fixes: 00e7b3bac1dc ("xfs: give xfs_refcount_intent its own perag reference") Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: fold xfs_setattr_size into xfs_vn_setattr_sizeChristoph Hellwig-27/+11
xfs_vn_setattr_size is the only caller of xfs_setattr_size, so merge the two functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-07xfs: remove a duplicate assert in xfs_setattr_sizeChristoph Hellwig-1/+0
There already is an assert that checks for uid and gid changes besides a lot of others at the beginning of the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-31xfs: return default quota limits for IDs without a dquotRavi Singh-1/+42
When an ID has no dquot on disk, Q_XGETQUOTA returns -ENOENT even though default quota limits are configured and enforced against that ID. This means unprivileged users who have never used any resources cannot see the limits that apply to them. When xfs_qm_dqget() returns -ENOENT for a non-zero ID, return a zero-usage response with the default limits filled in from m_quotainfo rather than propagating the error. This is consistent with the enforcement behavior in xfs_qm_adjust_dqlimits(), which pushes the same default limits into a dquot when it is first allocated. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ravi Singh <ravising@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-30xfs: start gc on zonegc_low_space attribute updatesHans Holmberg-1/+27
Start gc if the agressiveness of zone garbage collection is changed by the user (if the file system is not read only). Without this change, the new setting will not be taken into account until the gc thread is woken up by e.g. a write. Cc: stable@vger.kernel.org # v6.15 Fixes: 845abeb1f06a8a ("xfs: add tunable threshold parameter for triggering zone GC") Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-30xfs: don't decrement the buffer LRU count for in-use buffersChristoph Hellwig-10/+12
XFS buffers are added to the LRU when they are unused, but are only removed from the LRU lazily when the LRU list scan finds a used buffer. So far this only happen when the LRU counter hits 0, which is suboptimal as buffers that were added to the LRU, but are in use again still consume LRU scanning resources and are aged while actually in use. Fix this by checking for in-use buffers and removing the from the LRU before decrementing the LRU counter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-30xfs: switch (back) to a per-buftarg buffer hashChristoph Hellwig-69/+18
The per-AG buffer hashes were added when all buffer lookups took a per-hash look. Since then we've made lookups entirely lockless and removed the need for a hash-wide lock for inserts and removals as well. With this there is no need to sharding the hash, so reduce the used resources by using a per-buftarg hash for all buftargs. Long after writing this initially, syzbot found a problem in the buffer cache teardown order, which this happens to fix as well by doing the entire buffer cache teardown in one places instead of splitting it between destroying the buftarg and the perag structures. Link: https://lore.kernel.org/linux-xfs/aLeUdemAZ5wmtZel@dread.disaster.area/ Reported-by: syzbot+0391d34e801643e2809b@syzkaller.appspotmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: syzbot+0391d34e801643e2809b@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-30xfs: use a lockref for the buffer reference countChristoph Hellwig-55/+39
The lockref structure allows incrementing/decrementing counters like an atomic_t for the fast path, while still allowing complex slow path operations as if the counter was protected by a lock. The only slow path operations that actually need to take the lock are the final put, LRU evictions and marking a buffer stale. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-30xfs: don't keep a reference for buffers on the LRUChristoph Hellwig-92/+50
Currently the buffer cache adds a reference to b_hold for buffers that are on the LRU. This seems to go all the way back and allows releasing buffers from the LRU using xfs_buf_rele. But it makes xfs_buf_rele really complicated in differs from how other LRUs are implemented in Linux. Switch to not having a reference for buffers in the LRU, and use a separate negative hold value to mark buffers as dead. This simplifies xfs_buf_rele, which now just deal with the last "real" reference, and prepares for using the lockref primitive. This also removes the b_lock protection for removing buffers from the buffer hash. This is the desired outcome because the rhashtable is fully internally synchronized, and previously the lock was mostly held out of ordering constrains in xfs_buf_rele_cached. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-26Merge branch 'xfs-7.0-fixes' into for-nextCarlos Maiolino-20/+4
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-26xfs: remove file_path tracepoint dataDarrick J. Wong-19/+4
The xfile/xmbuf shmem file descriptions are no longer as detailed as they were when online fsck was first merged, because moving to static strings in commit 60382993a2e180 ("xfs: get rid of the xchk_xfile_*_descr calls") removed a memory allocation and hence a source of failure. However this makes encoding the description in the tracepoints sort of a waste of memory. David Laight also points out that file_path doesn't zero the whole buffer which causes exposure of stale trace bytes, and Steven Rostedt wonders why we're not using a dynamic array for the file path. I don't think this is worth fixing, so let's just rip it out. Cc: rostedt@goodmis.org Cc: david.laight.linux@gmail.com Link: https://lore.kernel.org/linux-xfs/20260323172204.work.979-kees@kernel.org/ Cc: stable@vger.kernel.org # v6.11 Fixes: 19ebc8f84ea12e ("xfs: fix file_path handling in tracepoints") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-26xfs: don't irele after failing to iget in xfs_attri_recover_workDarrick J. Wong-1/+0
xlog_recovery_iget* never set @ip to a valid pointer if they return an error, so this irele will walk off a dangling pointer. Fix that. Cc: stable@vger.kernel.org # v6.10 Fixes: ae673f534a3097 ("xfs: record inode generation in xattr update log intent items") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23Merge branch 'xfs-7.1-merge' into for-nextCarlos Maiolino-73/+168
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23Merge branch 'xfs-7.0-fixes' into for-nextCarlos Maiolino-98/+130
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: report cow mappings with dirty pagecache for iomap zero rangeBrian Foster-4/+22
XFS has long supported the case where it is possible to have dirty data in pagecache backed by COW fork blocks and a hole in the data fork. This occurs for two reasons. On reflink enabled files, COW fork blocks are allocated with preallocation to help avoid fragmention. Second, if a mapping lookup for a write finds blocks in the COW fork, it consumes those blocks unconditionally. This might mean that COW fork blocks are backed by non-shared blocks or even a hole in the data fork, both of which are perfectly fine. This leaves an odd corner case for zero range, however, because it needs to distinguish between ranges that are sparse and thus do not require zeroing and those that are not. A range backed by COW fork blocks and a data fork hole might either be a legitimate hole in the file or a range with pending buffered writes that will be written back (which will remap COW fork blocks into the data fork). This "COW fork blocks over data fork hole" situation has historically been reported as a hole to iomap, which then has grown a flush hack as a workaround to ensure zeroing occurs correctly. Now that this has been lifted into the filesystem and replaced by the dirty folio lookup mechanism, we can do better and use the pagecache state to decide how to report the mapping. If a COW fork range exists with dirty folios in cache, then report a typical shared mapping. If the range is clean in cache, then we can consider the COW blocks preallocation and call it a hole. This doesn't fundamentally change behavior, but makes mapping reporting more accurate. Note that this does require splitting across the EOF boundary (similar to normal zero range) to ensure we don't spuriously perform post-eof zeroing. iomap will warn about zeroing beyond EOF because folios beyond i_size may not be written back. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: replace zero range flush with folio batchBrian Foster-14/+6
Now that the zero range pagecache flush is purely isolated to providing zeroing correctness in this case, we can remove it and replace it with the folio batch mechanism that is used for handling unwritten extents. This is still slightly odd in that XFS reports a hole vs. a mapping that reflects the COW fork extents, but that has always been the case in this situation and so a separate issue. We drop the iomap warning that assumes the folio batch is always associated with unwritten mappings, but this is mainly a development assertion as otherwise the core iomap fbatch code doesn't care much about the mapping type if it's handed the set of folios to process. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: only flush when COW fork blocks overlap data fork holesBrian Foster-6/+30
The zero range hole mapping flush case has been lifted from iomap into XFS. Now that we have more mapping context available from the ->iomap_begin() handler, we can isolate the flush further to when we know a hole is fronted by COW blocks. Rather than purely rely on pagecache dirty state, explicitly check for the case where a range is a hole in both forks. Otherwise trim to the range where there does happen to be overlap and use that for the pagecache writeback check. This might prevent some spurious zeroing, but more importantly makes it easier to remove the flush entirely. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: look up cow fork extent earlier for buffered iomap_beginBrian Foster-21/+25
To further isolate the need for flushing for zero range, we need to know whether a hole in the data fork is fronted by blocks in the COW fork or not. COW fork lookup currently occurs further down in the function, after the zero range case is handled. As a preparation step, lift the COW fork extent lookup to earlier in the function, at the same time as the data fork lookup. Only the lookup logic is lifted. The COW fork branch/reporting logic remains as is to avoid any observable behavior change from an iomap reporting perspective. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: flush eof folio before insert range size updateBrian Foster-0/+17
The flush in xfs_buffered_write_iomap_begin() for zero range over a data fork hole fronted by COW fork prealloc is primarily designed to provide correct zeroing behavior in particular pagecache conditions. As it turns out, this also partially masks some odd behavior in insert range (via zero range via setattr). Insert range bumps i_size the length of the new range, flushes, unmaps pagecache and cancels COW prealloc, and then right shifts extents from the end of the file back to the target offset of the insert. Since the i_size update occurs before the pagecache flush, this creates a transient situation where writeback around EOF can behave differently. This appears to be corner case situation, but if happens to be fronted by COW fork speculative preallocation and a large, dirty folio that contains at least one full COW block beyond EOF, the writeback after i_size is bumped may remap that COW fork block into the data fork within EOF. The block is zeroed and then shifted back out to post-eof, but this is unexpected in that it leads to a written post-eof data fork block. This can cause a zero range warning on a subsequent size extension, because we should never find blocks that require physical zeroing beyond i_size. To avoid this quirk, flush the EOF folio before the i_size update during insert range. The entire range will be flushed, unmapped and invalidated anyways, so this should be relatively unnoticeable. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23iomap, xfs: lift zero range hole mapping flush into xfsBrian Foster-3/+22
iomap zero range has a wart in that it also flushes dirty pagecache over hole mappings (rather than only unwritten mappings). This was included to accommodate a quirk in XFS where COW fork preallocation can exist over a hole in the data fork, and the associated range is reported as a hole. This is because the range actually is a hole, but XFS also has an optimization where if COW fork blocks exist for a range being written to, those blocks are used regardless of whether the data fork blocks are shared or not. For zeroing, COW fork blocks over a data fork hole are only relevant if the range is dirty in pagecache, otherwise the range is already considered zeroed. The easiest way to deal with this corner case is to flush the pagecache to trigger COW remapping into the data fork, and then operate on the updated on-disk state. The problem is that ext4 cannot accommodate a flush from this context due to being a transaction deadlock vector. Outside of the hole quirk, ext4 can avoid the flush for zero range by using the recently introduced folio batch lookup mechanism for unwritten mappings. Therefore, take the next logical step and lift the hole handling logic into the XFS iomap_begin handler. iomap will still flush on unwritten mappings without a folio batch, and XFS will flush and retry mapping lookups in the case where it would otherwise report a hole with dirty pagecache during a zero range. Note that this is intended to be a fairly straightforward lift and otherwise not change behavior. Now that the flush exists within XFS, follow on patches can further optimize it. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: flush dirty pagecache over hole in zoned mode zero rangeBrian Foster-0/+19
For zoned filesystems a window exists between the first write to a sparse range (i.e. data fork hole) and writeback completion where we might spuriously observe holes in both the COW and data forks. This occurs because a buffered write populates the COW fork with delalloc, writeback submission removes the COW fork delalloc blocks and unlocks the inode, and then writeback completion remaps the physically allocated blocks into the data fork. If a zero range operation does a lookup during this window where both forks show a hole, it incorrectly reports a hole mapping for a range that contains data. This currently works because iomap checks for dirty pagecache over holes and unwritten mappings. If found, it flushes and retries the lookup. We plan to remove the hole flush logic from iomap, however, so lift the flush into xfs_zoned_buffered_write_iomap_begin() to preserve behavior and document the purpose for it. Zoned XFS filesystems don't support unwritten extents, so if zoned mode can come up with a way to close this transient hole window in the future, this flush can likely be removed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: fix iomap hole map reporting for zoned zero rangeBrian Foster-8/+10
The hole mapping logic for zero range in zoned mode is not quite correct. It currently reports a hole whenever one exists in the data fork. If the first write to a sparse range has completed and not yet written back, the blocks exist in the COW fork as delalloc until writeback completes, at which point they are allocated and mapped into the data fork. If a zero range occurs on a range that has not yet populated the data fork, we will incorrectly report it as a hole. Note that this currently functions correctly because we are bailed out by the pagecache flush in iomap_zero_range(). If a hole or unwritten mapping is reported with dirty pagecache, it assumes there is pending data, flushes to induce any pending block allocations/remaps, and retries the lookup. We want to remove this hack from iomap, however, so update iomap_begin() to only report a hole for zeroing when one exists in both forks. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: remove redundant validation in xlog_recover_attri_commit_pass2Long Li-46/+0
Remove the redundant post-parse validation switch. By the time that block is reached, xfs_attri_validate() has already guaranteed all name lengths are non-zero via xfs_attri_validate_namelen(), and xfs_attri_validate_name_iovec() has already returned -EFSCORRUPTED for NULL names. For the REMOVE case, attr_value and value_len are structurally guaranteed to be NULL/zero because the parsing loop only populates them when value_len != 0. All checks in that switch are therefore dead code. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: fix ri_total validation in xlog_recover_attri_commit_pass2Long Li-2/+2
The ri_total checks for SET/REPLACE operations are hardcoded to 3, but xfs_attri_item_size() only emits a value iovec when value_len > 0, so ri_total is 2 when value_len == 0. For PPTR_SET/PPTR_REMOVE/PPTR_REPLACE, value_len is validated by xfs_attri_validate() to be exactly sizeof(struct xfs_parent_rec) and is never zero, so their hardcoded checks remain correct. This problem may cause log recovery failures. The following script can be used to reproduce the problem: #!/bin/bash mkfs.xfs -f /dev/sda mount /dev/sda /mnt/test/ touch /mnt/test/file for i in {1..200}; do attr -s "user.attr_$i" -V "value_$i" /mnt/test/file > /dev/null done echo 1 > /sys/fs/xfs/debug/larp echo 1 > /sys/fs/xfs/sda/errortag/larp attr -s "user.zero" -V "" /mnt/test/file echo 0 > /sys/fs/xfs/sda/errortag/larp umount /mnt/test mount /dev/sda /mnt/test/ # mount failed Fix this by deriving the expected count dynamically as "2 + !!value_len" for SET/REPLACE operations. Cc: stable@vger.kernel.org # v6.9 Fixes: ad206ae50eca ("xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: close crash window in attr dabtree inactivationLong Li-38/+57
When inactivating an inode with node-format extended attributes, xfs_attr3_node_inactive() invalidates all child leaf/node blocks via xfs_trans_binval(), but intentionally does not remove the corresponding entries from their parent node blocks. The implicit assumption is that xfs_attr_inactive() will truncate the entire attr fork to zero extents afterwards, so log recovery will never reach the root node and follow those stale pointers. However, if a log shutdown occurs after the leaf/node block cancellations commit but before the attr bmap truncation commits, this assumption breaks. Recovery replays the attr bmap intact (the inode still has attr fork extents), but suppresses replay of all cancelled leaf/node blocks, maybe leaving them as stale data on disk. On the next mount, xlog_recover_process_iunlinks() retries inactivation and attempts to read the root node via the attr bmap. If the root node was not replayed, reading the unreplayed root block triggers a metadata verification failure immediately; if it was replayed, following its child pointers to unreplayed child blocks triggers the same failure: XFS (pmem0): Metadata corruption detected at xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78 XFS (pmem0): Unmount and run xfs_repair XFS (pmem0): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117 Fix this in two places: In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a child block, immediately remove the entry that references it from the parent node in the same transaction. This eliminates the window where the parent holds a pointer to a cancelled block. Once all children are removed, the now-empty root node is converted to a leaf block within the same transaction. This node-to-leaf conversion is necessary for crash safety. If the system shutdown after the empty node is written to the log but before the second-phase bmap truncation commits, log recovery will attempt to verify the root block on disk. xfs_da3_node_verify() does not permit a node block with count == 0; such a block will fail verification and trigger a metadata corruption shutdown. on the other hand, leaf blocks are allowed to have this transient state. In xfs_attr_inactive(), split the attr fork truncation into two explicit phases. First, truncate all extents beyond the root block (the child extents whose parent references have already been removed above). Second, invalidate the root block and truncate the attr bmap to zero in a single transaction. The two operations in the second phase must be atomic: as long as the attr bmap has any non-zero length, recovery can follow it to the root block, so the root block invalidation must commit together with the bmap-to-zero truncation. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: factor out xfs_attr3_leaf_initLong Li-0/+25
Factor out wrapper xfs_attr3_leaf_init function, which exported for external use. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: factor out xfs_attr3_node_entry_removeLong Li-11/+44
Factor out wrapper xfs_attr3_node_entry_remove function, which exported for external use. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: only assert new size for datafork during truncate extentsLong Li-1/+2
The assertion functions properly because we currently only truncate the attr to a zero size. Any other new size of the attr is not preempted. Make this assertion is specific to the datafork, preparing for subsequent patches to truncate the attribute to a non-zero size. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: Use xarray to track SB UUIDs instead of plain array.Lukas Herbolt-39/+39
Removing the plain array to track the UUIDs and switch to xarray makes it more readable. Signed-off-by: Lukas Herbolt <lukas@herbolt.com> [cem: remove unneeded return from xfs_uuid_unmount] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-18Merge branch 'xfs-7.1-merge' into for-nextCarlos Maiolino-57/+137
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-18xfs: avoid unnecessary calculations in xfs_zoned_need_gc()Damien Le Moal-6/+18
If zonegc_low_space is set to zero (which is the default), the second condition in xfs_zoned_need_gc() that triggers GC never evaluates to true because the calculated threshold will always be 0. So there is no need to calculate the threshold and to evaluate that condition. Return early when zonegc_low_space is zero. While at it, add comments to document the intent of each of the 3 tests used to determine the return value to control the execution of garbage collection. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>