summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorLines
2026-04-07btrfs: fix btrfs_ioctl_space_info() slot_count TOCTOU which can lead to ↵Yochai Eisenrich-2/+3
info-leak btrfs_ioctl_space_info() has a TOCTOU race between two passes over the block group RAID type lists. The first pass counts entries to determine the allocation size, then the second pass fills the buffer. The groups_sem rwlock is released between passes, allowing concurrent block group removal to reduce the entry count. When the second pass fills fewer entries than the first pass counted, copy_to_user() copies the full alloc_size bytes including trailing uninitialized kmalloc bytes to userspace. Fix by copying only total_spaces entries (the actually-filled count from the second pass) instead of alloc_size bytes, and switch to kzalloc so any future copy size mismatch cannot leak heap data. Fixes: 7fde62bffb57 ("Btrfs: buffer results in the space_info ioctl") CC: stable@vger.kernel.org # 3.0 Signed-off-by: Yochai Eisenrich <echelonh@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: avoid taking the device_list_mutex in btrfs_run_dev_stats()Filipe Manana-0/+30
btrfs_run_dev_stats() is called during the critical section of a transaction commit and it takes the device_list_mutex, which is also acquired by fitrim, which does discard operations while holding that mutex. Most of the time, if we are on a healthy filesystem, we don't have new stat updates to persist in the device tree, so blocking on the device_list_mutex is just wasting time and making any tasks that need to start a new transaction wait longer that necessary. Since the device list is RCU safe/protected, make btrfs_run_dev_stats() do an initial check for device stat updates using RCU and quit without taking the device_list_mutex in case there are no new device stats that need to be persisted in the device tree. Also note that adding/removing devices also requires starting a transaction, and since btrfs_run_dev_stats() is called from the critical section of a transaction commit, no one can be concurrently adding or removing a device while btrfs_run_dev_stats() is called. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: avoid GFP_ATOMIC allocations in qgroup free pathsLeo Martins-3/+28
When qgroups are enabled, __btrfs_qgroup_release_data() and qgroup_free_reserved_data() pass an extent_changeset to btrfs_clear_record_extent_bits() to track how many bytes had their EXTENT_QGROUP_RESERVED bits cleared. Inside the extent IO tree spinlock, add_extent_changeset() calls ulist_add() with GFP_ATOMIC to record each changed range. If this allocation fails, it hits a BUG_ON and panics the kernel. However, both of these callers only read changeset.bytes_changed afterwards — the range_changed ulist is populated and immediately freed without ever being iterated. The GFP_ATOMIC allocation is entirely unnecessary for these paths. Introduce extent_changeset_init_bytes_only() which uses a sentinel value (EXTENT_CHANGESET_BYTES_ONLY) on the ulist's prealloc field to signal that only bytes_changed should be tracked. add_extent_changeset() checks for this sentinel and returns early after updating bytes_changed, skipping the ulist_add() call entirely. This eliminates the GFP_ATOMIC allocation and makes the BUG_ON unreachable for these paths. Callers that need range tracking (qgroup_reserve_data, qgroup_unreserve_range, btrfs_qgroup_check_reserved_leak) continue to use extent_changeset_init() and are unaffected. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: decrease indentation of find_free_extent_update_loopJohannes Thumshirn-54/+55
Decrease the indentation of find_free_extent_update_loop(), by inverting the check if the loop state is smaller than LOOP_NO_EMPTY_SIZE. This also allows for an early return from find_free_extent_update_loop(), in case LOOP_NO_EMPTY_SIZE is already set at this point. While at it change a if () { } else if else pattern to all using curly braces and be consistent with the rest of btrfs code. Also change 'int exists' to 'bool have_trans' giving it a more meaningful name and type. No functional changes intended. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: unexport btrfs_qgroup_reserve_meta()Filipe Manana-6/+3
There's only one caller outside qgroup.c of btrfs_qgroup_reserve_meta() and we have btrfs_qgroup_reserve_meta_prealloc() is a wrapper around that function. Make that caller use btrfs_qgroup_reserve_meta_prealloc() and unexport btrfs_qgroup_reserve_meta(), simplifying the external API. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: collapse __btrfs_qgroup_reserve_meta() into ↵Filipe Manana-18/+8
btrfs_qgroup_reserve_meta_prealloc() Since __btrfs_qgroup_reserve_meta() is only called by btrfs_qgroup_reserve_meta_prealloc(), which is a simple inline wrapper, get rid of the later and rename __btrfs_qgroup_reserve_meta() to the later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: collapse __btrfs_qgroup_free_meta() into ↵Filipe Manana-14/+8
btrfs_qgroup_free_meta_prealloc() Since __btrfs_qgroup_free_meta() is only called by btrfs_qgroup_free_meta_prealloc(), which is a simple inline wrapper, get rid of the later and rename __btrfs_qgroup_free_meta() to the later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove unused qgroup functions for pertrans reservation and freeingFilipe Manana-16/+1
They have no more users since commit a6496849671a ("btrfs: fix start transaction qgroup rsv double free"), so remove them. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: optimize clearing all bits from first extent record in an io treeFilipe Manana-2/+42
When we are clearing all the bits from the first record that contains the target range and that record ends at or before our target range but starts before our target range, we are doing a lot of unnecessary work: 1) Allocating a prealloc state if we don't have one already; 2) Adjust that record's start offset to the start of our range and make the prealloc state have a range going from the original start offset of that first record to the start offset of our target range, and with the same bits as that first record. Then we insert the prealloc extent in the rbtree - this is done in split_state(); 3) Remove our adjusted first state from the rbtree since all the bits were cleared - this is done in clear_state_bit(). This is only wasting time when we can simply trim that first record, so that it represents the range from its start offset to the start offset of our target range. So optimize for that case and avoid the prealloc state allocation, insertion and deletion from the rbtree. This patch is the last patch of a patchset comprised of the following patches (in descending order): btrfs: optimize clearing all bits from first extent record in an io tree btrfs: panic instead of warn when splitting extent state not in the tree btrfs: free cached state outside critical section in wait_extent_bit() btrfs: avoid unnecessary wake ups on io trees when there are no waiters btrfs: remove wake parameter from clear_state_bit() btrfs: change last argument of add_extent_changeset() to boolean btrfs: use extent_io_tree_panic() instead of BUG_ON() btrfs: make add_extent_changeset() only return errors or success btrfs: tag as unlikely branches that call extent_io_tree_panic() btrfs: turn extent_io_tree_panic() into a macro for better error reporting btrfs: optimize clearing all bits from the last extent record in an io tree The following fio script was used to measure performance before and after applying all the patches: $ cat ./fio-io-uring-2.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE RUN_TIME" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 RUN_TIME=$3 cat <<EOF > /tmp/fio-job.ini [io_uring_rw] rw=randwrite fsync=0 fallocate=none group_reporting=1 direct=1 ioengine=io_uring fixedbufs=1 iodepth=64 bs=4K filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT When running this script on a 12 cores machine using a 16G null block device the results were the following: Before patchset: $ ./fio-io-uring-2.sh 12 8G 60 (...) WRITE: bw=74.8MiB/s (78.5MB/s), 74.8MiB/s-74.8MiB/s (78.5MB/s-78.5MB/s), io=4504MiB (4723MB), run=60197-60197msec After patchset: $ ./fio-io-uring-2.sh 12 8G 60 (...) WRITE: bw=82.2MiB/s (86.2MB/s), 82.2MiB/s-82.2MiB/s (86.2MB/s-86.2MB/s), io=4937MiB (5176MB), run=60027-60027msec Also, using bpftrace to collect the duration (in nanoseconds) of all the btrfs_clear_extent_bit_changeset() calls done during that fio test and then making an histogram from that data, held the following results: Before patchset: Count: 6304804 Range: 0.000 - 7587172.000; Mean: 2011.308; Median: 1219.000; Stddev: 17117.533 Percentiles: 90th: 1888.000; 95th: 2189.000; 99th: 16104.000 0.000 - 8.098: 7 | 8.098 - 40.385: 20 | 40.385 - 187.254: 146 | 187.254 - 855.347: 742048 ####### 855.347 - 3894.426: 5462542 ##################################################### 3894.426 - 17718.848: 41489 | 17718.848 - 80604.558: 46085 | 80604.558 - 366664.449: 11285 | 366664.449 - 1667918.122: 961 | 1667918.122 - 7587172.000: 113 | After patchset: Count: 6282879 Range: 0.000 - 6029290.000; Mean: 1896.482; Median: 1126.000; Stddev: 15276.691 Percentiles: 90th: 1741.000; 95th: 2026.000; 99th: 15713.000 0.000 - 60.014: 12 | 60.014 - 217.984: 63 | 217.984 - 784.949: 517515 ##### 784.949 - 2819.823: 5632335 ##################################################### 2819.823 - 10123.127: 55716 # 10123.127 - 36335.184: 46034 | 36335.184 - 130412.049: 25708 | 130412.049 - 468060.350: 4824 | 468060.350 - 1679903.189: 549 | 1679903.189 - 6029290.000: 84 | Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: panic instead of warn when splitting extent state not in the treeFilipe Manana-7/+6
We are not expected ever to split an extent state record that is not in the rbtree, as every record we pass to split_state() was found by iterating the rbtree, so if that ever happens it means we are not holding the extent io tree's spinlock or we have some memory corruption. Instead of simply warning in case the extent state record passed to split_state() is not in the rbtree, panic as this is a serious problem. Also tag as unlikely the case where the record is not in the rbtree. This also makes a tiny reduction the btrfs module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2000080 174328 15592 2190000 216ab0 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2000064 174328 15592 2189984 216aa0 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: free cached state outside critical section in wait_extent_bit()Filipe Manana-1/+1
There's no need to free the cached extent state record while holding the io tree's spinlock, it's just making the critical section longer than it needs to be. So just do it after unlocking the io tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: avoid unnecessary wake ups on io trees when there are no waitersFilipe Manana-8/+21
Whenever clearing the extent lock bits of an extent state record, we unconditionally call wake_up() on the state's waitqueue. Most of the time there are no waiters on the queue so we are just wasting time calling wake_up(), since that requires locking and unlocking the queue's spinlock, disable and re-enable interrupts, function calls, and other minor overhead while we are holding a critical section delimited by the extent io tree's spinlock. So call wake_up() only if there are waiters on an extent state's wait queue. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove wake parameter from clear_state_bit()Filipe Manana-10/+9
There's no need to pass the 'wake' parameter, we can determine if we have to wake up waiters by checking if EXTENT_LOCK_BITS is set in the bits to clear. So simplify things and remove the parameter. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: change last argument of add_extent_changeset() to booleanFilipe Manana-4/+4
The argument is used as a boolean but it's defined as an integer. Switch it to a boolean for better readability. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: use extent_io_tree_panic() instead of BUG_ON()Filipe Manana-2/+4
There's no need to call BUG_ON(), instead call extent_io_tree_panic(), which also calls BUG(), but it prints an additional error message with some useful information before hitting BUG(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: make add_extent_changeset() only return errors or successFilipe Manana-2/+6
Currently add_extent_changeset() always returns the return value from its call to ulist_add(), which can return an error, 0 or 1. There are no callers that care about the difference between 0 and 1 and all except one of them, check for negative values and ignore other values, but there is another caller (btrfs_clear_extent_bit_changeset()) that must set its 'ret' variable to 0 after calling add_extent_changeset(), so that it does not return an unexpected value of 1 to its caller. So change add_extent_changeset() to only return errors or 0, avoiding that caller (and any future callers) from having to deal with a return value of 1. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tag as unlikely branches that call extent_io_tree_panic()Filipe Manana-6/+6
It's unexpected to ever call extent_io_tree_panic() so surround with 'unlikely' every if statement condition that leads to it, making it explicit to a reader and to hint the compiler to potentially generate better code. On x86_64, using gcc 14.2.0-19 from Debian, this resulted in a slightly decrease of the btrfs module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1999832 174320 15592 2189744 2169b0 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1999768 174320 15592 2189680 216970 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: turn extent_io_tree_panic() into a macro for better error reportingFilipe Manana-9/+4
When extent_io_tree_panic() is called we get a stace trace that is not very useful since the error message reports the location inside the extent_io_tree_panic() function and not in the caller of the function. Example: [ 7830.424291] BTRFS critical (device sdb): panic in extent_io_tree_panic:334: extent io tree error on add_extent_changeset state start 4083712 end 4112383 (errno=1 unknown) [ 7830.426816] ------------[ cut here ]------------ [ 7830.427581] kernel BUG at fs/btrfs/extent-io-tree.c:334! [ 7830.428495] Oops: invalid opcode: 0000 [#1] SMP PTI [ 7830.429318] CPU: 5 UID: 0 PID: 1451600 Comm: fsstress Not tainted 7.0.0-rc2-btrfs-next-227+ #1 PREEMPT(full) [ 7830.430899] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 7830.432771] RIP: 0010:extent_io_tree_panic+0x41/0x43 [btrfs] [ 7830.433815] Code: 75 0a 48 8b (...) [ 7830.436849] RSP: 0018:ffffd2334f4a3b68 EFLAGS: 00010246 [ 7830.437668] RAX: 0000000000000000 RBX: 00000000003ebfff RCX: 0000000000000000 [ 7830.438801] RDX: ffffffffc08d4368 RSI: ffffffffbb6ce475 RDI: ffff896501d6b780 [ 7830.439671] RBP: 0000000000001000 R08: 0000000000000000 R09: 00000000ffefffff [ 7830.440575] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000 [ 7830.441458] R13: ffff896547374c08 R14: 00000000003effff R15: ffff896547374c08 [ 7830.442333] FS: 00007f3e252af0c0(0000) GS:ffff896c6185d000(0000) knlGS:0000000000000000 [ 7830.443326] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7830.444047] CR2: 00007f3e252ad000 CR3: 0000000113b0a004 CR4: 0000000000370ef0 [ 7830.444905] Call Trace: [ 7830.445229] <TASK> [ 7830.445557] btrfs_clear_extent_bit_changeset.cold+0x43/0x80 [btrfs] [ 7830.446543] btrfs_clear_record_extent_bits+0x19/0x20 [btrfs] [ 7830.447308] qgroup_free_reserved_data+0xf9/0x170 [btrfs] [ 7830.448040] btrfs_buffered_write+0x368/0x8e0 [btrfs] [ 7830.448707] btrfs_direct_write+0x1a5/0x480 [btrfs] [ 7830.449396] btrfs_do_write_iter+0x18c/0x210 [btrfs] [ 7830.450167] vfs_write+0x21f/0x450 [ 7830.450662] ksys_write+0x5f/0xd0 [ 7830.451092] do_syscall_64+0xe9/0xf20 [ 7830.451610] entry_SYSCALL_64_after_hwframe+0x76/0x7e Change extent_io_tree_panic() to a macro so that we get a report that gives the exact place where the error happens. Example after this change: [63677.406061] BTRFS critical (device sdc): panic in btrfs_clear_extent_bit_changeset:744: extent io tree error on add_extent_changeset state start 1818624 end 1830911 (errno=1 unknown) [63677.410055] ------------[ cut here ]------------ [63677.410910] kernel BUG at fs/btrfs/extent-io-tree.c:744! [63677.411918] Oops: invalid opcode: 0000 [#1] SMP PTI [63677.413032] CPU: 0 UID: 0 PID: 13028 Comm: fsstress Not tainted 7.0.0-rc2-btrfs-next-227+ #1 PREEMPT(full) [63677.415139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [63677.417283] RIP: 0010:btrfs_clear_extent_bit_changeset.cold+0xcd/0x10c [btrfs] [63677.418676] Code: 8b 37 48 8b (...) [63677.421917] RSP: 0018:ffffd2290a417b30 EFLAGS: 00010246 [63677.422824] RAX: 0000000000000000 RBX: 00000000001befff RCX: 0000000000000000 [63677.424320] RDX: ffffffffc0970348 RSI: ffffffffa92ce475 RDI: ffff8897ded9dc80 [63677.429772] RBP: 0000000000001000 R08: 0000000000000000 R09: 00000000ffefffff [63677.430787] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000 [63677.431818] R13: ffff8897966655d8 R14: 00000000001bffff R15: ffff8897966655d8 [63677.432764] FS: 00007f5c074c50c0(0000) GS:ffff889ef3b1d000(0000) knlGS:0000000000000000 [63677.433940] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63677.434787] CR2: 00007f5c074c3000 CR3: 000000014b9de002 CR4: 0000000000370ef0 [63677.435960] Call Trace: [63677.436432] <TASK> [63677.436838] btrfs_clear_record_extent_bits+0x19/0x20 [btrfs] [63677.437980] qgroup_free_reserved_data+0xf9/0x170 [btrfs] [63677.439070] btrfs_buffered_write+0x368/0x8e0 [btrfs] [63677.439889] btrfs_do_write_iter+0x1a8/0x210 [btrfs] [63677.441460] do_iter_readv_writev+0x145/0x240 [63677.446309] vfs_writev+0x120/0x3b0 [63677.446878] ? __do_sys_newfstat+0x33/0x60 [63677.447759] ? do_pwritev+0x8a/0xd0 [63677.449119] do_pwritev+0x8a/0xd0 [63677.452342] do_syscall_64+0xe9/0xf20 [63677.452961] entry_SYSCALL_64_after_hwframe+0x76/0x7e Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: optimize clearing all bits from the last extent record in an io treeFilipe Manana-0/+39
When we are clearing all the bits from the last record that contains the target range (i.e. the record starts before our target range and ends beyond it), we are doing a lot of unnecessary work: 1) Allocating a prealloc state if we don't have one already; 2) Adjust that last record's start offset to the end of our range and make the prealloc state have a range going from the original start offset of that last record to the end offset of our target range and the same bits as the last record. Then we insert the prealloc extent in the rbtree - this is done in split_state(); 3) Remove our prealloc state from the rbtree since all the bits were cleared - this is done in clear_state_bit(). This is only wasting time when we can simply trim the last record so that it's start offset is adjust to the end of the target range. So optimize for that case and avoid the prealloc state allocation, insertion and deletion from the rbtree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove atomic parameter from btrfs_buffer_uptodate()Qu Wenruo-12/+9
That parameter was introduced by commit b9fab919b748 ("Btrfs: avoid sleeping in verify_parent_transid while atomic"). At that time we needed to lock the extent buffer range inside the io tree to avoid content changes, thus it could sleep. But that behavior is no longer there, as later commit 9e2aff90fc2a ("btrfs: stop using lock_extent in btrfs_buffer_uptodate") dropped the io tree lock. We can remove the @atomic parameter safely now. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: output more info when duplicated ordered extent is foundQu Wenruo-3/+8
During development of a new feature, I triggered that btrfs_panic() inside insert_ordered_extent() and spent quite some unnecessary before noticing I'm passing incorrect flags when creating a new ordered extent. Unfortunately the existing error message is not providing much help. Enhance the output to provide file offset, num bytes and flags of both existing and new ordered extents. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: check type flags in alloc_ordered_extent()Qu Wenruo-24/+44
Unlike other flags used in btrfs, BTRFS_ORDERED_* macros are different as they cannot be directly used as flags. They are defined as bit values, thus they should be utilized with bit operations, not directly with logical operations. Unfortunately sometimes I forgot this and passed the incorrect flags to alloc_ordered_extent() and hit weird bugs. Enhance the type checks in alloc_ordered_extent(): - Make sure there is one and only one bit set for exclusive type flags There are four exclusive type flags, REGULAR, NOCOW, PREALLOC and COMPRESSED. So introduce a new macro, BTRFS_ORDERED_EXCLUSIVE_FLAGS, to cover above flags. Add an ASSERT() to check one and only one of those exclusive flags can be set for alloc_ordered_extent(). - Re-order the type bit numbers to the end of the enum This is make it much harder to get a valid false negative. E.g., with the old code BTRFS_ORDERED_REGULAR starts at zero, we can have the following flags passing the bit uniqueness check: * BTRFS_ORDERED_NOCOW Be treated as BTRFS_ORDERED_REGULAR (1 == 1UL << 0). * BTRFS_ORDERED_PREALLOC Be treated as BTRFS_ORDERED_NOCOW (2 == 1UL << 1). * BTRFS_ORDERED_DIRECT Be treated as BTRFS_ORDERED_PREALLOC (4 == 1UL << 2). Now all those types start at 8, passing any of those bit numbers as flags directly will not pass the ASSERT(). - Add a static assert to avoid overflow To make sure all BTRFS_ORDERED_* flags can fit into an unsigned long. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: revalidate cached tree blocks on the uptodate pathZhengYuan Huang-10/+39
read_extent_buffer_pages_nowait() returns immediately when an extent buffer is already marked uptodate. On that cache-hit path, the caller supplied btrfs_tree_parent_check is not re-run. This can let read_tree_root_path() accept a cached tree block whose actual header level/owner does not match the expected value derived from the parent. E.g. a corrupted root item that points to a tree block which doesn't even belong to that root, and has mismatching level/owner. But that tree block is already read and cached, later the corrupted tree root got read from disk and hit the cached tree block. Fix this by re-validating cached extent buffers against the supplied btrfs_tree_parent_check on the uptodate path, and make read_tree_root_path() pass its check to btrfs_buffer_uptodate(). This makes cache hits and fresh reads follow the same tree-parent verification rules, and turns the corruption into a read failure instead of constructing an inconsistent root object. Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Resolve the conflict with extent_buffer_uptodate() helper, handle transid mismatch case ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: prefer IS_ERR_OR_NULL() over manual NULL checkPhilipp Hahn-4/+4
Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL check. IS_ERR_OR_NULL() already uses likely(!ptr) internally. checkpatch does not like nesting it: > WARNING: nested (un)?likely() calls, IS_ERR_OR_NULL already uses > unlikely() internally Remove the explicit use of likely(). Change generated with coccinelle. Signed-off-by: Philipp Hahn <phahn-oss@avm.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tree-checker: introduce checks for FREE_SPACE_BITMAPZhengYuan Huang-0/+43
Introduce checks for FREE_SPACE_BITMAP item, which include: - Key alignment check Same as FREE_SPACE_EXTENT, the objectid is the logical bytenr of the free space, and offset is the length of the free space, so both should be aligned to the fs block size. - Non-zero range check A zero key->offset would describe an empty bitmap, which is invalid. - Item size check The item must hold exactly DIV_ROUND_UP(key->offset >> sectorsize_bits, BITS_PER_BYTE) bytes. A mismatch indicates a truncated or otherwise corrupt bitmap item; without this check, the bitmap loading path would walk past the end of the leaf and trigger a NULL dereference in assert_eb_folio_uptodate(). Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tree-checker: introduce checks for FREE_SPACE_EXTENTQu Wenruo-0/+29
Introduce FREE_SPACE_EXTENT checks, which include: - The key alignment check The objectid is the logical bytenr of the free space, and offset is the length of the free space, thus they should all be aligned to the fs block size. - The item size check The FREE_SPACE_EXTENT item should have a size of zero. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tree-checker: introduce checks for FREE_SPACE_INFOQu Wenruo-1/+52
Introduce checks for FREE_SPACE_INFO item, which include: - Key alignment check The objectid is the logical bytenr of the chunk/bg, and offset is the length of the chunk/bg, thus they should all be aligned to the fs block size. - Item size check The FREE_SPACE_INFO should a fix size. - Flags check The flags member should have no other flags than BTRFS_FREE_SPACE_USING_BITMAPS. For future expansion, introduce a new macro BTRFS_FREE_SPACE_FLAGS_MASK for such checks. And since we're here, the BTRFS_FREE_SPACE_USING_BITMAPS should not use unsigned long long, as the flags is only 32 bits wide. So fix that to use unsigned long. - Extent count check That member shows how many free space bitmap/extent items there are inside the chunk/bg. We know the chunk size (from key->offset), thus there should be at most (key->offset >> sectorsize_bits) blocks inside the chunk. Use that value as the upper limit and if that counter is larger than that, there is a high chance it's a bitflip in high bits. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: zoned: limit number of zones reclaimed in flush_space()Johannes Thumshirn-6/+14
Limit the number of zones reclaimed in flush_space()'s RECLAIM_ZONES state. This prevents possibly long running reclaim sweeps to block other tasks in the system, while the system is under pressure anyways, causing the tasks to hang. An example of this can be seen here, triggered by fstests generic/551: generic/551 [ 27.042349] run fstests generic/551 at 2026-02-27 11:05:30 BTRFS: device fsid 78c16e29-20d9-4c8e-bc04-7ba431be38ff devid 1 transid 8 /dev/vdb (254:16) scanned by mount (806) BTRFS info (device vdb): first mount of filesystem 78c16e29-20d9-4c8e-bc04-7ba431be38ff BTRFS info (device vdb): using crc32c checksum algorithm BTRFS info (device vdb): host-managed zoned block device /dev/vdb, 64 zones of 268435456 bytes BTRFS info (device vdb): zoned mode enabled with zone size 268435456 BTRFS info (device vdb): checking UUID tree BTRFS info (device vdb): enabling free space tree INFO: task kworker/u38:1:90 blocked for more than 120 seconds. Not tainted 7.0.0-rc1+ #345 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u38:1 state:D stack:0 pid:90 tgid:90 ppid:2 task_flags:0x4208060 flags:0x00080000 Workqueue: events_unbound btrfs_async_reclaim_data_space Call Trace: <TASK> __schedule+0x34f/0xe70 schedule+0x41/0x140 schedule_timeout+0xa3/0x110 ? mark_held_locks+0x40/0x70 ? lockdep_hardirqs_on_prepare+0xd8/0x1c0 ? trace_hardirqs_on+0x18/0x100 ? lockdep_hardirqs_on+0x84/0x130 ? _raw_spin_unlock_irq+0x33/0x50 wait_for_completion+0xa4/0x150 ? __flush_work+0x24c/0x550 __flush_work+0x339/0x550 ? __pfx_wq_barrier_func+0x10/0x10 ? wait_for_completion+0x39/0x150 flush_space+0x243/0x660 ? find_held_lock+0x2b/0x80 ? kvm_sched_clock_read+0x11/0x20 ? local_clock_noinstr+0x17/0x110 ? local_clock+0x15/0x30 ? lock_release+0x1b7/0x4b0 do_async_reclaim_data_space+0xe8/0x160 btrfs_async_reclaim_data_space+0x19/0x30 process_one_work+0x20a/0x5f0 ? lock_is_held_type+0xcd/0x130 worker_thread+0x1e2/0x3c0 ? __pfx_worker_thread+0x10/0x10 kthread+0x103/0x150 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x20d/0x320 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Showing all locks held in the system: 1 lock held by khungtaskd/67: #0: ffffffff824d58e0 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x3d/0x194 2 locks held by kworker/u38:1/90: #0: ffff8881000aa158 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x3c4/0x5f0 #1: ffffc90000c17e58 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c0/0x5f0 5 locks held by kworker/u39:1/191: #0: ffff8881000aa158 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x3c4/0x5f0 #1: ffffc90000dfbe58 ((work_completion)(&fs_info->reclaim_bgs_work)){+.+.}-{0:0}, at: process_one_work+0x1c0/0x5f0 #2: ffff888101da0420 (sb_writers#9){.+.+}-{0:0}, at: process_one_work+0x20a/0x5f0 #3: ffff88811040a648 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: btrfs_reclaim_bgs_work+0x1de/0x770 #4: ffff888110408a18 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x95a/0x20f0 1 lock held by aio-dio-write-v/980: #0: ffff888110093008 (&sb->s_type->i_mutex_key#15){++++}-{4:4}, at: btrfs_inode_lock+0x51/0xb0 ============================================= To prevent these long running reclaims from blocking the system, only reclaim 5 block_groups in the RECLAIM_ZONES state of flush_space(). Also as these reclaims are now constrained, it opens up the use for a synchronous call to brtfs_reclaim_block_groups(), eliminating the need to place the reclaim task on a workqueue and then flushing the workqueue again. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: create btrfs_reclaim_block_groups()Johannes Thumshirn-3/+9
Create a function btrfs_reclaim_block_groups() that gets called from the block-group reclaim worker. This allows creating synchronous block_group reclaim later on. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: move reclaiming of a single block group into its own functionJohannes Thumshirn-123/+133
The main work of reclaiming a single block-group in btrfs_reclaim_bgs_work() is done inside the loop iterating over all the block_groups in the fs_info->reclaim_bgs list. Factor out reclaim of a single block group from the loop to improve readability. No functional change intended. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: extract inlined creation into a dedicated delalloc helperQu Wenruo-110/+110
Currently we call cow_file_range_inline() in different situations, from regular cow_file_range() to compress_file_range(). This is because inline extent creation has different conditions based on whether it's a compressed one or not. But on the other hand, inline extent creation shouldn't be so distributed, we can just have a dedicated branch in btrfs_run_delalloc_range(). It will become more obvious for compressed inline cases, it makes no sense to go through all the complex async extent mechanism just to inline a single block. So here we introduce a dedicated run_delalloc_inline() helper, and remove all inline related handling from cow_file_range() and compress_file_range(). There is a special update to inode_need_compress(), that a new @check_inline parameter is introduced. This is to allow inline specific checks to be done inside run_delalloc_inline(), which allows single block compression, but other call sites should always reject single block compression. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: move the mapping_set_error() out of the loop in end_bbio_data_write()Qu Wenruo-3/+3
Previously we have to call mapping_set_error() inside the for_each_folio_all() loop, because we do not have a better way to grab an inode, other than through folio->mapping. But nowadays every btrfs_bio has its inode member populated, thus we can easily grab the inode and its i_mapping easily, without the help from a folio. Now we can move that mapping_set_error() out of the loop, and use bbio->inode to grab the i_mapping. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove the alignment check in end_bbio_data_write()Qu Wenruo-11/+0
The check is not necessary because: - There is already assert_bbio_alignment() at btrfs_submit_bbio() - There is also btrfs_subpage_assert() for all btrfs_folio_*() helpers - The original commit mentions the check may go away in the future Commit 17a5adccf3fd01 ("btrfs: do away with non-whole_page extent I/O") introduced the check first, and in the commit message: I've replaced the whole_page computations with warnings, just to be sure that we're not issuing partial page reads or writes. The warnings should probably just go away some time. - No similar check in all other endio functions No matter if it's data read, compressed read or write. - There is no such report for very long I do not even remember if there is any such report. Thus the need to do such check in end_bbio_data_write() is very weak, and we can just get rid of it. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: add tracepoint for search slot restart trackingLeo Martins-2/+32
Add a btrfs_search_slot_restart tracepoint that fires at each restart site in btrfs_search_slot(), recording the root, tree level, and reason for the restart. This enables tracking search slot restarts which contribute to COW amplification under memory pressure. The four restart reasons are: - write_lock: insufficient write lock level, need to restart with higher lock - setup_nodes: node setup returned -EAGAIN - slot_zero: insertion at slot 0 requires higher write lock level - read_block: read_block_for_search returned -EAGAIN (block not cached or lock contention) COW counts are already tracked by the existing trace_btrfs_cow_block() tracepoint. The per-restart-site tracepoint avoids counter overhead in the critical path when tracepoints are disabled, and provides richer per-event information that bpftrace scripts can aggregate into counts, histograms, and per-root breakdowns. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: inhibit extent buffer writeback to prevent COW amplificationLeo Martins-3/+98
Inhibit writeback on COW'd extent buffers for the lifetime of the transaction handle, preventing background writeback from setting BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW. COW amplification occurs when background writeback flushes an extent buffer that a transaction handle is still actively modifying. When lock_extent_buffer_for_io() transitions a buffer from dirty to writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as having been persisted to disk at its current bytenr. Once WRITTEN is set, should_cow_block() must either COW the block again or overwrite it in place, both of which are unnecessary overhead when the buffer is still being modified by the same handle that allocated it. By inhibiting background writeback on actively-used buffers, WRITTEN is never set while a transaction handle holds a reference to the buffer, avoiding this overhead entirely. Add an atomic_t writeback_inhibitors counter to struct extent_buffer, which fits in an existing 6-byte hole without increasing struct size. When a buffer is COW'd in btrfs_force_cow_block(), call btrfs_inhibit_eb_writeback() to store the eb in the transaction handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a reference, and increment writeback_inhibitors. The function handles dedup (same eb inhibited twice by the same handle) and replacement (different eb at the same logical address). Allocation failure is graceful: the buffer simply falls back to the pre-existing behavior where it may be written back and re-COW'd. Also inhibit writeback in should_cow_block() when COW is skipped, so that every transaction handle that reuses an already-COW'd buffer also inhibits its writeback. Without this, if handle A COWs a block and inhibits it, and handle B later reuses the same block without inhibiting, handle A's uninhibit on end_transaction leaves the buffer unprotected while handle B is still using it. This ensures all handles that access a COW'd buffer contribute to the inhibitor count, and the buffer remains protected until the last handle releases it. In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE is used by the VM flusher threads for background and periodic writeback, which are the only paths that cause COW amplification by opportunistically writing out dirty extent buffers mid-transaction. Skipping these is safe because the buffers remain dirty in the page cache and will be written out at transaction commit time. WB_SYNC_ALL must always proceed regardless of writeback_inhibitors. This is required for correctness in the fsync path: btrfs_sync_log() writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL) while the transaction handle that inhibited those same blocks is still active. Without the WB_SYNC_ALL bypass, those inhibited log tree blocks would be silently skipped, resulting in an incomplete log on disk and corruption on replay. btrfs_write_and_wait_transaction() also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path, inhibitors are already cleared beforehand, but the bypass ensures correctness regardless. Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers) to prevent a race where the committer proceeds while buffers are still inhibited. Also uninhibit in btrfs_commit_transaction() before writing and in cleanup_transaction() for the error path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: extract the max compression chunk size into a macroQu Wenruo-2/+5
We have two locations using open-coded 512K size, as the async chunk size. For compression we have not only the max size a compressed extent can represent (128K), but also how large an async chunk can be (512K). Although we have a macro for the maximum compressed extent size, we do not have any macro for the async chunk size. Add such a macro and replace the two open-coded SZ_512K. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove pointless error check in btrfs_check_dir_item_collision()Filipe Manana-3/+1
We're under the IS_ERR() branch so we know that 'ret', which got assigned the value of PTR_ERR(di) is always negative, so there's no point in checking if it's negative. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove duplicated uuid tree existence check in btrfs_uuid_tree_add()Filipe Manana-4/+1
There's no point in checking if the uuid root exists in btrfs_uuid_tree_add(), since we already do it in btrfs_uuid_tree_lookup(). We can just remove the check from btrfs_uuid_tree_add() and make btrfs_uuid_tree_lookup() return -EINVAL instead of -ENOENT in case the uuid tree does not exists. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: stop checking for -EEXIST return value from btrfs_uuid_tree_add()Filipe Manana-2/+2
We never return -EEXIST from btrfs_uuid_tree_add(), if the item already exists we extend it, so it's pointless to check for such return value. Furthermore, in create_pending_snapshot(), the logic is completely broken. The goal was to not error out and abort the transaction in case of -EEXIST but we left 'ret' with the -EEXIST value, so we end up setting pending->error to -EEXIST and return that error up the call chain up to btrfs_commit_transaction(), which will abort the transaction. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: report filesystem shutdown via fserrorMiquel Sabaté Solà-1/+4
Commit 347b7042fb26 ("Merge patch series "fs: generic file IO error reporting"") has introduced a common framework for reporting errors to fsnotify in a standard way. One of the functions being introduced is fserror_report_shutdown() that, when combined with the experimental support for shutdown in btrfs, it means that user-space can also easily detect whenever a btrfs filesystem has been marked as shutdown. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: replace kcalloc() calls to kzalloc_objs()Miquel Sabaté Solà-13/+10
Commit 2932ba8d9c99 ("slab: Introduce kmalloc_obj() and family") introduced, among many others, the kzalloc_objs() helper, which has some benefits over kcalloc(). Namely, internal introspection of the allocated type now becomes possible, allowing for future alignment-aware choices to be made by the allocator and future hardening work that can be type sensitive. Dropping 'sizeof' comes also as a nice side-effect. Moreover, this also allows us to be in line with the recent tree-wide migration to the kmalloc_obj() and family of helpers. See commit 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types"). Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: do compressed bio size roundup and zeroing in one goQu Wenruo-22/+5
Currently we zero out all the remaining bytes of the last folio of the compressed bio, then round the bio size to fs block boundary. But that is done in two different functions, zero_last_folio() to zero the remaining bytes of the last folio, and round_up_last_block() to round up the bio to fs block boundary. There are some minor problems: - zero_last_folio() is zeroing ranges we won't submit This is mostly affecting block size < page size cases, where we can have a large folio (e.g. 64K), but the fs block size is only 4K. In that case, we may only want to submit the first 4K of the folio, the remaining range won't matter, but we still zero them all. This causes unnecessary CPU usage just to zero out some bytes we won't utilized. - compressed_bio_last_folio() is called twice in two different functions Which in theory we only need to call it once. Enhance the situation by: - Only zero out bytes up to the fs block boundary Thus this will reduce some overhead for bs < ps cases. - Move the folio_zero_range() call into round_up_last_block() So that we can reuse the same folio returned by compressed_bio_last_folio(). Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: reduce the size of compressed_bioQu Wenruo-10/+6
The member compressed_bio::compressed_len can be replaced by the bio size, as we always submit the full compressed data without any partial read/write. Furthermore we already have enough ASSERT()s making sure the bio size matches the ordered extent or the extent map. This saves 8 bytes from compressed_bio: Before: struct compressed_bio { u64 start; /* 0 8 */ unsigned int len; /* 8 4 */ unsigned int compressed_len; /* 12 4 */ u8 compress_type; /* 16 1 */ bool writeback; /* 17 1 */ /* XXX 6 bytes hole, try to pack */ struct btrfs_bio * orig_bbio; /* 24 8 */ struct btrfs_bio bbio __attribute__((__aligned__(8))); /* 32 304 */ /* XXX last struct has 1 bit hole */ /* size: 336, cachelines: 6, members: 7 */ /* sum members: 330, holes: 1, sum holes: 6 */ /* member types with bit holes: 1, total: 1 */ /* forced alignments: 1 */ /* last cacheline: 16 bytes */ } __attribute__((__aligned__(8))); After: struct compressed_bio { u64 start; /* 0 8 */ unsigned int len; /* 8 4 */ u8 compress_type; /* 12 1 */ bool writeback; /* 13 1 */ /* XXX 2 bytes hole, try to pack */ struct btrfs_bio * orig_bbio; /* 16 8 */ struct btrfs_bio bbio __attribute__((__aligned__(8))); /* 24 304 */ /* XXX last struct has 1 bit hole */ /* size: 328, cachelines: 6, members: 6 */ /* sum members: 326, holes: 1, sum holes: 2 */ /* member types with bit holes: 1, total: 1 */ /* forced alignments: 1 */ /* last cacheline: 8 bytes */ } __attribute__((__aligned__(8))); Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: introduce a common helper to calculate the size of a bioQu Wenruo-29/+17
We have several call sites doing the same work to calculate the size of a bio: struct bio_vec *bvec; u32 bio_size = 0; int i; bio_for_each_bvec_all(bvec, bio, i) bio_size += bvec->bv_len; We can use a common helper instead of open-coding it everywhere. This also allows us to constify the @bio_size variables used in all the call sites. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: remove redundant nowait check in lock_extent_direct()Alexey Velichayshiy-1/+1
The nowait flag is always false in this context, making the conditional check unnecessary. Simplify the code by directly assigning -ENOTBLK. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Alexey Velichayshiy <a.velichayshiy@ispras.ru> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: fix placement of unlikely() in btrfs_insert_one_raid_extent()Mark Harmstone-1/+1
Fix the unlikely added to btrfs_insert_one_raid_extent() by commit a929904cf73b65 ("btrfs: add unlikely annotations to branches leading to transaction abort"): the exclamation point is in the wrong place, so we are telling the compiler that allocation failure is actually expected. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: pass a btrfs inode to tree-log.c:fill_inode_item()Filipe Manana-25/+23
All internal functions should be given a btrfs_inode for consistency and not a VFS inode. So pass a btrfs_inode instead of a VFS inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: stop printing condition result in assertion failure messagesFilipe Manana-4/+4
It's useless to print the result of the condition, it's always 0 if the assertion is triggered, so it doesn't provide any useful information. Examples: assertion failed: cb->bbio.bio.bi_iter.bi_size == disk_num_bytes :: 0, in inode.c:9991 assertion failed: folio_test_writeback(folio) :: 0, in subpage.c:476 So stop printing that, it's always ":: 0" for any assertion triggered (except for conditions that are just an identifier). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: constify arguments of some functionsFilipe Manana-12/+13
There are several functions that take pointer arguments but don't need to modify the objects they point to, so add the const qualifiers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: avoid unnecessary root node COW during snapshottingFilipe Manana-15/+5
There's no need to COW the root node of the subvolume we are snapshotting because we then call btrfs_copy_root(), which creates a copy of the root node and sets its generation to the current transaction. So remove this redundant COW right before calling btrfs_copy_root(), saving one extent allocation, memory allocation, copying things, etc, and making the code less confusing. Also rename the extent buffer variable from "old" to "root_eb" since that name no longer makes any sense after removing the unnecessary COW operation. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>