summaryrefslogtreecommitdiffstats
path: root/fs/f2fs/node.c
AgeCommit message (Collapse)AuthorLines
2026-04-21Merge tag 'f2fs-for-7.1-rc1' of ↵Linus Torvalds-53/+59
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, the changes primarily focus on resolving race conditions, memory safety issues (UAF), and improving the robustness of garbage collection (GC), and folio management. Enhancements: - add page-order information for large folio reads in iostat - add defrag_blocks sysfs node Bug fixes: - fix uninitialized kobject put in f2fs_init_sysfs() - disallow setting an extension to both cold and hot - fix node_cnt race between extent node destroy and writeback - preserve previous reserve_{blocks,node} value when remount - freeze GC and discard threads quickly - fix false alarm of lockdep on cp_global_sem lock - fix data loss caused by incorrect use of nat_entry flag - skip empty sections in f2fs_get_victim - fix inline data not being written to disk in writeback path - fix fsck inconsistency caused by FGGC of node block - fix fsck inconsistency caused by incorrect nat_entry flag usage - call f2fs_handle_critical_error() to set cp_error flag - fix fiemap boundary handling when read extent cache is incomplete - fix use-after-free of sbi in f2fs_compress_write_end_io() - fix UAF caused by decrementing sbi->nr_pages[] in f2fs_write_end_io() - fix incorrect file address mapping when inline inode is unwritten - fix incomplete search range in f2fs_get_victim when f2fs_need_rand_seg is enabled - avoid memory leak in f2fs_rename()" * tag 'f2fs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (35 commits) f2fs: add page-order information for large folio reads in iostat f2fs: do not support mmap write for large folio f2fs: fix uninitialized kobject put in f2fs_init_sysfs() f2fs: protect extension_list reading with sb_lock in f2fs_sbi_show() f2fs: disallow setting an extension to both cold and hot f2fs: fix node_cnt race between extent node destroy and writeback f2fs: allow empty mount string for Opt_usr|grp|projjquota f2fs: fix to preserve previous reserve_{blocks,node} value when remount f2fs: invalidate block device page cache on umount f2fs: fix to freeze GC and discard threads quickly f2fs: fix to avoid uninit-value access in f2fs_sanity_check_node_footer f2fs: fix false alarm of lockdep on cp_global_sem lock f2fs: fix data loss caused by incorrect use of nat_entry flag f2fs: fix to skip empty sections in f2fs_get_victim f2fs: fix inline data not being written to disk in writeback path f2fs: fix fsck inconsistency caused by FGGC of node block f2fs: fix fsck inconsistency caused by incorrect nat_entry flag usage f2fs: fix to do sanity check on dcc->discard_cmd_cnt conditionally f2fs: refactor node footer flag setting related code f2fs: refactor f2fs_move_node_folio function ...
2026-04-15Merge tag 'mm-stable-2026-04-13-21-45' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...
2026-04-13Merge tag 'vfs-7.1-rc1.kino' of ↵Linus Torvalds-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs i_ino updates from Christian Brauner: "For historical reasons, the inode->i_ino field is an unsigned long, which means that it's 32 bits on 32 bit architectures. This has caused a number of filesystems to implement hacks to hash a 64-bit identifier into a 32-bit field, and deprives us of a universal identifier field for an inode. This changes the inode->i_ino field from an unsigned long to a u64. This shouldn't make any material difference on 64-bit hosts, but 32-bit hosts will see struct inode grow by at least 4 bytes. This could have effects on slabcache sizes and field alignment. The bulk of the changes are to format strings and tracepoints, since the kernel itself doesn't care that much about the i_ino field. The first patch changes some vfs function arguments, so check that one out carefully. With this change, we may be able to shrink some inode structures. For instance, struct nfs_inode has a fileid field that holds the 64-bit inode number. With this set of changes, that field could be eliminated. I'd rather leave that sort of cleanups for later just to keep this simple" * tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group() EVM: add comment describing why ino field is still unsigned long vfs: remove externs from fs.h on functions modified by i_ino widening treewide: fix missed i_ino format specifier conversions ext4: fix signed format specifier in ext4_load_inode trace event treewide: change inode->i_ino from unsigned long to u64 nilfs2: widen trace event i_ino fields to u64 f2fs: widen trace event i_ino fields to u64 ext4: widen trace event i_ino fields to u64 zonefs: widen trace event i_ino fields to u64 hugetlbfs: widen trace event i_ino fields to u64 ext2: widen trace event i_ino fields to u64 cachefiles: widen trace event i_ino fields to u64 vfs: widen trace event i_ino fields to u64 net: change sock.sk_ino and sock_i_ino() to u64 audit: widen ino fields to u64 vfs: widen inode hash/lookup functions to u64
2026-04-05folio_batch: rename pagevec.h to folio_batch.hTal Zussman-1/+1
struct pagevec was removed in commit 1e0877d58b1e ("mm: remove struct pagevec"). Rename include/linux/pagevec.h to reflect reality and update includes tree-wide. Add the new filename to MAINTAINERS explicitly, as it no longer matches the "include/linux/page[-_]*" pattern in MEMORY MANAGEMENT - CORE. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-3-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-02f2fs: fix data loss caused by incorrect use of nat_entry flagYongpeng Yang-0/+3
Data loss can occur when fsync is performed on a newly created file (before any checkpoint has been written) concurrently with a checkpoint operation. The scenario is as follows: create & write & fsync 'file A' write checkpoint - f2fs_do_sync_file // inline inode - f2fs_write_inode // inode folio is dirty - f2fs_write_checkpoint - f2fs_flush_merged_writes - f2fs_sync_node_pages - f2fs_flush_nat_entries - f2fs_fsync_node_pages // no dirty node - f2fs_need_inode_block_update // return false SPO and lost 'file A' f2fs_flush_nat_entries() sets the IS_CHECKPOINTED and HAS_LAST_FSYNC flags for the nat_entry, but this does not mean that the checkpoint has actually completed successfully. However, f2fs_need_inode_block_update() checks these flags and incorrectly assumes that the checkpoint has finished. The root cause is that the semantics of IS_CHECKPOINTED and HAS_LAST_FSYNC are only guaranteed after the checkpoint write fully completes. This patch modifies f2fs_need_inode_block_update() to acquire the sbi->node_write lock before reading the nat_entry flags, ensuring that once IS_CHECKPOINTED and HAS_LAST_FSYNC are observed to be set, the checkpoint operation has already completed. Fixes: e05df3b115e7 ("f2fs: add node operations") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-04-02f2fs: fix inline data not being written to disk in writeback pathYongpeng Yang-1/+1
When f2fs_fiemap() is called with `fileinfo->fi_flags` containing the FIEMAP_FLAG_SYNC flag, it attempts to write data to disk before retrieving file mappings via filemap_write_and_wait(). However, there is an issue where the file does not get mapped as expected. The following scenario can occur: root@vm:/mnt/f2fs# dd if=/dev/zero of=data.3k bs=3k count=1 root@vm:/mnt/f2fs# xfs_io data.3k -c "fiemap -v 0 4096" data.3k: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..5]: 0..5 6 0x307 The root cause of this issue is that f2fs_write_single_data_page() only calls f2fs_write_inline_data() to copy data from the data folio to the inode folio, and it clears the dirty flag on the data folio. However, it does not mark the data folio as writeback. When __filemap_fdatawait_range() checks for folios with the writeback flag, it returns early, causing f2fs_fiemap() to report that the file has no mapping. To fix this issue, the solution is to call f2fs_write_single_node_folio() in f2fs_inline_data_fiemap() when getting fiemap with FIEMAP_FLAG_SYNC flags. This patch ensures that the inode folio is written back and the writeback process completes before proceeding. Cc: stable@kernel.org Fixes: 9ffe0fb5f3bb ("f2fs: handle inline data operations") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-04-02f2fs: fix fsck inconsistency caused by FGGC of node blockYongpeng Yang-14/+13
During FGGC node block migration, fsck may incorrectly treat the migrated node block as fsync-written data. The reproduction scenario: root@vm:/mnt/f2fs# seq 1 2048 | xargs -n 1 ./test_sync // write inline inode and sync root@vm:/mnt/f2fs# rm -f 1 root@vm:/mnt/f2fs# sync root@vm:/mnt/f2fs# f2fs_io gc_range // move data block in sync mode and not write CP SPO, "fsck --dry-run" find inode has already checkpointed but still with DENT_BIT_SHIFT set The root cause is that GC does not clear the dentry mark and fsync mark during node block migration, leading fsck to misinterpret them as user-issued fsync writes. In BGGC mode, node block migration is handled by f2fs_sync_node_pages(), which guarantees the dentry and fsync marks are cleared before writing. This patch move the set/clear of the fsync|dentry marks into __write_node_folio to make the logic clearer, and ensures the fsync|dentry mark is cleared in FGGC. Cc: stable@kernel.org Fixes: da011cc0da8c ("f2fs: move node pages only in victim section during GC") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-04-02f2fs: fix fsck inconsistency caused by incorrect nat_entry flag usageYongpeng Yang-9/+5
f2fs_need_dentry_mark() reads nat_entry flags without mutual exclusion with the checkpoint path, which can result in an incorrect inode block marking state. The scenario is as follows: create & write & fsync 'file A' write checkpoint - f2fs_do_sync_file // inline inode - f2fs_write_inode // inode folio is dirty - f2fs_write_checkpoint - f2fs_flush_merged_writes - f2fs_sync_node_pages - f2fs_fsync_node_pages // no dirty node - f2fs_need_inode_block_update // return true - f2fs_fsync_node_pages // inode dirtied - f2fs_need_dentry_mark //return true - f2fs_flush_nat_entries - f2fs_write_checkpoint end - __write_node_folio // inode with DENT_BIT_SHIFT set SPO, "fsck --dry-run" find inode has already checkpointed but still with DENT_BIT_SHIFT set The state observed by f2fs_need_dentry_mark() can differ from the state observed in __write_node_folio() after acquiring sbi->node_write. The root cause is that the semantics of IS_CHECKPOINTED and HAS_FSYNCED_INODE are only guaranteed after the checkpoint write has fully completed. This patch moves set_dentry_mark() into __write_node_folio() and protects it with the sbi->node_write lock. Cc: stable@kernel.org Fixes: 88bd02c9472a ("f2fs: fix conditions to remain recovery information in f2fs_sync_file") Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-04-02f2fs: refactor node footer flag setting related codeYongpeng Yang-1/+1
This patch refactors the node footer flag setting code to simplify redundant logic and adjust function parameters and return types. No logical changes. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-04-02f2fs: refactor f2fs_move_node_folio functionYongpeng Yang-22/+32
This patch refactor the f2fs_move_node_folio() function. No logical changes. Cc: stable@kernel.org Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-24f2fs: use more generic f2fs_stop_checkpoint()Chao Yu-1/+1
Let's use more generic f2fs_stop_checkpoint() instead of f2fs_handle_critical_error() to handle critical error in f2fs. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-24f2fs: drop unused ri parameter from truncate_partial_nodes()Yongpeng Yang-5/+3
The ri parameter in truncate_partial_nodes() is unused. Remove it along with the related code. No logical changes. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-24f2fs: drop unused sbi parameter from f2fs_in_warm_node_list()Yongpeng Yang-2/+2
The sbi parameter in f2fs_in_warm_node_list() is not used. Remove it to simplify the function. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-03-06treewide: change inode->i_ino from unsigned long to u64Jeff Layton-5/+5
On 32-bit architectures, unsigned long is only 32 bits wide, which causes 64-bit inode numbers to be silently truncated. Several filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that exceed 32 bits, and this truncation can lead to inode number collisions and other subtle bugs on 32-bit systems. Change the type of inode->i_ino from unsigned long to u64 to ensure that inode numbers are always represented as 64-bit values regardless of architecture. Update all format specifiers treewide from %lu/%lx to %llu/%llx to match the new type, along with corresponding local variable types. This is the bulk treewide conversion. Earlier patches in this series handled trace events separately to allow trace field reordering for better struct packing on 32-bit. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-06vfs: widen inode hash/lookup functions to u64Jeff Layton-1/+1
Change the inode hash/lookup VFS API functions to accept u64 parameters instead of unsigned long for inode numbers and hash values. This is preparation for widening i_ino itself to u64, which will allow filesystems to store full 64-bit inode numbers on 32-bit architectures. Since unsigned long implicitly widens to u64 on all architectures, this change is backward-compatible with all existing callers. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-1-2257ad83d372@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-17f2fs: stop using writeback internals for dirty_exceeded checksKundan Kumar-2/+2
Replace direct dereferences of dirty_exceeded with the core helper bdi_wb_dirty_exceeded(), removing f2fs dependencies on writeback internals. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://patch.msgid.link/20260213054634.79785-3-kundan.kumar@samsung.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-27f2fs: optimize NAT block loading during checkpoint writeYongpeng Yang-1/+13
Under stress tests with frequent metadata operations, checkpoint write time can become excessively long. Analysis shows that the slowdown is caused by synchronous, one-by-one reads of NAT blocks during checkpoint processing. The issue can be reproduced with the following workload: 1. seq 1 650000 | xargs -P 16 -n 1 touch 2. sync # avoid checkpoint write during deleting 3. delete 1 file every 455 files 4. echo 3 > /proc/sys/vm/drop_caches 5. sync # trigger checkpoint write This patch submits read I/O for all NAT blocks required in the __flush_nat_entry_set() phase in advance, reducing the overhead of synchronous waiting for individual NAT block reads. The NAT block flush latency before and after the change is as below: | |NAT blocks accessed|NAT blocks read|Flush time (ms)| |-------------|-------------------|---------------|---------------| |Before change|1205 |1191 |158 | |After change |1264 |1242 |11 | With a similar number of NAT blocks accessed and read from disk, adding NAT block readahead reduces the total NAT block flush time by more than 90%. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17f2fs: detect more inconsistent cases in sanity_check_node_footer()Chao Yu-3/+12
Let's enhance sanity_check_node_footer() to detect more inconsistent cases as below: Node Type Node Footer Info =================== ============================= NODE_TYPE_REGULAR inode = true and xnode = true NODE_TYPE_INODE inode = false or xnode = true NODE_TYPE_XATTR inode = true or xnode = false NODE_TYPE_NON_INODE inode = false Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17f2fs: fix to do sanity check on node footer in {read,write}_end_ioChao Yu-9/+11
-----------[ cut here ]------------ kernel BUG at fs/f2fs/data.c:358! Call Trace: <IRQ> blk_update_request+0x5eb/0xe70 block/blk-mq.c:987 blk_mq_end_request+0x3e/0x70 block/blk-mq.c:1149 blk_complete_reqs block/blk-mq.c:1224 [inline] blk_done_softirq+0x107/0x160 block/blk-mq.c:1229 handle_softirqs+0x283/0x870 kernel/softirq.c:579 __do_softirq kernel/softirq.c:613 [inline] invoke_softirq kernel/softirq.c:453 [inline] __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050 </IRQ> In f2fs_write_end_io(), it detects there is inconsistency in between node page index (nid) and footer.nid of node page. If footer of node page is corrupted in fuzzed image, then we load corrupted node page w/ async method, e.g. f2fs_ra_node_pages() or f2fs_ra_node_page(), in where we won't do sanity check on node footer, once node page becomes dirty, we will encounter this bug after node page writeback. Cc: stable@kernel.org Reported-by: syzbot+803dd716c4310d16ff3a@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=803dd716c4310d16ff3a Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17f2fs: fix to do sanity check on node footer in __write_node_folio()Chao Yu-1/+5
Add node footer sanity check during node folio's writeback, if sanity check fails, let's shutdown filesystem to avoid looping to redirty and writeback in .writepages. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17f2fs: support non-4KB block size without packed_ssa featureDaeho Jeong-6/+6
Currently, F2FS requires the packed_ssa feature to be enabled when utilizing non-4KB block sizes (e.g., 16KB). This restriction limits the flexibility of filesystem formatting options. This patch allows F2FS to support non-4KB block sizes even when the packed_ssa feature is disabled. It adjusts the SSA calculation logic to correctly handle summary entries in larger blocks without the packed layout. Cc: stable@kernel.org Fixes: 7ee8bc3942f2 ("f2fs: revert summary entry count from 2048 to 512 in 16kb block support") Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07f2fs: fix IS_CHECKPOINTED flag inconsistency issue caused by concurrent ↵Yongpeng Yang-4/+10
atomic commit and checkpoint writes During SPO tests, when mounting F2FS, an -EINVAL error was returned from f2fs_recover_inode_page. The issue occurred under the following scenario Thread A Thread B f2fs_ioc_commit_atomic_write - f2fs_do_sync_file // atomic = true - f2fs_fsync_node_pages : last_folio = inode folio : schedule before folio_lock(last_folio) f2fs_write_checkpoint - block_operations// writeback last_folio - schedule before f2fs_flush_nat_entries : set_fsync_mark(last_folio, 1) : set_dentry_mark(last_folio, 1) : folio_mark_dirty(last_folio) - __write_node_folio(last_folio) : f2fs_down_read(&sbi->node_write)//block - f2fs_flush_nat_entries : {struct nat_entry}->flag |= BIT(IS_CHECKPOINTED) - unblock_operations : f2fs_up_write(&sbi->node_write) f2fs_write_checkpoint//return : f2fs_do_write_node_page() f2fs_ioc_commit_atomic_write//return SPO Thread A calls f2fs_need_dentry_mark(sbi, ino), and the last_folio has already been written once. However, the {struct nat_entry}->flag did not have the IS_CHECKPOINTED set, causing set_dentry_mark(last_folio, 1) and write last_folio again after Thread B finishes f2fs_write_checkpoint. After SPO and reboot, it was detected that {struct node_info}->blk_addr was not NULL_ADDR because Thread B successfully write the checkpoint. This issue only occurs in atomic write scenarios. For regular file fsync operations, the folio must be dirty. If block_operations->f2fs_sync_node_pages successfully submit the folio write, this path will not be executed. Otherwise, the f2fs_write_checkpoint will need to wait for the folio write submission to complete, as sbi->nr_pages[F2FS_DIRTY_NODES] > 0. Therefore, the situation where f2fs_need_dentry_mark checks that the {struct nat_entry}->flag /wo the IS_CHECKPOINTED flag, but the folio write has already been submitted, will not occur. Therefore, for atomic file fsync, sbi->node_write should be acquired through __write_node_folio to ensure that the IS_CHECKPOINTED flag correctly indicates that the checkpoint write has been completed. Fixes: 608514deba38 ("f2fs: set fsync mark only for the last dnode") Cc: stable@kernel.org Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Signed-off-by: Jinbao Liu <liujinbao1@xiaomi.com> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07f2fs: trace elapsed time for node_write lockChao Yu-4/+5
Use f2fs_{down,up}_read_trace for node_write to trace lock elapsed time. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01f2fs: fix to do sanity check on nat entry of quota inodeChao Yu-0/+11
As Zhiguo reported, nat entry of quota inode could be corrupted: "ino/block_addr=NULL_ADDR in nid=4 entry" We'd better to do sanity check on quota inode to detect and record nat.blk_addr inconsistency, so that we can have a chance to repair it w/ later fsck. Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-09-09f2fs: avoid unnecessary folio_clear_uptodate() for cleanupChao Yu-1/+1
In error path of __get_node_folio(), if the folio is not uptodate, let's avoid unnecessary folio_clear_uptodate() for cleanup. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-28f2fs: fix to do sanity check on node footer for non inode dnodeChao Yu-19/+39
As syzbot reported below: ------------[ cut here ]------------ kernel BUG at fs/f2fs/file.c:1243! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5354 Comm: syz.0.0 Not tainted 6.17.0-rc1-syzkaller-00211-g90d970cade8e #0 PREEMPT(full) RIP: 0010:f2fs_truncate_hole+0x69e/0x6c0 fs/f2fs/file.c:1243 Call Trace: <TASK> f2fs_punch_hole+0x2db/0x330 fs/f2fs/file.c:1306 f2fs_fallocate+0x546/0x990 fs/f2fs/file.c:2018 vfs_fallocate+0x666/0x7e0 fs/open.c:342 ksys_fallocate fs/open.c:366 [inline] __do_sys_fallocate fs/open.c:371 [inline] __se_sys_fallocate fs/open.c:369 [inline] __x64_sys_fallocate+0xc0/0x110 fs/open.c:369 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1e65f8ebe9 w/ a fuzzed image, f2fs may encounter panic due to it detects inconsistent truncation range in direct node in f2fs_truncate_hole(). The root cause is: a non-inode dnode may has the same footer.ino and footer.nid, so the dnode will be parsed as an inode, then ADDRS_PER_PAGE() may return wrong blkaddr count which may be 923 typically, by chance, dn.ofs_in_node is equal to 923, then count can be calculated to 0 in below statement, later it will trigger panic w/ f2fs_bug_on(, count == 0 || ...). count = min(end_offset - dn.ofs_in_node, pg_end - pg_start); This patch introduces a new node_type NODE_TYPE_NON_INODE, then allowing passing the new_type to sanity_check_node_footer in f2fs_get_node_folio() to detect corruption that a non-inode dnode has the same footer.ino and footer.nid. Scripts to reproduce: mkfs.f2fs -f /dev/vdb mount /dev/vdb /mnt/f2fs touch /mnt/f2fs/foo touch /mnt/f2fs/bar dd if=/dev/zero of=/mnt/f2fs/foo bs=1M count=8 umount /mnt/f2fs inject.f2fs --node --mb i_nid --nid 4 --idx 0 --val 5 /dev/vdb mount /dev/vdb /mnt/f2fs xfs_io /mnt/f2fs/foo -c "fpunch 6984k 4k" Cc: stable@kernel.org Reported-by: syzbot+b9c7ffd609c3f09416ab@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/68a68e27.050a0220.1a3988.0002.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-08-20f2fs: fix to detect potential corrupted nid in free_nid_listChao Yu-1/+16
As reported, on-disk footer.ino and footer.nid is the same and out-of-range, let's add sanity check on f2fs_alloc_nid() to detect any potential corruption in free_nid_list. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-28f2fs: directly add newly allocated pre-dirty nat entry to dirty set listwangzijie-8/+21
When we need to alloc nat entry and set it dirty, we can directly add it to dirty set list(or initialize its list_head for new_ne) instead of adding it to clean list and make a move. Introduce init_dirty flag to do it. Signed-off-by: wangzijie <wangzijie1@honor.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-28f2fs: avoid redundant clean nat entry move in lru listwangzijie-12/+15
__lookup_nat_cache follows LRU manner to move clean nat entry, when nat entries are going to be dirty, no need to move them to tail of lru list. Introduce a parameter 'for_dirty' to avoid it. Signed-off-by: wangzijie <wangzijie1@honor.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: fix to avoid out-of-boundary access in dnode pageChao Yu-0/+10
As Jiaming Zhang reported: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x1c1/0x2a0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0x17e/0x800 mm/kasan/report.c:480 kasan_report+0x147/0x180 mm/kasan/report.c:593 data_blkaddr fs/f2fs/f2fs.h:3053 [inline] f2fs_data_blkaddr fs/f2fs/f2fs.h:3058 [inline] f2fs_get_dnode_of_data+0x1a09/0x1c40 fs/f2fs/node.c:855 f2fs_reserve_block+0x53/0x310 fs/f2fs/data.c:1195 prepare_write_begin fs/f2fs/data.c:3395 [inline] f2fs_write_begin+0xf39/0x2190 fs/f2fs/data.c:3594 generic_perform_write+0x2c7/0x910 mm/filemap.c:4112 f2fs_buffered_write_iter fs/f2fs/file.c:4988 [inline] f2fs_file_write_iter+0x1ec8/0x2410 fs/f2fs/file.c:5216 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x546/0xa90 fs/read_write.c:686 ksys_write+0x149/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf3/0x3d0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The root cause is in the corrupted image, there is a dnode has the same node id w/ its inode, so during f2fs_get_dnode_of_data(), it tries to access block address in dnode at offset 934, however it parses the dnode as inode node, so that get_dnode_addr() returns 360, then it tries to access page address from 360 + 934 * 4 = 4096 w/ 4 bytes. To fix this issue, let's add sanity check for node id of all direct nodes during f2fs_get_dnode_of_data(). Cc: stable@kernel.org Reported-by: Jiaming Zhang <r772577952@gmail.com> Closes: https://groups.google.com/g/syzkaller/c/-ZnaaOOfO3M Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to F2FS_NODE()Matthew Wilcox (Oracle)-3/+3
All callers now have a folio so pass it in Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass the nat_blk to __update_nat_bits()Matthew Wilcox (Oracle)-3/+2
The page argument is only used to look up the address of the nat_blk. Since the caller already has it, pass it in instead. Also mark it const as the nat_blk isn't modified by this function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Convert get_next_nat_page() to get_next_nat_folio()Matthew Wilcox (Oracle)-10/+10
Return a folio from this function and convert its one caller. Removes a call to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Add folio counterparts to page_private_flags functionsMatthew Wilcox (Oracle)-5/+5
Name these new functions folio_test_f2fs_*(), folio_set_f2fs_*() and folio_clear_f2fs_*(). Convert all callers which currently have a folio and cast back to a page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to IS_INODE()Matthew Wilcox (Oracle)-8/+6
All callers now have a folio so pass it in. Also make it const to help the compiler. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to ofs_of_node()Matthew Wilcox (Oracle)-2/+2
All callers now have a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to IS_DNODE()Matthew Wilcox (Oracle)-8/+7
All callers now have a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to is_cold_node()Matthew Wilcox (Oracle)-6/+6
All callers now have a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Add fio->folioMatthew Wilcox (Oracle)-2/+2
Put fio->page insto a union with fio->folio. This lets us remove a lot of folio->page and page->folio conversions. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to is_fsync_dnode()Matthew Wilcox (Oracle)-1/+1
Both callers have a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to f2fs_recover_xattr_data()Matthew Wilcox (Oracle)-3/+3
One caller passes NULL and the other caller already has a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to cpver_of_node()Matthew Wilcox (Oracle)-1/+1
All callers have a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to fill_node_footer()Matthew Wilcox (Oracle)-2/+2
All callers have a folio so pass it in. Also mark it as const to help the compiler. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to set_cold_node()Matthew Wilcox (Oracle)-2/+2
All callers have a folio so pass it in. Also mark it as const to help the compiler. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to get_nid()Matthew Wilcox (Oracle)-9/+9
All callers have a folio so pass it in. Also mark it as const to help the compiler. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to f2fs_inode_chksum_set()Matthew Wilcox (Oracle)-1/+1
All callers have a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to set_fsync_mark()Matthew Wilcox (Oracle)-3/+3
All callers have a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to set_dentry_mark()Matthew Wilcox (Oracle)-3/+3
All callers have a folio so pass it in. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to nid_of_node()Matthew Wilcox (Oracle)-3/+3
All callers have a folio so pass it in. Also make the argument const as the function does not modify it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-22f2fs: Pass a folio to ino_of_node()Matthew Wilcox (Oracle)-12/+12
All callers have a folio so pass it in. Also make the argument const as the function does not modify it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>