summaryrefslogtreecommitdiffstats
path: root/fs/inode.c
AgeCommit message (Collapse)AuthorLines
2026-04-21Merge tag 'pull-dcache-busy-wait' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache busy loop updates from Al Viro: "Fix livelocks in shrink_dcache_tree() If shrink_dcache_tree() finds a dentry in the middle of being killed by another thread, it has to wait until the victim finishes dying, gets detached from the tree and ceases to pin its parent. The way we used to deal with that amounted to busy-wait; unfortunately, it's not just inefficient but can lead to reliably reproducible hard livelocks. Solved by having shrink_dentry_tree() attach a completion to such dentry, with dentry_unlist() calling complete() on all objects attached to it. With a bit of care it can be done without growing struct dentry or adding overhead in normal case" * tag 'pull-dcache-busy-wait' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: get rid of busy-waiting in shrink_dcache_tree() dcache.c: more idiomatic "positives are not allowed" sanity checks struct dentry: make ->d_u anonymous for_each_alias(): helper macro for iterating through dentries of given inode
2026-04-13Merge tag 'vfs-7.1-rc1.bh.metadata' of ↵Linus Torvalds-15/+9
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs buffer_head updates from Christian Brauner: "This cleans up the mess that has accumulated over the years in metadata buffer_head tracking for inodes. It moves the tracking into dedicated structure in filesystem-private part of the inode (so that we don't use private_list, private_data, and private_lock in struct address_space), and also moves couple other users of private_data and private_list so these are removed from struct address_space saving 3 longs in struct inode for 99% of inodes" * tag 'vfs-7.1-rc1.bh.metadata' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) fs: Drop i_private_list from address_space fs: Drop mapping_metadata_bhs from address space ext4: Track metadata bhs in fs-private inode part minix: Track metadata bhs in fs-private inode part udf: Track metadata bhs in fs-private inode part fat: Track metadata bhs in fs-private inode part bfs: Track metadata bhs in fs-private inode part affs: Track metadata bhs in fs-private inode part ext2: Track metadata bhs in fs-private inode part fs: Provide functions for handling mapping_metadata_bhs directly fs: Switch inode_has_buffers() to take mapping_metadata_bhs fs: Make bhs point to mapping_metadata_bhs fs: Move metadata bhs tracking to a separate struct fs: Fold fsync_buffers_list() into sync_mapping_buffers() fs: Drop osync_buffers_list() kvm: Use private inode list instead of i_private_list fs: Remove i_private_data aio: Stop using i_private_data and i_private_lock hugetlbfs: Stop using i_private_data fs: Stop using i_private_data for metadata bh tracking ...
2026-04-02struct dentry: make ->d_u anonymousAl Viro-1/+1
Making ->d_rcu and (then) ->d_child overlapping dates back to 2006; anon unions support had been added to gcc only in 4.6 (2011) and the minimal gcc version hadn't been bumped to that until 4.19 (2018). These days there's no reason not to keep that union named. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-03-26fs: Drop i_private_list from address_spaceJan Kara-2/+0
Nobody is using i_private_list anymore. Remove it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-84-jack@suse.cz Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Drop mapping_metadata_bhs from address spaceJan Kara-3/+0
Nobody uses mapping_metadata_bhs in struct address_space anymore. Just remove it and with it all helper functions using it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-83-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Make bhs point to mapping_metadata_bhsJan Kara-0/+1
Make buffer heads point to mapping_metadata_bhs instead of struct address_space. This makes the code more self contained. For the (only) case of IO error handling where we really need to reach struct address_space add a pointer to the mapping from mapping_metadata_bhs. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-73-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Move metadata bhs tracking to a separate structJan Kara-0/+2
Instead of tracking metadata bhs for a mapping using i_private_list and i_private_lock create a dedicated mapping_metadata_bhs struct for it. So far this struct is embedded in address_space but that will be switched for per-fs private inode parts later in the series. This also changes the locking from bdev mapping's i_private_lock to a new lock embedded in mapping_metadata_bhs to untangle the i_private_lock locking for maintaining lists of metadata bhs and the locking for looking up / reclaiming bdev's buffer heads. The locking in remove_assoc_map() gets more complex due to this but overall this looks like a reasonable tradeoff. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-72-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Remove i_private_dataJan Kara-1/+0
Nobody is using it anymore. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-68-jack@suse.cz Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Ignore inode metadata buffers in inode_lru_isolate()Jan Kara-12/+9
There are only a few filesystems that use generic tracking of inode metadata buffer heads. As such the logic to reclaim tracked metadata buffer heads in inode_lru_isolate() doesn't bring a benefit big enough to justify intertwining of inode reclaim and metadata buffer head tracking. Just treat tracked metadata buffer heads as any other metadata filesystem has to properly clean up on inode eviction and stop handling it in inode_lru_isolate(). As a result filesystems using generic tracking of metadata buffer heads may now see dirty metadata buffers in their .evict methods more often which can slow down inode reclaim but given these filesystems aren't used in performance demanding setups we should be fine. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-64-jack@suse.cz Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-06treewide: change inode->i_ino from unsigned long to u64Jeff Layton-7/+6
On 32-bit architectures, unsigned long is only 32 bits wide, which causes 64-bit inode numbers to be silently truncated. Several filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that exceed 32 bits, and this truncation can lead to inode number collisions and other subtle bugs on 32-bit systems. Change the type of inode->i_ino from unsigned long to u64 to ensure that inode numbers are always represented as 64-bit values regardless of architecture. Update all format specifiers treewide from %lu/%lx to %llu/%llx to match the new type, along with corresponding local variable types. This is the bulk treewide conversion. Earlier patches in this series handled trace events separately to allow trace field reordering for better struct packing on 32-bit. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-06vfs: widen inode hash/lookup functions to u64Jeff Layton-18/+18
Change the inode hash/lookup VFS API functions to accept u64 parameters instead of unsigned long for inode numbers and hash values. This is preparation for widening i_ino itself to u64, which will allow filesystems to store full 64-bit inode numbers on 32-bit architectures. Since unsigned long implicitly widens to u64 on all architectures, this change is backward-compatible with all existing callers. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-1-2257ad83d372@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-12Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds-0/+9
Pull fsverity updates from Eric Biggers: "fsverity cleanups, speedup, and memory usage optimization from Christoph Hellwig: - Move some logic into common code - Fix btrfs to reject truncates of fsverity files - Improve the readahead implementation - Store each inode's fsverity_info in a hash table instead of using a pointer in the filesystem-specific part of the inode. This optimizes for memory usage in the usual case where most files don't have fsverity enabled. - Look up the fsverity_info fewer times during verification, to amortize the hash table overhead" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: remove inode from fsverity_verification_ctx fsverity: use a hashtable to find the fsverity_info btrfs: consolidate fsverity_info lookup f2fs: consolidate fsverity_info lookup ext4: consolidate fsverity_info lookup fs: consolidate fsverity_info lookup in buffer.c fsverity: push out fsverity_info lookup fsverity: deconstify the inode pointer in struct fsverity_info fsverity: kick off hash readahead at data I/O submission time ext4: move ->read_folio and ->readahead to readpage.c readahead: push invalidate_lock out of page_cache_ra_unbounded fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio fsverity: start consolidating pagecache code fsverity: pass struct file to ->write_merkle_tree_block f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY fs,fsverity: clear out fsverity_info from common code fs,fsverity: reject size changes on fsverity files in setattr_prepare
2026-02-09Merge tag 'vfs-7.0-rc1.misc' of ↵Linus Torvalds-34/+59
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains a mix of VFS cleanups, performance improvements, API fixes, documentation, and a deprecation notice. Scalability and performance: - Rework pid allocation to only take pidmap_lock once instead of twice during alloc_pid(), improving thread creation/teardown throughput by 10-16% depending on false-sharing luck. Pad the namespace refcount to reduce false-sharing - Track file lock presence via a flag in ->i_opflags instead of reading ->i_flctx, avoiding false-sharing with ->i_readcount on open/close hot paths. Measured 4-16% improvement on 24-core open-in-a-loop benchmarks - Use a consume fence in locks_inode_context() to match the store-release/load-consume idiom, eliminating a hardware fence on some architectures - Annotate cdev_lock with __cacheline_aligned_in_smp to prevent false-sharing - Remove a redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu() that never fires since the caller already verifies it, eliminating a 100% mispredicted branch - Fix a 100% mispredicted likely() in devcgroup_inode_permission() that became wrong after a prior code reorder Bug fixes and correctness: - Make insert_inode_locked() wait for inode destruction instead of skipping, fixing a corner case where two matching inodes could exist in the hash - Move f_mode initialization before file_ref_init() in alloc_file() to respect the SLAB_TYPESAFE_BY_RCU ordering contract - Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with no buffers attached, preventing a null pointer dereference when AS_RELEASE_ALWAYS is set but no release_folio op exists - Fix select restart_block to store end_time as timespec64, avoiding truncation of tv_sec on 32-bit architectures - Make dump_inode() use get_kernel_nofault() to safely access inode and superblock fields, matching the dump_mapping() pattern API modernization: - Make posix_acl_to_xattr() allocate the buffer internally since every single caller was doing it anyway. Reduces boilerplate and unnecessary error checking across ~15 filesystems - Replace deprecated simple_strtoul() with kstrtoul() for the ihash_entries, dhash_entries, mhash_entries, and mphash_entries boot parameters, adding proper error handling - Convert chardev code to use guard(mutex) and __free(kfree) cleanup patterns - Replace min_t() with min() or umin() in VFS code to avoid silently truncating unsigned long to unsigned int - Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers already check the flag Deprecation: - Begin deprecating legacy BSD process accounting (acct(2)). The interface has numerous footguns and better alternatives exist (eBPF) Documentation: - Fix and complete kernel-doc for struct export_operations, removing duplicated documentation between ReST and source - Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait() Testing: - Add a kunit test for initramfs cpio handling of entries with filesize > PATH_MAX Misc: - Add missing <linux/init_task.h> include in fs_struct.c" * tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) posix_acl: make posix_acl_to_xattr() alloc the buffer fs: make insert_inode_locked() wait for inode destruction initramfs_test: kunit test for cpio.filesize > PATH_MAX fs: improve dump_inode() to safely access inode fields fs: add <linux/init_task.h> for 'init_fs' docs: exportfs: Use source code struct documentation fs: move initializing f_mode before file_ref_init() exportfs: Complete kernel-doc for struct export_operations exportfs: Mark struct export_operations functions at kernel-doc exportfs: Fix kernel-doc output for get_name() acct(2): begin the deprecation of legacy BSD process accounting device_cgroup: remove branch hint after code refactor VFS: fix __start_dirop() kernel-doc warnings fs: Describe @isnew parameter in ilookup5_nowait() fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS select: store end_time as timespec64 in restart block chardev: Switch to guard(mutex) and __free(kfree) namespace: Replace simple_strtoul with kstrtoul to parse boot params dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries ...
2026-02-09Merge tag 'vfs-7.0-rc1.nonblocking_timestamps' of ↵Linus Torvalds-92/+110
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This contains the changes to support non-blocking timestamp updates. Since commit 66fa3cedf16a ("fs: Add async write file modification handling") file_update_time_flags() unconditionally returns -EAGAIN when any timestamp needs updating and IOCB_NOWAIT is set. This makes non-blocking direct writes impossible on file systems with granular enough timestamps, which in practice means all of them. This reworks the timestamp update path to propagate IOCB_NOWAIT through ->update_time so that file systems which can update timestamps without blocking are no longer penalized. With that groundwork in place, the core change passes IOCB_NOWAIT into ->update_time and returns -EAGAIN only when the file system indicates it would block. XFS implements non-blocking timestamp updates by using the new ->sync_lazytime and open-coding generic_update_time without the S_NOWAIT check, since the lazytime path through the generic helpers can never block in XFS" * tag 'vfs-7.0-rc1.nonblocking_timestamps' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: enable non-blocking timestamp updates xfs: implement ->sync_lazytime fs: refactor file_update_time_flags fs: add support for non-blocking timestamp updates fs: add a ->sync_lazytime method fs: factor out a sync_lazytime helper fs: refactor ->update_time handling fat: cleanup the flags for fat_truncate_time nfs: split nfs_update_timestamps fs: allow error returns from generic_update_time fs: remove inode_update_time
2026-01-29fs,fsverity: clear out fsverity_info from common codeChristoph Hellwig-0/+9
Free the fsverity_info directly in clear_inode instead of requiring file systems to handle it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260128152630.627409-3-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-14fs: make insert_inode_locked() wait for inode destructionMateusz Guzik-17/+24
This is the only routine which instead skipped instead of waiting. The current behavior is arguably a bug as it results in a corner case where the inode hash can have *two* matching inodes, one of which is on its way out. Ironing out this difference is an incremental step towards sanitizing the API. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20260114094717.236202-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-14fs: improve dump_inode() to safely access inode fieldsYuto Ohnuki-13/+34
Use get_kernel_nofault() to safely access inode and related structures (superblock, file_system_type) to avoid crashing when the inode pointer is invalid. This allows the same pattern as dump_mapping(). Note: The original access method for i_state and i_count is preserved, as get_kernel_nofault() is unnecessary once the inode structure is verified accessible. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Link: https://patch.msgid.link/20260112181443.81286-1-ytohnuki@amazon.com Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: refactor file_update_time_flagsChristoph Hellwig-16/+15
Split all the inode timestamp flags into a helper. This not only makes the code a bit more readable, but also optimizes away the further checks as soon as know we need an update. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-10-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: add support for non-blocking timestamp updatesChristoph Hellwig-11/+34
Currently file_update_time_flags unconditionally returns -EAGAIN if any timestamp needs to be updated and IOCB_NOWAIT is passed. This makes non-blocking direct writes impossible on file systems with granular enough timestamps. Pass IOCB_NOWAIT to ->update_time and return -EAGAIN if it could block. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-9-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: factor out a sync_lazytime helperChristoph Hellwig-4/+1
Centralize how we synchronize a lazytime update into the actual on-disk timestamp into a single helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-7-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: refactor ->update_time handlingChristoph Hellwig-64/+70
Pass the type of update (atime vs c/mtime plus version) as an enum instead of a set of flags that caused all kinds of confusion. Because inode_update_timestamps now can't return a modified version of those flags, return the I_DIRTY_* flags needed to persist the update, which is what the main caller in generic_update_time wants anyway, and which is suitable for the other callers that only want to know if an update happened. The whole update_time path keeps the flags argument, which will be used to support non-blocking updates soon even if it is unused, and (the slightly renamed) inode_update_time also gains the possibility to return a negative errno to support this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-6-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: allow error returns from generic_update_timeChristoph Hellwig-2/+2
Now that no caller looks at the updated flags, switch generic_update_time to the same calling convention as the ->update_time method and return 0 or a negative errno. This prepares for adding non-blocking timestamp updates that could return -EAGAIN. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: remove inode_update_timeChristoph Hellwig-15/+8
The only external user is gone now, open code it in the two VFS callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-2-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-06fs: Describe @isnew parameter in ilookup5_nowait()Bagas Sanjaya-0/+3
Sphinx reports kernel-doc warning: WARNING: ./fs/inode.c:1607 function parameter 'isnew' not described in 'ilookup5_nowait' Describe the parameter. Fixes: a27628f4363435 ("fs: rework I_NEW handling to operate without fences") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20251219024620.22880-2-bagasdotme@gmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-24fs: Replace simple_strtoul with kstrtoul in set_ihash_entriesThorsten Blum-4/+1
Replace simple_strtoul() with the recommended kstrtoul() for parsing the 'ihash_entries=' boot parameter. Check the return value of kstrtoul() and reject invalid values. This adds error handling while preserving behavior for existing valid values, and removes use of the deprecated simple_strtoul() helper. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20251218112144.225301-2-thorsten.blum@linux.dev Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-24fs: Describe @isnew parameter in ilookup5_nowait()Bagas Sanjaya-0/+3
Sphinx reports kernel-doc warning: WARNING: ./fs/inode.c:1607 function parameter 'isnew' not described in 'ilookup5_nowait' Describe the parameter. Fixes: a27628f4363435 ("fs: rework I_NEW handling to operate without fences") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20251219024620.22880-2-bagasdotme@gmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-03fs: assert on I_FREEING not being set in iput() and iput_not_last()Mateusz Guzik-1/+2
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251201132037.22835-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-01Merge tag 'vfs-6.19-rc1.inode' of ↵Linus Torvalds-105/+142
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "Features: - Hide inode->i_state behind accessors. Open-coded accesses prevent asserting they are done correctly. One obvious aspect is locking, but significantly more can be checked. For example it can be detected when the code is clearing flags which are already missing, or is setting flags when it is illegal (e.g., I_FREEING when ->i_count > 0) - Provide accessors for ->i_state, converts all filesystems using coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2, overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to compile - Rework I_NEW handling to operate without fences, simplifying the code after the accessor infrastructure is in place Cleanups: - Move wait_on_inode() from writeback.h to fs.h - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb for clarity - Cosmetic fixes to LRU handling - Push list presence check into inode_io_list_del() - Touch up predicts in __d_lookup_rcu() - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage - Assert on ->i_count in iput_final() - Assert ->i_lock held in __iget() Fixes: - Add missing fences to I_NEW handling" * tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) dcache: touch up predicts in __d_lookup_rcu() fs: push list presence check into inode_io_list_del() fs: cosmetic fixes to lru handling fs: rework I_NEW handling to operate without fences fs: make plain ->i_state access fail to compile xfs: use the new ->i_state accessors nilfs2: use the new ->i_state accessors overlayfs: use the new ->i_state accessors gfs2: use the new ->i_state accessors f2fs: use the new ->i_state accessors smb: use the new ->i_state accessors ceph: use the new ->i_state accessors btrfs: use the new ->i_state accessors Manual conversion to use ->i_state accessors of all places not covered by coccinelle Coccinelle-based conversion to use ->i_state accessors fs: provide accessors for ->i_state fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb fs: move wait_on_inode() from writeback.h to fs.h fs: add missing fences to I_NEW handling ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage ...
2025-12-01Merge tag 'vfs-6.19-rc1.misc' of ↵Linus Torvalds-39/+19
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE permission checks during path lookup and adds the IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid expensive permission work. - Hide dentry_cache behind runtime const machinery. - Add German Maglione as virtiofs co-maintainer. Cleanups: - Tidy up and inline step_into() and walk_component() for improved code generation. - Re-enable IOCB_NOWAIT writes to files. This refactors file timestamp update logic, fixing a layering bypass in btrfs when updating timestamps on device files and improving FMODE_NOCMTIME handling in VFS now that nfsd started using it. - Path lookup optimizations extracting slowpaths into dedicated routines and adding branch prediction hints for mntput_no_expire(), fd_install(), lookup_slow(), and various other hot paths. - Enable clang's -fms-extensions flag, requiring a JFS rename to avoid conflicts. - Remove spurious exports in fs/file_attr.c. - Stop duplicating union pipe_index declaration. This depends on the shared kbuild branch that brings in -fms-extensions support which is merged into this branch. - Use MD5 library instead of crypto_shash in ecryptfs. - Use largest_zero_folio() in iomap_dio_zero(). - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and initrd code. - Various typo fixes. Fixes: - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs() call with wait == 1 to commit super blocks. The emergency sync path never passed this, leaving btrfs data uncommitted during emergency sync. - Use local kmap in watch_queue's post_one_notification(). - Add hint prints in sb_set_blocksize() for LBS dependency on THP" * tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) MAINTAINERS: add German Maglione as virtiofs co-maintainer fs: inline step_into() and walk_component() fs: tidy up step_into() & friends before inlining orangefs: use inode_update_timestamps directly btrfs: fix the comment on btrfs_update_time btrfs: use vfs_utimes to update file timestamps fs: export vfs_utimes fs: lift the FMODE_NOCMTIME check into file_update_time_flags fs: refactor file timestamp update logic include/linux/fs.h: trivial fix: regualr -> regular fs/splice.c: trivial fix: pipes -> pipe's fs: mark lookup_slow() as noinline fs: add predicts based on nd->depth fs: move mntput_no_expire() slowpath into a dedicated routine fs: remove spurious exports in fs/file_attr.c watch_queue: Use local kmap in post_one_notification() fs: touch up predicts in path lookup fs: move fd_install() slowpath into a dedicated routine and provide commentary fs: hide dentry_cache behind runtime const machinery fs: touch predicts in do_dentry_open() ...
2025-11-26fs: lift the FMODE_NOCMTIME check into file_update_time_flagsChristoph Hellwig-2/+2
FMODE_NOCMTIME used to be just a hack for the legacy XFS handle-based "invisible I/O", but commit e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation") started using it from generic callers. I'm not sure other file systems are actually read for this in general, so the above commit should get a closer look, but for it to make any sense, file_update_time needs to respect the flag. Lift the check from file_modified_flags to file_update_time so that users of file_update_time inherit the behavior and so that all the checks are done in one place. Fixes: e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-3-hch@lst.de Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26fs: refactor file timestamp update logicChristoph Hellwig-37/+17
Currently the two high-level APIs use two helper functions to implement almost all of the logic. Refactor the two helpers and the common logic into a new file_update_time_flags routine that gets the iocb flags or 0 in case of file_update_time passed so that the entire logic is contained in a single function and can be easily understood and modified. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-2-hch@lst.de Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25fs: push list presence check into inode_io_list_del()Mateusz Guzik-3/+1
For consistency with sb routines. ext4 is the only consumer outside of evict(). Damage-controlling it is outside of the scope of this cleanup. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251103230911.516866-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25fs: cosmetic fixes to lru handlingMateusz Guzik-24/+26
1. inode_bit_waitqueue() was somehow placed between __inode_add_lru() and inode_add_lru(). move it up 2. assert ->i_lock is held in __inode_add_lru instead of just claiming it is needed 3. s/__inode_add_lru/__inode_lru_list_add/ for consistency with itself (inode_lru_list_del()) and similar routines for sb and io list management 4. push list presence check into inode_lru_list_del(), just like sb and io list Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251029131428.654761-2-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25fs: rework I_NEW handling to operate without fencesMateusz Guzik-37/+61
In the inode hash code grab the state while ->i_lock is held. If found to be set, synchronize the sleep once more with the lock held. In the real world the flag is not set most of the time. Apart from being simpler to reason about, it comes with a minor speed up as now clearing the flag does not require the smp_mb() fence. While here rename wait_on_inode() to wait_on_new_inode() to line it up with __wait_on_freeing_inode(). Christian Brauner <brauner@kernel.org> says: As per the discussion in [1] I folded in the diff sent in [2]. Link: https://lore.kernel.org/69238e4d.a70a0220.d98e3.006e.GAE@google.com [1] Link: https://lore.kernel.org/c2kpawomkbvtahjm7y5mposbhckb7wxthi3iqy5yr22ggpucrm@ufvxwy233qxo [2] Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251010221737.1403539-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12fs: add iput_not_last()Mateusz Guzik-0/+12
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251105212025.807549-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20Manual conversion to use ->i_state accessors of all places not covered by ↵Mateusz Guzik-10/+8
coccinelle Nothing to look at apart from iput_final(). Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20Coccinelle-based conversion to use ->i_state accessorsMateusz Guzik-46/+46
All places were patched by coccinelle with the default expecting that ->i_lock is held, afterwards entries got fixed up by hand to use unlocked variants as needed. The script: @@ expression inode, flags; @@ - inode->i_state & flags + inode_state_read(inode) & flags @@ expression inode, flags; @@ - inode->i_state &= ~flags + inode_state_clear(inode, flags) @@ expression inode, flag1, flag2; @@ - inode->i_state &= ~flag1 & ~flag2 + inode_state_clear(inode, flag1 | flag2) @@ expression inode, flags; @@ - inode->i_state |= flags + inode_state_set(inode, flags) @@ expression inode, flags; @@ - inode->i_state = flags + inode_state_assign(inode, flags) @@ expression inode, flags; @@ - flags = inode->i_state + flags = inode_state_read(inode) @@ expression inode, flags; @@ - READ_ONCE(inode->i_state) & flags + inode_state_read(inode) & flags Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20fs: add missing fences to I_NEW handlingMateusz Guzik-0/+8
Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5()) and unlock_new_inode(). In principle the latter can clear the I_NEW flag before prior stores into the inode were made visible. The former can in turn observe I_NEW is cleared and proceed to use the inode, while possibly reading from not-yet-published areas. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20fs: assert on ->i_count in iput_final()Mateusz Guzik-0/+7
Notably make sure the count is 0 after the return from ->drop_inode(), provided we are going to drop. Inspired by suspicious games played by f2fs. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-29Merge tag 'vfs-6.18-rc1.inode' of ↵Linus Torvalds-20/+70
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
2025-09-19fs: add might_sleep() annotation to iput() and moreMax Kellermann-0/+18
When iput() drops the reference counter to zero, it may sleep via inode_wait_for_writeback(). This happens rarely because it's usually the dcache which evicts inodes, but really iput() should only ever be called in contexts where sleeping is allowed. This annotation allows finding buggy callers. Additionally, this patch annotates a few low-level functions that can call iput() conditionally. Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/20250917153632.2228828-1-max.kellermann@ionos.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15fs: rename generic_delete_inode() and generic_drop_inode()Mateusz Guzik-3/+3
generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15fs: expand dump_inode()Mateusz Guzik-1/+9
This adds fs name and few fields from struct inode: i_mode, i_opflags, i_flags, i_state and i_count. All values printed raw, no attempt to pretty-print anything. Compile tested on i386 and runtime tested on amd64. Sample output: [ 23.121281] VFS_WARN_ON_INODE("crap") encountered for inode ffff9a1a83ce3660 fs pipefs mode 10600 opflags 0x4 flags 0x0 state 0x38 count 0 Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
2025-09-15fs: use the switch statement in init_special_inode()Mateusz Guzik-6/+13
Similar to may_open(). No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01inode: fix whitespace issuesChristian Brauner-5/+5
Fix two minor whitespace issues. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01fs: add an icount_read helperJosef Bacik-4/+4
Instead of doing direct access to ->i_count, add a helper to handle this. This will make it easier to convert i_count to a refcount later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01fs: rework iput logicJosef Bacik-11/+35
Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible. We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference. Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above. Co-developed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/be208b89bdb650202e712ce2bcfc407ac7044c7a.1756222464.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-11vfs: show filesystem name at dump_inode()Tetsuo Handa-1/+1
Commit 8b17e540969a ("vfs: add initial support for CONFIG_DEBUG_VFS") added dump_inode(), but dump_inode() currently reports only raw pointer address. Comment says that adding a proper inode dumping routine is a TODO. However, syzkaller concurrently tests multiple filesystems, and several filesystems started calling dump_inode() due to hitting VFS_BUG_ON_INODE() added by commit af153bb63a33 ("vfs: catch invalid modes in may_open()") before a proper inode dumping routine is implemented. Show filesystem name at dump_inode() so that we can find which filesystem has passed an invalid mode to may_open() from syzkaller's crash reports. Link: https://syzkaller.appspot.com/bug?extid=895c23f6917da440ed0d Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/ceaf4021-65cc-422e-9d0e-6afa18dd8276@I-love.SAKURA.ne.jp Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-11fs: mark file_remove_privs_flags staticChristoph Hellwig-2/+1
file_remove_privs_flags is only used inside of inode.c, mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250724074854.3316911-1-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-10vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()Jan Kara-2/+2
evict_inodes() uses list_for_each_entry_safe() to iterate sb->s_inodes list. However, since we use i_lru list entry for our local temporary list of inodes to destroy, the inode is guaranteed to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So there is no real need for safe iteration variant and we can use list_for_each_entry() just fine. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/20250709090635.26319-2-jack@suse.cz Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>