summaryrefslogtreecommitdiffstats
path: root/fs/open.c
AgeCommit message (Collapse)AuthorLines
2026-03-23fs: remove do_sys_truncateChristoph Hellwig-4/+4
do_sys_truncate ist only used to implement ksys_truncate and the native truncate syscalls. Merge do_sys_truncate into ksys_truncate and return int from it as it only returns 0 or negative errnos. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260323070205.2939118-4-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-23fs: pass on FTRUNCATE_* flags to do_truncateChristoph Hellwig-12/+10
Pass the flags one level down to replace the somewhat confusing small argument, and clean up do_truncate as a result. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260323070205.2939118-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-23fs: fix archiecture-specific compat_ftruncate64Christoph Hellwig-6/+6
The "small" argument to do_sys_ftruncate indicates if > 32-bit size should be reject, but all the arch-specific compat ftruncate64 implementations get this wrong. Merge do_sys_ftruncate and ksys_ftruncate, replace the integer as boolean small flag with a descriptive one about LFS semantics, and use it correctly in the architecture-specific ftruncate64 implementations. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: 3dd681d944f6 ("arm64: 32-bit (compat) applications support") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260323070205.2939118-2-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-09Merge tag 'pull-filename' of ↵Linus Torvalds-72/+47
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs 'struct filename' updates from Al Viro: "[Mostly] sanitize struct filename handling" * tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits) sysfs(2): fs_index() argument is _not_ a pathname alpha: switch osf_mount() to strndup_user() ksmbd: use CLASS(filename_kernel) mqueue: switch to CLASS(filename) user_statfs(): switch to CLASS(filename) statx: switch to CLASS(filename_maybe_null) quotactl_block(): switch to CLASS(filename) chroot(2): switch to CLASS(filename) move_mount(2): switch to CLASS(filename_maybe_null) namei.c: switch user pathname imports to CLASS(filename{,_flags}) namei.c: convert getname_kernel() callers to CLASS(filename_kernel) do_f{chmod,chown,access}at(): use CLASS(filename_uflags) do_readlinkat(): switch to CLASS(filename_flags) do_sys_truncate(): switch to CLASS(filename) do_utimes_path(): switch to CLASS(filename_uflags) chdir(2): unspaghettify a bit... do_fchownat(): unspaghettify a bit... fspick(2): use CLASS(filename_flags) name_to_handle_at(): use CLASS(filename_uflags) vfs_open_tree(): use CLASS(filename_uflags) ...
2026-01-16chroot(2): switch to CLASS(filename)Al Viro-9/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16do_f{chmod,chown,access}at(): use CLASS(filename_uflags)Al Viro-6/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16do_sys_truncate(): switch to CLASS(filename)Al Viro-7/+5
Note that failures from filename_lookup() are final - ESTALE returned by it means that retry had been done by filename_lookup() and it failed there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16chdir(2): unspaghettify a bit...Al Viro-17/+10
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16do_fchownat(): unspaghettify a bit...Al Viro-16/+12
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13simplify the callers of file_open_name()Al Viro-8/+2
It accepts ERR_PTR() for name and does the right thing in that case. That allows to simplify the logics in callers, making them trivial to switch to CLASS(filename). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13do_sys_openat2(): get rid of useless check, switch to CLASS(filename)Al Viro-9/+3
do_file_open() will do the right thing when given ERR_PTR() as name... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13rename do_filp_open() to do_file_open()Al Viro-2/+2
"filp" thing never made sense; seeing that there are exactly 4 callers in the entire tree (and it's neither exported nor even declared in linux/*/*.h), there's no point keeping that ugliness. FWIW, the 'filp' thing did originate in OSD&I; for some reason Tanenbaum decided to call the object representing an opened file 'struct filp', the last letter standing for 'position'. In all Unices, Linux included, the corresponding object had always been 'struct file'... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13do_sys_truncate(): import pathname only onceAl Viro-1/+4
Convert the user_path_at() call inside a retry loop into getname_flags() + filename_lookup() + putname() and leave only filename_lookup() inside the loop. In this case we never pass LOOKUP_EMPTY, so getname_flags() is equivalent to plain getname(). The things could be further simplified by use of cleanup.h stuff, but let's not clutter the patch with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13chroot(2): import pathname only onceAl Viro-1/+3
Convert the user_path_at() call inside a retry loop into getname_flags() + filename_lookup() + putname() and leave only filename_lookup() inside the loop. In this case we never pass LOOKUP_EMPTY, so getname_flags() is equivalent to plain getname(). The things could be further simplified by use of cleanup.h stuff, but let's not clutter the patch with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13chdir(2): import pathname only onceAl Viro-1/+3
Convert the user_path_at() call inside a retry loop into getname_flags() + filename_lookup() + putname() and leave only filename_lookup() inside the loop. In this case we never pass LOOKUP_EMPTY, so getname_flags() is equivalent to plain getname(). The things could be further simplified by use of cleanup.h stuff, but let's not clutter the patch with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13do_fchownat(): import pathname only onceAl Viro-5/+6
Convert the user_path_at() call inside a retry loop into getname_flags() + filename_lookup() + putname() and leave only filename_lookup() inside the loop. Since we have the default logics for use of LOOKUP_EMPTY (passed iff AT_EMPTY_PATH is present in flags), just use getname_uflags() and don't bother with setting LOOKUP_EMPTY in lookup_flags - getname_uflags() will pass the right thing to getname_flags() and filename_lookup() doesn't care about LOOKUP_EMPTY at all. The things could be further simplified by use of cleanup.h stuff, but let's not clutter the patch with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13do_fchmodat(): import pathname only onceAl Viro-4/+4
Convert the user_path_at() call inside a retry loop into getname_flags() + filename_lookup() + putname() and leave only filename_lookup() inside the loop. Since we have the default logics for use of LOOKUP_EMPTY (passed iff AT_EMPTY_PATH is present in flags), just use getname_uflags() and don't bother with setting LOOKUP_EMPTY in lookup_flags - getname_uflags() will pass the right thing to getname_flags() and filename_lookup() doesn't care about LOOKUP_EMPTY at all. The things could be further simplified by use of cleanup.h stuff, but let's not clutter the patch with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13do_faccessat(): import pathname only onceAl Viro-3/+4
Convert the user_path_at() call inside a retry loop into getname_flags() + filename_lookup() + putname() and leave only filename_lookup() inside the loop. Since we have the default logics for use of LOOKUP_EMPTY (passed iff AT_EMPTY_PATH is present in flags), just use getname_uflags() and don't bother with setting LOOKUP_EMPTY in lookup_flags - getname_uflags() will pass the right thing to getname_flags() and filename_lookup() doesn't care about LOOKUP_EMPTY at all. The things could be further simplified by use of cleanup.h stuff, but let's not clutter the patch with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-12-15VFS: move dentry_create() from fs/open.c to fs/namei.cBenjamin Coddington-39/+0
To prepare knfsd's helper dentry_create(), move it to namei.c so that it can access static functions within. Callers of dentry_create() can be viewed as being mostly done with lookup, but still need to perform a few final checks. In order to use atomic_open() we want dentry_create() to be able to access: - vfs_prepare_mode - may_o_create - atomic_open .. all of which have static declarations. Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com> Link: https://patch.msgid.link/42deec53a50e1676e5501f8f1e17967d47b83681.1764259052.git.bcodding@hammerspace.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-01Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of ↵Linus Torvalds-14/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fd prepare updates from Christian Brauner: "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the common pattern of get_unused_fd_flags() + create file + fd_install() that is used extensively throughout the kernel and currently requires cumbersome cleanup paths. FD_ADD() - For simple cases where a file is installed immediately: fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device)); if (fd < 0) vfio_device_put_registration(device); return fd; FD_PREPARE() - For cases requiring access to the fd or file, or additional work before publishing: FD_PREPARE(fdf, O_CLOEXEC, sync_file->file); if (fdf.err) { fput(sync_file->file); return fdf.err; } data.fence = fd_prepare_fd(fdf); if (copy_to_user((void __user *)arg, &data, sizeof(data))) return -EFAULT; return fd_publish(fdf); The primitives are centered around struct fd_prepare. FD_PREPARE() encapsulates all allocation and cleanup logic and must be followed by a call to fd_publish() which associates the fd with the file and installs it into the caller's fdtable. If fd_publish() isn't called, both are deallocated automatically. FD_ADD() is a shorthand that does fd_publish() immediately and never exposes the struct to the caller. I've implemented this in a way that it's compatible with the cleanup infrastructure while also being usable separately. IOW, it's centered around struct fd_prepare which is aliased to class_fd_prepare_t and so we can make use of all the basica guard infrastructure" * tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) io_uring: convert io_create_mock_file() to FD_PREPARE() file: convert replace_fd() to FD_PREPARE() vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD() tty: convert ptm_open_peer() to FD_ADD() ntsync: convert ntsync_obj_get_fd() to FD_PREPARE() media: convert media_request_alloc() to FD_PREPARE() hv: convert mshv_ioctl_create_partition() to FD_ADD() gpio: convert linehandle_create() to FD_PREPARE() pseries: port papr_rtas_setup_file_interface() to FD_ADD() pseries: convert papr_platform_dump_create_handle() to FD_ADD() spufs: convert spufs_gang_open() to FD_PREPARE() papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE() spufs: convert spufs_context_open() to FD_PREPARE() net/socket: convert __sys_accept4_file() to FD_ADD() net/socket: convert sock_map_fd() to FD_ADD() net/kcm: convert kcm_ioctl() to FD_PREPARE() net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE() secretmem: convert memfd_secret() to FD_ADD() memfd: convert memfd_create() to FD_ADD() bpf: convert bpf_token_create() to FD_PREPARE() ...
2025-12-01Merge tag 'vfs-6.19-rc1.directory.delegations' of ↵Linus Torvalds-7/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull directory delegations update from Christian Brauner: "This contains the work for recall-only directory delegations for knfsd. Add support for simple, recallable-only directory delegations. This was decided at the fall NFS Bakeathon where the NFS client and server maintainers discussed how to merge directory delegation support. The approach starts with recallable-only delegations for several reasons: 1. RFC8881 has gaps that are being addressed in RFC8881bis. In particular, it requires directory position information for CB_NOTIFY callbacks, which is difficult to implement properly under Linux. The spec is being extended to allow that information to be omitted. 2. Client-side support for CB_NOTIFY still lags. The client side involves heuristics about when to request a delegation. 3. Early indication shows simple, recallable-only delegations can help performance. Anna Schumaker mentioned seeing a multi-minute speedup in xfstests runs with them enabled. With these changes, userspace can also request a read lease on a directory that will be recalled on conflicting accesses. This may be useful for applications like Samba. Users can disable leases altogether via the fs.leases-enable sysctl if needed. VFS changes: - Dedicated Type for Delegations Introduce struct delegated_inode to track inodes that may have delegations that need to be broken. This replaces the previous approach of passing raw inode pointers through the delegation breaking code paths, providing better type safety and clearer semantics for the delegation machinery. - Break parent directory delegations in open(..., O_CREAT) codepath - Allow mkdir to wait for delegation break on parent - Allow rmdir to wait for delegation break on parent - Add try_break_deleg calls for parents to vfs_link(), vfs_rename(), and vfs_unlink() - Make vfs_create(), vfs_mknod(), and vfs_symlink() break delegations on parent directory - Clean up argument list for vfs_create() - Expose delegation support to userland Filelock changes: - Make lease_alloc() take a flags argument - Rework the __break_lease API to use flags - Add struct delegated_inode - Push the S_ISREG check down to ->setlease handlers - Lift the ban on directory leases in generic_setlease NFSD changes: - Allow filecache to hold S_IFDIR files - Allow DELEGRETURN on directories - Wire up GET_DIR_DELEGATION handling Fixes: - Fix kernel-doc warnings in __fcntl_getlease - Add needed headers for new struct delegation definition" * tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: add needed headers for new struct delegation definition filelock: __fcntl_getlease: fix kernel-doc warnings vfs: expose delegation support to userland nfsd: wire up GET_DIR_DELEGATION handling nfsd: allow DELEGRETURN on directories nfsd: allow filecache to hold S_IFDIR files filelock: lift the ban on directory leases in generic_setlease vfs: make vfs_symlink break delegations on parent dir vfs: make vfs_mknod break delegations on parent directory vfs: make vfs_create break delegations on parent directory vfs: clean up argument list for vfs_create() vfs: break parent dir delegations in open(..., O_CREAT) codepath vfs: allow rmdir to wait for delegation break on parent vfs: allow mkdir to wait for delegation break on parent vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink} filelock: push the S_ISREG check down to ->setlease handlers filelock: add struct delegated_inode filelock: rework the __break_lease API to use flags filelock: make lease_alloc() take a flags argument
2025-12-01Merge tag 'vfs-6.19-rc1.guards' of ↵Linus Torvalds-6/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock lock guard updates from Christian Brauner: "This starts the work of introducing guards for superblock related locks. Introduce super_write_guard for scoped superblock write protection. This provides a guard-based alternative to the manual sb_start_write() and sb_end_write() pattern, allowing the compiler to automatically handle the cleanup" * tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: use super write guard in xfs_file_ioctl() open: use super write guard in do_ftruncate() btrfs: use super write guard in relocating_repair_kthread() ext4: use super write guard in write_mmp_block() btrfs: use super write guard in sb_start_write() btrfs: use super write guard btrfs_run_defrag_inode() btrfs: use super write guard in btrfs_reclaim_bgs_work() fs: add super_write_guard
2025-11-28open: convert do_sys_openat2() to FD_ADD()Christian Brauner-14/+3
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-13-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12fs: touch predicts in do_dentry_open()Mateusz Guzik-3/+3
Helps out some of the asm, the routine is still a mess. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251109125254.1288882-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12vfs: make vfs_create break delegations on parent directoryJeff Layton-1/+1
In order to add directory delegation support, we need to break delegations on the parent whenever there is going to be a change in the directory. Add a delegated_inode parameter to vfs_create. Most callers are converted to pass in NULL, but do_mknodat() is changed to wait for a delegation break if there is one. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-10-52f3feebb2f2@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12vfs: clean up argument list for vfs_create()Jeff Layton-3/+1
As Neil points out: "I would be in favour of dropping the "dir" arg because it is always d_inode(dentry->d_parent) which is stable." ...and... "Also *every* caller of vfs_create() passes ".excl = true". So maybe we don't need that arg at all." Drop both arguments from vfs_create() and fix up the callers. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-9-52f3feebb2f2@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12filelock: add struct delegated_inodeJeff Layton-4/+4
The current API requires a pointer to an inode pointer. It's easy for callers to get this wrong. Add a new delegated_inode structure and use that to pass back any inode that needs to be waited on. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-3-52f3feebb2f2@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05open: use super write guard in do_ftruncate()Christian Brauner-6/+3
Link: https://patch.msgid.link/20251104-work-guards-v1-7-5108ac78a171@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-03Merge tag 'pull-f_path' of ↵Linus Torvalds-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull file->f_path constification from Al Viro: "Only one thing was modifying ->f_path of an opened file - acct(2). Massaging that away and constifying a bunch of struct path * arguments in functions that might be given &file->f_path ends up with the situation where we can turn ->f_path into an anon union of const struct path f_path and struct path __f_path, the latter modified only in a few places in fs/{file_table,open,namei}.c, all for struct file instances that are yet to be opened" * tag 'pull-f_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) Have cc(1) catch attempts to modify ->f_path kernel/acct.c: saner struct file treatment configfs:get_target() - release path as soon as we grab configfs_item reference apparmor/af_unix: constify struct path * arguments ovl_is_real_file: constify realpath argument ovl_sync_file(): constify path argument ovl_lower_dir(): constify path argument ovl_get_verity_digest(): constify path argument ovl_validate_verity(): constify {meta,data}path arguments ovl_ensure_verity_loaded(): constify datapath argument ksmbd_vfs_set_init_posix_acl(): constify path argument ksmbd_vfs_inherit_posix_acl(): constify path argument ksmbd_vfs_kern_path_unlock(): constify path argument ksmbd_vfs_path_lookup_locked(): root_share_path can be const struct path * check_export(): constify path argument export_operations->open(): constify path argument rqst_exp_get_by_name(): constify path argument nfs: constify path argument of __vfs_getattr() bpf...d_path(): constify path argument done_path_create(): constify path argument ...
2025-09-27Have cc(1) catch attempts to modify ->f_pathAl Viro-5/+5
There are very few places that have cause to do that - all in core VFS now, and all done to files that are not yet opened (or visible to anybody else, for that matter). Let's turn f_path into a union of struct path __f_path and const struct path f_path. It's C, not C++ - 6.5.2.3[4] in C99 and later explicitly allows that kind of type-punning. That way any attempts to bypass these checks will be either very easy to catch, or (if the bastards get sufficiently creative to make it hard to spot with grep alone) very clearly malicious - and still catchable with a bit of instrumentation for sparse. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-16allow finish_no_open(file, ERR_PTR(-E...))Al Viro-4/+6
... allowing any ->lookup() return value to be passed to it. Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-31Merge tag 'fsnotify_for_v6.17-rc1' of ↵Linus Torvalds-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "A couple of small improvements for fsnotify subsystem. The most interesting is probably Amir's change modifying the meaning of fsnotify fmode bits (and I spell it out specifically because I know you care about those). There's no change for the common cases of no fsnotify watches or no permission event watches. But when there are permission watches (either for open or for pre-content events) but no FAN_ACCESS_PERM watch (which nobody uses in practice) we are now able optimize away unnecessary cache loads from the read path" * tag 'fsnotify_for_v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: optimize FMODE_NONOTIFY_PERM for the common cases fsnotify: merge file_set_fsnotify_mode_from_watchers() with open perm hook samples: fix building fs-monitor on musl systems fanotify: sanitize handle_type values when reporting fid
2025-07-28Merge tag 'vfs-6.17-rc1.fallocate' of ↵Linus Torvalds-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
2025-07-28fsnotify: merge file_set_fsnotify_mode_from_watchers() with open perm hookAmir Goldstein-3/+3
Create helper fsnotify_open_perm_and_set_mode() that moves the fsnotify_open_perm() hook into file_set_fsnotify_mode_from_watchers(). This will allow some more optimizations. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250708143641.418603-2-amir73il@gmail.com
2025-06-23fs: introduce FALLOC_FL_WRITE_ZEROES to fallocateZhang Yi-0/+1
With the development of flash-based storage devices, we can quickly write zeros to SSDs using the WRITE_ZERO command if the devices do not actually write physical zeroes to the media. Therefore, we can use this command to quickly preallocate a real all-zero file with written extents. This approach should be beneficial for subsequent pure overwriting within this file, as it can save on block allocation and, consequently, significant metadata changes, which should greatly improve overwrite performance on certain filesystems. Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to fallocate. This flag is used to convert a specified range of a file to zeros by issuing a zeroing operation. Blocks should be allocated for the regions that span holes in the file, and the entire range is converted to written extents. If the underlying device supports the actual offload write zeroes command, the process of zeroing out operation can be accelerated. If it does not, we currently don't prevent the file system from writing actual zeros to the device. This provides users with a new method to quickly generate a zeroed file, users no longer need to write zero data to create a file with written extents. Users can determine whether a disk supports the unmap write zeroes feature through querying this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes Users can also enable or disable the unmap write zeroes operation through this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes Finally, this flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is not permitted. In addition, filesystems that always require out-of-place writes should not support this flag since they still need to allocated new blocks during subsequent overwrites. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-7-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16don't duplicate vfs_open() in kernel_file_open()Al Viro-4/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250615003216.GB3011112@ZenIV Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15fs/open: make do_truncate() killableMax Kellermann-1/+4
Allows killing processes that are waiting for the inode lock. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/20250513150327.1373061-3-max.kellermann@ionos.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15fs/open: make chmod_common() and chown_common() killableMax Kellermann-2/+7
Allows killing processes that are waiting for the inode lock. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/20250513150327.1373061-2-max.kellermann@ionos.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-24Merge tag 'vfs-6.15-rc1.file' of ↵Linus Torvalds-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file handling updates from Christian Brauner: "This contains performance improvements for struct file's new refcount mechanism and various other performance work: - The stock kernel transitioning the file to no refs held penalizes the caller with an extra atomic to block any increments. For cases where the file is highly likely to be going away this is easily avoidable. Add file_ref_put_close() to better handle the common case where closing a file descriptor also operates on the last reference and build fput_close_sync() and fput_close() on top of it. This brings about 1% performance improvement by eliding one atomic in the common case. - Predict no error in close() since the vast majority of the time system call returns 0. - Reduce the work done in fdget_pos() by predicting that the file was found and by explicitly comparing the reference count to one and ignoring the dead zone" * tag 'vfs-6.15-rc1.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce work in fdget_pos() fs: use fput_close() in path_openat() fs: use fput_close() in filp_close() fs: use fput_close_sync() in close() file: add fput and file_ref_put routines optimized for use when closing a fd fs: predict no error in close()
2025-03-24Merge tag 'vfs-6.15-rc1.misc' of ↵Linus Torvalds-14/+15
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Add CONFIG_DEBUG_VFS infrastucture: - Catch invalid modes in open - Use the new debug macros in inode_set_cached_link() - Use debug-only asserts around fd allocation and install - Place f_ref to 3rd cache line in struct file to resolve false sharing Cleanups: - Start using anon_inode_getfile_fmode() helper in various places - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by f_pos_lock - Add unlikely() to kcmp() - Remove legacy ->remount_fs method from ecryptfs after port to the new mount api - Remove invalidate_inodes() in favour of evict_inodes() - Simplify ep_busy_loopER by removing unused argument - Avoid mmap sem relocks when coredumping with many missing pages - Inline getname() - Inline new_inode_pseudo() and de-staticize alloc_inode() - Dodge an atomic in putname if ref == 1 - Consistently deref the files table with rcu_dereference_raw() - Dedup handling of struct filename init and refcounts bumps - Use wq_has_sleeper() in end_dir_add() - Drop the lock trip around I_NEW wake up in evict() - Load the ->i_sb pointer once in inode_sb_list_{add,del} - Predict not reaching the limit in alloc_empty_file() - Tidy up do_sys_openat2() with likely/unlikely - Call inode_sb_list_add() outside of inode hash lock - Sort out fd allocation vs dup2 race commentary - Turn page_offset() into a wrapper around folio_pos() - Remove locking in exportfs around ->get_parent() call - try_lookup_one_len() does not need any locks in autofs - Fix return type of several functions from long to int in open - Fix return type of several functions from long to int in ioctls Fixes: - Fix watch queue accounting mismatch" * tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: sort out fd allocation vs dup2 race commentary, take 2 fs: call inode_sb_list_add() outside of inode hash lock fs: tidy up do_sys_openat2() with likely/unlikely fs: predict not reaching the limit in alloc_empty_file() fs: load the ->i_sb pointer once in inode_sb_list_{add,del} fs: drop the lock trip around I_NEW wake up in evict() fs: use wq_has_sleeper() in end_dir_add() VFS/autofs: try_lookup_one_len() does not need any locks fs: dedup handling of struct filename init and refcounts bumps fs: consistently deref the files table with rcu_dereference_raw() exportfs: remove locking around ->get_parent() call. fs: use debug-only asserts around fd allocation and install fs: dodge an atomic in putname if ref == 1 vfs: Remove invalidate_inodes() ecryptfs: remove NULL remount_fs from super_operations watch_queue: fix pipe accounting mismatch fs: place f_ref to 3rd cache line in struct file to resolve false sharing epoll: simplify ep_busy_loop by removing always 0 argument fs: Turn page_offset() into a wrapper around folio_pos() kcmp: improve performance adding an unlikely hint to task comparisons ...
2025-03-20fs: tidy up do_sys_openat2() with likely/unlikelyMateusz Guzik-4/+5
Otherwise gcc 13 generates conditional forward jumps (aka branch mispredict by default) for build_open_flags() being succesfull. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250320092331.1921700-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05fs: use fput_close() in filp_close()Mateusz Guzik-1/+1
When tracing a kernel build over refcounts seen this is a wash: @[kprobe:filp_close]: [0] 32195 |@@@@@@@@@@ | [1] 164567 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| I verified vast majority of the skew comes from do_close_on_exec() which could be changed to use a different variant instead. Even without changing that, the 19.5% of calls which got here still can save the extra atomic. Calls here are borderline non-existent compared to fput (over 3.2 mln!), so they should not negatively affect scalability. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250305123644.554845-4-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05fs: use fput_close_sync() in close()Mateusz Guzik-1/+1
This bumps open+close rate by 1% on Sapphire Rapids by eliding one atomic. It would be higher if it was not for several other slowdowns of the same nature. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250305123644.554845-3-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05fs: predict no error in close()Mateusz Guzik-4/+7
Vast majority of the time the system call returns 0. Letting the compiler know shortens the routine (119 -> 116) and the fast path. Disasm starting at the call to __fput_sync(): before: <+55>: call 0xffffffff816b0da0 <__fput_sync> <+60>: lea 0x201(%rbx),%eax <+66>: cmp $0x1,%eax <+69>: jbe 0xffffffff816ab707 <__x64_sys_close+103> <+71>: mov %ebx,%edx <+73>: movslq %ebx,%rax <+76>: and $0xfffffffd,%edx <+79>: cmp $0xfffffdfc,%edx <+85>: mov $0xfffffffffffffffc,%rdx <+92>: cmove %rdx,%rax <+96>: pop %rbx <+97>: pop %rbp <+98>: jmp 0xffffffff82242fa0 <__x86_return_thunk> <+103>: mov $0xfffffffffffffffc,%rax <+110>: jmp 0xffffffff816ab700 <__x64_sys_close+96> <+112>: mov $0xfffffffffffffff7,%rax <+119>: jmp 0xffffffff816ab700 <__x64_sys_close+96> after: <+56>: call 0xffffffff816b0da0 <__fput_sync> <+61>: xor %eax,%eax <+63>: test %ebp,%ebp <+65>: jne 0xffffffff816ab6ea <__x64_sys_close+74> <+67>: pop %rbx <+68>: pop %rbp <+69>: jmp 0xffffffff82242fa0 <__x86_return_thunk> # the jmp out <+74>: lea 0x201(%rbp),%edx <+80>: mov $0xfffffffffffffffc,%rax <+87>: cmp $0x1,%edx <+90>: jbe 0xffffffff816ab6e3 <__x64_sys_close+67> <+92>: mov %ebp,%edx <+94>: and $0xfffffffd,%edx <+97>: cmp $0xfffffdfc,%edx <+103>: cmovne %rbp,%rax <+107>: jmp 0xffffffff816ab6e3 <__x64_sys_close+67> <+109>: mov $0xfffffffffffffff7,%rax <+116>: jmp 0xffffffff816ab6e3 <__x64_sys_close+67> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250301104356.246031-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21open: Fix return type of several functions from long to intYuichiro Tsuji-10/+10
Fix the return type of several functions from long to int to match its actu al behavior. These functions only return int values. This change improves type consistency across the filesystem code and aligns the function signatu re with its existing implementation and usage. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yuichiro Tsuji <yuichtsu@amazon.com> Link: https://lore.kernel.org/r/20250121070844.4413-2-yuichtsu@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07fsnotify: disable notification by default for all pseudo filesAmir Goldstein-2/+2
Most pseudo files are not applicable for fsnotify events at all, let alone to the new pre-content events. Disable notifications to all files allocated with alloc_file_pseudo() and enable legacy inotify events for the specific cases of pipe and socket, which have known users of inotify events. Pre-content events are also kept disabled for sockets and pipes. Fixes: 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches") Reported-by: Alex Williamson <alex.williamson@redhat.com> Closes: https://lore.kernel.org/linux-fsdevel/20250131121703.1e4d00a7.alex.williamson@redhat.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wi2pThSVY=zhO=ZKxViBj5QCRX-=AS2+rVknQgJnHXDFg@mail.gmail.com/ Tested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20250203223205.861346-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07fsnotify: use accessor to set FMODE_NONOTIFY_*Amir Goldstein-3/+4
The FMODE_NONOTIFY_* bits are a 2-bits mode. Open coding manipulation of those bits is risky. Use an accessor file_set_fsnotify_mode() to set the mode. Rename file_set_fsnotify_mode() => file_set_fsnotify_mode_from_watchers() to make way for the simple accessor name. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20250203223205.861346-2-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-26Merge tag 'mm-stable-2025-01-26-14-59' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...
2025-01-25lib/list_debug.c: add object information in case of invalid objectManinder Singh-1/+1
As of now during link list corruption it prints about cluprit address and its wrong value, but sometime it is not enough to catch the actual issue point. If it prints allocation and free path of that corrupted node, it will be a lot easier to find and fix the issues. Adding the same information when data mismatch is found in link list debug data: [ 14.243055] slab kmalloc-32 start ffff0000cda19320 data offset 32 pointer offset 8 size 32 allocated at add_to_list+0x28/0xb0 [ 14.245259] __kmalloc_cache_noprof+0x1c4/0x358 [ 14.245572] add_to_list+0x28/0xb0 ... [ 14.248632] do_el0_svc_compat+0x1c/0x34 [ 14.249018] el0_svc_compat+0x2c/0x80 [ 14.249244] Free path: [ 14.249410] kfree+0x24c/0x2f0 [ 14.249724] do_force_corruption+0xbc/0x100 ... [ 14.252266] el0_svc_common.constprop.0+0x40/0xe0 [ 14.252540] do_el0_svc_compat+0x1c/0x34 [ 14.252763] el0_svc_compat+0x2c/0x80 [ 14.253071] ------------[ cut here ]------------ [ 14.253303] list_del corruption. next->prev should be ffff0000cda192a8, but was 6b6b6b6b6b6b6b6b. (next=ffff0000cda19348) [ 14.254255] WARNING: CPU: 3 PID: 84 at lib/list_debug.c:65 __list_del_entry_valid_or_report+0x158/0x164 Moved prototype of mem_dump_obj() to bug.h, as mm.h can not be included in bug.h. Link: https://lkml.kernel.org/r/20241230101043.53773-1-maninder1.s@samsung.com Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-23Merge tag 'fsnotify_hsm_for_v6.14-rc1' of ↵Linus Torvalds-15/+47
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify pre-content notification support from Jan Kara: "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets generated before a file contents is accessed. The event is synchronous so if there is listener for this event, the kernel waits for reply. On success the execution continues as usual, on failure we propagate the error to userspace. This allows userspace to fill in file content on demand from slow storage. The context in which the events are generated has been picked so that we don't hold any locks and thus there's no risk of a deadlock for the userspace handler. The new pre-content event is available only for users with global CAP_SYS_ADMIN capability (similarly to other parts of fanotify functionality) and it is an administrator responsibility to make sure the userspace event handler doesn't do stupid stuff that can DoS the system. Based on your feedback from the last submission, fsnotify code has been improved and now file->f_mode encodes whether pre-content event needs to be generated for the file so the fast path when nobody wants pre-content event for the file just grows the additional file->f_mode check. As a bonus this also removes the checks whether the old FS_ACCESS event needs to be generated from the fast path. Also the place where the event is generated during page fault has been moved so now filemap_fault() generates the event if and only if there is no uptodate folio in the page cache. Also we have dropped FS_PRE_MODIFY event as current real-world users of the pre-content functionality don't really use it so let's start with the minimal useful feature set" * tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) fanotify: Fix crash in fanotify_init(2) fs: don't block write during exec on pre-content watched files fs: enable pre-content events on supported file systems ext4: add pre-content fsnotify hook for DAX faults btrfs: disable defrag on pre-content watched files xfs: add pre-content fsnotify hook for DAX faults fsnotify: generate pre-content permission event on page fault mm: don't allow huge faults for files with pre content watches fanotify: disable readahead if we have pre-content watches fanotify: allow to set errno in FAN_DENY permission response fanotify: report file range info with pre-content events fanotify: introduce FAN_PRE_ACCESS permission event fsnotify: generate pre-content permission event on truncate fsnotify: pass optional file access range in pre-content event fsnotify: introduce pre-content permission events fanotify: reserve event bit of deprecated FAN_DIR_MODIFY fanotify: rename a misnamed constant fanotify: don't skip extra event info if no info_mode is set fsnotify: check if file is actually being watched for pre-content events on open fsnotify: opt-in for permission events at file open time ...