| Age | Commit message (Collapse) | Author | Lines |
|
Pull fsverity fixes from Eric Biggers:
- Fix a build error on parisc
- Remove the non-large-folio-aware function fsverity_verify_page()
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: fix build error by adding fsverity_readahead() stub
fsverity: remove fsverity_verify_page()
f2fs: make f2fs_verify_cluster() partially large-folio-aware
f2fs: remove unnecessary ClearPageUptodate in f2fs_verify_cluster()
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
f2fs_verify_cluster() is the only remaining caller of the
non-large-folio-aware function fsverity_verify_page(). To unblock the
removal of that function, change f2fs_verify_cluster() to verify the
entire folio of each page and mark it up-to-date.
Note that this doesn't actually make f2fs_verify_cluster()
large-folio-aware, as it is still passed an array of pages. Currently,
it's never called with large folios.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218010630.7407-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Remove the unnecessary clearing of PG_uptodate. It's guaranteed to
already be clear.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260218010630.7407-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull more misc vfs updates from Christian Brauner:
"Features:
- Optimize close_range() from O(range size) to O(active FDs) by using
find_next_bit() on the open_fds bitmap instead of linearly scanning
the entire requested range. This is a significant improvement for
large-range close operations on sparse file descriptor tables.
- Add FS_XFLAG_VERITY file attribute for fs-verity files, retrievable
via FS_IOC_FSGETXATTR and file_getattr(). The flag is read-only.
Add tracepoints for fs-verity enable and verify operations,
replacing the previously removed debug printk's.
- Prevent nfsd from exporting special kernel filesystems like pidfs
and nsfs. These filesystems have custom ->open() and ->permission()
export methods that are designed for open_by_handle_at(2) only and
are incompatible with nfsd. Update the exportfs documentation
accordingly.
Fixes:
- Fix KMSAN uninit-value in ovl_fill_real() where strcmp() was used
on a non-null-terminated decrypted directory entry name from
fscrypt. This triggered on encrypted lower layers when the
decrypted name buffer contained uninitialized tail data.
The fix also adds VFS-level name_is_dot(), name_is_dotdot(), and
name_is_dot_dotdot() helpers, replacing various open-coded "." and
".." checks across the tree.
- Fix read-only fsflags not being reset together with xflags in
vfs_fileattr_set(). Currently harmless since no read-only xflags
overlap with flags, but this would cause inconsistencies for any
future shared read-only flag
- Return -EREMOTE instead of -ESRCH from PIDFD_GET_INFO when the
target process is in a different pid namespace. This lets userspace
distinguish "process exited" from "process in another namespace",
matching glibc's pidfd_getpid() behavior
Cleanups:
- Use C-string literals in the Rust seq_file bindings, replacing the
kernel::c_str!() macro (available since Rust 1.77)
- Fix typo in d_walk_ret enum comment, add porting notes for the
readlink_copy() calling convention change"
* tag 'vfs-7.0-rc1.misc.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add porting notes about readlink_copy()
pidfs: return -EREMOTE when PIDFD_GET_INFO is called on another ns
nfsd: do not allow exporting of special kernel filesystems
exportfs: clarify the documentation of open()/permission() expotrfs ops
fsverity: add tracepoints
fs: add FS_XFLAG_VERITY for fs-verity files
rust: seq_file: replace `kernel::c_str!` with C-Strings
fs: dcache: fix typo in enum d_walk_ret comment
ovl: use name_is_dot* helpers in readdir code
fs: add helpers name_is_dot{,dot,_dotdot}
ovl: Fix uninit-value in ovl_fill_real
fs: reset read-only fsflags together with xflags
fs/file: optimize close_range() complexity from O(N) to O(Sparse)
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this development cycle, we focused on several key performance
optimizations:
- introducing large folio support to enhance read speeds for
immutable files
- reducing checkpoint=enable latency by flushing only committed dirty
pages
- implementing tracepoints to diagnose and resolve lock priority
inversion.
Additionally, we introduced the packed_ssa feature to optimize the SSA
footprint when utilizing large block sizes.
Detail summary:
Enhancements:
- support large folio for immutable non-compressed case
- support non-4KB block size without packed_ssa feature
- optimize f2fs_enable_checkpoint() to avoid long delay
- optimize f2fs_overwrite_io() for f2fs_iomap_begin
- optimize NAT block loading during checkpoint write
- add write latency stats for NAT and SIT blocks in
f2fs_write_checkpoint
- pin files do not require sbi->writepages lock for ordering
- avoid f2fs_map_blocks() for consecutive holes in readpages
- flush plug periodically during GC to maximize readahead effect
- add tracepoints to catch lock overheads
- add several sysfs entries to tune internal lock priorities
Fixes:
- fix lock priority inversion issue
- fix incomplete block usage in compact SSA summaries
- fix to show simulate_lock_timeout correctly
- fix to avoid mapping wrong physical block for swapfile
- fix IS_CHECKPOINTED flag inconsistency issue caused by
concurrent atomic commit and checkpoint writes
- fix to avoid UAF in f2fs_write_end_io()"
* tag 'f2fs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (61 commits)
f2fs: sysfs: introduce critical_task_priority
f2fs: introduce trace_f2fs_priority_update
f2fs: fix lock priority inversion issue
f2fs: optimize f2fs_overwrite_io() for f2fs_iomap_begin
f2fs: fix incomplete block usage in compact SSA summaries
f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint()
f2fs: optimize NAT block loading during checkpoint write
f2fs: change size parameter of __has_cursum_space() to unsigned int
f2fs: add write latency stats for NAT and SIT blocks in f2fs_write_checkpoint
f2fs: pin files do not require sbi->writepages lock for ordering
f2fs: fix to show simulate_lock_timeout correctly
f2fs: introduce FAULT_SKIP_WRITE
f2fs: check skipped write in f2fs_enable_checkpoint()
Revert "f2fs: add timeout in f2fs_enable_checkpoint()"
f2fs: fix to unlock folio in f2fs_read_data_large_folio()
f2fs: fix error path handling in f2fs_read_data_large_folio()
f2fs: use folio_end_read
f2fs: fix to avoid mapping wrong physical block for swapfile
f2fs: avoid f2fs_map_blocks() for consecutive holes in readpages
f2fs: advance index and offset after zeroing in large folio read
...
|
|
Pull fsverity updates from Eric Biggers:
"fsverity cleanups, speedup, and memory usage optimization from
Christoph Hellwig:
- Move some logic into common code
- Fix btrfs to reject truncates of fsverity files
- Improve the readahead implementation
- Store each inode's fsverity_info in a hash table instead of using a
pointer in the filesystem-specific part of the inode.
This optimizes for memory usage in the usual case where most files
don't have fsverity enabled.
- Look up the fsverity_info fewer times during verification, to
amortize the hash table overhead"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: remove inode from fsverity_verification_ctx
fsverity: use a hashtable to find the fsverity_info
btrfs: consolidate fsverity_info lookup
f2fs: consolidate fsverity_info lookup
ext4: consolidate fsverity_info lookup
fs: consolidate fsverity_info lookup in buffer.c
fsverity: push out fsverity_info lookup
fsverity: deconstify the inode pointer in struct fsverity_info
fsverity: kick off hash readahead at data I/O submission time
ext4: move ->read_folio and ->readahead to readpage.c
readahead: push invalidate_lock out of page_cache_ra_unbounded
fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
fsverity: start consolidating pagecache code
fsverity: pass struct file to ->write_merkle_tree_block
f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
fs,fsverity: clear out fsverity_info from common code
fs,fsverity: reject size changes on fsverity files in setattr_prepare
|
|
This patch introduces /sys/fs/f2fs/<disk>/critical_task_priority, w/
this new sysfs interface, we can tune priority of f2fs_ckpt thread and
f2fs_gc thread.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Support for batch request processing for ublk, improving the
efficiency of the kernel/ublk server communication. This can yield
nice 7-12% performance improvements
- Support for integrity data for ublk
- Various other ublk improvements and additions, including a ton of
selftests additions and updated
- Move the handling of blk-crypto software fallback from below the
block layer to above it. This reduces the complexity of dealing with
bio splitting
- Series fixing a number of potential deadlocks in blk-mq related to
the queue usage counter and writeback throttling and rq-qos debugfs
handling
- Add an async_depth queue attribute, to resolve a performance
regression that's been around for a qhilw related to the scheduler
depth handling
- Only use task_work for IOPOLL completions on NVMe, if it is necessary
to do so. An earlier fix for an issue resulted in all these
completions being punted to task_work, to guarantee that completions
were only run for a given io_uring ring when it was local to that
ring. With the new changes, we can detect if it's necessary to use
task_work or not, and avoid it if possible.
- rnbd fixes:
- Fix refcount underflow in device unmap path
- Handle PREFLUSH and NOUNMAP flags properly in protocol
- Fix server-side bi_size for special IOs
- Zero response buffer before use
- Fix trace format for flags
- Add .release to rnbd_dev_ktype
- MD pull requests via Yu Kuai
- Fix raid5_run() to return error when log_init() fails
- Fix IO hang with degraded array with llbitmap
- Fix percpu_ref not resurrected on suspend timeout in llbitmap
- Fix GPF in write_page caused by resize race
- Fix NULL pointer dereference in process_metadata_update
- Fix hang when stopping arrays with metadata through dm-raid
- Fix any_working flag handling in raid10_sync_request
- Refactor sync/recovery code path, improve error handling for
badblocks, and remove unused recovery_disabled field
- Consolidate mddev boolean fields into mddev_flags
- Use mempool to allocate stripe_request_ctx and make sure
max_sectors is not less than io_opt in raid5
- Fix return value of mddev_trylock
- Fix memory leak in raid1_run()
- Add Li Nan as mdraid reviewer
- Move phys_vec definitions to the kernel types, mostly in preparation
for some VFIO and RDMA changes
- Improve the speed for secure erase for some devices
- Various little rust updates
- Various other minor fixes, improvements, and cleanups
* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
blk-mq: ABI/sysfs-block: fix docs build warnings
selftests: ublk: organize test directories by test ID
block: decouple secure erase size limit from discard size limit
block: remove redundant kill_bdev() call in set_blocksize()
blk-mq: add documentation for new queue attribute async_dpeth
block, bfq: convert to use request_queue->async_depth
mq-deadline: covert to use request_queue->async_depth
kyber: covert to use request_queue->async_depth
blk-mq: add a new queue sysfs attribute async_depth
blk-mq: factor out a helper blk_mq_limit_depth()
blk-mq-sched: unify elevators checking for async requests
block: convert nr_requests to unsigned int
block: don't use strcpy to copy blockdev name
blk-mq-debugfs: warn about possible deadlock
blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
blk-rq-qos: fix possible debugfs_mutex deadlock
blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs error reporting updates from Christian Brauner:
"This contains the changes to support generic I/O error reporting.
Filesystems currently have no standard mechanism for reporting
metadata corruption and file I/O errors to userspace via fsnotify.
Each filesystem (xfs, ext4, erofs, f2fs, etc.) privately defines
EFSCORRUPTED, and error reporting to fanotify is inconsistent or
absent entirely.
This introduces a generic fserror infrastructure built around struct
super_block that gives filesystems a standard way to queue metadata
and file I/O error reports for delivery to fsnotify.
Errors are queued via mempools and queue_work to avoid holding
filesystem locks in the notification path; unmount waits for pending
events to drain. A new super_operations::report_error callback lets
filesystem drivers respond to file I/O errors themselves (to be used
by an upcoming XFS self-healing patchset).
On the uapi side, EFSCORRUPTED and EUCLEAN are promoted from private
per-filesystem definitions to canonical errno.h values across all
architectures"
* tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ext4: convert to new fserror helpers
xfs: translate fsdax media errors into file "data lost" errors when convenient
xfs: report fs metadata errors via fsnotify
iomap: report file I/O errors to the VFS
fs: report filesystem and file I/O errors to fsnotify
uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
|
|
Use the kernel's resizable hash table (rhashtable) to find the
fsverity_info. This way file systems that want to support fsverity don't
have to bloat every inode in the system with an extra pointer. The
trade-off is that looking up the fsverity_info is a bit more expensive
now, but the main operations are still dominated by I/O and hashing
overhead.
The rhashtable implementations requires no external synchronization, and
the _fast versions of the APIs provide the RCU critical sections required
by the implementation. Because struct fsverity_info is only removed on
inode eviction and does not contain a reference count, there is no need
for an extended critical section to grab a reference or validate the
object state. The file open path uses rhashtable_lookup_get_insert_fast,
which can either find an existing object for the hash key or insert a
new one in a single atomic operation, so that concurrent opens never
instantiate duplicate fsverity_info structure. FS_IOC_ENABLE_VERITY must
already be synchronized by a combination of i_rwsem and file system flags
and uses rhashtable_lookup_insert_fast, which errors out on an existing
object for the hash key as an additional safety check.
Because insertion into the hash table now happens before S_VERITY is set,
fsverity just becomes a barrier and a flag check and doesn't have to look
up the fsverity_info at all, so there is only a single lookup per
->read_folio or ->readahead invocation. For btrfs there is an additional
one for each bio completion, while for ext4 and f2fs the fsverity_info
is stored in the per-I/O context and reused for the completion workqueue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de
[EB: folded in fix for missing fsverity_free_info()]
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Look up the fsverity_info once in f2fs_mpage_readpages, and then use it
for the readahead, local verification of holes and pass it along to the
I/O completion workqueue in struct bio_post_read_ctx. Do the same
thing in f2fs_get_read_data_folio for reads that come from garbage
collection and other background activities.
This amortizes the lookup better once it becomes less efficient.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260202060754.270269-10-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Pass a struct fsverity_info to the verification and readahead helpers,
and push the lookup into the callers. Right now this is a very dumb
almost mechanic move that open codes a lot of fsverity_info_addr() calls
in the file systems. The subsequent patches will clean this up.
This prepares for reducing the number of fsverity_info lookups, which
will allow to amortize them better when using a more expensive lookup
method.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260202060754.270269-7-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Currently all reads of the fsverity hashes are kicked off from the data
I/O completion handler, leading to needlessly dependent I/O. This is
worked around a bit by performing readahead on the level 0 nodes, but
still fairly ineffective.
Switch to a model where the ->read_folio and ->readahead methods instead
kick off explicit readahead of the fsverity hashed so they are usually
available at I/O completion time.
For 64k sequential reads on my test VM this improves read performance
from 2.4GB/s - 2.6GB/s to 3.5GB/s - 3.9GB/s. The improvements for
random reads are likely to be even bigger.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260202060754.270269-5-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Require the invalidate_lock to be held over calls to
page_cache_ra_unbounded instead of acquiring it in this function.
This prepares for calling page_cache_ra_unbounded from ->readahead for
fsverity read-ahead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20260202060754.270269-3-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
This patch introduces two new tracepoints for debug purpose.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If userspace thread has held f2fs rw semaphore, due to its low priority,
it could be runnable or preempted state for long time, during the time,
it will block high priority thread which is trying to grab the same rw
semaphore, e.g. cp_rwsem, io_rwsem...
To fix such issue, let's detect thread's priority when it tries to grab
f2fs_rwsem lock, if the priority is lower than a priority threshold, let's
uplift the priority before it enters into critical region of lock, and
restore the priority after it leaves from critical region.
Meanwhile, introducing two new sysfs nodes:
- /sys/fs/f2fs/<disk>/adjust_lock_priority, it is used to control whether
the functionality is enable or not.
========== ==================
Flag_Value Flag_Description
========== ==================
0x00000000 Disabled (default)
0x00000001 cp_rwsem
0x00000002 node_change
0x00000004 node_write
0x00000008 gc_lock
0x00000010 cp_global
0x00000020 io_rwsem
========== ==================
- /sys/fs/f2fs/<disk>/lock_duration_priority, it is used to control
priority threshold.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When overwriting already allocated blocks, f2fs_iomap_begin() calls
f2fs_overwrite_io() to check block mappings. However,
f2fs_overwrite_io() iterates through all mapped blocks in the range,
which can be inefficient for fragmented files with large I/O requests.
This patch optimizes f2fs_overwrite_io() by adding a 'check_first'
parameter and introducing __f2fs_overwrite_io() helper. When called from
f2fs_iomap_begin(), we only check the first mapping to determine if the
range is already allocated, which is sufficient for setting
map.m_may_create.
This optimization significantly reduces the number of f2fs_map_blocks()
calls in f2fs_overwrite_io() when called from f2fs_iomap_begin(),
especially for fragmented files with large I/O requests.
Cc: stable@kernel.org
Fixes: 351bc761338d ("f2fs: optimize f2fs DIO overwrites")
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Sunmin Jeong <s_min.jeong@samsung.com>
Signed-off-by: Yeongjin Gil <youngjin.gil@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In a previous commit, a bug was introduced where compact SSA summaries
failed to utilize the entire block space in non-4KB block size
configurations, leading to inefficient space management.
This patch fixes the calculation logic to ensure that compact SSA
summaries can fully occupy the block regardless of the block size.
Reported-by: Chris Mason <clm@meta.com>
Fixes: e48e16f3e37f ("f2fs: support non-4KB block size without packed_ssa feature")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
ext4 and f2fs are largely using the same code to read a page full
of Merkle tree blocks from the page cache, and the upcoming xfs
fsverity support would add another copy.
Move the ext4 code to fs/verity/ and use it in f2fs as well. For f2fs
this removes the previous f2fs-specific error injection, but otherwise
the behavior remains unchanged.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260128152630.627409-7-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
This will make an iomap implementation of the method easier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260128152630.627409-6-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Use IS_ENABLED to disable this code, leading to a slight size reduction:
text data bss dec hex filename
25709 2412 24 28145 6df1 fs/f2fs/compress.o.old
25198 2252 24 27474 6b52 fs/f2fs/compress.o
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260128152630.627409-5-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Free the fsverity_info directly in clear_inode instead of requiring file
systems to handle it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Link: https://lore.kernel.org/r/20260128152630.627409-3-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Add the check to reject truncates of fsverity files directly to
setattr_prepare instead of requiring the file system to handle it.
Besides removing boilerplate code, this also fixes the complete lack of
such check in btrfs.
Fixes: 146054090b08 ("btrfs: initial fsverity support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260128152630.627409-2-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Rename the helper is_dot_dotdot() into the name_ namespace
and add complementary helpers to check for dot and dotdot
names individually.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20260128132406.23768-3-amir73il@gmail.com
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
It's rare case that sync_inodes_sb() always skips to flush some drity
datas, so it's enough to give extra three more chances to flush data.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Under stress tests with frequent metadata operations, checkpoint write
time can become excessively long. Analysis shows that the slowdown is
caused by synchronous, one-by-one reads of NAT blocks during checkpoint
processing.
The issue can be reproduced with the following workload:
1. seq 1 650000 | xargs -P 16 -n 1 touch
2. sync # avoid checkpoint write during deleting
3. delete 1 file every 455 files
4. echo 3 > /proc/sys/vm/drop_caches
5. sync # trigger checkpoint write
This patch submits read I/O for all NAT blocks required in the
__flush_nat_entry_set() phase in advance, reducing the overhead of
synchronous waiting for individual NAT block reads.
The NAT block flush latency before and after the change is as below:
| |NAT blocks accessed|NAT blocks read|Flush time (ms)|
|-------------|-------------------|---------------|---------------|
|Before change|1205 |1191 |158 |
|After change |1264 |1242 |11 |
With a similar number of NAT blocks accessed and read from disk, adding
NAT block readahead reduces the total NAT block flush time by more than
90%.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
All callers of __has_cursum_space() pass an unsigned int value as the
size parameter. Change the parameter type to unsigned int accordingly.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch adds separate write latency accounting for NAT and SIT blocks
in f2fs_write_checkpoint().
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
For pinned files, the file mapping is already established before
writing, and since the writes are in IPU, there is no need to acquire
the sbi->writepages lock to guarantee write ordering.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Commit d36de29f4bb5 ("f2fs: sysfs: introduce inject_lock_timeout")
introduces a bug as below, fix it.
cat /sys/fs/f2fs/vdx/inject_lock_timeout
s/fs/f2fs/vdx/inject_lock_timeout: Invalid argument
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In order to simulate skipped write during enable_checkpoint().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch introduces sbi->nr_pages[F2FS_SKIPPED_WRITE] to record any
skipped write during data flush in f2fs_enable_checkpoint().
So in the loop of data flush, if there is any skipped write in previous
flush, let's retry sync_inode_sb(), otherwise, all dirty data written
before f2fs_enable_checkpoint() should have been persisted, then break
the retry loop.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This reverts commit 4bc347779698b5e67e1514bab105c2c083e55502.
Let's apply a better approach to flush the only dirty pages committed by user
to avoid the delay caused by unncessary incoming ones.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We missed to unlock folio in error path of f2fs_read_data_large_folio(),
fix it.
With below testcase, it can reproduce the bug.
touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
f2fs_io setflags immutable /mnt/f2fs/file
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
f2fs_io clearflags immutable /mnt/f2fs/file
echo 1 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In error path of f2fs_read_data_large_folio(), if bio is valid, it
may submit bio twice, fix it.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
No logic change.
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Xiaolong Guo reported a f2fs bug in bugzilla [1]
[1] https://bugzilla.kernel.org/show_bug.cgi?id=220951
Quoted:
"When using stress-ng's swap stress test on F2FS filesystem with kernel 6.6+,
the system experiences data corruption leading to either:
1 dm-verity corruption errors and device reboot
2 F2FS node corruption errors and boot hangs
The issue occurs specifically when:
1 Using F2FS filesystem (ext4 is unaffected)
2 Swapfile size is less than F2FS section size (2MB)
3 Swapfile has fragmented physical layout (multiple non-contiguous extents)
4 Kernel version is 6.6+ (6.1 is unaffected)
The root cause is in check_swap_activate() function in fs/f2fs/data.c. When the
first extent of a small swapfile (< 2MB) is not aligned to section boundaries,
the function incorrectly treats it as the last extent, failing to map
subsequent extents. This results in incorrect swap_extent creation where only
the first extent is mapped, causing subsequent swap writes to overwrite wrong
physical locations (other files' data).
Steps to Reproduce
1 Setup a device with F2FS-formatted userdata partition
2 Compile stress-ng from https://github.com/ColinIanKing/stress-ng
3 Run swap stress test: (Android devices)
adb shell "cd /data/stressng; ./stress-ng-64 --metrics-brief --timeout 60
--swap 0"
Log:
1 Ftrace shows in kernel 6.6, only first extent is mapped during second
f2fs_map_blocks call in check_swap_activate():
stress-ng-swap-8990: f2fs_map_blocks: ino=11002, file offset=0, start
blkaddr=0x43143, len=0x1
(Only 4KB mapped, not the full swapfile)
2 in kernel 6.1, both extents are correctly mapped:
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=0, start
blkaddr=0x13cd4, len=0x1
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=1, start
blkaddr=0x60c84b, len=0xff
The problematic code is in check_swap_activate():
if ((pblock - SM_I(sbi)->main_blkaddr) % blks_per_sec ||
nr_pblocks % blks_per_sec ||
!f2fs_valid_pinned_area(sbi, pblock)) {
bool last_extent = false;
not_aligned++;
nr_pblocks = roundup(nr_pblocks, blks_per_sec);
if (cur_lblock + nr_pblocks > sis->max)
nr_pblocks -= blks_per_sec;
/* this extent is last one */
if (!nr_pblocks) {
nr_pblocks = last_lblock - cur_lblock;
last_extent = true;
}
ret = f2fs_migrate_blocks(inode, cur_lblock, nr_pblocks);
if (ret) {
if (ret == -ENOENT)
ret = -EINVAL;
goto out;
}
if (!last_extent)
goto retry;
}
When the first extent is unaligned and roundup(nr_pblocks, blks_per_sec)
exceeds sis->max, we subtract blks_per_sec resulting in nr_pblocks = 0. The
code then incorrectly assumes this is the last extent, sets nr_pblocks =
last_lblock - cur_lblock (entire swapfile), and performs migration. After
migration, it doesn't retry mapping, so subsequent extents are never processed.
"
In order to fix this issue, we need to lookup block mapping info after
we migrate all blocks in the tail of swapfile.
Cc: stable@kernel.org
Fixes: 9703d69d9d15 ("f2fs: support file pinning for zoned devices")
Cc: Daeho Jeong <daehojeong@google.com>
Reported-and-tested-by: Xiaolong Guo <guoxiaolong2008@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220951
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
For consecutive large hole mapping across {d,id,did}nodes , we don't
need to call f2fs_map_blocks() to check one hole block per one time,
instead, we can use map.m_next_pgofs as a hint of next potential valid
block, so that we can skip calling f2fs_map_blocks the range of
[cur_pgofs + 1, .m_next_pgofs).
1) regular case
touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
Before:
real 0m0.706s
user 0m0.000s
sys 0m0.706s
After:
real 0m0.620s
user 0m0.008s
sys 0m0.611s
2) large folio case
touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
f2fs_io setflags immutable /mnt/f2fs/file
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
Before:
real 0m0.438s
user 0m0.004s
sys 0m0.433s
After:
real 0m0.368s
user 0m0.004s
sys 0m0.364s
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In f2fs_read_data_large_folio(), the block zeroing path calls
folio_zero_range() and then continues the loop. However, it fails to
advance index and offset before continuing.
This can cause the loop to repeatedly process the same subpage of the
folio, leading to stalls/hangs and incorrect progress when reading large
folios with holes/zeroed blocks.
Fix it by advancing index and offset unconditionally in the loop
iteration, so they are updated even when the zeroing path continues.
Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs_read_data_large_folio() can build a single read BIO across multiple
folios during readahead. If a folio ends up having none of its subpages
added to the BIO (e.g. all subpages are zeroed / treated as holes), it
will never be seen by f2fs_finish_read_bio(), so folio_end_read() is
never called. This leaves the folio locked and not marked uptodate.
Track whether the current folio has been added to a BIO via a local
'folio_in_bio' bool flag, and when iterating readahead folios, explicitly
mark the folio uptodate (on success) and unlock it when nothing was added.
Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In the second call to f2fs_map_blocks within f2fs_read_data_large_folio,
map.m_len exceeds the logical address space to be read. This patch
ensures map.m_len does not exceed the required address space.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Let's enhance sanity_check_node_footer() to detect more inconsistent
cases as below:
Node Type Node Footer Info
=================== =============================
NODE_TYPE_REGULAR inode = true and xnode = true
NODE_TYPE_INODE inode = false or xnode = true
NODE_TYPE_XATTR inode = true or xnode = false
NODE_TYPE_NON_INODE inode = false
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
-----------[ cut here ]------------
kernel BUG at fs/f2fs/data.c:358!
Call Trace:
<IRQ>
blk_update_request+0x5eb/0xe70 block/blk-mq.c:987
blk_mq_end_request+0x3e/0x70 block/blk-mq.c:1149
blk_complete_reqs block/blk-mq.c:1224 [inline]
blk_done_softirq+0x107/0x160 block/blk-mq.c:1229
handle_softirqs+0x283/0x870 kernel/softirq.c:579
__do_softirq kernel/softirq.c:613 [inline]
invoke_softirq kernel/softirq.c:453 [inline]
__irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
</IRQ>
In f2fs_write_end_io(), it detects there is inconsistency in between
node page index (nid) and footer.nid of node page.
If footer of node page is corrupted in fuzzed image, then we load corrupted
node page w/ async method, e.g. f2fs_ra_node_pages() or f2fs_ra_node_page(),
in where we won't do sanity check on node footer, once node page becomes
dirty, we will encounter this bug after node page writeback.
Cc: stable@kernel.org
Reported-by: syzbot+803dd716c4310d16ff3a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=803dd716c4310d16ff3a
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add node footer sanity check during node folio's writeback, if sanity
check fails, let's shutdown filesystem to avoid looping to redirty
and writeback in .writepages.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Clean up code to improve readability, no logic changes.
Signed-off-by: Yangyang Zang <zangyangyang1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Currently, F2FS requires the packed_ssa feature to be enabled when
utilizing non-4KB block sizes (e.g., 16KB). This restriction limits
the flexibility of filesystem formatting options.
This patch allows F2FS to support non-4KB block sizes even when the
packed_ssa feature is disabled. It adjusts the SSA calculation logic to
correctly handle summary entries in larger blocks without the packed
layout.
Cc: stable@kernel.org
Fixes: 7ee8bc3942f2 ("f2fs: revert summary entry count from 2048 to 512 in 16kb block support")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
__blkdev_issue_discard() in __submit_discard_cmd() will never fail, so
let's make FAULT_DISCARD fault injection obsolete.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
As syzbot reported an use-after-free issue in f2fs_write_end_io().
It is caused by below race condition:
loop device umount
- worker_thread
- loop_process_work
- do_req_filebacked
- lo_rw_aio
- lo_rw_aio_complete
- blk_mq_end_request
- blk_update_request
- f2fs_write_end_io
- dec_page_count
- folio_end_writeback
- kill_f2fs_super
- kill_block_super
- f2fs_put_super
: free(sbi)
: get_pages(, F2FS_WB_CP_DATA)
accessed sbi which is freed
In kill_f2fs_super(), we will drop all page caches of f2fs inodes before
call free(sbi), it guarantee that all folios should end its writeback, so
it should be safe to access sbi before last folio_end_writeback().
Let's relocate ckpt thread wakeup flow before folio_end_writeback() to
resolve this issue.
Cc: stable@kernel.org
Fixes: e234088758fc ("f2fs: avoid wait if IO end up when do_checkpoint for better performance")
Reported-by: syzbot+b4444e3c972a7a124187@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b4444e3c972a7a124187
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|