diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-10-03 18:48:02 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-10-03 18:48:02 -0700 |
| commit | 7dbec0bbc3b468310be172f1ce6ddc9411c84952 (patch) | |
| tree | d7fb08a5e40b3e916136fc426236715c425c55ed /Documentation/admin-guide | |
| parent | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma (diff) | |
| parent | dm raid: use proper md_ro_state enumerators (diff) | |
| download | linux-7dbec0bbc3b468310be172f1ce6ddc9411c84952.tar.gz linux-7dbec0bbc3b468310be172f1ce6ddc9411c84952.zip | |
Merge tag 'for-6.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:
- a new dm-pcache target for read/write caching on persistent memory
- fix typos in docs
- misc small refactoring
- mark dm-error with DM_TARGET_PASSES_INTEGRITY
- dm-request-based: fix NULL pointer dereference and quiesce_depth out of sync
- dm-linear: optimize REQ_PREFLUSH
- dm-vdo: return error on corrupted metadata
- dm-integrity: support asynchronous hash interface
* tag 'for-6.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (27 commits)
dm raid: use proper md_ro_state enumerators
dm-integrity: prefer synchronous hash interface
dm-integrity: enable asynchronous hash interface
dm-integrity: rename internal_hash
dm-integrity: add the "offset" argument
dm-integrity: allocate the recalculate buffer with kmalloc
dm-integrity: introduce integrity_kmap and integrity_kunmap
dm-integrity: replace bvec_kmap_local with kmap_local_page
dm-integrity: use internal variable for digestsize
dm vdo: return error on corrupted metadata in start_restoring_volume functions
dm vdo: Update code to use mem_is_zero
dm: optimize REQ_PREFLUSH with data when using the linear target
dm-pcache: use int type to store negative error codes
dm: fix "writen"->"written"
dm-pcache: cleanup: fix coding style report by checkpatch.pl
dm-pcache: remove ctrl_lock for pcache_cache_segment
dm: fix NULL pointer dereference in __dm_suspend()
dm: fix queue start/stop imbalance under suspend/load/resume races
dm-pcache: add persistent cache target in device-mapper
dm error: mark as DM_TARGET_PASSES_INTEGRITY
...
Diffstat (limited to 'Documentation/admin-guide')
4 files changed, 208 insertions, 4 deletions
diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst index 4d667228e744..a1e673c0e782 100644 --- a/Documentation/admin-guide/device-mapper/delay.rst +++ b/Documentation/admin-guide/device-mapper/delay.rst @@ -3,7 +3,7 @@ dm-delay ======== Device-Mapper's "delay" target delays reads and/or writes -and/or flushs and optionally maps them to different devices. +and/or flushes and optionally maps them to different devices. Arguments:: @@ -18,7 +18,7 @@ Table line has to either have 3, 6 or 9 arguments: to write and flush operations on optionally different write_device with optionally different sector offset -9: same as 6 arguments plus define flush_offset and flush_delay explicitely +9: same as 6 arguments plus define flush_offset and flush_delay explicitly on/with optionally different flush_device/flush_offset. Offsets are specified in sectors. @@ -40,7 +40,7 @@ Example scripts #!/bin/sh # # Create mapped device delaying write and flush operations for 400ms and - # splitting reads to device $1 but writes and flushs to different device $2 + # splitting reads to device $1 but writes and flushes to different device $2 # to different offsets of 2048 and 4096 sectors respectively. # dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400" @@ -48,7 +48,7 @@ Example scripts :: #!/bin/sh # - # Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms + # Create mapped device delaying reads for 50ms, writes for 100ms and flushes for 333ms # onto the same backing device at offset 0 sectors. # dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333" diff --git a/Documentation/admin-guide/device-mapper/dm-pcache.rst b/Documentation/admin-guide/device-mapper/dm-pcache.rst new file mode 100644 index 000000000000..09d327ef4b14 --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-pcache.rst @@ -0,0 +1,202 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================= +dm-pcache — Persistent Cache +================================= + +*Author: Dongsheng Yang <dongsheng.yang@linux.dev>* + +This document describes *dm-pcache*, a Device-Mapper target that lets a +byte-addressable *DAX* (persistent-memory, “pmem”) region act as a +high-performance, crash-persistent cache in front of a slower block +device. The code lives in `drivers/md/dm-pcache/`. + +Quick feature summary +===================== + +* *Write-back* caching (only mode currently supported). +* *16 MiB segments* allocated on the pmem device. +* *Data CRC32* verification (optional, per cache). +* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX + == 2`) and protected with CRC+sequence numbers. +* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism +* Pure *DAX path* I/O – no extra BIO round-trips +* *Log-structured write-back* that preserves backend crash-consistency + + +Constructor +=========== + +:: + + pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>] + +========================= ==================================================== +``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…). + All metadata *and* cached blocks are stored here. + +``backing_dev`` The slow block device to be cached. + +``cache_mode`` Optional, Only ``writeback`` is accepted at the + moment. + +``data_crc`` Optional, default to ``false`` + + * ``true`` – store CRC32 for every cached entry + and verify on reads + * ``false`` – skip CRC (faster) +========================= ==================================================== + +Example +------- + +.. code-block:: shell + + dmsetup create pcache_sdb --table \ + "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" + +The first time a pmem device is used, dm-pcache formats it automatically +(super-block, cache_info, etc.). + + +Status line +=========== + +``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints: + +:: + + <sb_flags> <seg_total> <cache_segs> <segs_used> \ + <gc_percent> <cache_flags> \ + <key_head_seg>:<key_head_off> \ + <dirty_tail_seg>:<dirty_tail_off> \ + <key_tail_seg>:<key_tail_off> + +Field meanings +-------------- + +=============================== ============================================= +``sb_flags`` Super-block flags (e.g. endian marker). + +``seg_total`` Number of physical *pmem* segments. + +``cache_segs`` Number of segments used for cache. + +``segs_used`` Segments currently allocated (bitmap weight). + +``gc_percent`` Current GC high-water mark (0-90). + +``cache_flags`` Bit 0 – DATA_CRC enabled + Bit 1 – INIT_DONE (cache initialised) + Bits 2-5 – cache mode (0 == WB). + +``key_head`` Where new key-sets are being written. + +``dirty_tail`` First dirty key-set that still needs + write-back to the backing device. + +``key_tail`` First key-set that may be reclaimed by GC. +=============================== ============================================= + + +Messages +======== + +*Change GC trigger* + +:: + + dmsetup message <dev> 0 gc_percent <0-90> + + +Theory of operation +=================== + +Sub-devices +----------- + +==================== ========================================================= +backing_dev Any block device (SSD/HDD/loop/LVM, etc.). +cache_dev DAX device; must expose direct-access memory. +==================== ========================================================= + +Segments and key-sets +--------------------- + +* The pmem space is divided into *16 MiB segments*. +* Each write allocates space from a per-CPU *data_head* inside a segment. +* A *cache-key* records a logical range on the origin and where it lives + inside pmem (segment + offset + generation). +* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem + and are themselves crash-safe (CRC). +* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets. + +Write-back +---------- + +Dirty keys are queued into a tree; a background worker copies data +back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the +upper layers forces an immediate metadata commit. + +Garbage collection +------------------ + +GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks +from *key_tail*, frees segments whose every key has been invalidated, and +advances *key_tail*. + +CRC verification +---------------- + +If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data +range when it is inserted and stores it in the on-media key. Reads +validate the CRC before copying to the caller. + + +Failure handling +================ + +* *pmem media errors* – all metadata copies are read with + ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation. +* *Cache full* – if no free segment can be found, writes return ``-EBUSY``; + dm-pcache retries internally (request deferral). +* *System crash* – on attach, the driver replays ksets from *key_tail* to + rebuild the in-core trees; every segment’s generation guards against + use-after-free keys. + + +Limitations & TODO +================== + +* Only *write-back* mode; other modes planned. +* Only FIFO cache invalidate; other (LRU, ARC...) planned. +* Table reload is not supported currently. +* Discard planned. + + +Example workflow +================ + +.. code-block:: shell + + # 1. Create devices + dmsetup create pcache_sdb --table \ + "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" + + # 2. Put a filesystem on top + mkfs.ext4 /dev/mapper/pcache_sdb + mount /dev/mapper/pcache_sdb /mnt + + # 3. Tune GC threshold to 80 % + dmsetup message pcache_sdb 0 gc_percent 80 + + # 4. Observe status + watch -n1 'dmsetup status pcache_sdb' + + # 5. Shutdown + umount /mnt + dmsetup remove pcache_sdb + + +``dm-pcache`` is under active development; feedback, bug reports and patches +are very welcome! diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst index cc5aec861576..f1c1f4b824ba 100644 --- a/Documentation/admin-guide/device-mapper/index.rst +++ b/Documentation/admin-guide/device-mapper/index.rst @@ -18,6 +18,7 @@ Device Mapper dm-integrity dm-io dm-log + dm-pcache dm-queue-length dm-raid dm-service-time diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst index a14e6d3e787c..8a67b320a97b 100644 --- a/Documentation/admin-guide/device-mapper/vdo.rst +++ b/Documentation/admin-guide/device-mapper/vdo.rst @@ -1,5 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0-only +====== dm-vdo ====== |
