summaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)AuthorLines
2026-02-02ipv6: pass proto by value to ipv6_push_nfrag_opts() and ipv6_push_frag_opts()Eric Dumazet-5/+5
With CONFIG_STACKPROTECTOR_STRONG=y, it is better to avoid passing a pointer to an automatic variable. Change these exported functions to return 'u8 proto' instead of void. - ipv6_push_nfrag_opts() - ipv6_push_frag_opts() For instance, replace ipv6_push_frag_opts(skb, opt, &proto); with: proto = ipv6_push_frag_opts(skb, opt, proto); Note that even after this change, ip6_xmit() has to use a stack canary because of @first_hop variable. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02net: add a debug check in __skb_push()Eric Dumazet-0/+1
Add the following check, to detect bugs sooner for CONFIG_DEBUG_NET=y builds. DEBUG_NET_WARN_ON_ONCE(skb->data < skb->head); Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260130160253.2936789-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02fsverity: push out fsverity_info lookupChristoph Hellwig-9/+21
Pass a struct fsverity_info to the verification and readahead helpers, and push the lookup into the callers. Right now this is a very dumb almost mechanic move that open codes a lot of fsverity_info_addr() calls in the file systems. The subsequent patches will clean this up. This prepares for reducing the number of fsverity_info lookups, which will allow to amortize them better when using a more expensive lookup method. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260202060754.270269-7-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02fsverity: kick off hash readahead at data I/O submission timeChristoph Hellwig-8/+22
Currently all reads of the fsverity hashes are kicked off from the data I/O completion handler, leading to needlessly dependent I/O. This is worked around a bit by performing readahead on the level 0 nodes, but still fairly ineffective. Switch to a model where the ->read_folio and ->readahead methods instead kick off explicit readahead of the fsverity hashed so they are usually available at I/O completion time. For 64k sequential reads on my test VM this improves read performance from 2.4GB/s - 2.6GB/s to 3.5GB/s - 3.9GB/s. The improvements for random reads are likely to be even bigger. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260202060754.270269-5-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02net: l3mdev: use skb_dst_dev_rcu() in l3mdev_l3_out()Eric Dumazet-3/+4
Extend the RCU section a bit so that we can use the safer skb_dst_dev_rcu() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130191906.3781856-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-03pinctrl: core: Remove unused devm_pinctrl_unregister()Andy Shevchenko-3/+0
There are no users, drop it for good. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Signed-off-by: Linus Walleij <linusw@kernel.org>
2026-02-02Anbernic RG-DS AW87391 Speaker AmpsMark Brown-10/+54
Merge series from Chris Morgan <macroalpha82@gmail.com>: Add support for the Anbernic RG-DS Speaker Amplifiers. The Anbernic RG-DS uses two AW87391 ICs at 0x58 and 0x5B on i2c2. However, the manufacturer did not provide a firmware file, only a sequence of register writes to each device to enable and disable them. Add support for this *specific* configuration in the AW87390 driver. Since we are relying on a device specific sequence I am using a device specific compatible string. This driver does not currently support the aw87391 for any other device as I have none to test with valid firmware. Attempts to create firmware with the AwinicSCPv4 have not been successful.
2026-02-02spi: add multi-lane supportMark Brown-0/+30
Merge series from David Lechner <dlechner@baylibre.com>: This series is adding support for SPI controllers and peripherals that have multiple SPI data lanes (data lanes being independent sets of SDI/SDO lines, each with their own serializer/deserializer). This series covers this specific use case: +--------------+ +---------+ | SPI | | SPI | | Controller | | ADC | | | | | | CS0 |--->| CS | | SCLK |--->| SCLK | | SDO |--->| SDI | | SDI0 |<---| SDOA | | SDI1 |<---| SDOB | | SDI2 |<---| SDOC | | SDI3 |<---| SDOD | +--------------+ +--------+ The ADC is a simultaneous sampling ADC that can convert 4 samples at the same time. It has 4 data output lines (SDOA-D) that each contain the data of one of the 4 channels. So it requires a SPI controller with 4 separate deserializers in order to receive all of the information at the same time. This should also work for the use case in [1] as well. (Some of the patches in this series were already submitted there). In that case the SPI controller is used kind of like it is two separate SPI controllers, each with its own chip select, clock, and data lines. [1]: https://lore.kernel.org/linux-spi/20250616220054.3968946-1-sean.anderson@linux.dev/ The DT bindings are a fairly straight-forward mapping of which pins on the peripheral are connected to which pins on the controller. The SPI core code parses this and makes the information available to drivers. When a peripheral driver sees that multiple data lanes are wired up, it can chose to use them when sending messages. The SPI message API is a bit higher-level than just specifying the number of data lines for a SPI transfer though. I did some research on other SPI controllers that have this feature. They tend to be the kind meant for connecting to two flash memory chips at the same time but can be used more generically as well. They generally have the option to either use one lane at a time (Sean's use case), or can mirror the same data on multiple lanes (no users of this yet) or can perform striping of a single data FIFO/DMA stream to/from the two lanes (our use case). For now, the API assumes that if you want to do mirror/striping, then you want to use all available data lanes. Otherwise, it just uses the first data lane for "normal" SPI transfers.
2026-02-02rcu: Mark lockdep_assert_rcu_helper() __always_inlineArnd Bergmann-1/+1
There are some configurations in which lockdep_assert_rcu_helper() ends up not being inlined, for some reason. This leads to a link failure because now the caller tries to pass a nonexistant __ctx_lock_RCU structure: ld: lib/test_context-analysis.o: in function `test_rcu_assert_variants': test_context-analysis.c:(.text+0x275c): undefined reference to `RCU' ld: test_context-analysis.c:(.text+0x276c): undefined reference to `RCU_BH' ld: test_context-analysis.c:(.text+0x2774): undefined reference to `RCU_SCHED' I saw this in one out of many 32-bit arm builds using gcc-15.2, but it probably happens in others as well. Mark this function as __always_inline to fix the build. Fixes: fe00f6e84621 ("rcu: Support Clang's context analysis") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://patch.msgid.link/20260202095507.1237440-1-arnd@kernel.org
2026-02-02Merge tag 'iio-for-7.0a' of ↵Greg Kroah-Hartman-36/+192
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio into char-misc-next Jonathan writes: IIO: New device support, features and cleanup for the 6.20/7.0 cycle. Slightly messier than normal unfortunately due to some conflicts and build config bugs related to I3C drivers. One last minute Kconfig fix right at the top after a linux-next report. I've simplified the Kconfig and made it match other instances in the kernel so that should be safe enough despite short soak time in front of build bots. Merge of an immutable branch from I3C to get some stubs that were missing and caused build issues with dual I2C / I3C drivers. This also brought in a drop of some deprecated interfaces so there is also one patch to update a new driver to not use those. We are having another go at using cleanup.h magic with the IIO mode claim functions after backing out last try at this. This time we have wrappers around the new ACQUIRE() and ACQUIRE_ERR() macros. Having been burnt once, we will be taking it a bit more slowly this time wrt to wide adoption of these! Thanks in particular to Kurt for taking on this core IIO work. New Device Support ================== adi,ad18113 - New driver to support the AD18113 amplifier - an interesting device due to the external bypass paths where we need to describe what gain those paths have in DT. Longer term it will be interesting to see if this simplistic description is enough for real deployments. adi,ad4062 - New driver for the AD4060 and AD4052 SAR ADCs including trigger, event and GPIO controller support. Follow up patch replaced use of some deprecated I3C interfaces prior to the I3C immutable branch merge as that includes dropping them. adi,ad4134 - New driver for the AD4134 24bit 4 channel simultaneous sampling ADC. adi,ad7768-1, - Add support for the ADAQ767-1, ADAQ7768-1 and ADAQ7769-1 ADCs after some rework to enable the driver to support multiple device types. adi,ad9467 - Add support for the similar ad9211 ADC to this existing driver. - Make the selection of 2s comp mode explicit for normal operation and switch to offset binary when entering calibration mode. honeywell,abp2 - New driver to support this huge family (100+) of board mount pressure and temperature sensors. maxim,max22007 - New drier for this 4 channel DAC. memsic,mmc5633 - New driver for this I2C/I3C magnetometer. Follow on patches fixed up issues related to single driver supporting both bus types. microchip,mcp747feb02 - New driver for the Microchip MCP47F(E/V)B(0/1/2)1, MCP47F(E/V)B(0/1/2)2, MCP47F(E/V)B(0/1/2)4 and MCP47F(E/V)B(0/1/2)8 buffered voltage output DACs. nxp,sar-adc - New driver support ADCs found on s32g2 and s32g3 platforms. ti,ads1018 - New drier for the ADS1018 and ADS1118 SPI ADCs. ti,ads131m02 - New driver supporting ADS131M(02/03/04/06/08)24-bit simultaneous sampling ADCs. Features ======== iio-core - New IIO_DEV_ACQUIRE_DIRECT_MODE() / IIO_DEV_ACQUIRE_FAILED() + equivalents for the much rarer case where the mode needs pinning whether or not it is in direct mode. These use the ACQUIRE() / ACQUIRE_ERR() infrastructure underneath to provide both simple checks on whether we got the requested mode and to provide scope based release. Applied in a few initial drivers. adi,ad9467 - Support calibbias control adi,adf4377 - Add support to act as a clock provider. adi,adxl380 - Support low power 1KHz sampling frequency mode. Required rework of how events and filters were configured, plus applying of constraints when in this mode. rf-digital,rfd77402 - Add interrupt support as alternative to polling for completion. st,lsm6dsx - Tap event detection (after considerable driver rework) Cleanup and Minor Fixes ======================= More minor cleanup such as typos, white space etc not called out except where they were applied to a lot of drivers. Various drivers. - Use of dev_err_probe() to cleanup error handling. - Introduce local struct device and struct device_node variables to reduce duplication of getting them from containing structs. - Ensure uses of iio_trigger_generic_data_rdy_poll() set IRQF_NO_THREAD as that function calls non threaded child interrupt handlers. - Replace IRQF_ONESHOT in not thread interrupt handlers with IRQF_NO_THREAD to ensure they run as intended. Drop one unnecessary case. iio-sw-device/trigger. - Constify configs_group_operations structures. iio-buffer-dma / buffer-dma-engine - Use lockdep_assert_held() to replace WARN_ON() to check lock is correctly held. - Make use of cleanup.h magic to simplify various code paths. - Make iio_dma_buffer_init() return void rather than always success. adi,ad7766 - Replace custom interrupt handler with iio_trigger_generic_data_rdy_poll() adi,ad9832 - Drop legacy platform_data support. adi,ade9000 - Add a maintainer entry. adi,adt7316 - Move to EXPORT_GPL_SIMPLE_DEV_PM_OPS() and pm_sleep_ptr() so the compiler can cleanly drop unused pm structures and callbacks. adi,adxl345 - Relax build constraint vs the driver that is in input so both may be built as modules and selection made at runtime. adi,adxl380 - Make sure we don't read tail entries in the hardware fifo if a partial new scan has been written. - Move to a single larger regmap_noinc_read() to read the hardware fifo. aspeed,ast2600 - Add missing interrupts property to DT binding. bosch,bmi270_i2c - Add missing MODULE_DEVICE_TABLE() macros so auto probing of modules can work. bosch,smi330 - Drop duplicate assignment of IIO_TYPE in smi330_read_avail() - Use new common field_get() and field_prep() helpers to replace local version. honeywell,mprls0025pa Fixes delayed to merge window as late in cycle and we didn't want to delay the rest of the series. - Allow Kconfig selection of specific bus sub-drivers rather than tying that to the buses themselves being supported. - Zero spi_transfer structure to avoid chance of unintentionally set fields effecting transfer. - Fix a potential timing violation wrt to the chip select to first clock edge timing. - As recent driver, take risk inherent in dropping interrupt direction from driver as that should be set by firmware. - Fix wrong reported number of data bits for channel. - Fix a pressure channel calculation bug. - Rework to allow embedding the tx buffer in the iio_priv() structure rather than requiring separate allocation. - Move the buffer clearing to the shared core bringing it into affect for SPI as well as I2C. - Stricter checks for status byte. - Greatly simplify the measurement sequence. - Add a copyright entry to reflect Petre's continued work on this driver. intersil,isl29018 - Switch from spritnf to sysfs_emit_at() to make it clear overflow can't occur. invensense,icm42600 - Allow sysfs access to temperature when buffered capture in use as it does not impact other sensor data paths. invensense,itg3200 - Check unused return value in read_raw() callback. men,z188 - Drop now duplicated module alias. rf-digital,rfd77402 - Add DT binding doc and explicit of_device_id table. - Poll for timeout with times as on datasheet, then replace opencoded version with read_poll_timeout(). sensiron,scd4x - Add missing timestamp channel. The code to push it to the buffer was there but there was no way to turn it on. vti,sca3000 - Fix resource leak if iio_device_register() fails. * tag 'iio-for-7.0a' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio: (144 commits) iio: magn: mmc5633: Fix Kconfig for combination of I3C as module and driver builtin iio: sca3000: Fix a resource leak in sca3000_probe() iio: proximity: rfd77402: Add interrupt handling support iio: proximity: rfd77402: Document device private data structure iio: proximity: rfd77402: Use devm-managed mutex initialization iio: proximity: rfd77402: Use kernel helper for result polling iio: proximity: rfd77402: Align polling timeout with datasheet iio: cros_ec: Allow enabling/disabling calibration mode iio: frequency: ad9523: correct kernel-doc bad line warning iio: buffer: buffer_impl.h: fix kernel-doc warnings iio: gyro: itg3200: Fix unchecked return value in read_raw MAINTAINERS: add entry for ADE9000 driver iio: accel: sca3000: remove unused last_timestamp field iio: accel: adxl372: remove unused int2_bitmask field iio: adc: ad7766: Use iio_trigger_generic_data_rdy_poll() iio: magnetometer: Remove IRQF_ONESHOT iio: Replace IRQF_ONESHOT with IRQF_NO_THREAD iio: Use IRQF_NO_THREAD iio: dac: Add MAX22007 DAC driver support dt-bindings: iio: dac: Add max22007 ...
2026-02-02RDMA/bnxt_re: Report packet pacing capabilities when querying deviceKalesh AP-0/+16
Enable the support to report packet pacing capabilities from kernel to user space. Packet pacing allows to limit the rate to any number between the maximum and minimum. The capabilities are exposed to user space through query_device. The following capabilities are reported: 1. The maximum and minimum rate limit in kbps. 2. Bitmap showing which QP types support rate limit. Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20260202133413.3182578-3-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Anantha Prabhu <anantha.prabhu@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-02-02spi: add multi_lane_mode field to struct spi_transferDavid Lechner-0/+8
Add a new multi_lane_mode field to struct spi_transfer to allow peripherals that support multiple SPI lanes to be used with a single SPI controller. This requires both the peripheral and the controller to have multiple serializers connected to separate data lanes. It could also be used with a single controller and multiple peripherals that are functioning as a single logical device (similar to parallel memories). Acked-by: Nuno Sá <nuno.sa@analog.com> Acked-by: Marcelo Schmitt <marcelo.schmitt@analog.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: David Lechner <dlechner@baylibre.com> Link: https://patch.msgid.link/20260123-spi-add-multi-bus-support-v6-4-12af183c06eb@baylibre.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-02-02spi: support controllers with multiple data lanesDavid Lechner-0/+22
Add support for SPI controllers with multiple physical SPI data lanes. (A data lane in this context means lines connected to a serializer, so a controller with two data lanes would have two serializers in a single controller). This is common in the type of controller that can be used with parallel flash memories, but can be used for general purpose SPI as well. To indicate support, a controller just needs to set ctlr->num_data_lanes to something greater than 1. Peripherals indicate which lane they are connected to via device tree (ACPI support can be added if needed). The spi-{tx,rx}-bus-width DT properties can now be arrays. The length of the array indicates the number of data lanes, and each element indicates the bus width of that lane. For now, we restrict all lanes to have the same bus width to keep things simple. Support for an optional controller lane mapping property is also implemented. Signed-off-by: David Lechner <dlechner@baylibre.com> Link: https://patch.msgid.link/20260123-spi-add-multi-bus-support-v6-3-12af183c06eb@baylibre.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-02-02KVM: arm64: Use standard seq_file iterator for vgic-debug debugfsFuad Tabba-3/+0
The current implementation uses `vgic_state_iter` in `struct vgic_dist` to track the sequence position. This effectively makes the iterator shared across all open file descriptors for the VM. This approach has significant drawbacks: - It enforces mutual exclusion, preventing concurrent reads of the debugfs file (returning -EBUSY). - It relies on storing transient iterator state in the long-lived VM structure (`vgic_dist`). Refactor the implementation to use the standard `seq_file` iterator. Instead of storing state in `kvm_arch`, rely on the `pos` argument passed to the `start` and `next` callbacks, which tracks the logical index specific to the file descriptor. This change enables concurrent access and eliminates the `vgic_state_iter` field from `struct vgic_dist`. Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260202085721.3954942-4-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-02KVM: arm64: Reimplement vgic-debug XArray iterationFuad Tabba-1/+0
The vgic-debug interface implementation uses XArray marks (`LPI_XA_MARK_DEBUG_ITER`) to "snapshot" LPIs at the start of iteration. This modifies global state for a read-only operation and complicates reference counting, leading to leaks if iteration is aborted or fails. Reimplement the iterator to use dynamic iteration logic: - Remove `lpi_idx` from `struct vgic_state_iter`. - Replace the XArray marking mechanism with dynamic iteration using `xa_find_after(..., XA_PRESENT)`. - Wrap XArray traversals in `rcu_read_lock()`/`rcu_read_unlock()` to ensure safety against concurrent modifications (e.g., LPI unmapping). - Handle potential races where an LPI is removed during iteration by gracefully skipping it in `show()`, rather than warning. - Remove the unused `LPI_XA_MARK_DEBUG_ITER` definition. This simplifies the lifecycle management of the iterator and prevents resource leaks associated with the marking mechanism, and paves the way for using a standard seq_file iterator. Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260202085721.3954942-3-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-02wifi: mac80211: Add eMLSR/eMLMR action frame parsing supportLorenzo Bianconi-0/+49
Introduce support in AP mode for parsing of the Operating Mode Notification frame sent by the client to enable/disable MLO eMLSR or eMLMR if supported by both the AP and the client. Add drv_set_eml_op_mode mac80211 callback in order to configure underlay driver with eMLSR/eMLMR info. Tested-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260129-mac80211-emlsr-v4-1-14bdadf57380@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-02-02wifi: mac80211: add initial UHR supportJohannes Berg-2/+33
Add support for making UHR connections and accepting AP stations with UHR support. Link: https://patch.msgid.link/20260130164259.7185980484eb.Ieec940b58dbf8115dab7e1e24cb5513f52c8cb2f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-02-02wifi: cfg80211: add initial UHR supportJohannes Berg-3/+85
Add initial support for making UHR connections (or suppressing that), adding UHR capable stations on the AP side, encoding and decoding UHR MCSes (except rate calculation for the new MCSes 17, 19, 20 and 23) as well as regulatory support. Link: https://patch.msgid.link/20260130164259.54cc12fbb307.I26126bebd83c7ab17e99827489f946ceabb3521f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-02-02wifi: ieee80211: add some initial UHR definitionsJohannes Berg-2/+251
This is based on Draft P802.11bn_D1.2, but that's still very incomplete, so don't handle a number of things and make some local decisions such as using 40 bits for MAC capabilities and 8 bits for PHY capabilities. Link: https://patch.msgid.link/20260130164259.b28c9456ff94.I5b11fb0345a933bf497fd802aecc72932d58dd68@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-02-02wifi: mac80211: correct ieee80211-{s1g/eht}.h include guard commentsLachlan Hodges-2/+2
After the split of ieee80211.h some include guard comments weren't updated, update them to their new file names. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20260130005319.70019-1-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-02-02tracing/dma: Cap dma_map_sg tracepoint arrays to prevent buffer overflowDeepanshu Kartikey-6/+19
The dma_map_sg tracepoint can trigger a perf buffer overflow when tracing large scatter-gather lists. With devices like virtio-gpu creating large DRM buffers, nents can exceed 1000 entries, resulting in: phys_addrs: 1000 * 8 bytes = 8,000 bytes dma_addrs: 1000 * 8 bytes = 8,000 bytes lengths: 1000 * 4 bytes = 4,000 bytes Total: ~20,000 bytes This exceeds PERF_MAX_TRACE_SIZE (8192 bytes), causing: WARNING: CPU: 0 PID: 5497 at kernel/trace/trace_event_perf.c:405 perf buffer not large enough, wanted 24620, have 8192 Cap all three dynamic arrays at 128 entries using min() in the array size calculation. This ensures arrays are only as large as needed (up to the cap), avoiding unnecessary memory allocation for small operations while preventing overflow for large ones. The tracepoint now records the full nents/ents counts and a truncated flag so users can see when data has been capped. Changes in v2: - Use min(nents, DMA_TRACE_MAX_ENTRIES) for dynamic array sizing instead of fixed DMA_TRACE_MAX_ENTRIES allocation (feedback from Steven Rostedt) - This allocates only what's needed up to the cap, avoiding waste for small operations Reported-by: syzbot+28cea38c382fd15e751a@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=28cea38c382fd15e751a Tested-by: syzbot+28cea38c382fd15e751a@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Reviwed-by: Sean Anderson <sean.anderson@linux.dev> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260130155215.69737-1-kartikey406@gmail.com
2026-02-02Partial revert "x86/xen: fix balloon target initialization for PVH dom0"Roger Pau Monne-0/+2
This partially reverts commit 87af633689ce16ddb166c80f32b120e50b1295de so the current memory target for PV guests is still fetched from start_info->nr_pages, which matches exactly what the toolstack sets the initial memory target to. Using get_num_physpages() is possible on PV also, but needs adjusting to take into account the ISA hole and the PFN at 0 not considered usable memory despite being populated, and hence would need extra adjustments. Instead of carrying those extra adjustments switch back to the previous code. That leaves Linux with a difference in how current memory target is obtained for HVM vs PV, but that's better than adding extra logic just for PV. However if switching to start_info->nr_pages for PV domains we need to differentiate between released pages (freed back to the hypervisor) as opposed to pages in the physmap which are not populated to start with. Introduce a new xen_unpopulated_pages to account for papges that have never been populated, and hence in the PV case don't need subtracting. Fixes: 87af633689ce ("x86/xen: fix balloon target initialization for PVH dom0") Reported-by: James Dingwall <james@dingwall.me.uk> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260128110510.46425-2-roger.pau@citrix.com>
2026-02-02Merge tag 'amd-drm-next-6.20-2026-01-30' of ↵Dave Airlie-0/+2
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.20-2026-01-30: amdgpu: - Misc cleanups - SMU 13 fixes - SMU 14 fixes - GPUVM fault filter fix - USB4 fixes - DC FP guard fixes - Powergating fix - JPEG ring reset fix - RAS fixes - Xclk fix for soc21 APUs - Fix COND_EXEC handling for GC 11 - UserQ fixes - MQD size alignment fixes - SMU feature interface cleanup - GC 10-12 KGQ init fixes - GC 11-12 KGQ reset fixes amdkfd: - Fix device snapshot reporting - GC 12.1 trap handler fixes - MQD size alignment fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260130183257.28879-1-alexander.deucher@amd.com
2026-02-01Merge tag 'perf-urgent-2026-02-01' of ↵Linus Torvalds-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fix from Ingo Molnar: "Fix a race in the user-callchains code" * tag 'perf-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: sched: Fix perf crash with new is_user_task() helper
2026-02-01mfd: wm8350-core: Use IRQF_ONESHOTSebastian Andrzej Siewior-1/+1
Using a threaded interrupt without a dedicated primary handler mandates the IRQF_ONESHOT flag to mask the interrupt source while the threaded handler is active. Otherwise the interrupt can fire again before the threaded handler had a chance to run. Mark explained that this should not happen with this hardware since it is a slow irqchip which is behind an I2C/ SPI bus but the IRQ-core will refuse to accept such a handler. Set IRQF_ONESHOT so the interrupt source is masked until the secondary handler is done. Fixes: 1c6c69525b40e ("genirq: Reject bogus threaded irq requests") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20260128095540.863589-16-bigeasy@linutronix.de
2026-02-01genirq: Set IRQF_COND_ONESHOT in devm_request_irq().Sebastian Andrzej Siewior-1/+1
The flag IRQF_COND_ONESHOT was already force-added to request_irq() because the ACPI SCI interrupt handler is using the IRQF_ONESHOT flag which breaks all shared handlers. devm_request_irq() needs the same change since some users, such as int0002_vgpio, are using this function instead. Add IRQF_COND_ONESHOT to the flags passed to devm_request_irq(). Fixes: c37927a203fa2 ("genirq: Set IRQF_COND_ONESHOT in request_irq()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260128095540.863589-2-bigeasy@linutronix.de
2026-02-01cgroup: increase maximum subsystem count from 16 to 32Chen Ridong-5/+5
The current cgroup subsystem limit of 16 is insufficient, as the number of existing subsystems has already reached this limit. When adding a new subsystem that is not yet in the mainline kernel, building with `make allmodconfig` requires first bypassing the `BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16)` restriction to allow compilation to succeed. However, the kernel still fails to boot afterward. This patch increases the maximum number of supported cgroup subsystems from 16 to 32, providing enough room for future subsystem additions. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Tested-by: JP Kobryn <inwardvessel@gmail.com> Acked-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-31ipc: don't audit capability check in ipc_permissions()Ondrej Mosnacek-0/+6
The IPC sysctls implement the ctl_table_root::permissions hook and they override the file access mode based on the CAP_CHECKPOINT_RESTORE capability, which is being checked regardless of whether any access is actually denied or not, so if an LSM denies the capability, an audit record may be logged even when access is in fact granted. It wouldn't be viable to restructure the sysctl permission logic to only check the capability when the access would be actually denied if it's not granted. Thus, do the same as in net_ctl_permissions() (net/sysctl_net.c) - switch from ns_capable() to ns_capable_noaudit(), so that the check never emits an audit record. Link: https://lkml.kernel.org/r/20260122141303.241133-1-omosnace@redhat.com Fixes: 0889f44e2810 ("ipc: Check permissions for checkpoint_restart sysctls at open time") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Alexey Gladkov <legion@kernel.org> Acked-by: Serge Hallyn <serge@hallyn.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Paul Moore <paul@paul-moore.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31delayacct: add timestamp of delay maxWang Yaxin-1/+34
Problem ======= Commit 658eb5ab916d ("delayacct: add delay max to record delay peak") introduced the delay max for getdelays, which records abnormal latency peaks and helps us understand the magnitude of such delays. However, the peak latency value alone is insufficient for effective root cause analysis. Without the precise timestamp of when the peak occurred, we still lack the critical context needed to correlate it with other system events. Solution ======== To address this, we need to additionally record a precise timestamp when the maximum latency occurs. By correlating this timestamp with system logs and monitoring metrics, we can identify processes with abnormal resource usage at the same moment, which can help us to pinpoint root causes. Use Case ======== bash-4.4# ./getdelays -d -t 227 print delayacct stats ON TGID 227 CPU count real total virtual total delay total delay average delay max delay min delay max timestamp 46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58 IO count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A SWAP count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A RECLAIM count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A THRAS HING count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A COMPACT count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A WPCOPY count delay total delay average delay max delay min delay max timestamp 182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24 IRQ count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A Link: https://lkml.kernel.org/r/20260119100241520gWubW8-5QfhSf9gjqcc_E@zte.com.cn Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31tracing: move tracing declarations from kernel.h to a dedicated headerYury Norov-195/+205
Tracing is a half of the kernel.h in terms of LOCs, although it's a self-consistent part. It is intended for quick debugging purposes and isn't used by the normal tracing utilities. Move it to a separate header. If someone needs to just throw a trace_printk() in their driver, they will not have to pull all the heavy tracing machinery. This is a pure move. Link: https://lkml.kernel.org/r/20260116042510.241009-7-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31tracing: remove size parameter in __trace_puts()Steven Rostedt-2/+2
The __trace_puts() function takes a string pointer and the size of the string itself. All users currently simply pass in the strlen() of the string it is also passing in. There's no reason to pass in the size. Instead have the __trace_puts() function do the strlen() within the function itself. This fixes a header recursion issue where using strlen() in the macro calling __trace_puts() requires adding #include <linux/string.h> in order to use strlen(). Removing the use of strlen() from the header fixes the recursion issue. Link: https://lore.kernel.org/all/aUN8Hm377C5A0ILX@yury/ Link: https://lkml.kernel.org/r/20260116042510.241009-6-ynorov@nvidia.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31kernel.h: include linux/instruction_pointer.h explicitlyYury Norov-0/+1
In preparation for decoupling linux/instruction_pointer.h and linux/kernel.h, include instruction_pointer.h explicitly where needed. Link: https://lkml.kernel.org/r/20260116042510.241009-5-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31kernel.h: move VERIFY_OCTAL_PERMISSIONS() to sysfs.hYury Norov-13/+14
The macro is related to sysfs, but is defined in kernel.h. Move it to the proper header, and unload the generic kernel.h. Now that the macro is removed from kernel.h, linux/moduleparam.h is decoupled, and kernel.h inclusion can be removed. Link: https://lkml.kernel.org/r/20260116042510.241009-4-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31moduleparam: include required headers explicitlyYury Norov-0/+5
The following patch drops moduleparam.h dependency on kernel.h. In preparation to it, list all the required headers explicitly. Link: https://lkml.kernel.org/r/20260116042510.241009-3-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Suggested-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31kernel.h: drop STACK_MAGIC macroYury Norov-2/+0
Patch series "Unload linux/kernel.h", v5. kernel.h hosts declarations that can be placed better. This series decouples kernel.h with some explicit and implicit dependencies; also, moves tracing functionality to a new independent header. This patch (of 6): The macro was introduced in 1994, v1.0.4, for stacks protection. Since that, people found better ways to protect stacks, and now the macro is only used by i915 selftests. Move it to a local header and drop from the kernel.h. Link: https://lkml.kernel.org/r/20260116042510.241009-1-ynorov@nvidia.com Link: https://lkml.kernel.org/r/20260116042510.241009-2-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31compiler-clang.h: require LLVM 19.1.0 or higher for __typeof_unqual__Nathan Chancellor-1/+1
When building the kernel using a version of LLVM between llvmorg-19-init (the first commit of the LLVM 19 development cycle) and the change in LLVM that actually added __typeof_unqual__ for all C modes [1], which might happen during a bisect of LLVM, there is a build failure: In file included from arch/x86/kernel/asm-offsets.c:9: In file included from include/linux/crypto.h:15: In file included from include/linux/completion.h:12: In file included from include/linux/swait.h:7: In file included from include/linux/spinlock.h:56: In file included from include/linux/preempt.h:79: arch/x86/include/asm/preempt.h:61:2: error: call to undeclared function '__typeof_unqual__'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 61 | raw_cpu_and_4(__preempt_count, ~PREEMPT_NEED_RESCHED); | ^ arch/x86/include/asm/percpu.h:478:36: note: expanded from macro 'raw_cpu_and_4' 478 | #define raw_cpu_and_4(pcp, val) percpu_binary_op(4, , "and", (pcp), val) | ^ arch/x86/include/asm/percpu.h:210:3: note: expanded from macro 'percpu_binary_op' 210 | TYPEOF_UNQUAL(_var) pto_tmp__; \ | ^ include/linux/compiler.h:248:29: note: expanded from macro 'TYPEOF_UNQUAL' 248 | # define TYPEOF_UNQUAL(exp) __typeof_unqual__(exp) | ^ The current logic of CC_HAS_TYPEOF_UNQUAL just checks for a major version of 19 but half of the 19 development cycle did not have support for __typeof_unqual__. Harden the logic of CC_HAS_TYPEOF_UNQUAL to avoid this error by only using __typeof_unqual__ with a released version of LLVM 19, which is greater than or equal to 19.1.0 with LLVM's versioning scheme that matches GCC's [2]. Link: https://github.com/llvm/llvm-project/commit/cc308f60d41744b5920ec2e2e5b25e1273c8704b [1] Link: https://github.com/llvm/llvm-project/commit/4532617ae420056bf32f6403dde07fb99d276a49 [2] Link: https://lkml.kernel.org/r/20260116-require-llvm-19-1-for-typeof_unqual-v1-1-3b9a4a4b212b@kernel.org Fixes: ac053946f5c4 ("compiler.h: introduce TYPEOF_UNQUAL() macro") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31kho: use unsigned long for nr_pagesPratyush Yadav-3/+3
Patch series "kho: clean up page initialization logic", v2. This series simplifies the page initialization logic in kho_restore_page(). It was originally only a single patch [0], but on Pasha's suggestion, I added another patch to use unsigned long for nr_pages. Technically speaking, the patches aren't related and can be applied independently, but bundling them together since patch 2 relies on 1 and it is easier to manage them this way. This patch (of 2): With 4k pages, a 32-bit nr_pages can span up to 16 TiB. While it is a lot, there exist systems with terabytes of RAM. gup is also moving to using long for nr_pages. Use unsigned long and make KHO future-proof. Link: https://lkml.kernel.org/r/20260116112217.915803-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260116112217.915803-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31Merge branch 'mm-hotfixes-stable' into mm-nonmm-stable to pick up changesAndrew Morton-3/+26
required to merge "kho: use unsigned long for nr_pages".
2026-01-31mm, swap: drop the SWAP_HAS_CACHE flagKairui Song-1/+0
Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries) after swap allocation (folio_alloc_swap), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have just simplified that and defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so stop checking the SWAP_HAS_CACHE bit and just check the swap cache (swap table) directly, and add a WARN if a swap entry's count is being increased for the first time while the folio is not in swap cache. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up batch freeing to check the swap cache usage using the swap table: a slot with swap cache in the swap table will not be freed until its cache is gone, and no SWAP_HAS_CACHE bit is involved anymore. And besides, the removal of a folio and freeing of the slots are being done in the same critical section now, which should improve the performance. After these two changes, SWAP_HAS_CACHE no longer has any users. Swap cache synchronization is also done by the swap table directly, so using SWAP_HAS_CACHE to pin a slot before adding the cache is also no longer needed. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-18-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm, swap: add folio to swap cache directly on allocationKairui Song-5/+0
The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-15-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm, swap: cleanup swap entry management workflowKairui Song-33/+25
The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly freed and free with a folio bound. The folio lock will be useful for resolving many swap related races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table is fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change [kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi] Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com [ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris] Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4 Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Kairui Song <ryncsn@gmail.com> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Chris Mason <clm@meta.com> Cc: Lai Yi <yi1.lai@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm, swap: use swap cache as the swap in synchronize layerKairui Song-6/+0
Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. Now all swap in paths are using the swap cache, and both the swap cache and swap map are protected by the cluster lock. So we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock and folio lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced swap operations will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"), which will be removed very soon. [kasong@tencent.com: fix cgroup v1 accounting issue] Link: https://lkml.kernel.org/r/CAMgjq7CGUnzOVG7uSaYjzw9wD7w2dSKOHprJfaEp4CcGLgE3iw@mail.gmail.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm/shmem, swap: remove SWAP_MAP_SHMEMNhat Pham-8/+7
The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now having swap count == SWAP_MAP_SHMEM value is basically the same as having swap count == 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-8-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm/vma: add and use vma_assert_stabilised()Lorenzo Stoakes-4/+53
Sometimes we wish to assert that a VMA is stable, that is - the VMA cannot be changed underneath us. This will be the case if EITHER the VMA lock or the mmap lock is held. In order to do so, we introduce a new assert vma_assert_stabilised() - this will make a lockdep assert if lockdep is enabled AND the VMA is read-locked. Currently lockdep tracking for VMA write locks is not implemented, so it suffices to check in this case that we have either an mmap read or write semaphore held. Note that because the VMA lock uses the non-standard vmlock_dep_map naming convention, we cannot use lockdep_assert_is_write_held() so have to open code this ourselves via lockdep-asserting that lock_is_held_type(&vma->vmlock_dep_map, 0). We have to be careful here - for instance when merging a VMA, we use the mmap write lock to stabilise the examination of adjacent VMAs which might be simultaneously VMA read-locked whilst being faulted in. If we were to assert VMA read lock using lockdep we would encounter an incorrect lockdep assert. Also, we have to be careful about asserting mmap locks are held - if we try to address the above issue by first checking whether mmap lock is held and if so asserting it via lockdep, we may find that we were raced by another thread acquiring an mmap read lock simultaneously that either we don't own (and thus can be released any time - so we are not stable) or was indeed released since we last checked. So to deal with these complexities we end up with either a precise (if lockdep is enabled) or imprecise (if not) approach - in the first instance we assert the lock is held using lockdep and thus whether we own it. If we do own it, then the check is complete, otherwise we must check for the VMA read lock being held (VMA write lock implies mmap write lock so the mmap lock suffices for this). If lockdep is not enabled we simply check if the mmap lock is held and risk a false negative (i.e. not asserting when we should do). There are a couple places in the kernel where we already do this stabliisation check - the anon_vma_name() helper in mm/madvise.c and vma_flag_set_atomic() in include/linux/mm.h, which we update to use vma_assert_stabilised(). This change abstracts these into vma_assert_stabilised(), uses lockdep if possible, and avoids a duplicate check of whether the mmap lock is held. This is also self-documenting and lays the foundations for further VMA stability checks in the code. The only functional change here is adding the lockdep check. Link: https://lkml.kernel.org/r/6c9e64bb2b56ddb6f806fde9237f8a00cb3a776b.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm/vma: update vma_assert_locked() to use lockdepLorenzo Stoakes-8/+48
We can use lockdep to avoid unnecessary work here, otherwise update the code to logically evaluate all pertinent cases and share code with vma_assert_write_locked(). Make it clear here that we treat the VMA being detached at this point as a bug, this was only implicit before. Additionally, abstract references to vma->vmlock_dep_map by introducing a macro helper __vma_lockdep_map() which accesses this field if lockdep is enabled. Since lock_is_held() is specified as an extern function if lockdep is disabled, we can simply have __vma_lockdep_map() defined as NULL in this case, and then use IS_ENABLED(CONFIG_LOCKDEP) to avoid ugly ifdeffery. [lorenzo.stoakes@oracle.com: add helper macro __vma_lockdep_map(), per Vlastimil] Link: https://lkml.kernel.org/r/7c4b722e-604b-4b20-8e33-03d2f8d55407@lucifer.local Link: https://lkml.kernel.org/r/538762f079cc4fa76ff8bf30a8a9525a09961451.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm/vma: improve and document __is_vma_write_locked()Lorenzo Stoakes-21/+24
We don't actually need to return an output parameter providing mm sequence number, rather we can separate that out into another function - __vma_raw_mm_seqnum() - and have any callers which need to obtain that invoke that instead. The access to the raw sequence number requires that we hold the exclusive mmap lock such that we know we can't race vma_end_write_all(), so move the assert to __vma_raw_mm_seqnum() to make this requirement clear. Also while we're here, convert all of the VM_BUG_ON_VMA()'s to VM_WARN_ON_ONCE_VMA()'s in line with the convention that we do not invoke oopses when we can avoid it. [lorenzo.stoakes@oracle.com: minor tweaks, per Vlastimil] Link: https://lkml.kernel.org/r/3fa89c13-232d-4eee-86cc-96caa75c2c67@lucifer.local Link: https://lkml.kernel.org/r/ef6c415c2d2c03f529dca124ccaed66bc2f60edc.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm/vma: introduce helper struct + thread through exclusive lock fnsLorenzo Stoakes-8/+29
It is confusing to have __vma_start_exclude_readers() return 0, 1 or an error (but only when waiting for readers in TASK_KILLABLE state), and having the return value be stored in a stack variable called 'locked' is further confusion. More generally, we are doing a lot of rather finnicky things during the acquisition of a state in which readers are excluded and moving out of this state, including tracking whether we are detached or not or whether an error occurred. We are implementing logic in __vma_start_exclude_readers() that effectively acts as if 'if one caller calls us do X, if another then do Y', which is very confusing from a control flow perspective. Introducing the shared helper object state helps us avoid this, as we can now handle the 'an error arose but we're detached' condition correctly in both callers - a warning if not detaching, and treating the situation as if no error arose in the case of a VMA detaching. This also acts to help document what's going on and allows us to add some more logical debug asserts. Also update vma_mark_detached() to add a guard clause for the likely 'already detached' state (given we hold the mmap write lock), and add a comment about ephemeral VMA read lock reference count increments to clarify why we are entering/exiting an exclusive locked state here. Finally, separate vma_mark_detached() into its fast-path component and make it inline, then place the slow path for excluding readers in mmap_lock.c. No functional change intended. [akpm@linux-foundation.org: fix function naming in comments, add comment per Vlastimil per Lorenzo] Link: https://lkml.kernel.org/r/7d3084d596c84da10dd374130a5055deba6439c0.1769198904.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/7d3084d596c84da10dd374130a5055deba6439c0.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm/vma: clean up __vma_enter/exit_locked()Lorenzo Stoakes-2/+2
These functions are very confusing indeed. 'Entering' a lock could be interpreted as acquiring it, but this is not what these functions are interacting with. Equally they don't indicate at all what kind of lock we are 'entering' or 'exiting'. Finally they are misleading as we invoke these functions when we already hold a write lock to detach a VMA. These functions are explicitly simply 'entering' and 'exiting' a state in which we hold the EXCLUSIVE lock in order that we can either mark the VMA as being write-locked, or mark the VMA detached. Rename the functions accordingly, and also update __vma_end_exclude_readers() to return detached state with a __must_check directive, as it is simply clumsy to pass an output pointer here to detached state and inconsistent vs. __vma_start_exclude_readers(). Finally, remove the unnecessary 'inline' directives. No functional change intended. Link: https://lkml.kernel.org/r/33273be9389712347d69987c408ca7436f0c1b22.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm/vma: add+use vma lockdep acquire/release definesLorenzo Stoakes-3/+34
The code is littered with inscrutable and duplicative lockdep incantations, replace these with defines which explain what is going on and add commentary to explain what we're doing. If lockdep is disabled these become no-ops. We must use defines so _RET_IP_ remains meaningful. These are self-documenting and aid readability of the code. Additionally, instead of using the confusing rwsem_*() form for something that is emphatically not an rwsem, we instead explicitly use lock_[acquired, release]_shared/exclusive() lockdep invocations since we are doing something rather custom here and these make more sense to use. No functional change intended. Link: https://lkml.kernel.org/r/fdae72441949ecf3b4a0ed3510da803e881bb153.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31mm/vma: rename is_vma_write_only(), separate out shared refcount putLorenzo Stoakes-13/+53
The is_vma_writer_only() function is misnamed - this isn't determining if there is only a write lock, as it checks for the presence of the VM_REFCNT_EXCLUDE_READERS_FLAG. Really, it is checking to see whether readers are excluded, with a possibility of a false positive in the case of a detachment (there we expect the vma->vm_refcnt to eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it to eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1). Rename the function accordingly. Relatedly, we use a __refcount_dec_and_test() primitive directly in vma_refcount_put(), using the old value to determine what the reference count ought to be after the operation is complete (ignoring racing reference count adjustments). Wrap this into a __vma_refcount_put_return() function, which we can then utilise in vma_mark_detached() and thus keep the refcount primitive usage abstracted. This function, as the name implies, returns the value after the reference count has been updated. This reduces duplication in the two invocations of this function. Also adjust comments, removing duplicative comments covered elsewhere and adding more to aid understanding. No functional change intended. Link: https://lkml.kernel.org/r/32053580bff460eb1092ef780b526cefeb748bad.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>