linux/kernel/rcu/tree.c, branch v6.3

rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period

2023-04-06T17:04:23Z

Memory passed to kvfree_rcu() that is to be freed is tracked by a per-CPU kfree_rcu_cpu structure, which in turn contains pointers to kvfree_rcu_bulk_data structures that contain pointers to memory that has not yet been handed to RCU, along with an kfree_rcu_cpu_work structure that tracks the memory that has already been handed to RCU. These structures track three categories of memory: (1) Memory for kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived during an OOM episode. The first two categories are tracked in a cache-friendly manner involving a dynamically allocated page of pointers (the aforementioned kvfree_rcu_bulk_data structures), while the third uses a simple (but decidedly cache-unfriendly) linked list through the rcu_head structures in each block of memory. On a given CPU, these three categories are handled as a unit, with that CPU's kfree_rcu_cpu_work structure having one pointer for each of the three categories. Clearly, new memory for a given category cannot be placed in the corresponding kfree_rcu_cpu_work structure until any old memory has had its grace period elapse and thus has been removed. And the kfree_rcu_monitor() function does in fact check for this. Except that the kfree_rcu_monitor() function checks these pointers one at a time. This means that if the previous kfree_rcu() memory passed to RCU had only category 1 and the current one has only category 2, the kfree_rcu_monitor() function will send that current category-2 memory along immediately. This can result in memory being freed too soon, that is, out from under unsuspecting RCU readers. To see this, consider the following sequence of events, in which: o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", then is preempted. o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset" after a later grace period. Except that "from_cset" is freed right after the previous grace period ended, so that "from_cset" is immediately freed. Task A resumes and references "from_cset"'s member, after which nothing good happens. In full detail: CPU 0 CPU 1 ---------------------- ---------------------- count_memcg_event_mm() |rcu_read_lock() <--- |mem_cgroup_from_task() |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq comes, current task is scheduled out. cgroup_attach_task() |cgroup_migrate() |cgroup_migrate_execute() |css_set_move_task(task, from_cset, to_cset, true) |cgroup_move_task(task, to_cset) |rcu_assign_pointer(.., to_cset) |... |cgroup_migrate_finish() |put_css_set_locked(from_cset) |from_cset->refcount return 0 |kfree_rcu(cset, rcu_head) // free from_cset after new gp |add_ptr_to_bulk_krc_lock() |schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] |queue_rcu_work(system_wq, &krwp->rcu_work) |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() |__queue_work(.., rwork->wq, &rwork->work); |kfree_rcu_work() |krwp->bulk_head_free[0] bulk is freed before new gp end!!! |The "from_cset" is freed before new gp end. // the task resumes some time later. |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. This commit therefore causes kfree_rcu_monitor() to refrain from moving kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU grace period has completed for all three categories. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()") Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") Reported-by: Mukesh Ojha Signed-off-by: Ziwei Dai Reviewed-by: Uladzislau Rezki (Sony) Tested-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney

Merge branch 'stall.2023.01.09a' into HEAD

2023-02-03T00:40:07Z

stall.2023.01.09a: RCU CPU stall-warning updates.

Merge branches 'doc.2023.01.05a', 'fixes.2023.01.23a', 'kvfree.2023.01.03a', 'srcu.2023.01.03a', 'srcu-always.2023.02.02a', 'tasks.2023.01.03a', 'torture.2023.01.05a' and 'torturescript.2023.01.03a' into HEAD

2023-02-03T00:33:43Z

doc.2023.01.05a: Documentation update. fixes.2023.01.23a: Miscellaneous fixes. kvfree.2023.01.03a: kvfree_rcu() updates. srcu.2023.01.03a: SRCU updates. srcu-always.2023.02.02a: Finish making SRCU be unconditionally available. tasks.2023.01.03a: Tasks-RCU updates. torture.2023.01.05a: Torture-test updates. torturescript.2023.01.03a: Torture-test scripting updates.

rcu: Disable laziness if lazy-tracking says so

2023-01-24T00:51:29Z

During suspend, we see failures to suspend 1 in 300-500 suspends. Looking closer, it appears that asynchronous RCU callbacks are being queued as lazy even though synchronous callbacks are expedited. These delays appear to not be very welcome by the suspend/resume code as evidenced by these occasional suspend failures. This commit modifies call_rcu() to check if rcu_async_should_hurry(), which will return true if we are in suspend or in-kernel boot. [ paulmck: Alphabetize local variables. ] Ignoring the lazy hint makes the 3000 suspend/resume cycles pass reliably on a 12th gen 12-core Intel CPU, and there is some evidence that it also slightly speeds up boot performance. Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney

rcu: Track laziness during boot and suspend

2023-01-18T04:20:11Z

Boot and suspend/resume should not be slowed down in kernels built with CONFIG_RCU_LAZY=y. In particular, suspend can sometimes fail in such kernels. This commit therefore adds rcu_async_hurry(), rcu_async_relax(), and rcu_async_should_hurry() functions that track whether or not either a boot or a suspend/resume operation is in progress. This will enable a later commit to refrain from laziness during those times. Export rcu_async_should_hurry(), rcu_async_hurry(), and rcu_async_relax() for later use by rcutorture. [ paulmck: Apply feedback from Steve Rostedt. ] Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney

rcu: Remove redundant call to rcu_boost_kthread_setaffinity()

2023-01-12T19:30:11Z

The rcu_boost_kthread_setaffinity() function is invoked at rcutree_online_cpu() and rcutree_offline_cpu() time, early in the online timeline and late in the offline timeline, respectively. It is also invoked from rcutree_dead_cpu(), however, in the absence of userspace manipulations (for which userspace must take responsibility), this call is redundant with that from rcutree_offline_cpu(). This redundancy can be demonstrated by printing out the relevant cpumasks This commit therefore removes the call to rcu_boost_kthread_setaffinity() from rcutree_dead_cpu(). Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney Reviewed-by: Frederic Weisbecker

rcu: Add RCU stall diagnosis information

2023-01-05T20:21:11Z

Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei Reviewed-by: Mukesh Ojha Reviewed-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney

rcu/kvfree: Split ready for reclaim objects from a batch

2023-01-04T01:48:41Z

This patch splits the lists of objects so as to avoid sending any through RCU that have already been queued for more than one grace period. These long-term-resident objects are immediately freed. The remaining short-term-resident objects are queued for later freeing using queue_rcu_work(). This change avoids delaying workqueue handlers with synchronize_rcu() invocations. Yes, workqueue handlers are designed to handle blocking, but avoiding blocking when unnecessary improves performance during low-memory situations. Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney

rcu/kvfree: Carefully reset number of objects in krcp

2023-01-04T01:48:41Z

The schedule_delayed_monitor_work() function relies on the count of objects queued into any given kfree_rcu_cpu structure. This count is used to determine how quickly to schedule passing these objects to RCU. There are three pipes where pointers can be placed. When any pipe is offloaded, the kfree_rcu_cpu structure's ->count counter is set to zero, which is wrong because the other pipes might still be non-empty. This commit therefore maintains per-pipe counters, and introduces a krc_count() helper to access the aggregate value of those counters. Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney

rcu/kvfree: Use READ_ONCE() when access to krcp->head

2023-01-04T01:48:41Z

The need_offload_krc() function is now lock-free, which gives the compiler freedom to load old values from plain C-language loads from the kfree_rcu_cpu struture's ->head pointer. This commit therefore applied READ_ONCE() to these loads. Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney