<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched/rt.c, branch v5.17</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.17</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.17'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2021-12-07T14:14:10Z</updated>
<entry>
<title>sched/rt: Try to restart rt period timer when rt runtime exceeded</title>
<updated>2021-12-07T14:14:10Z</updated>
<author>
<name>Li Hua</name>
<email>hucool.lihua@huawei.com</email>
</author>
<published>2021-12-03T03:36:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9b58e976b3b391c0cf02e038d53dd0478ed3013c'/>
<id>urn:sha1:9b58e976b3b391c0cf02e038d53dd0478ed3013c</id>
<content type='text'>
When rt_runtime is modified from -1 to a valid control value, it may
cause the task to be throttled all the time. Operations like the following
will trigger the bug. E.g:

  1. echo -1 &gt; /proc/sys/kernel/sched_rt_runtime_us
  2. Run a FIFO task named A that executes while(1)
  3. echo 950000 &gt; /proc/sys/kernel/sched_rt_runtime_us

When rt_runtime is -1, The rt period timer will not be activated when task
A enqueued. And then the task will be throttled after setting rt_runtime to
950,000. The task will always be throttled because the rt period timer is
not activated.

Fixes: d0b27fa77854 ("sched: rt-group: synchonised bandwidth period")
Reported-by: Hulk Robot &lt;hulkci@huawei.com&gt;
Signed-off-by: Li Hua &lt;hucool.lihua@huawei.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com
</content>
</entry>
<entry>
<title>sched/fair: Prevent dead task groups from regaining cfs_rq's</title>
<updated>2021-11-11T12:09:33Z</updated>
<author>
<name>Mathias Krause</name>
<email>minipli@grsecurity.net</email>
</author>
<published>2021-11-03T19:06:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b027789e5e50494c2325cc70c8642e7fd6059479'/>
<id>urn:sha1:b027789e5e50494c2325cc70c8642e7fd6059479</id>
<content type='text'>
Kevin is reporting crashes which point to a use-after-free of a cfs_rq
in update_blocked_averages(). Initial debugging revealed that we've
live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
free_fair_sched_group(). However, it was unclear how that can happen.

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object
which makes it incidentally overlap with SLUB's freelist pointer.
That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
mangling, leads to a reliable access violation in form of a #GP which
made the UAF fail fast.

Michal seems to have run into the same issue[1]. He already correctly
diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
cfs_rq's to list on unthrottle") is causing the preconditions for the
UAF to happen by re-adding cfs_rq's also to task groups that have no
more running tasks, i.e. also to dead ones. His analysis, however,
misses the real root cause and it cannot be seen from the crash
backtrace only, as the real offender is tg_unthrottle_up() getting
called via sched_cfs_period_timer() via the timer interrupt at an
inconvenient time.

When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
task group, it doesn't protect itself from getting interrupted. If the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group, sched_cfs_period_timer() will execute and walk the list of
task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
dying task group. These will later -- in free_fair_sched_group() -- be
kfree()'ed while still being linked, leading to the fireworks Kevin
and Michal are seeing.

To fix this race, ensure the dying task group gets unlinked first.
However, simply switching the order of unregistering and unlinking the
task group isn't sufficient, as concurrent RCU walkers might still see
it, as can be seen below:

    CPU1:                                      CPU2:
      :                                        timer IRQ:
      :                                          do_sched_cfs_period_timer():
      :                                            :
      :                                            distribute_cfs_runtime():
      :                                              rcu_read_lock();
      :                                              :
      :                                              unthrottle_cfs_rq():
    sched_offline_group():                             :
      :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
      list_del_rcu(&amp;tg-&gt;list);                           :
 (1)  :                                                  list_for_each_entry_rcu(child, &amp;parent-&gt;children, siblings)
      :                                                    :
 (2)  list_del_rcu(&amp;tg-&gt;siblings);                         :
      :                                                    tg_unthrottle_up():
      unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg-&gt;cfs_rq[cpu_of(rq)];
        :                                                    :
        list_del_leaf_cfs_rq(tg-&gt;cfs_rq[cpu]);               :
        :                                                    :
        :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq-&gt;nr_running)
 (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
      :                                                      :
      :                                                    :
      :                                                  :
      :                                                :
      :                                              :
 (4)  :                                              rcu_read_unlock();

CPU 2 walks the task group list in parallel to sched_offline_group(),
specifically, it'll read the soon to be unlinked task group entry at
(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
all cfs_rq's via list_del_leaf_cfs_rq() in
unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
these at (3), which is the cause of the UAF later on.

To prevent this additional race from happening, we need to wait until
walk_tg_tree_from() has finished traversing the task groups, i.e.
after the RCU read critical section ends in (4). Afterwards we're safe
to call unregister_fair_sched_group(), as each new walk won't see the
dying task group any more.

On top of that, we need to wait yet another RCU grace period after
unregister_fair_sched_group() to ensure print_cfs_stats(), which might
run concurrently, always sees valid objects, i.e. not already free'd
ones.

This patch survives Michal's reproducer[2] for 8h+ now, which used to
trigger within minutes before.

  [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
  [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/

Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
[peterz: shuffle code around a bit]
Reported-by: Kevin Tanguy &lt;kevin.tanguy@corp.ovh.com&gt;
Signed-off-by: Mathias Krause &lt;minipli@grsecurity.net&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
</content>
</entry>
<entry>
<title>sched/rt: Support schedstats for RT sched class</title>
<updated>2021-10-05T13:51:51Z</updated>
<author>
<name>Yafang Shao</name>
<email>laoar.shao@gmail.com</email>
</author>
<published>2021-09-05T14:35:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=57a5c2dafca8e3ce4f70e975a9c7727b66b5071f'/>
<id>urn:sha1:57a5c2dafca8e3ce4f70e975a9c7727b66b5071f</id>
<content type='text'>
We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. This patch enable it for RT sched class
as well.

After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for RT sched class.

The schedstat usage in RT sched class is similar with fair sched class,
for example,
                fair                        RT
enqueue         update_stats_enqueue_fair   update_stats_enqueue_rt
dequeue         update_stats_dequeue_fair   update_stats_dequeue_rt
put_prev_task   update_stats_wait_start     update_stats_wait_start_rt
set_next_task   update_stats_wait_end       update_stats_wait_end_rt

The user can get the schedstats information in the same way in fair sched
class. For example,
       fair                            RT
       /proc/[pid]/sched               /proc/[pid]/sched

schedstats is not supported for RT group.

The output of a RT task's schedstats as follows,
$ cat /proc/10349/sched
...
sum_sleep_runtime                            :           972.434535
sum_block_runtime                            :           960.433522
wait_start                                   :        188510.871584
sleep_start                                  :             0.000000
block_start                                  :             0.000000
sleep_max                                    :            12.001013
block_max                                    :           952.660622
exec_max                                     :             0.049629
slice_max                                    :             0.000000
wait_max                                     :             0.018538
wait_sum                                     :             0.424340
wait_count                                   :                   49
iowait_sum                                   :           956.495640
iowait_count                                 :                   24
nr_migrations_cold                           :                    0
nr_failed_migrations_affine                  :                    0
nr_failed_migrations_running                 :                    0
nr_failed_migrations_hot                     :                    0
nr_forced_migrations                         :                    0
nr_wakeups                                   :                   49
nr_wakeups_sync                              :                    0
nr_wakeups_migrate                           :                    0
nr_wakeups_local                             :                   49
nr_wakeups_remote                            :                    0
nr_wakeups_affine                            :                    0
nr_wakeups_affine_attempts                   :                    0
nr_wakeups_passive                           :                    0
nr_wakeups_idle                              :                    0
...

The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace RT tasks as well. The output of these tracepoints for a
RT tasks as follows,

- runtime
          stress-10352   [004] d.h.  1035.382286: sched_stat_runtime: comm=stress pid=10352 runtime=995769 [ns] vruntime=0 [ns]
          [vruntime=0 means it is a RT task]

- wait
          &lt;idle&gt;-0       [004] dN..  1227.688544: sched_stat_wait: comm=stress pid=10352 delay=46849882 [ns]

- blocked
     kworker/4:1-465     [004] dN..  1585.676371: sched_stat_blocked: comm=stress pid=17194 delay=189963 [ns]

- iowait
     kworker/4:1-465     [004] dN..  1585.675330: sched_stat_iowait: comm=stress pid=17189 delay=182848 [ns]

- sleep
           sleep-18194   [023] dN..  1780.891840: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001160770 [ns]
           sleep-18196   [023] dN..  1781.893208: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001161970 [ns]
           sleep-18197   [023] dN..  1782.894544: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001128840 [ns]
           [ In sleep.sh, it sleeps 1 sec each time. ]

[lkp@intel.com: reported build failure in earlier version]

Signed-off-by: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20210905143547.4668-7-laoar.shao@gmail.com
</content>
</entry>
<entry>
<title>sched/rt: Support sched_stat_runtime tracepoint for RT sched class</title>
<updated>2021-10-05T13:51:49Z</updated>
<author>
<name>Yafang Shao</name>
<email>laoar.shao@gmail.com</email>
</author>
<published>2021-09-05T14:35:44Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ed7b564cfdd0668efbd739d0b4e2d67797293f32'/>
<id>urn:sha1:ed7b564cfdd0668efbd739d0b4e2d67797293f32</id>
<content type='text'>
The runtime of a RT task has already been there, so we only need to
add a tracepoint.

One difference between fair task and RT task is that there is no vruntime
in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as
vruntime for RT task.

The output of this tracepoint for RT task as follows,
          stress-9748    [039] d.h.   113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns]
          stress-9748    [039] d.h.   113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns]
          stress-9748    [039] d.h.   113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns]

Signed-off-by: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20210905143547.4668-6-laoar.shao@gmail.com
</content>
</entry>
<entry>
<title>sched: Make struct sched_statistics independent of fair sched class</title>
<updated>2021-10-05T13:51:45Z</updated>
<author>
<name>Yafang Shao</name>
<email>laoar.shao@gmail.com</email>
</author>
<published>2021-09-05T14:35:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ceeadb83aea28372e54857bf88ab7e17af48ab7b'/>
<id>urn:sha1:ceeadb83aea28372e54857bf88ab7e17af48ab7b</id>
<content type='text'>
If we want to use the schedstats facility to trace other sched classes, we
should make it independent of fair sched class. The struct sched_statistics
is the schedular statistics of a task_struct or a task_group. So we can
move it into struct task_struct and struct task_group to achieve the goal.

After the patch, schestats are orgnized as follows,

    struct task_struct {
       ...
       struct sched_entity se;
       struct sched_rt_entity rt;
       struct sched_dl_entity dl;
       ...
       struct sched_statistics stats;
       ...
   };

Regarding the task group, schedstats is only supported for fair group
sched, and a new struct sched_entity_stats is introduced, suggested by
Peter -

    struct sched_entity_stats {
        struct sched_entity     se;
        struct sched_statistics stats;
    } __no_randomize_layout;

Then with the se in a task_group, we can easily get the stats.

The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.

As this patch changes the core struct of scheduler, so I verified the
performance it may impact on the scheduler with 'perf bench sched
pipe', suggested by Mel. Below is the result, in which all the values
are in usecs/op.
                                  Before               After
      kernel.sched_schedstats=0  5.2~5.4               5.2~5.4
      kernel.sched_schedstats=1  5.3~5.5               5.3~5.5
[These data is a little difference with the earlier version, that is
 because my old test machine is destroyed so I have to use a new
 different test machine.]

Almost no impact on the sched performance.

No functional change.

[lkp@intel.com: reported build failure in earlier version]

Signed-off-by: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
</content>
</entry>
<entry>
<title>sched/rt: Fix RT utilization tracking during policy change</title>
<updated>2021-06-22T14:41:59Z</updated>
<author>
<name>Vincent Donnefort</name>
<email>vincent.donnefort@arm.com</email>
</author>
<published>2021-06-21T10:37:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fecfcbc288e9f4923f40fd23ca78a6acdc7fdf6c'/>
<id>urn:sha1:fecfcbc288e9f4923f40fd23ca78a6acdc7fdf6c</id>
<content type='text'>
RT keeps track of the utilization on a per-rq basis with the structure
avg_rt. This utilization is updated during task_tick_rt(),
put_prev_task_rt() and set_next_task_rt(). However, when the current
running task changes its policy, set_next_task_rt() which would usually
take care of updating the utilization when the rq starts running RT tasks,
will not see a such change, leaving the avg_rt structure outdated. When
that very same task will be dequeued later, put_prev_task_rt() will then
update the utilization, based on a wrong last_update_time, leading to a
huge spike in the RT utilization signal.

The signal would eventually recover from this issue after few ms. Even if
no RT tasks are run, avg_rt is also updated in __update_blocked_others().
But as the CPU capacity depends partly on the avg_rt, this issue has
nonetheless a significant impact on the scheduler.

Fix this issue by ensuring a load update when a running task changes
its policy to RT.

Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
Signed-off-by: Vincent Donnefort &lt;vincent.donnefort@arm.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
</content>
</entry>
<entry>
<title>sched: Introduce sched_class::pick_task()</title>
<updated>2021-05-12T09:43:28Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-11-17T23:19:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=21f56ffe4482e501b9e83737612493eeaac21f5a'/>
<id>urn:sha1:21f56ffe4482e501b9e83737612493eeaac21f5a</id>
<content type='text'>
Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
[Vineeth: folded fixes]
Signed-off-by: Vineeth Remanan Pillai &lt;viremana@linux.microsoft.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Don Hiatt &lt;dhiatt@digitalocean.com&gt;
Tested-by: Hongyu Ning &lt;hongyu.ning@linux.intel.com&gt;
Tested-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Link: https://lkml.kernel.org/r/20210422123308.437092775@infradead.org
</content>
</entry>
<entry>
<title>sched: Wrap rq::lock access</title>
<updated>2021-05-12T09:43:26Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-11-17T23:19:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5cb9eaa3d274f75539077a28cf01e3563195fa53'/>
<id>urn:sha1:5cb9eaa3d274f75539077a28cf01e3563195fa53</id>
<content type='text'>
In preparation of playing games with rq-&gt;lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Don Hiatt &lt;dhiatt@digitalocean.com&gt;
Tested-by: Hongyu Ning &lt;hongyu.ning@linux.intel.com&gt;
Tested-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org
</content>
</entry>
<entry>
<title>sched: Fix various typos</title>
<updated>2021-03-21T23:11:52Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2021-03-18T12:38:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3b03706fa621ce31a3e9ef6307020fde4e6aae16'/>
<id>urn:sha1:3b03706fa621ce31a3e9ef6307020fde4e6aae16</id>
<content type='text'>
Fix ~42 single-word typos in scheduler code comments.

We have accumulated a few fun ones over the years. :-)

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Cc: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Cc: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Ben Segall &lt;bsegall@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: linux-kernel@vger.kernel.org
</content>
</entry>
<entry>
<title>sched: Use task_current() instead of 'rq-&gt;curr == p'</title>
<updated>2021-01-14T10:20:11Z</updated>
<author>
<name>Hui Su</name>
<email>sh_def@163.com</email>
</author>
<published>2020-10-30T17:32:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=65bcf072e20ed7597caa902f170f293662b0af3c'/>
<id>urn:sha1:65bcf072e20ed7597caa902f170f293662b0af3c</id>
<content type='text'>
Use the task_current() function where appropriate.

No functional change.

Signed-off-by: Hui Su &lt;sh_def@163.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk
</content>
</entry>
</feed>
