<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/time, branch v6.8</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.8</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.8'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2024-02-06T09:56:35Z</updated>
<entry>
<title>hrtimer: Report offline hrtimer enqueue</title>
<updated>2024-02-06T09:56:35Z</updated>
<author>
<name>Frederic Weisbecker</name>
<email>frederic@kernel.org</email>
</author>
<published>2024-01-29T23:56:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dad6a09f3148257ac1773cd90934d721d68ab595'/>
<id>urn:sha1:dad6a09f3148257ac1773cd90934d721d68ab595</id>
<content type='text'>
The hrtimers migration on CPU-down hotplug process has been moved
earlier, before the CPU actually goes to die. This leaves a small window
of opportunity to queue an hrtimer in a blind spot, leaving it ignored.

For example a practical case has been reported with RCU waking up a
SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that
way a sched/rt timer to the local offline CPU.

Make sure such situations never go unnoticed and warn when that happens.

Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Reported-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Signed-off-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com
</content>
</entry>
<entry>
<title>tick/sched: Preserve number of idle sleeps across CPU hotplug events</title>
<updated>2024-01-25T08:52:40Z</updated>
<author>
<name>Tim Chen</name>
<email>tim.c.chen@linux.intel.com</email>
</author>
<published>2024-01-22T23:35:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9a574ea9069be30b835a3da772c039993c43369b'/>
<id>urn:sha1:9a574ea9069be30b835a3da772c039993c43369b</id>
<content type='text'>
Commit 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs
CPU hotplug") preserved total idle sleep time and iowait sleeptime across
CPU hotplug events.

Similar reasoning applies to the number of idle calls and idle sleeps to
get the proper average of sleep time per idle invocation.

Preserve those fields too.

Fixes: 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug")
Signed-off-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com
</content>
</entry>
<entry>
<title>clocksource: Skip watchdog check for large watchdog intervals</title>
<updated>2024-01-25T08:13:16Z</updated>
<author>
<name>Jiri Wiesner</name>
<email>jwiesner@suse.de</email>
</author>
<published>2024-01-22T17:23:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=644649553508b9bacf0fc7a5bdc4f9e0165576a5'/>
<id>urn:sha1:644649553508b9bacf0fc7a5bdc4f9e0165576a5</id>
<content type='text'>
There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:

  clocksource: timekeeping watchdog on CPU373:
  Marking clocksource 'tsc' as unstable because the skew is too large:
  clocksource:   'hpet' wd_nsec: 14523447520
  clocksource:   'tsc'  cs_nsec: 14524115132

The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:

  cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612

The kernel used 200 microseconds for the uncertainty_margin of both the
clocksource and watchdog, resulting in a threshold of 400 microseconds (the
md variable). Both the cs_nsec and the wd_nsec value indicate that the
readout interval was circa 14.5 seconds.  The observed behaviour is that
watchdog checks failed for large readout intervals on 8 NUMA node
machines. This indicates that the size of the skew was directly proportinal
to the length of the readout interval on those machines. The measured
clocksource skew, 668 microseconds, was evaluated against a threshold (the
md variable) that is suited for readout intervals of roughly
WATCHDOG_INTERVAL, i.e. HZ &gt;&gt; 1, which is 0.5 second.

The intention of 2e27e793e280 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use a
clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
readout interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is far stricter than the skew with which NTP can deal.

The root causes of the skew being directly proportinal to the length of
the readout interval are:

  * the inaccuracy of the shift/mult pairs of clocksources and the watchdog
  * the conversion to nanoseconds is imprecise for large readout intervals

Prevent this by skipping the current watchdog check if the readout
interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
(of the TSC and HPET) corresponds to a limit on clocksource skew of 250
ppm (microseconds of skew per second).  To keep the limit imposed by NTP
(500 microseconds of skew per second) for all possible readout intervals,
the margins would have to be scaled so that the threshold value is
proportional to the length of the actual readout interval.

As for why the readout interval may get stretched: Since the watchdog is
executed in softirq context the expiration of the watchdog timer can get
severely delayed on account of a ksoftirqd thread not getting to run in a
timely manner. Surely, a system with such belated softirq execution is not
working well and the scheduling issue should be looked into but the
clocksource watchdog should be able to deal with it accordingly.

Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang &lt;feng.tang@intel.com&gt;
Signed-off-by: Jiri Wiesner &lt;jwiesner@suse.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Tested-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Reviewed-by: Feng Tang &lt;feng.tang@intel.com&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122172350.GA740@incl
</content>
</entry>
<entry>
<title>Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2024-01-21T19:14:40Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-01-21T19:14:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4fbbed7872677b0a28ba8237169968171a61efbd'/>
<id>urn:sha1:4fbbed7872677b0a28ba8237169968171a61efbd</id>
<content type='text'>
Pull timer updates from Thomas Gleixner:
 "Updates for time and clocksources:

   - A fix for the idle and iowait time accounting vs CPU hotplug.

     The time is reset on CPU hotplug which makes the accumulated
     systemwide time jump backwards.

   - Assorted fixes and improvements for clocksource/event drivers"

* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
  clocksource/drivers/ep93xx: Fix error handling during probe
  clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
  clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
  clocksource/timer-riscv: Add riscv_clock_shutdown callback
  dt-bindings: timer: Add StarFive JH8100 clint
  dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
</content>
</entry>
<entry>
<title>tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug</title>
<updated>2024-01-19T15:40:38Z</updated>
<author>
<name>Heiko Carstens</name>
<email>hca@linux.ibm.com</email>
</author>
<published>2024-01-15T16:35:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=71fee48fb772ac4f6cfa63dbebc5629de8b4cc09'/>
<id>urn:sha1:71fee48fb772ac4f6cfa63dbebc5629de8b4cc09</id>
<content type='text'>
When offlining and onlining CPUs the overall reported idle and iowait
times as reported by /proc/stat jump backward and forward:

cpu  132 0 176 225249 47 6 6 21 0 0
cpu0 80 0 115 112575 33 3 4 18 0 0
cpu1 52 0 60 112673 13 3 1 2 0 0

cpu  133 0 177 226681 47 6 6 21 0 0
cpu0 80 0 116 113387 33 3 4 18 0 0

cpu  133 0 178 114431 33 6 6 21 0 0 &lt;---- jump backward
cpu0 80 0 116 114247 33 3 4 18 0 0
cpu1 52 0 61 183 0 3 1 2 0 0        &lt;---- idle + iowait start with 0

cpu  133 0 178 228956 47 6 6 21 0 0 &lt;---- jump forward
cpu0 81 0 117 114929 33 3 4 18 0 0

Reason for this is that get_idle_time() in fs/proc/stat.c has different
sources for both values depending on if a CPU is online or offline:

- if a CPU is online the values may be taken from its per cpu
  tick_cpu_sched structure

- if a CPU is offline the values are taken from its per cpu cpustat
  structure

The problem is that the per cpu tick_cpu_sched structure is set to zero on
CPU offline. See tick_cancel_sched_timer() in kernel/time/tick-sched.c.

Therefore when a CPU is brought offline and online afterwards both its idle
and iowait sleeptime will be zero, causing a jump backward in total system
idle and iowait sleeptime. In a similar way if a CPU is then brought
offline again the total idle and iowait sleeptimes will jump forward.

It looks like this behavior was introduced with commit 4b0c0f294f60
("tick: Cleanup NOHZ per cpu data on cpu down").

This was only noticed now on s390, since we switched to generic idle time
reporting with commit be76ea614460 ("s390/idle: remove arch_cpu_idle_time()
and corresponding code").

Fix this by preserving the values of idle_sleeptime and iowait_sleeptime
members of the per-cpu tick_sched structure on CPU hotplug.

Fixes: 4b0c0f294f60 ("tick: Cleanup NOHZ per cpu data on cpu down")
Reported-by: Gerald Schaefer &lt;gerald.schaefer@linux.ibm.com&gt;
Signed-off-by: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lore.kernel.org/r/20240115163555.1004144-1-hca@linux.ibm.com

</content>
</entry>
<entry>
<title>Merge tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2024-01-09T02:44:11Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-01-09T02:44:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f24dc33f8e0a765bf9bdf1c190ae5b9a23343d65'/>
<id>urn:sha1:f24dc33f8e0a765bf9bdf1c190ae5b9a23343d65</id>
<content type='text'>
Pull timer subsystem updates from Ingo Molnar:

 - Various preparatory cleanups &amp; enhancements of the timer-wheel code,
   in preparation for the WIP 'pull timers at expiry' timer migration
   model series (which will replace the current 'push timers at enqueue'
   migration model), by Anna-Maria Behnsen:

      - Update comments and clean up confusing variable names

      - Add debug check to warn about time travel

      - Improve/expand timer-wheel tracepoints

      - Optimize away unnecessary IPIs for deferrable timers

      - Restructure &amp; clean up next_expiry_recalc()

      - Clean up forward_timer_base()

      - Introduce __forward_timer_base() and use it to simplify and
        micro-optimize get_next_timer_interrupt()

 - Restructure the get_next_timer_interrupt()'s idle logic for better
   readability and to enable a minor optimization.

 - Fix the nextevt calculation when no timers are pending

 - Fix the sysfs_get_uname() prototype declaration

* tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Fix nextevt calculation when no timers are pending
  timers: Rework idle logic
  timers: Use already existing function for forwarding timer base
  timers: Split out forward timer base functionality
  timers: Clarify check in forward_timer_base()
  timers: Move store of next event into __next_timer_interrupt()
  timers: Do not IPI for deferrable timers
  tracing/timers: Add tracepoint for tracking timer base is_idle flag
  tracing/timers: Enhance timer_start tracepoint
  tick-sched: Warn when next tick seems to be in the past
  tick/sched: Cleanup confusing variables
  tick-sched: Fix function names in comments
  time: Make sysfs_get_uname() function visible in header
</content>
</entry>
<entry>
<title>posix-timers: Get rid of [COMPAT_]SYS_NI() uses</title>
<updated>2023-12-21T05:30:27Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-12-19T23:26:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a4aebe936554dac6a91e5d091179c934f8325708'/>
<id>urn:sha1:a4aebe936554dac6a91e5d091179c934f8325708</id>
<content type='text'>
Only the posix timer system calls use this (when the posix timer support
is disabled, which does not actually happen in any normal case), because
they had debug code to print out a warning about missing system calls.

Get rid of that special case, and just use the standard COND_SYSCALL
interface that creates weak system call stubs that return -ENOSYS for
when the system call does not exist.

This fixes a kCFI issue with the SYS_NI() hackery:

  CFI failure at int80_emulation+0x67/0xb0 (target: sys_ni_posix_timers+0x0/0x70; expected type: 0xb02b34d9)
  WARNING: CPU: 0 PID: 48 at int80_emulation+0x67/0xb0

Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Reviewed-by: Sami Tolvanen &lt;samitolvanen@google.com&gt;
Tested-by: Sami Tolvanen &lt;samitolvanen@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>timers: Fix nextevt calculation when no timers are pending</title>
<updated>2023-12-20T15:49:39Z</updated>
<author>
<name>Anna-Maria Behnsen</name>
<email>anna-maria@linutronix.de</email>
</author>
<published>2023-12-01T09:26:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5'/>
<id>urn:sha1:da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5</id>
<content type='text'>
When no timer is queued into an empty timer base, the next_expiry will not
be updated. It was originally calculated as

  base-&gt;clk + NEXT_TIMER_MAX_DELTA

When the timer base stays empty long enough (&gt; NEXT_TIMER_MAX_DELTA), the
next_expiry value of the empty base suggests that there is a timer pending
soon. This might be more a kind of a theoretical problem, but the fix
doesn't hurt.

Use only base-&gt;next_expiry value as nextevt when timers are
pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all
information is in place, update base-&gt;next_expiry value of the empty timer
base as well.

Signed-off-by: Anna-Maria Behnsen &lt;anna-maria@linutronix.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lore.kernel.org/r/20231201092654.34614-13-anna-maria@linutronix.de

</content>
</entry>
<entry>
<title>timers: Rework idle logic</title>
<updated>2023-12-20T15:49:39Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2023-12-01T09:26:33Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bb8caad5083f8fbba70faf41f1d3bab7cf09da6d'/>
<id>urn:sha1:bb8caad5083f8fbba70faf41f1d3bab7cf09da6d</id>
<content type='text'>
To improve readability of the code, split base-&gt;idle calculation and
expires calculation into separate parts. While at it, update the comment
about timer base idle marking.

Thereby the following subtle change happens if the next event is just one
jiffy ahead and the tick was already stopped: Originally base-&gt;is_idle
remains true in this situation. Now base-&gt;is_idle turns to false. This may
spare an IPI if a timer is enqueued remotely to an idle CPU that is going
to tick on the next jiffy.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Anna-Maria Behnsen &lt;anna-maria@linutronix.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lore.kernel.org/r/20231201092654.34614-12-anna-maria@linutronix.de

</content>
</entry>
<entry>
<title>timers: Use already existing function for forwarding timer base</title>
<updated>2023-12-20T15:49:38Z</updated>
<author>
<name>Anna-Maria Behnsen</name>
<email>anna-maria@linutronix.de</email>
</author>
<published>2023-12-01T09:26:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7a39a5080ef0e3cf233d92165f6a778f08a08244'/>
<id>urn:sha1:7a39a5080ef0e3cf233d92165f6a778f08a08244</id>
<content type='text'>
There is an already existing function for forwarding the timer
base. Forwarding the timer base is implemented directly in
get_next_timer_interrupt() as well.

Remove the code duplication and invoke __forward_timer_base() instead.

Signed-off-by: Anna-Maria Behnsen &lt;anna-maria@linutronix.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lore.kernel.org/r/20231201092654.34614-11-anna-maria@linutronix.de

</content>
</entry>
</feed>
