<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sys.c, branch v5.7</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.7</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.7'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-03-03T18:34:32Z</updated>
<entry>
<title>sys/sysinfo: Respect boottime inside time namespace</title>
<updated>2020-03-03T18:34:32Z</updated>
<author>
<name>Cyril Hrubis</name>
<email>chrubis@suse.cz</email>
</author>
<published>2020-03-03T15:06:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ecc421e05bab97cf3ff4fe456ade47ef84dba8c2'/>
<id>urn:sha1:ecc421e05bab97cf3ff4fe456ade47ef84dba8c2</id>
<content type='text'>
The sysinfo() syscall includes uptime in seconds but has no correction for
time namespaces which makes it inconsistent with the /proc/uptime inside of
a time namespace.

Add the missing time namespace adjustment call.

Signed-off-by: Cyril Hrubis &lt;chrubis@suse.cz&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Dmitry Safonov &lt;dima@arista.com&gt;
Link: https://lkml.kernel.org/r/20200303150638.7329-1-chrubis@suse.cz

</content>
</entry>
<entry>
<title>prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim</title>
<updated>2020-01-28T09:09:51Z</updated>
<author>
<name>Mike Christie</name>
<email>mchristi@redhat.com</email>
</author>
<published>2019-11-12T00:19:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8d19f1c8e1937baf74e1962aae9f90fa3aeab463'/>
<id>urn:sha1:8d19f1c8e1937baf74e1962aae9f90fa3aeab463</id>
<content type='text'>
There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
amd nbd that have userspace components that can run in the IO path. For
example, iscsi and nbd's userspace deamons may need to recreate a socket
and/or send IO on it, and dm-multipath's daemon multipathd may need to
send SG IO or read/write IO to figure out the state of paths and re-set
them up.

In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting an allocation that ended up
writing data back to the same device we are trying to allocate for.
The device is then in a state of deadlock, because to execute IO the
device needs to allocate memory, but to allocate memory the memory
layers want execute IO to the device.

Here is an example with nbd using a local userspace daemon that performs
network IO to a remote server. We are using XFS on top of the nbd device,
but it can happen with any FS or other modules layered on top of the nbd
device that can write out data to free memory.  Here a nbd daemon helper
thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
a request. This kicks off a reclaim operation which results in a WRITE to
the nbd device and the nbd thread calling back into the mm layer.

[ 1626.609191] msgr-worker-1   D    0  1026      1 0x00004000
[ 1626.609193] Call Trace:
[ 1626.609195]  ? __schedule+0x29b/0x630
[ 1626.609197]  ? wait_for_completion+0xe0/0x170
[ 1626.609198]  schedule+0x30/0xb0
[ 1626.609200]  schedule_timeout+0x1f6/0x2f0
[ 1626.609202]  ? blk_finish_plug+0x21/0x2e
[ 1626.609204]  ? _xfs_buf_ioapply+0x2e6/0x410
[ 1626.609206]  ? wait_for_completion+0xe0/0x170
[ 1626.609208]  wait_for_completion+0x108/0x170
[ 1626.609210]  ? wake_up_q+0x70/0x70
[ 1626.609212]  ? __xfs_buf_submit+0x12e/0x250
[ 1626.609214]  ? xfs_bwrite+0x25/0x60
[ 1626.609215]  xfs_buf_iowait+0x22/0xf0
[ 1626.609218]  __xfs_buf_submit+0x12e/0x250
[ 1626.609220]  xfs_bwrite+0x25/0x60
[ 1626.609222]  xfs_reclaim_inode+0x2e8/0x310
[ 1626.609224]  xfs_reclaim_inodes_ag+0x1b6/0x300
[ 1626.609227]  xfs_reclaim_inodes_nr+0x31/0x40
[ 1626.609228]  super_cache_scan+0x152/0x1a0
[ 1626.609231]  do_shrink_slab+0x12c/0x2d0
[ 1626.609233]  shrink_slab+0x9c/0x2a0
[ 1626.609235]  shrink_node+0xd7/0x470
[ 1626.609237]  do_try_to_free_pages+0xbf/0x380
[ 1626.609240]  try_to_free_pages+0xd9/0x1f0
[ 1626.609245]  __alloc_pages_slowpath+0x3a4/0xd30
[ 1626.609251]  ? ___slab_alloc+0x238/0x560
[ 1626.609254]  __alloc_pages_nodemask+0x30c/0x350
[ 1626.609259]  skb_page_frag_refill+0x97/0xd0
[ 1626.609274]  sk_page_frag_refill+0x1d/0x80
[ 1626.609279]  tcp_sendmsg_locked+0x2bb/0xdd0
[ 1626.609304]  tcp_sendmsg+0x27/0x40
[ 1626.609307]  sock_sendmsg+0x54/0x60
[ 1626.609308]  ___sys_sendmsg+0x29f/0x320
[ 1626.609313]  ? sock_poll+0x66/0xb0
[ 1626.609318]  ? ep_item_poll.isra.15+0x40/0xc0
[ 1626.609320]  ? ep_send_events_proc+0xe6/0x230
[ 1626.609322]  ? hrtimer_try_to_cancel+0x54/0xf0
[ 1626.609324]  ? ep_read_events_proc+0xc0/0xc0
[ 1626.609326]  ? _raw_write_unlock_irq+0xa/0x20
[ 1626.609327]  ? ep_scan_ready_list.constprop.19+0x218/0x230
[ 1626.609329]  ? __hrtimer_init+0xb0/0xb0
[ 1626.609331]  ? _raw_spin_unlock_irq+0xa/0x20
[ 1626.609334]  ? ep_poll+0x26c/0x4a0
[ 1626.609337]  ? tcp_tsq_write.part.54+0xa0/0xa0
[ 1626.609339]  ? release_sock+0x43/0x90
[ 1626.609341]  ? _raw_spin_unlock_bh+0xa/0x20
[ 1626.609342]  __sys_sendmsg+0x47/0x80
[ 1626.609347]  do_syscall_64+0x5f/0x1c0
[ 1626.609349]  ? prepare_exit_to_usermode+0x75/0xa0
[ 1626.609351]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch adds a new prctl command that daemons can use after they have
done their initial setup, and before they start to do allocations that
are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
flags so both userspace block and FS threads can use it to avoid the
allocation recursion and try to prevent from being throttled while
writing out data to free up memory.

Signed-off-by: Mike Christie &lt;mchristi@redhat.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Tested-by: Masato Suzuki &lt;masato.suzuki@wdc.com&gt;
Reviewed-by: Damien Le Moal &lt;damien.lemoal@wdc.com&gt;
Reviewed-by: Bart Van Assche &lt;bvanassche@acm.org&gt;
Reviewed-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>kernel/sys.c: avoid copying possible padding bytes in copy_to_user</title>
<updated>2019-12-05T03:44:12Z</updated>
<author>
<name>Joe Perches</name>
<email>joe@perches.com</email>
</author>
<published>2019-12-05T00:50:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5e1aada08cd19ea652b2d32a250501d09b02ff2e'/>
<id>urn:sha1:5e1aada08cd19ea652b2d32a250501d09b02ff2e</id>
<content type='text'>
Initialization is not guaranteed to zero padding bytes so use an
explicit memset instead to avoid leaking any kernel content in any
possible padding bytes.

Link: http://lkml.kernel.org/r/dfa331c00881d61c8ee51577a082d8bebd61805c.camel@perches.com
Signed-off-by: Joe Perches &lt;joe@perches.com&gt;
Cc: Dan Carpenter &lt;error27@gmail.com&gt;
Cc: Julia Lawall &lt;julia.lawall@lip6.fr&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>y2038: rusage: use __kernel_old_timeval</title>
<updated>2019-11-15T13:38:29Z</updated>
<author>
<name>Arnd Bergmann</name>
<email>arnd@arndb.de</email>
</author>
<published>2019-10-25T20:46:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bdd565f817a74b9e30edec108f7cb1dbc762b8a6'/>
<id>urn:sha1:bdd565f817a74b9e30edec108f7cb1dbc762b8a6</id>
<content type='text'>
There are two 'struct timeval' fields in 'struct rusage'.

Unfortunately the definition of timeval is now ambiguous when used in
user space with a libc that has a 64-bit time_t, and this also changes
the 'rusage' definition in user space in a way that is incompatible with
the system call interface.

While there is no good solution to avoid all ambiguity here, change
the definition in the kernel headers to be compatible with the kernel
ABI, using __kernel_old_timeval as an unambiguous base type.

In previous discussions, there was also a plan to add a replacement
for rusage based on 64-bit timestamps and nanosecond resolution,
i.e. 'struct __kernel_timespec'. I have patches for that as well,
if anyone thinks we should do that.

Reviewed-by: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
</content>
</entry>
<entry>
<title>Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2019-09-17T19:35:15Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-09-17T19:35:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7f2444d38f6bbfa12bc15e2533d8f9daa85ca02b'/>
<id>urn:sha1:7f2444d38f6bbfa12bc15e2533d8f9daa85ca02b</id>
<content type='text'>
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
</content>
</entry>
<entry>
<title>Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2019-09-17T01:47:53Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-09-17T01:47:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=22331f895298bd23ca9f99f6a237aae883c9e1c7'/>
<id>urn:sha1:22331f895298bd23ca9f99f6a237aae883c9e1c7</id>
<content type='text'>
Pull x86 cpu-feature updates from Ingo Molnar:

 - Rework the Intel model names symbols/macros, which were decades of
   ad-hoc extensions and added random noise. It's now a coherent, easy
   to follow nomenclature.

 - Add new Intel CPU model IDs:
    - "Tiger Lake" desktop and mobile models
    - "Elkhart Lake" model ID
    - and the "Lightning Mountain" variant of Airmont, plus support code

 - Add the new AVX512_VP2INTERSECT instruction to cpufeatures

 - Remove Intel MPX user-visible APIs and the self-tests, because the
   toolchain (gcc) is not supporting it going forward. This is the
   first, lowest-risk phase of MPX removal.

 - Remove X86_FEATURE_MFENCE_RDTSC

 - Various smaller cleanups and fixes

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  x86/cpu: Update init data for new Airmont CPU model
  x86/cpu: Add new Airmont variant to Intel family
  x86/cpu: Add Elkhart Lake to Intel family
  x86/cpu: Add Tiger Lake to Intel family
  x86: Correct misc typos
  x86/intel: Add common OPTDIFFs
  x86/intel: Aggregate microserver naming
  x86/intel: Aggregate big core graphics naming
  x86/intel: Aggregate big core mobile naming
  x86/intel: Aggregate big core client naming
  x86/cpufeature: Explain the macro duplication
  x86/ftrace: Remove mcount() declaration
  x86/PCI: Remove superfluous returns from void functions
  x86/msr-index: Move AMD MSRs where they belong
  x86/cpu: Use constant definitions for CPU models
  lib: Remove redundant ftrace flag removal
  x86/crash: Remove unnecessary comparison
  x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
  x86: Remove X86_FEATURE_MFENCE_RDTSC
  x86/mpx: Remove MPX APIs
  ...
</content>
</entry>
<entry>
<title>posix-cpu-timers: Get rid of zero checks</title>
<updated>2019-08-28T09:50:40Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2019-08-21T19:09:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2bbdbdae05167c688b6d3499a7dab74208b80a22'/>
<id>urn:sha1:2bbdbdae05167c688b6d3499a7dab74208b80a22</id>
<content type='text'>
Deactivation of the expiry cache is done by setting all clock caches to
0. That requires to have a check for zero in all places which update the
expiry cache:

	if (cache == 0 || new &lt; cache)
		cache = new;

Use U64_MAX as the deactivated value, which allows to remove the zero
checks when updating the cache and reduces it to the obvious check:

	if (new &lt; cache)
		cache = new;

This also removes the weird workaround in do_prlimit() which was required
to convert a RLIMIT_CPU value of 0 (immediate expiry) to 1 because handing
in 0 to the posix CPU timer code would have effectively disarmed it.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lkml.kernel.org/r/20190821192922.275086128@linutronix.de

</content>
</entry>
<entry>
<title>rlimit: Rewrite non-sensical RLIMIT_CPU comment</title>
<updated>2019-08-28T09:50:40Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2019-08-21T19:09:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=24db4dd90dd53ad6e3331b6f01cb985e466cface'/>
<id>urn:sha1:24db4dd90dd53ad6e3331b6f01cb985e466cface</id>
<content type='text'>
The comment above the function which arms RLIMIT_CPU in the posix CPU timer
code makes no sense at all. It claims that the kernel does not return an
error code when it rejected the attempt to set RLIMIT_CPU. That's clearly
bogus as the code does an error check and the rlimit is only set and
activated when the permission checks are ok. In case of a rejection an
appropriate error code is returned.

This is a historical and outdated comment which got dragged along even when
the rlimit handling code was rewritten.

Replace it with an explanation why the setup function is not called when
the rlimit value is RLIM_INFINITY and how the 'disarming' is handled.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lkml.kernel.org/r/20190821192922.185511287@linutronix.de

</content>
</entry>
<entry>
<title>arm64: Tighten the PR_{SET, GET}_TAGGED_ADDR_CTRL prctl() unused arguments</title>
<updated>2019-08-20T17:17:55Z</updated>
<author>
<name>Catalin Marinas</name>
<email>catalin.marinas@arm.com</email>
</author>
<published>2019-08-15T15:44:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3e91ec89f527b9870fe42dcbdb74fd389d123a95'/>
<id>urn:sha1:3e91ec89f527b9870fe42dcbdb74fd389d123a95</id>
<content type='text'>
Require that arg{3,4,5} of the PR_{SET,GET}_TAGGED_ADDR_CTRL prctl and
arg2 of the PR_GET_TAGGED_ADDR_CTRL prctl() are zero rather than ignored
for future extensions.

Acked-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Signed-off-by: Will Deacon &lt;will@kernel.org&gt;
</content>
</entry>
<entry>
<title>arm64: Introduce prctl() options to control the tagged user addresses ABI</title>
<updated>2019-08-06T17:08:45Z</updated>
<author>
<name>Catalin Marinas</name>
<email>catalin.marinas@arm.com</email>
</author>
<published>2019-07-23T17:58:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=63f0c60379650d82250f22e4cf4137ef3dc4f43d'/>
<id>urn:sha1:63f0c60379650d82250f22e4cf4137ef3dc4f43d</id>
<content type='text'>
It is not desirable to relax the ABI to allow tagged user addresses into
the kernel indiscriminately. This patch introduces a prctl() interface
for enabling or disabling the tagged ABI with a global sysctl control
for preventing applications from enabling the relaxed ABI (meant for
testing user-space prctl() return error checking without reconfiguring
the kernel). The ABI properties are inherited by threads of the same
application and fork()'ed children but cleared on execve(). A Kconfig
option allows the overall disabling of the relaxed ABI.

The PR_SET_TAGGED_ADDR_CTRL will be expanded in the future to handle
MTE-specific settings like imprecise vs precise exceptions.

Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Signed-off-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: Will Deacon &lt;will@kernel.org&gt;
</content>
</entry>
</feed>
