<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/fs-writeback.c, branch v5.6</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.6</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.6'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-01-31T18:30:36Z</updated>
<entry>
<title>memcg: fix a crash in wb_workfn when a device disappears</title>
<updated>2020-01-31T18:30:36Z</updated>
<author>
<name>Theodore Ts'o</name>
<email>tytso@mit.edu</email>
</author>
<published>2020-01-31T06:11:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=68f23b89067fdf187763e75a56087550624fdbee'/>
<id>urn:sha1:68f23b89067fdf187763e75a56087550624fdbee</id>
<content type='text'>
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures.  In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.

With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi.  There is a refcount which prevents
the bdi object from being released (and hence, unregistered).  So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).

Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly.  It does
this without informing the file system, or the memcg code, or anything
else.  This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown.  So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb-&gt;bdi-&gt;dev to fetch the device name, but
unfortunately bdi-&gt;dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk().  As a result, *boom*.

Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi-&gt;dev and bdi-&gt;owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi-&gt;dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.

The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on.  It was triggering
several times a day in a heavily loaded production environment.

Google Bug Id: 145475544

Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Cc: Chris Mason &lt;clm@fb.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead</title>
<updated>2019-11-08T20:37:24Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-11-08T20:18:29Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=65de03e251382306a4575b1779c57c87889eee49'/>
<id>urn:sha1:65de03e251382306a4575b1779c57c87889eee49</id>
<content type='text'>
cgroup writeback tries to refresh the associated wb immediately if the
current wb is dead.  This is to avoid keeping issuing IOs on the stale
wb after memcg - blkcg association has changed (ie. when blkcg got
disabled / enabled higher up in the hierarchy).

Unfortunately, the logic gets triggered spuriously on inodes which are
associated with dead cgroups.  When the logic is triggered on dead
cgroups, the attempt fails only after doing quite a bit of work
allocating and initializing a new wb.

While c3aab9a0bd91 ("mm/filemap.c: don't initiate writeback if mapping
has no dirty pages") alleviated the issue significantly as it now only
triggers when the inode has dirty pages.  However, the condition can
still be triggered before the inode is switched to a different cgroup
and the logic simply doesn't make sense.

Skip the immediate switching if the associated memcg is dying.

This is a simplified version of the following two patches:

 * https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/
 * http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz

Cc: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Fixes: e8a7abf5a5bd ("writeback: disassociate inodes from dying bdi_writebacks")
Acked-by: Dennis Zhou &lt;dennis@kernel.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>fs/fs-writeback.c: fix kernel-doc warning</title>
<updated>2019-10-14T22:04:01Z</updated>
<author>
<name>Randy Dunlap</name>
<email>rdunlap@infradead.org</email>
</author>
<published>2019-10-14T21:12:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b46ec1da5eb7d728938c8d115c4c291c7c71a98d'/>
<id>urn:sha1:b46ec1da5eb7d728938c8d115c4c291c7c71a98d</id>
<content type='text'>
Fix kernel-doc warning in fs/fs-writeback.c:

  fs/fs-writeback.c:913: warning: Excess function parameter 'nr_pages' description in 'cgroup_writeback_by_id'

Link: http://lkml.kernel.org/r/756645ac-0ce8-d47e-d30a-04d9e4923a4f@infradead.org
Fixes: d62241c7a406 ("writeback, memcg: Implement cgroup_writeback_by_id()")
Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>writeback: fix use-after-free in finish_writeback_work()</title>
<updated>2019-10-07T22:47:19Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-10-07T00:58:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8e00c4e9dd852f7a9bf12234fad65a2f2f93788f'/>
<id>urn:sha1:8e00c4e9dd852f7a9bf12234fad65a2f2f93788f</id>
<content type='text'>
finish_writeback_work() reads @done-&gt;waitq after decrementing
@done-&gt;cnt.  However, once @done-&gt;cnt reaches zero, @done may be freed
(from stack) at any moment and @done-&gt;waitq can contain something
unrelated by the time finish_writeback_work() tries to read it.  This
led to the following crash.

  "BUG: kernel NULL pointer dereference, address: 0000000000000002"
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
  CPU: 40 PID: 555153 Comm: kworker/u98:50 Kdump: loaded Not tainted
  ...
  Workqueue: writeback wb_workfn (flush-btrfs-1)
  RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
  Code: 48 89 d8 5b c3 e8 50 db 6b ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 &lt;f0&gt; 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 fe ca 6b ff eb f2 66 90
  RSP: 0018:ffffc90049b27d98 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000002
  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff889fff407600 R11: ffff88ba9395d740 R12: 000000000000e300
  R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88bfdfa00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000002 CR3: 0000000002409005 CR4: 00000000001606e0
  Call Trace:
   __wake_up_common_lock+0x63/0xc0
   wb_workfn+0xd2/0x3e0
   process_one_work+0x1f5/0x3f0
   worker_thread+0x2d/0x3d0
   kthread+0x111/0x130
   ret_from_fork+0x1f/0x30

Fix it by reading and caching @done-&gt;waitq before decrementing
@done-&gt;cnt.

Link: http://lkml.kernel.org/r/20190924010631.GH2233839@devbig004.ftw2.facebook.com
Fixes: 5b9cce4c7eb069 ("writeback: Generalize and expose wb_completion")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Debugged-by: Chris Mason &lt;clm@fb.com&gt;
Reviewed-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[5.2+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>writeback: add tracepoints for cgroup foreign writebacks</title>
<updated>2019-08-30T13:42:49Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-08-29T22:47:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3a8e9ac89e6a5106cfb6b85d4c9cf9bfa3519bc7'/>
<id>urn:sha1:3a8e9ac89e6a5106cfb6b85d4c9cf9bfa3519bc7</id>
<content type='text'>
cgroup foreign inode handling has quite a bit of heuristics and
internal states which sometimes makes it difficult to understand
what's going on.  Add tracepoints to improve visibility.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>writeback, memcg: Implement cgroup_writeback_by_id()</title>
<updated>2019-08-27T15:22:38Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-08-26T16:06:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d62241c7a406f0680d702bd974f6f17e28ab8e5d'/>
<id>urn:sha1:d62241c7a406f0680d702bd974f6f17e28ab8e5d</id>
<content type='text'>
Implement cgroup_writeback_by_id() which initiates cgroup writeback
from bdi and memcg IDs.  This will be used by memcg foreign inode
flushing.

v2: Use wb_get_lookup() instead of wb_get_create() to avoid creating
    spurious wbs.

v3: Interpret 0 @nr as 1.25 * nr_dirty to implement best-effort
    flushing while avoding possible livelocks.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>writeback: Generalize and expose wb_completion</title>
<updated>2019-08-27T15:22:38Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-08-26T16:06:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5b9cce4c7eb0696558dfd4946074ae1fb9d8f05d'/>
<id>urn:sha1:5b9cce4c7eb0696558dfd4946074ae1fb9d8f05d</id>
<content type='text'>
wb_completion is used to track writeback completions.  We want to use
it from memcg side for foreign inode flushes.  This patch updates it
to remember the target waitq instead of assuming bdi-&gt;wb_waitq and
expose it outside of fs-writeback.c.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>writeback, cgroup: inode_switch_wbs() shouldn't give up on wb_switch_rwsem trylock fail</title>
<updated>2019-08-15T19:30:46Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-08-02T19:08:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6444f47eb8678a43d5260c67b89c18b1ea09e79e'/>
<id>urn:sha1:6444f47eb8678a43d5260c67b89c18b1ea09e79e</id>
<content type='text'>
As inode wb switching may make sync(2) miss some inodes, they're
synchronized using wb_switch_rwsem so that no wb switching happens
while sync(2) is in progress.  In addition to synchronizing the actual
switching, the rwsem is also used to prevent queueing new switch
attempts while sync(2) is in progress.  This is to avoid queueing too
many instances while the rwsem is held by sync(2).  Unfortunately,
this is too agressive and can block wb switching for a long time if
sync(2) is frequent.

The goal is avoiding expolding the number of scheduled switches, not
avoiding scheduling anything.  Let's use wb_switch_rwsem only for
synchronizing the actual switching and sync(2) and use
isw_nr_in_flight instead for limiting the maximum number of scheduled
switches.  The limit is set to 1024 which should be more than enough
while still avoiding extreme situations.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>writeback, cgroup: Adjust WB_FRN_TIME_CUT_DIV to accelerate foreign inode switching</title>
<updated>2019-08-15T19:30:44Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-08-15T19:25:28Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=55a694dffb7fd126b1e047aa46c437731d2700bb'/>
<id>urn:sha1:55a694dffb7fd126b1e047aa46c437731d2700bb</id>
<content type='text'>
WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic
to ignore short writeback rounds to prevent getting confused by a
burst of short writebacks.  The parameter is currently 2 meaning that
anything smaller than half of the running average writback duration
will be ignored.

This is unnecessarily aggressive.  The detection logic uses 16 history
slots and is already reasonably protected against some short bursts
confusing it and the current parameter can lead to tens of seconds of
missed detection depending on the writeback pattern.

Let's change the parameter to 8, so that it only ignores writeback
with are smaller than 12.5% of the current running average.

v2: Add comment explaining what's going on with the foreign detection
    parameters.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>blkcg, writeback: Add wbc-&gt;no_cgroup_owner</title>
<updated>2019-07-10T15:00:57Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-06-27T20:39:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=27b36d8fa81fa8274fb72f4eb1484026f6b6daa8'/>
<id>urn:sha1:27b36d8fa81fa8274fb72f4eb1484026f6b6daa8</id>
<content type='text'>
When writeback IOs are bounced through async layers, the IOs should
only be accounted against the wbc from the original bdi writeback to
avoid confusing cgroup inode ownership arbitration.  Add
wbc-&gt;no_cgroup_owner to allow disabling wbc cgroup owner accounting.
This will be used make btrfs compression work well with cgroup IO
control.

v2: Renamed from no_wbc_acct to no_cgroup_owner and added comment as
    per Jan.

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
</feed>
