<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/fs-writeback.c, branch v4.5</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.5</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.5'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2016-03-03T21:42:50Z</updated>
<entry>
<title>writeback: flush inode cgroup wb switches instead of pinning super_block</title>
<updated>2016-03-03T21:42:50Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-02-29T23:28:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a1a0e23e49037c23ea84bc8cc146a03584d13577'/>
<id>urn:sha1:a1a0e23e49037c23ea84bc8cc146a03584d13577</id>
<content type='text'>
If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Al Viro &lt;viro@ZenIV.linux.org.uk&gt;
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
Cc: stable@vger.kernel.org #v4.5
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>writeback: keep superblock pinned during cgroup writeback association switches</title>
<updated>2016-02-16T18:34:07Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-02-16T18:34:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5ff8eaac1636bf6deae86491f4818c4c69d1a9ac'/>
<id>urn:sha1:5ff8eaac1636bf6deae86491f4818c4c69d1a9ac</id>
<content type='text'>
If cgroup writeback is in use, an inode is associated with a cgroup
for writeback.  If the inode's main dirtier changes to another cgroup,
the association gets updated asynchronously.  Nothing was pinning the
superblock while such switches are in progress and superblock could go
away while async switching is pending or in progress leading to
crashes like the following.

 kernel BUG at fs/jbd2/transaction.c:319!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
 CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
 Hardware name: Google Google, BIOS Google 01/01/2011
 Workqueue: events inode_switch_wbs_work_fn
 task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
 RIP: 0010:[&lt;ffffffff803e6922&gt;]  [&lt;ffffffff803e6922&gt;] start_this_handle+0x382/0x3e0
 RSP: 0018:ffff880209267c30  EFLAGS: 00010202
 ...
 Call Trace:
  [&lt;ffffffff803e6be4&gt;] jbd2__journal_start+0xf4/0x190
  [&lt;ffffffff803cfc7e&gt;] __ext4_journal_start_sb+0x4e/0x70
  [&lt;ffffffff803b31ec&gt;] ext4_evict_inode+0x12c/0x3d0
  [&lt;ffffffff8035338b&gt;] evict+0xbb/0x190
  [&lt;ffffffff80354190&gt;] iput+0x130/0x190
  [&lt;ffffffff80360223&gt;] inode_switch_wbs_work_fn+0x343/0x4c0
  [&lt;ffffffff80279819&gt;] process_one_work+0x129/0x300
  [&lt;ffffffff80279b16&gt;] worker_thread+0x126/0x480
  [&lt;ffffffff8027ed14&gt;] kthread+0xc4/0xe0
  [&lt;ffffffff809771df&gt;] ret_from_fork+0x3f/0x70

Fix it by bumping s_active while cgroup association switching is in
flight.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-and-tested-by: Tahsin Erdogan &lt;tahsin@google.com&gt;
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>cgroup, memcg, writeback: drop spurious rcu locking around mem_cgroup_css_from_page()</title>
<updated>2016-01-16T01:56:32Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2016-01-16T00:57:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=654a0dd0953fcd87ff7bbb468fb889f0eb67df33'/>
<id>urn:sha1:654a0dd0953fcd87ff7bbb468fb889f0eb67df33</id>
<content type='text'>
In earlier versions, mem_cgroup_css_from_page() could return non-root
css on a legacy hierarchy which can go away and required rcu locking;
however, the eventual version simply returns the root cgroup if memcg is
on a legacy hierarchy and thus doesn't need rcu locking around or in it.
Remove spurious rcu lockings.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs/writeback.c: fix kernel-doc warnings</title>
<updated>2015-11-09T23:11:24Z</updated>
<author>
<name>Randy Dunlap</name>
<email>rdunlap@infradead.org</email>
</author>
<published>2015-11-09T22:58:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dbce03b9e3e61e122451a7aa4e6900d5f0bb5993'/>
<id>urn:sha1:dbce03b9e3e61e122451a7aa4e6900d5f0bb5993</id>
<content type='text'>
Fix kernel-doc warnings in fs/fs-writeback.c by moving a #define macro to
after the function's opening brace.  Also #undef this macro at the end of
the function.

  ../fs/fs-writeback.c:1984: warning: Excess function parameter 'inode' description in 'I_DIRTY_INODE'
  ../fs/fs-writeback.c:1984: warning: Excess function parameter 'flags' description in 'I_DIRTY_INODE'

Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/filemap.c: make global sync not clear error status of individual inodes</title>
<updated>2015-11-06T03:34:48Z</updated>
<author>
<name>Junichi Nomura</name>
<email>j-nomura@ce.jp.nec.com</email>
</author>
<published>2015-11-06T02:47:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=aa750fd71c242dba02ee2034e15fbd7d0cdb2461'/>
<id>urn:sha1:aa750fd71c242dba02ee2034e15fbd7d0cdb2461</id>
<content type='text'>
filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.

The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error if events
happen in the following order:

   Application                    System admin
   ----------------------------------------------------------
   write data on page cache
                                  Run sync command
                                  writeback completes with error
                                  filemap_fdatawait() clears error
   fsync returns success
   (but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura &lt;j-nomura@ce.jp.nec.com&gt;
Acked-by: Andi Kleen &lt;ak@linux.intel.com&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Fengguang Wu &lt;fengguang.wu@gmail.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs()</title>
<updated>2015-10-28T12:17:30Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-10-27T05:19:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b33e18f61bd18227a456016a77b1a968f5bc1d65'/>
<id>urn:sha1:b33e18f61bd18227a456016a77b1a968f5bc1d65</id>
<content type='text'>
bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
to walk @bdi-&gt;wb_list.  To set up the initial iteration
condition, it uses list_entry_rcu() to calculate the entry
pointer corresponding to the list head; however, this isn't an
actual RCU dereference and using list_entry_rcu() for it ended
up breaking a proposed list_entry_rcu() change because it was
feeding an non-lvalue pointer into the macro.

Don't use the RCU variant for simple pointer offsetting.  Use
list_entry() instead.

Reported-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Darren Hart &lt;dvhart@linux.intel.com&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Dipankar Sarma &lt;dipankar@in.ibm.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Josh Triplett &lt;josh@joshtriplett.org&gt;
Cc: Lai Jiangshan &lt;jiangshanlai@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Patrick Marlier &lt;patrick.marlier@gmail.com&gt;
Cc: Paul McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: pranith kumar &lt;bobby.prani@gmail.com&gt;
Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>writeback: bdi_writeback iteration must not skip dying ones</title>
<updated>2015-10-12T16:31:12Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-10-02T18:47:05Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b817525a4a80c04e4ca44192d97a1ffa9f2be572'/>
<id>urn:sha1:b817525a4a80c04e4ca44192d97a1ffa9f2be572</id>
<content type='text'>
bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi-&gt;cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi-&gt;wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi-&gt;wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-and-tested-by: Artem Bityutskiy &lt;dedekind1@gmail.com&gt;
Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()</title>
<updated>2015-10-12T16:31:11Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-09-29T16:47:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6fdf860f15d4a6be8f0947bad608d687fe0c7af7'/>
<id>urn:sha1:6fdf860f15d4a6be8f0947bad608d687fe0c7af7</id>
<content type='text'>
wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
unfortunately, it was always waking up bdi-&gt;wb instead of the wb being
walked.  Fix it.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 001fe6f617b1 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
Reviewed-by: Jan Kara &lt;jack@suse.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>fs-writeback: unplug before cond_resched in writeback_sb_inodes</title>
<updated>2015-09-20T01:50:19Z</updated>
<author>
<name>Chris Mason</name>
<email>clm@fb.com</email>
</author>
<published>2015-09-18T17:35:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=590dca3a71875461e8fea3013af74386945191b2'/>
<id>urn:sha1:590dca3a71875461e8fea3013af74386945191b2</id>
<content type='text'>
Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during writeback_sb_inodes,
which increases the merge rate when relatively contiguous small files
are written by the filesystem.  It helps both on flash and spindles.

For an fs_mark workload creating 4K files in parallel across 8 drives,
this commit improves performance ~9% more by unplugging before calling
cond_resched().  cond_resched() doesn't trigger an implicit unplug, so
explicitly getting the IO down to the device before scheduling reduces
latencies for anyone waiting on clean pages.

It also cuts down on how often we use kblockd to unplug, which means
less work bouncing from one workqueue to another.

Many more details about how we got here:

  https://lkml.org/lkml/2015/9/11/570

Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>writeback: plug writeback in wb_writeback() and writeback_inodes_wb()</title>
<updated>2015-09-12T18:13:07Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-09-11T20:37:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=505a666ee3fc611518e85df203eb8c707995ceaa'/>
<id>urn:sha1:505a666ee3fc611518e85df203eb8c707995ceaa</id>
<content type='text'>
We had to revert the pluggin in writeback_sb_inodes() because the
wb-&gt;list_lock is held, but we could easily plug at a higher level before
taking that lock, and unplug after releasing it.  This does that.

Chris will run performance numbers, just to verify that this approach is
comparable to the alternative (we could just drop and re-take the lock
around the blk_finish_plug() rather than these two commits.

I'd have preferred waiting for actual performance numbers before picking
one approach over the other, but I don't want to release rc1 with the
known "sleeping function called from invalid context" issue, so I'll
pick this cleanup version for now.  But if the numbers show that we
really want to plug just at the writeback_sb_inodes() level, and we
should just play ugly games with the spinlock, we'll switch to that.

Cc: Chris Mason &lt;clm@fb.com&gt;
Cc: Josef Bacik &lt;jbacik@fb.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
