<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/fs-writeback.c, branch v4.19</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.19</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.19'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2018-05-03T22:11:37Z</updated>
<entry>
<title>bdi: Fix oops in wb_workfn()</title>
<updated>2018-05-03T22:11:37Z</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2018-05-03T16:26:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b8b784958eccbf8f51ebeee65282ca3fd59ea391'/>
<id>urn:sha1:b8b784958eccbf8f51ebeee65282ca3fd59ea391</id>
<content type='text'>
Syzbot has reported that it can hit a NULL pointer dereference in
wb_workfn() due to wb-&gt;bdi-&gt;dev being NULL. This indicates that
wb_workfn() was called for an already unregistered bdi which should not
happen as wb_shutdown() called from bdi_unregister() should make sure
all pending writeback works are completed before bdi is unregistered.
Except that wb_workfn() itself can requeue the work with:

	mod_delayed_work(bdi_wq, &amp;wb-&gt;dwork, 0);

and if this happens while wb_shutdown() is waiting in:

	flush_delayed_work(&amp;wb-&gt;dwork);

the dwork can get executed after wb_shutdown() has finished and
bdi_unregister() has cleared wb-&gt;bdi-&gt;dev.

Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
the necessary precautions against racing with bdi unregistration.

CC: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
CC: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
Reported-by: syzbot &lt;syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com&gt;
Reviewed-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>writeback: safer lock nesting</title>
<updated>2018-04-21T00:18:35Z</updated>
<author>
<name>Greg Thelen</name>
<email>gthelen@google.com</email>
</author>
<published>2018-04-20T21:55:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2e898e4c0a3897ccd434adac5abb8330194f527b'/>
<id>urn:sha1:2e898e4c0a3897ccd434adac5abb8330194f527b</id>
<content type='text'>
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &amp;locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    &lt;interrupts mistakenly enabled&gt;
                                    &lt;interrupt&gt;
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    &lt;deadlock&gt;
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 &gt; a/memory.move_charge_at_immigrate
  echo 1 &gt; b/memory.move_charge_at_immigrate
  (
    echo $BASHPID &gt; a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &amp;
  while true; do
    sync
  done &amp;
  sleep 1h &amp;
  SLEEP=$!
  while true; do
    echo $SLEEP &gt; a/cgroup.procs
    echo $SLEEP &gt; b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen &lt;gthelen@google.com&gt;
Reported-by: Wang Long &lt;wanglong19@meituan.com&gt;
Acked-by: Wang Long &lt;wanglong19@meituan.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[v4.2+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>page cache: use xa_lock</title>
<updated>2018-04-11T17:28:39Z</updated>
<author>
<name>Matthew Wilcox</name>
<email>mawilcox@microsoft.com</email>
</author>
<published>2018-04-10T23:36:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b93b016313b3ba8003c3b8bb71f569af91f19fc7'/>
<id>urn:sha1:b93b016313b3ba8003c3b8bb71f569af91f19fc7</id>
<content type='text'>
Remove the address_space -&gt;tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space -&gt;page_tree to -&gt;i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
Acked-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Cc: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ryusuke Konishi &lt;konishi.ryusuke@lab.ntt.co.jp&gt;
Cc: Will Deacon &lt;will.deacon@arm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs: move I_DIRTY_INODE to fs.h</title>
<updated>2018-03-28T05:39:02Z</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2018-02-21T15:54:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0e11f6443f522f89509495b13ef1f3745640144d'/>
<id>urn:sha1:0e11f6443f522f89509495b13ef1f3745640144d</id>
<content type='text'>
And use it in a few more places rather than opencoding the values.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>writeback: update comment in inode_io_list_move_locked</title>
<updated>2018-01-06T16:18:00Z</updated>
<author>
<name>Wang Long</name>
<email>wanglong19@meituan.com</email>
</author>
<published>2017-12-05T12:23:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bbbc3c1cfaf6900d24e3c9fcaac25d267ad2bc40'/>
<id>urn:sha1:bbbc3c1cfaf6900d24e3c9fcaac25d267ad2bc40</id>
<content type='text'>
The @head can be wb-&gt;b_dirty_time, so update the comment.

Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Wang Long &lt;wanglong19@meituan.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>Rename superblock flags (MS_xyz -&gt; SB_xyz)</title>
<updated>2017-11-27T21:05:09Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2017-11-27T21:05:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1751e8a6cb935e555fcdbcb9ab4f0446e322ca3e'/>
<id>urn:sha1:1751e8a6cb935e555fcdbcb9ab4f0446e322ca3e</id>
<content type='text'>
This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb-&gt;s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>writeback: merge try_to_writeback_inodes_sb_nr() into caller</title>
<updated>2017-10-10T14:14:37Z</updated>
<author>
<name>Rakesh Pandit</name>
<email>rakesh@tuxera.com</email>
</author>
<published>2017-10-09T10:34:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8264c3214f28b52b399d9e03bfa7feec275a0d71'/>
<id>urn:sha1:8264c3214f28b52b399d9e03bfa7feec275a0d71</id>
<content type='text'>
Since commit 925a6efb8ff0c ("Btrfs: stop using
try_to_writeback_inodes_sb_nr to flush delalloc") this function hasn't
been used outside so stop exporting it.

In addition we merge it into try_to_writeback_inodes_sb() which is the
only caller.  Also change return type of try_to_writeback_inodes_sb to
void as the only user ext4 doesn't care.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Rakesh Pandit &lt;rakesh@tuxera.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>writeback: eliminate work item allocation in bd_start_writeback()</title>
<updated>2017-10-04T17:24:12Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2017-09-30T08:09:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=85009b4f5f0399669a44f07cb9a5622c0e71d419'/>
<id>urn:sha1:85009b4f5f0399669a44f07cb9a5622c0e71d419</id>
<content type='text'>
Handle start-all writeback like we do periodic or kupdate
style writeback - by marking the bdi_writeback as needing a full
flush, and simply waking the thread. This eliminates the need to
allocate and queue a specific work item just for this purpose.

After this change, we truly only ever have one of them running at
any point in time. We mark the need to start all flushes, and the
writeback thread will clear it once it has processed the request.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>writeback: only allow one inflight and pending full flush</title>
<updated>2017-10-03T14:38:17Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2017-09-28T17:31:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=aac8d41cd438f25bf3110fc6b98f1d16d7dbc169'/>
<id>urn:sha1:aac8d41cd438f25bf3110fc6b98f1d16d7dbc169</id>
<content type='text'>
When someone calls wakeup_flusher_threads() or
wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
pages in the system (or on that bdi). If we are tight on memory, we
can get tons of these queued from kswapd/vmscan. This causes (at
least) two problems:

1) We consume a ton of memory just allocating writeback work items.
   We've seen as much as 600 million of these writeback work items
   pending. That's a lot of memory to pointlessly hold hostage,
   while the box is under memory pressure.

2) We spend so much time processing these work items, that we
   introduce a softlockup in writeback processing. This is because
   each of the writeback work items don't end up doing any work (it's
   hard when you have millions of identical ones coming in to the
   flush machinery), so we just sit in a tight loop pulling work
   items and deleting/freeing them.

Fix this by adding a 'start_all' bit to the writeback structure, and
set that when someone attempts to flush all dirty pages. The bit is
cleared when we start writeback on that work item. If the bit is
already set when we attempt to queue !nr_pages writeback, then we
simply ignore it.

This provides us one full flush in flight, with one pending as well,
and makes for more efficient handling of this type of writeback.

Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Tested-by: Chris Mason &lt;clm@fb.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>writeback: move nr_pages == 0 logic to one location</title>
<updated>2017-10-03T14:38:17Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2017-09-28T17:31:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e8e8a0c6c9bfc0b320671166dd795f413f636773'/>
<id>urn:sha1:e8e8a0c6c9bfc0b320671166dd795f413f636773</id>
<content type='text'>
Now that we have no external callers of wb_start_writeback(), we
can shuffle the passing in of 'nr_pages'. Everybody passes in 0
at this point, so just kill the argument and move the dirty
count retrieval to that function.

Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Tested-by: Chris Mason &lt;clm@fb.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
</feed>
