<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/fs-writeback.c, branch v3.19</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v3.19</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v3.19'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2014-11-04T17:42:23Z</updated>
<entry>
<title>writeback: fix a subtle race condition in I_DIRTY clearing</title>
<updated>2014-11-04T17:42:23Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2014-10-24T19:38:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9c6ac78eb3521c5937b2dd8a7d1b300f41092f45'/>
<id>urn:sha1:9c6ac78eb3521c5937b2dd8a7d1b300f41092f45</id>
<content type='text'>
After invoking -&gt;dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode-&gt;i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set.  The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.

And it sure enough was broken.  Please consider the following
scenario.

 CPU 0					CPU 1
 -------------------------------------------------------------------------------

					enters __writeback_single_inode()
					grabs inode-&gt;i_lock
					tests PAGECACHE_TAG_DIRTY which is clear
 enters __set_page_dirty()
 grabs mapping-&gt;tree_lock
 sets PAGECACHE_TAG_DIRTY
 releases mapping-&gt;tree_lock
 leaves __set_page_dirty()

 enters __mark_inode_dirty()
 smp_mb()
 sees I_DIRTY_PAGES set
 leaves __mark_inode_dirty()
					clears I_DIRTY_PAGES
					releases inode-&gt;i_lock

Now @inode has dirty pages w/ I_DIRTY_PAGES clear.  This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.

The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness.  There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode.  Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from -&gt;dirty_inode().

Fix it by adding an explicit smp_mb() after I_DIRTY clearing.  Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side.  It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.

Also add comments explaining how and why the barriers are paired.

Lightly tested.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>sched: Remove proliferation of wait_on_bit() action functions</title>
<updated>2014-07-16T13:10:39Z</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2014-07-07T05:16:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=743162013d40ca612b4cb53d3a200dff2d9ab26e'/>
<id>urn:sha1:743162013d40ca612b4cb53d3a200dff2d9ab26e</id>
<content type='text'>
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().

So:
 Rename wait_on_bit and        wait_on_bit_lock to
        wait_on_bit_action and wait_on_bit_lock_action
 to make it explicit that they need an action function.

 Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
 which are *not* given an action function but implicitly use
 a standard one.
 The decision to error-out if a signal is pending is now made
 based on the 'mode' argument rather than being encoded in the action
 function.

 All instances of the old wait_on_bit and wait_on_bit_lock which
 can use the new version have been changed accordingly and their
 action functions have been discarded.
 wait_on_bit{_lock} does not return any specific error code in the
 event of a signal so the caller must check for non-zero and
 interpolate their own error code as appropriate.

The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"

The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.

A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack.  So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).

Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS.  CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Acked-by: David Howells &lt;dhowells@redhat.com&gt; (fscache, keys)
Acked-by: Steven Whitehouse &lt;swhiteho@redhat.com&gt; (gfs2)
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Steve French &lt;sfrench@samba.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw</title>
<updated>2014-04-04T21:49:16Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-04-04T21:49:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=34917f9713905a937816ebb7ee5f25bef7a6441c'/>
<id>urn:sha1:34917f9713905a937816ebb7ee5f25bef7a6441c</id>
<content type='text'>
Pull GFS2 updates from Steven Whitehouse:
 "One of the main highlights this time, is not the patches themselves
  but instead the widening contributor base.  It is good to see that
  interest is increasing in GFS2, and I'd like to thank all the
  contributors to this patch set.

  In addition to the usual set of bug fixes and clean ups, there are
  patches to improve inode creation performance when xattrs are required
  and some improvements to the transaction code which is intended to
  help improve scalability after further changes in due course.

  Journal extent mapping is also updated to make it more efficient and
  again, this is a foundation for future work in this area.

  The maximum number of ACLs has been increased to 300 (for a 4k block
  size) which means that even with a few additional xattrs from selinux,
  everything should fit within a single fs block.

  There is also a patch to bring GFS2's own copy of the writepages code
  up to the same level as the core VFS.  Eventually we may be able to
  merge some of this code, since it is fairly similar.

  The other major change this time, is bringing consistency to the
  printing of messages via fs_&lt;level&gt;, pr_&lt;level&gt; macros"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (29 commits)
  GFS2: Fix address space from page function
  GFS2: Fix uninitialized VFS inode in gfs2_create_inode
  GFS2: Fix return value in slot_get()
  GFS2: inline function gfs2_set_mode
  GFS2: Remove extraneous function gfs2_security_init
  GFS2: Increase the max number of ACLs
  GFS2: Re-add a call to log_flush_wait when flushing the journal
  GFS2: Ensure workqueue is scheduled after noexp request
  GFS2: check NULL return value in gfs2_ok_to_move
  GFS2: Convert gfs2_lm_withdraw to use fs_err
  GFS2: Use fs_&lt;level&gt; more often
  GFS2: Use pr_&lt;level&gt; more consistently
  GFS2: Move recovery variables to journal structure in memory
  GFS2: global conversion to pr_foo()
  GFS2: return -E2BIG if hit the maximum limits of ACLs
  GFS2: Clean up journal extent mapping
  GFS2: replace kmalloc - __vmalloc / memset 0
  GFS2: Remove extra "if" in gfs2_log_flush()
  fs: NULL dereference in posix_acl_to_xattr()
  GFS2: Move log buffer accounting to transaction
  ...
</content>
</entry>
<entry>
<title>bdi: avoid oops on device removal</title>
<updated>2014-04-03T23:20:49Z</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2014-04-03T21:46:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555'/>
<id>urn:sha1:5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555</id>
<content type='text'>
After commit 839a8e8660b6 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() -&gt;
set_worker_desc() because bdi-&gt;dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977

Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Derek Basehore &lt;dbasehore@chromium.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>backing_dev: fix hung task on sync</title>
<updated>2014-04-03T23:20:49Z</updated>
<author>
<name>Derek Basehore</name>
<email>dbasehore@chromium.org</email>
</author>
<published>2014-04-03T21:46:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e'/>
<id>urn:sha1:6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e</id>
<content type='text'>
bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes.  The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead.  This is a regression caused by commit
839a8e8660b6 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task.  Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future.  This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty.  The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore &lt;dbasehore@chromium.org&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Alexander Viro &lt;viro@zento.linux.org.uk&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: "Darrick J. Wong" &lt;darrick.wong@oracle.com&gt;
Cc: Derek Basehore &lt;dbasehore@chromium.org&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Benson Leung &lt;bleung@chromium.org&gt;
Cc: Sonny Rao &lt;sonnyrao@chromium.org&gt;
Cc: Luigi Semenzato &lt;semenzato@chromium.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Revert "writeback: do not sync data dirtied after sync start"</title>
<updated>2014-02-22T01:02:28Z</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2014-02-21T10:19:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0dc83bd30b0bf5410c0933cfbbf8853248eff0a9'/>
<id>urn:sha1:0dc83bd30b0bf5410c0933cfbbf8853248eff0a9</id>
<content type='text'>
This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave
Chinner &lt;david@fromorbit.com&gt; has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.

CC: stable@vger.kernel.org # &gt;= 3.13
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
</entry>
<entry>
<title>GFS2: journal data writepages update</title>
<updated>2014-02-06T15:47:47Z</updated>
<author>
<name>Steven Whitehouse</name>
<email>swhiteho@redhat.com</email>
</author>
<published>2014-02-06T15:47:47Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=774016b2d455017935b3e318b6cc4e055e9dd47f'/>
<id>urn:sha1:774016b2d455017935b3e318b6cc4e055e9dd47f</id>
<content type='text'>
GFS2 has carried what is more or less a copy of the
write_cache_pages() for some time. It seems that this
copy has slipped behind the core code over time. This
patch brings it back uptodate, and in addition adds the
tracepoint which would otherwise be missing.

We could go further, and eliminate some or all of the
code duplication here. The issue is that if we do that,
then the function we need to split out from the existing
write_cache_pages(), which will look a lot like
gfs2_jdata_write_pagevec(), would land up putting quite a
lot of extra variables on the stack. I know that has been
a problem in the past in the writeback code path, which
is why I've hesitated to do it here.

Signed-off-by: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
</content>
</entry>
<entry>
<title>writeback: Fix data corruption on NFS</title>
<updated>2013-12-13T20:21:26Z</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2013-12-13T20:21:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f9b0e058cbd04ada76b13afffa7e1df830543c24'/>
<id>urn:sha1:f9b0e058cbd04ada76b13afffa7e1df830543c24</id>
<content type='text'>
Commit 4f8ad655dbc8 "writeback: Refactor writeback_single_inode()" added
a condition to skip clean inode. However this is wrong in WB_SYNC_ALL
mode because there we also want to wait for outstanding writeback on
possibly clean inode. This was causing occasional data corruption issues
on NFS because it uses sync_inode() to make sure all outstanding writes
are flushed to the server before truncating the inode and with
sync_inode() returning prematurely file was sometimes extended back
by an outstanding write after it was truncated.

So modify the test to also check for pages under writeback in
WB_SYNC_ALL mode.

CC: stable@vger.kernel.org # &gt;= 3.5
Fixes: 4f8ad655dbc82cf05d2edc11e66b78a42d38bf93
Reported-and-tested-by: Dan Duval &lt;dan.duval@oracle.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
</entry>
<entry>
<title>Merge branch 'akpm' (patches from Andrew Morton)</title>
<updated>2013-11-13T06:45:43Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-11-13T06:45:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5cbb3d216e2041700231bcfc383ee5f8b7fc8b74'/>
<id>urn:sha1:5cbb3d216e2041700231bcfc383ee5f8b7fc8b74</id>
<content type='text'>
Merge first patch-bomb from Andrew Morton:
 "Quite a lot of other stuff is banked up awaiting further
  next-&gt;mainline merging, but this batch contains:

   - Lots of random misc patches
   - OCFS2
   - Most of MM
   - backlight updates
   - lib/ updates
   - printk updates
   - checkpatch updates
   - epoll tweaking
   - rtc updates
   - hfs
   - hfsplus
   - documentation
   - procfs
   - update gcov to gcc-4.7 format
   - IPC"

* emailed patches from Andrew Morton &lt;akpm@linux-foundation.org&gt;: (269 commits)
  ipc, msg: fix message length check for negative values
  ipc/util.c: remove unnecessary work pending test
  devpts: plug the memory leak in kill_sb
  ./Makefile: export initial ramdisk compression config option
  init/Kconfig: add option to disable kernel compression
  drivers: w1: make w1_slave::flags long to avoid memory corruption
  drivers/w1/masters/ds1wm.cuse dev_get_platdata()
  drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
  drivers/memstick/core/mspro_block.c: fix attributes array allocation
  drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
  kernel/panic.c: reduce 1 byte usage for print tainted buffer
  gcov: reuse kbasename helper
  kernel/gcov/fs.c: use pr_warn()
  kernel/module.c: use pr_foo()
  gcov: compile specific gcov implementation based on gcc version
  gcov: add support for gcc 4.7 gcov format
  gcov: move gcov structs definitions to a gcc version specific file
  kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
  kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
  kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
  ...
</content>
</entry>
<entry>
<title>writeback: do not sync data dirtied after sync start</title>
<updated>2013-11-13T03:09:07Z</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2013-11-12T23:07:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c4a391b53a72d2df4ee97f96f78c1d5971b47489'/>
<id>urn:sha1:c4a391b53a72d2df4ee97f96f78c1d5971b47489</id>
<content type='text'>
When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
(e.g.  causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.

The following script reproduces the problem:

  function run_writers
  {
    for (( i = 0; i &lt; 10; i++ )); do
      mkdir $1/dir$i
      for (( j = 0; j &lt; 40000; j++ )); do
        dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &amp;&gt;/dev/null
      done &amp;
    done
  }

  for dir in "$@"; do
    run_writers $dir
  done

  sleep 40
  time sync

Fix the problem by disregarding inodes dirtied after sync(2) was called
in the WB_SYNC_ALL pass.  To allow for this, sync_inodes_sb() now takes
a time stamp when sync has started which is used for setting up work for
flusher threads.

To give some numbers, when above script is run on two ext4 filesystems
on simple SATA drive, the average sync time from 10 runs is 267.549
seconds with standard deviation 104.799426.  With the patched kernel,
the average sync time from 10 runs is 2.995 seconds with standard
deviation 0.096.

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Reviewed-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
