<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/fs-writeback.c, branch v3.1</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v3.1</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v3.1'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2011-07-31T14:52:08Z</updated>
<entry>
<title>don't busy retry the inode on failed grab_super_passive()</title>
<updated>2011-07-31T14:52:08Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-07-30T04:14:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0e995816f4fb69cef602b7fe82da68ced6be3b41'/>
<id>urn:sha1:0e995816f4fb69cef602b7fe82da68ced6be3b41</id>
<content type='text'>
This fixes a soft lockup on conditions

a) the flusher is working on a work by __bdi_start_writeback(), while

b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
   grab sb-&gt;s_umount and enqueue a new work for the flusher to execute

The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb-&gt;list_lock, which further blocks other tasks.

Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().

bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html

Reported-by: Christoph Hellwig &lt;hch@infradead.org&gt;
Tested-by: Christoph Hellwig &lt;hch@infradead.org&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback</title>
<updated>2011-07-26T17:39:54Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-07-26T17:39:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f01ef569cddb1a8627b1c6b3a134998ad1cf4b22'/>
<id>urn:sha1:f01ef569cddb1a8627b1c6b3a134998ad1cf4b22</id>
<content type='text'>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
  mm: properly reflect task dirty limits in dirty_exceeded logic
  writeback: don't busy retry writeback on new/freeing inodes
  writeback: scale IO chunk size up to half device bandwidth
  writeback: trace global_dirty_state
  writeback: introduce max-pause and pass-good dirty limits
  writeback: introduce smoothed global dirty limit
  writeback: consolidate variable names in balance_dirty_pages()
  writeback: show bdi write bandwidth in debugfs
  writeback: bdi write bandwidth estimation
  writeback: account per-bdi accumulated written pages
  writeback: make writeback_control.nr_to_write straight
  writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
  writeback: trace event writeback_queue_io
  writeback: trace event writeback_single_inode
  writeback: remove .nonblocking and .encountered_congestion
  writeback: remove writeback_control.more_io
  writeback: skip balance_dirty_pages() for in-memory fs
  writeback: add bdi_dirty_limit() kernel-doc
  writeback: avoid extra sync work at enqueue time
  writeback: elevate queue_io() into wb_writeback()
  ...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c
</content>
</entry>
<entry>
<title>writeback: don't busy retry writeback on new/freeing inodes</title>
<updated>2011-07-24T02:46:51Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-07-12T06:08:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fcc5c22218a18509a7412bf074fc9a7a5d874a8a'/>
<id>urn:sha1:fcc5c22218a18509a7412bf074fc9a7a5d874a8a</id>
<content type='text'>
Fix a system hang bug introduced by commit b7a2441f9966 ("writeback:
remove writeback_control.more_io") and e8dfc3058 ("writeback: elevate
queue_io() into wb_writeback()") easily reproducible with high memory
pressure and lots of file creation/deletions, for example, a kernel
build in limited memory.

It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE 
state, the flusher will get stuck busy retrying that inode, never
releasing wb-&gt;list_lock. The lock in turn blocks all kinds of other
tasks when they are trying to grab it.

As put by Jan, it's a safe change regarding data integrity. I_FREEING or
I_WILL_FREE inodes are written back by iput_final() and it is reclaim
code that is responsible for eventually removing them. So writeback code
can safely ignore them. I_NEW inodes should move out of this state when
they are fully set up and in the writeback round following that, we will
consider them for writeback. So the change makes sense.                                                         

CC: Jan Kara &lt;jack@suse.cz&gt; 
Reported-by: Hugh Dickins &lt;hughd@google.com&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>superblock: move pin_sb_for_writeback() to fs/super.c</title>
<updated>2011-07-20T05:44:38Z</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2011-07-08T04:14:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=12ad3ab66103e6582ca69c0c9de18b13487eaaef'/>
<id>urn:sha1:12ad3ab66103e6582ca69c0c9de18b13487eaaef</id>
<content type='text'>
The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.

pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>writeback: scale IO chunk size up to half device bandwidth</title>
<updated>2011-07-10T05:09:03Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2010-08-29T19:28:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1a12d8bd7b2998be01ee55edb64e7473728abb9c'/>
<id>urn:sha1:1a12d8bd7b2998be01ee55edb64e7473728abb9c</id>
<content type='text'>
Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
will adapt to as large as the storage device can write within 500ms.

XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.

The balance_dirty_pages() control scrope is

	[(background_thresh + dirty_thresh) / 2, dirty_thresh]

which is by default [15%, 20%] of global dirty pages, whose range size
is dirty_thresh / DIRTY_FULL_SCOPE.

The adpative write chunk size will be rounded to the nearest 4MB
boundary.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o &lt;tytso@mit.edu&gt;
CC: Dave Chinner &lt;david@fromorbit.com&gt;
CC: Chris Mason &lt;chris.mason@oracle.com&gt;
CC: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: introduce smoothed global dirty limit</title>
<updated>2011-07-10T05:09:02Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-03-02T21:54:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c42843f2f0bbc9d716a32caf667d18fc2bf3bc4c'/>
<id>urn:sha1:c42843f2f0bbc9d716a32caf667d18fc2bf3bc4c</id>
<content type='text'>
The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.

balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.

About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.

Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: bdi write bandwidth estimation</title>
<updated>2011-07-10T05:09:01Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2010-08-29T17:22:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e98be2d599207c6b31e9bb340d52a231b2f3662d'/>
<id>urn:sha1:e98be2d599207c6b31e9bb340d52a231b2f3662d</id>
<content type='text'>
The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.

The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.

The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.

The bdi-&gt;avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.

The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.

The 200ms update interval is suitable, because it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.

The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.

That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.

CC: Li Shaohua &lt;shaohua.li@intel.com&gt;
CC: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: make writeback_control.nr_to_write straight</title>
<updated>2011-07-10T05:09:01Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-05-05T01:54:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d46db3d58233be4be980eb1e42eebe7808bcabab'/>
<id>urn:sha1:d46db3d58233be4be980eb1e42eebe7808bcabab</id>
<content type='text'>
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work-&gt;nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc-&gt;nr_to_write.

Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Proposed-by: Christoph Hellwig &lt;hch@infradead.org&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: trace event writeback_queue_io</title>
<updated>2011-06-08T00:25:23Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-04-23T18:27:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e84d0a4f8e39a73003a6ec9a11b07702745f4c1f'/>
<id>urn:sha1:e84d0a4f8e39a73003a6ec9a11b07702745f4c1f</id>
<content type='text'>
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.

Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: trace event writeback_single_inode</title>
<updated>2011-06-08T00:25:23Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2010-12-01T23:33:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=251d6a471c831e22880b3c146bb4556ddfb1dc82'/>
<id>urn:sha1:251d6a471c831e22880b3c146bb4556ddfb1dc82</id>
<content type='text'>
It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

- "state" reflects inode-&gt;i_state at the end of writeback_single_inode()
- "index" reflects mapping-&gt;writeback_index after the -&gt;writepages() call
- "to_write" is the wbc-&gt;nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

v2: add trace event writeback_single_inode_requeue as proposed by Dave.

CC: Dave Chinner &lt;david@fromorbit.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
</feed>
