<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/fs-writeback.c, branch v4.0</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.0</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.0'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2015-03-17T16:23:32Z</updated>
<entry>
<title>fs: add dirtytime_expire_seconds sysctl</title>
<updated>2015-03-17T16:23:32Z</updated>
<author>
<name>Theodore Ts'o</name>
<email>tytso@mit.edu</email>
</author>
<published>2015-03-17T16:23:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1efff914afac8a965ad63817ecf8861a927c2ace'/>
<id>urn:sha1:1efff914afac8a965ad63817ecf8861a927c2ace</id>
<content type='text'>
Add a tuning knob so we can adjust the dirtytime expiration timeout,
which is very useful for testing lazytime.

Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
</entry>
<entry>
<title>fs: make sure the timestamps for lazytime inodes eventually get written</title>
<updated>2015-03-17T16:23:19Z</updated>
<author>
<name>Theodore Ts'o</name>
<email>tytso@mit.edu</email>
</author>
<published>2015-03-17T16:23:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a2f4870697a5bcf4a87073ec6b32dd2928c1211d'/>
<id>urn:sha1:a2f4870697a5bcf4a87073ec6b32dd2928c1211d</id>
<content type='text'>
Jan Kara pointed out that if there is an inode which is constantly
getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
will never be written since inode-&gt;dirtied_when is constantly getting
updated.  We fix this by adding an extra field to the inode,
dirtied_time_when, so inodes with a stale dirtytime can get detected
and handled.

In addition, if we have a dirtytime inode caused by an atime update,
and there is no write activity on the file system, we need to have a
secondary system to make sure these inodes get written out.  We do
this by setting up a second delayed work structure which wakes up the
CPU much more rarely compared to writeback_expire_centisecs.

Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
</entry>
<entry>
<title>trylock_super(): replacement for grab_super_passive()</title>
<updated>2015-02-22T16:38:42Z</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@yandex-team.ru</email>
</author>
<published>2015-02-19T17:19:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=eb6ef3df4faa5424cf2a24b4e4f3eeceb1482a8e'/>
<id>urn:sha1:eb6ef3df4faa5424cf2a24b4e4f3eeceb1482a8e</id>
<content type='text'>
I've noticed significant locking contention in memory reclaimer around
sb_lock inside grab_super_passive(). Grab_super_passive() is called from
two places: in icache/dcache shrinkers (function super_cache_scan) and
from writeback (function __writeback_inodes_wb). Both are required for
progress in memory allocator.

Grab_super_passive() acquires sb_lock to increment sb-&gt;s_count and check
sb-&gt;s_instances. It seems sb-&gt;s_umount locked for read is enough here:
super-block deactivation always runs under sb-&gt;s_umount locked for write.
Protecting super-block itself isn't a problem: in super_cache_scan() sb
is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
are still active. Inside writeback super-block comes from inode from bdi
writeback list under wb-&gt;list_lock.

This patch removes locking sb_lock and checks s_instances under s_umount:
generic_shutdown_super() unlinks it under sb-&gt;s_umount locked for write.
New variant is called trylock_super() and since it only locks semaphore,
callers must call up_read(&amp;sb-&gt;s_umount) instead of drop_super(sb) when
they're done.

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2015-02-18T00:12:34Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-02-18T00:12:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=038911597e17017cee55fe93d521164a27056866'/>
<id>urn:sha1:038911597e17017cee55fe93d521164a27056866</id>
<content type='text'>
Pull lazytime mount option support from Al Viro:
 "Lazytime stuff from tytso"

* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4: add optimization for the lazytime mount option
  vfs: add find_inode_nowait() function
  vfs: add support for a lazytime mount option
</content>
</entry>
<entry>
<title>vfs: add support for a lazytime mount option</title>
<updated>2015-02-05T07:45:00Z</updated>
<author>
<name>Theodore Ts'o</name>
<email>tytso@mit.edu</email>
</author>
<published>2015-02-02T05:37:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0ae45f63d4ef8d8eeec49c7d8b44a1775fff13e8'/>
<id>urn:sha1:0ae45f63d4ef8d8eeec49c7d8b44a1775fff13e8</id>
<content type='text'>
Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact long tail latencies --- which
is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>fs: make inode_to_bdi() handle NULL inode</title>
<updated>2015-01-22T15:13:17Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@fb.com</email>
</author>
<published>2015-01-22T15:13:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b520252aa287b14e1f39a51e20051775b273b82a'/>
<id>urn:sha1:b520252aa287b14e1f39a51e20051775b273b82a</id>
<content type='text'>
Running a heavy fs workload, I ran into a situation where we pass
down a page for writeback/swap that doesn't have an inode mapping:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [&lt;ffffffff8119589f&gt;] inode_to_bdi+0xf/0x50
PGD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: wl(O) tun cfg80211 btusb joydev hid_apple hid_generic usbhid hid bcm5974 usb_storage nouveau snd_hda_codec_hdmi snd_hda_codec_cirrus snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_intel kvm_intel snd_hda_controller snd_hda_codec kvm snd_hwdep snd_pcm applesmc input_polldev snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_timer snd_seq_device snd xhci_pci xhci_hcd ttm thunderbolt soundcore apple_gmux apple_bl bluetooth binfmt_misc fuse nls_iso8859_1 nls_cp437 vfat fat [last unloaded: wl]
CPU: 4 PID: 50 Comm: kswapd0 Tainted: G     U     O   3.19.0-rc5+ #60
Hardware name: Apple Inc. MacBookPro11,3/Mac-2BD1B31983FE1663, BIOS MBP112.88Z.0138.B02.1310181745 10/18/2013
task: ffff880462e917f0 ti: ffff880462edc000 task.ti: ffff880462edc000
RIP: 0010:[&lt;ffffffff8119589f&gt;]  [&lt;ffffffff8119589f&gt;] inode_to_bdi+0xf/0x50
RSP: 0000:ffff880462edf8e8  EFLAGS: 00010282
RAX: ffffffff81c4cd80 RBX: ffffea0001b3abc0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff880462edf8f8 R08: 00000000001e8500 R09: ffff880460f7cb68
R10: ffff880462edfa00 R11: 0000000000000101 R12: 0000000000000000
R13: ffffffff81c4cd98 R14: 0000000000000000 R15: ffff880460f7c9c0
FS:  0000000000000000(0000) GS:ffff88047f300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 00000002b6341000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 ffffea0001b3abc0 ffffffff81c4cd80 ffff880462edf948 ffffffff811244aa
 ffffffff811565b0 ffff880460f7c9c0 ffff880462edf948 ffffea0001b3abc0
 0000000000000001 ffff880462edfb40 ffff880008b999c0 ffff880460f7c9c0
Call Trace:
 [&lt;ffffffff811244aa&gt;] __test_set_page_writeback+0x3a/0x170
 [&lt;ffffffff811565b0&gt;] ? SyS_madvise+0x790/0x790
 [&lt;ffffffff81156bb6&gt;] __swap_writepage+0x216/0x280
 [&lt;ffffffff8133d592&gt;] ? radix_tree_insert+0x32/0xe0
 [&lt;ffffffff81157741&gt;] ? swap_info_get+0x61/0xf0
 [&lt;ffffffff81159bfc&gt;] ? page_swapcount+0x4c/0x60
 [&lt;ffffffff81156c4d&gt;] swap_writepage+0x2d/0x50
 [&lt;ffffffff81131658&gt;] shmem_writepage+0x198/0x2c0
 [&lt;ffffffff8112cae4&gt;] shrink_page_list+0x464/0xa00
 [&lt;ffffffff8112d666&gt;] shrink_inactive_list+0x266/0x500
 [&lt;ffffffff8112e215&gt;] shrink_lruvec+0x5d5/0x720
 [&lt;ffffffff8112e3bb&gt;] shrink_zone+0x5b/0x190
 [&lt;ffffffff8112ee3f&gt;] kswapd+0x48f/0x8d0
 [&lt;ffffffff8112e9b0&gt;] ? try_to_free_pages+0x4c0/0x4c0
 [&lt;ffffffff81067be2&gt;] kthread+0xd2/0xf0
 [&lt;ffffffff81060000&gt;] ? workqueue_congested+0x30/0x80
 [&lt;ffffffff81067b10&gt;] ? kthread_create_on_node+0x180/0x180
 [&lt;ffffffff816b556c&gt;] ret_from_fork+0x7c/0xb0
 [&lt;ffffffff81067b10&gt;] ? kthread_create_on_node+0x180/0x180
Code: 00 48 c7 c7 8d 8d a4 81 e8 3f 62 eb ff e9 fc fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 &lt;48&gt; 8b 5f 28 48 89 df e8 15 f8 00 00 85 c0 75 11 48 8b 83 d8 00
RIP  [&lt;ffffffff8119589f&gt;] inode_to_bdi+0xf/0x50
 RSP &lt;ffff880462edf8e8&gt;
CR2: 0000000000000028
---[ end trace eb0e21aa7dad3ddf ]---

Handle this in inode_to_bdi() by punting it to noop_backing_dev_info,
if mapping-&gt;host is NULL.

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>fs: export inode_to_bdi and use it in favor of mapping-&gt;backing_dev_info</title>
<updated>2015-01-20T21:03:04Z</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2015-01-14T09:42:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=de1414a654e66b81b5348dbc5259ecf2fb61655e'/>
<id>urn:sha1:de1414a654e66b81b5348dbc5259ecf2fb61655e</id>
<content type='text'>
Now that we got rid of the bdi abuse on character devices we can always use
sb-&gt;s_bdi to get at the backing_dev_info for a file, except for the block
device special case.  Export inode_to_bdi and replace uses of
mapping-&gt;backing_dev_info with it to prepare for the removal of
mapping-&gt;backing_dev_info.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>block_dev: get bdev inode bdi directly from the block device</title>
<updated>2015-01-20T21:03:01Z</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2015-01-14T09:42:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=495a276e1ca96af622edb67ad0f85431935b20d2'/>
<id>urn:sha1:495a276e1ca96af622edb67ad0f85431935b20d2</id>
<content type='text'>
Directly grab the backing_dev_info from the request_queue instead of
detouring through the address_space.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>writeback: fix a subtle race condition in I_DIRTY clearing</title>
<updated>2014-11-04T17:42:23Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2014-10-24T19:38:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9c6ac78eb3521c5937b2dd8a7d1b300f41092f45'/>
<id>urn:sha1:9c6ac78eb3521c5937b2dd8a7d1b300f41092f45</id>
<content type='text'>
After invoking -&gt;dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode-&gt;i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set.  The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.

And it sure enough was broken.  Please consider the following
scenario.

 CPU 0					CPU 1
 -------------------------------------------------------------------------------

					enters __writeback_single_inode()
					grabs inode-&gt;i_lock
					tests PAGECACHE_TAG_DIRTY which is clear
 enters __set_page_dirty()
 grabs mapping-&gt;tree_lock
 sets PAGECACHE_TAG_DIRTY
 releases mapping-&gt;tree_lock
 leaves __set_page_dirty()

 enters __mark_inode_dirty()
 smp_mb()
 sees I_DIRTY_PAGES set
 leaves __mark_inode_dirty()
					clears I_DIRTY_PAGES
					releases inode-&gt;i_lock

Now @inode has dirty pages w/ I_DIRTY_PAGES clear.  This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.

The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness.  There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode.  Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from -&gt;dirty_inode().

Fix it by adding an explicit smp_mb() after I_DIRTY clearing.  Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side.  It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.

Also add comments explaining how and why the barriers are paired.

Lightly tested.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>sched: Remove proliferation of wait_on_bit() action functions</title>
<updated>2014-07-16T13:10:39Z</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2014-07-07T05:16:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=743162013d40ca612b4cb53d3a200dff2d9ab26e'/>
<id>urn:sha1:743162013d40ca612b4cb53d3a200dff2d9ab26e</id>
<content type='text'>
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().

So:
 Rename wait_on_bit and        wait_on_bit_lock to
        wait_on_bit_action and wait_on_bit_lock_action
 to make it explicit that they need an action function.

 Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
 which are *not* given an action function but implicitly use
 a standard one.
 The decision to error-out if a signal is pending is now made
 based on the 'mode' argument rather than being encoded in the action
 function.

 All instances of the old wait_on_bit and wait_on_bit_lock which
 can use the new version have been changed accordingly and their
 action functions have been discarded.
 wait_on_bit{_lock} does not return any specific error code in the
 event of a signal so the caller must check for non-zero and
 interpolate their own error code as appropriate.

The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"

The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.

A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack.  So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).

Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS.  CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Acked-by: David Howells &lt;dhowells@redhat.com&gt; (fscache, keys)
Acked-by: Steven Whitehouse &lt;swhiteho@redhat.com&gt; (gfs2)
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Steve French &lt;sfrench@samba.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
</feed>
