<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm/filemap.c, branch v4.9</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.9</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.9'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2016-11-06T18:29:15Z</updated>
<entry>
<title>mm/filemap: don't allow partially uptodate page for pipes</title>
<updated>2016-11-06T18:29:15Z</updated>
<author>
<name>Eryu Guan</name>
<email>guaneryu@gmail.com</email>
</author>
<published>2016-11-01T07:43:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6d6d36bc6e77f8b1f86d81884ad5149931bb4acd'/>
<id>urn:sha1:6d6d36bc6e77f8b1f86d81884ad5149931bb4acd</id>
<content type='text'>
Starting from 4.9-rc1 kernel, I started noticing some test failures
of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
testing on sub-page block size filesystems (tested both XFS and
ext4), these syscalls start to return EIO in the tests. e.g.

sendfile02    1  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
sendfile02    2  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
sendfile02    3  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
sendfile02    4  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1

This is because that in sub-page block size cases, we don't need the
whole page to be uptodate, only the part we care about is uptodate
is OK (if fs has -&gt;is_partially_uptodate defined). But
page_cache_pipe_buf_confirm() doesn't have the ability to check the
partially-uptodate case, it needs the whole page to be uptodate. So
it returns EIO in this case.

This is a regression introduced by commit 82c156f85384 ("switch
generic_file_splice_read() to use of -&gt;read_iter()"). Prior to the
change, generic_file_splice_read() doesn't allow partially-uptodate
page either, so it worked fine.

Fix it by skipping the partially-uptodate check if we're working on
a pipe in do_generic_file_read(), so we read the whole page from
disk as long as the page is not uptodate.

Signed-off-by: Eryu Guan &lt;guaneryu@gmail.com&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>mm: remove per-zone hashtable of bitlock waitqueues</title>
<updated>2016-10-27T16:27:57Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-10-26T17:15:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9dcb8b685fc30813b35ab4b4bf39244430753190'/>
<id>urn:sha1:9dcb8b685fc30813b35ab4b4bf39244430753190</id>
<content type='text'>
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:

     wait_on_bit(&amp;gh-&gt;gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().

The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).

It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.

As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.

Peter Zijlstra already has a patch for that, but let's see if anybody
even notices.  In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.

Reported-by: Andreas Gruenbacher &lt;agruenba@redhat.com&gt;
Tested-by: Bob Peterson &lt;rpeterso@redhat.com&gt;
Acked-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2016-10-10T20:38:49Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-10-10T20:38:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fed41f7d039bad02f94cad9059e4b14cd81d13f2'/>
<id>urn:sha1:fed41f7d039bad02f94cad9059e4b14cd81d13f2</id>
<content type='text'>
Pull splice fixups from Al Viro:
 "A couple of fixups for interaction of pipe-backed iov_iter with
  O_DIRECT reads + constification of a couple of primitives in uio.h
  missed by previous rounds.

  Kudos to davej - his fuzzing has caught those bugs"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [btrfs] fix check_direct_IO() for non-iovec iterators
  constify iov_iter_count() and iter_is_iovec()
  fix ITER_PIPE interaction with direct_IO
</content>
</entry>
<entry>
<title>fix ITER_PIPE interaction with direct_IO</title>
<updated>2016-10-10T17:36:06Z</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2016-10-10T17:26:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c3a690240423fc4eb8a0c3c7df025d13eadf140b'/>
<id>urn:sha1:c3a690240423fc4eb8a0c3c7df025d13eadf140b</id>
<content type='text'>
by making sure we call iov_iter_advance() on original
iov_iter even if direct_IO (done on its copy) has returned 0.
It's a no-op for old iov_iter flavours and does the right thing
(== truncation of the stuff we'd allocated, but not filled) in
ITER_PIPE case.  Failures (e.g. -EIO) get caught and dealt with
by cleanup in generic_file_read_iter().

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>vfs,mm: fix a dead loop in truncate_inode_pages_range()</title>
<updated>2016-10-08T01:46:29Z</updated>
<author>
<name>Wei Fang</name>
<email>fangwei1@huawei.com</email>
</author>
<published>2016-10-08T00:01:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c2a9737f45e27d8263ff9643f994bda9bac0b944'/>
<id>urn:sha1:c2a9737f45e27d8263ff9643f994bda9bac0b944</id>
<content type='text'>
We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:

	...
	fd = open();
	write(fd, buf, 4096);
	preadv64(fd, &amp;iovec, 1, 0xffffffff000);
	ftruncate(fd, 0);
	...

Then ftruncate() will not return forever.

The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.

When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of -&gt;mapping.  Then
this page can be found in -&gt;mapping with pagevec_lookup().  After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:

 - find a page with index=0xffffffff, since index&gt;=end, this page won't
   be truncated

 - index++, and index become 0

 - the page with index=0xffffffff will be found again

The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.

Since truncate_inode_pages_range() is executed with holding lock of
inode-&gt;i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:

  INFO: task truncate_test:3364 blocked for more than 120 seconds.
  ...
     call_rwsem_down_write_failed+0x17/0x30
     generic_file_write_iter+0x32/0x1c0
     ubifs_write_iter+0xcc/0x170
     __vfs_write+0xc4/0x120
     vfs_write+0xb2/0x1b0
     SyS_write+0x46/0xa0

The page with index=0xffffffff added to -&gt;mapping is useless.  Fix this
by checking the read position before allocating pages.

Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang &lt;fangwei1@huawei.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>do_generic_file_read(): fail immediately if killed</title>
<updated>2016-10-08T01:46:27Z</updated>
<author>
<name>Bart Van Assche</name>
<email>bart.vanassche@sandisk.com</email>
</author>
<published>2016-10-07T23:58:33Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c4b209a426847b55c40360c1d04dc7986b55ddc7'/>
<id>urn:sha1:c4b209a426847b55c40360c1d04dc7986b55ddc7</id>
<content type='text'>
If a fatal signal has been received, fail immediately instead of trying
to read more data.

If wait_on_page_locked_killable() was interrupted then this page is most
likely is not PageUptodate() and in this case do_generic_file_read()
will fail after lock_page_killable().

See also commit ebded02788b5 ("mm: filemap: avoid unnecessary calls to
lock_page when waiting for IO to complete during a read")

[oleg@redhat.com: changelog addition]
Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.com
Signed-off-by: Bart Van Assche &lt;bart.vanassche@sandisk.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Acked-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs</title>
<updated>2016-10-06T15:18:10Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-10-06T15:18:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8d370595811e13378243832006f8c52bbc9cca5e'/>
<id>urn:sha1:8d370595811e13378243832006f8c52bbc9cca5e</id>
<content type='text'>
Pull xfs and iomap updates from Dave Chinner:
 "The main things in this update are the iomap-based DAX infrastructure,
  an XFS delalloc rework, and a chunk of fixes to how log recovery
  schedules writeback to prevent spurious corruption detections when
  recovery of certain items was not required.

  The other main chunk of code is some preparation for the upcoming
  reflink functionality. Most of it is generic and cleanups that stand
  alone, but they were ready and reviewed so are in this pull request.

  Speaking of reflink, I'm currently planning to send you another pull
  request next week containing all the new reflink functionality. I'm
  working through a similar process to the last cycle, where I sent the
  reverse mapping code in a separate request because of how large it
  was. The reflink code merge is even bigger than reverse mapping, so
  I'll be doing the same thing again....

  Summary for this update:

   - change of XFS mailing list to linux-xfs@vger.kernel.org

   - iomap-based DAX infrastructure w/ XFS and ext2 support

   - small iomap fixes and additions

   - more efficient XFS delayed allocation infrastructure based on iomap

   - a rework of log recovery writeback scheduling to ensure we don't
     fail recovery when trying to replay items that are already on disk

   - some preparation patches for upcoming reflink support

   - configurable error handling fixes and documentation

   - aio access time update race fixes for XFS and
     generic_file_read_iter"

* tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits)
  fs: update atime before I/O in generic_file_read_iter
  xfs: update atime before I/O in xfs_file_dio_aio_read
  ext2: fix possible integer truncation in ext2_iomap_begin
  xfs: log recovery tracepoints to track current lsn and buffer submission
  xfs: update metadata LSN in buffers during log recovery
  xfs: don't warn on buffers not being recovered due to LSN
  xfs: pass current lsn to log recovery buffer validation
  xfs: rework log recovery to submit buffers on LSN boundaries
  xfs: quiesce the filesystem after recovery on readonly mount
  xfs: remote attribute blocks aren't really userdata
  ext2: use iomap to implement DAX
  ext2: stop passing buffer_head to ext2_get_blocks
  xfs: use iomap to implement DAX
  xfs: refactor xfs_setfilesize
  xfs: take the ilock shared if possible in xfs_file_iomap_begin
  xfs: fix locking for DAX writes
  dax: provide an iomap based fault handler
  dax: provide an iomap based dax read/write path
  dax: don't pass buffer_head to copy_user_dax
  dax: don't pass buffer_head to dax_insert_mapping
  ...
</content>
</entry>
<entry>
<title>mm: filemap: fix mapping-&gt;nrpages double accounting in fuse</title>
<updated>2016-10-05T16:17:56Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2016-10-04T14:58:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3ddf40e8c31964b744ff10abb48c8e36a83ec6e7'/>
<id>urn:sha1:3ddf40e8c31964b744ff10abb48c8e36a83ec6e7</id>
<content type='text'>
Commit 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker
caused by replace_page_cache_page()") switched replace_page_cache() from
raw radix tree operations to page_cache_tree_insert() but didn't take
into account that the latter function, unlike the raw radix tree op,
handles mapping-&gt;nrpages.  As a result, that counter is bumped for each
page replacement rather than balanced out even.

The mapping-&gt;nrpages counter is used to skip needless radix tree walks
when invalidating, truncating, syncing inodes without pages, as well as
statistics for userspace.  Since the error is positive, we'll do more
page cache tree walks than necessary; we won't miss a necessary one.
And we'll report more buffer pages to userspace than there are.  The
error is limited to fuse inodes.

Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Miklos Szeredi &lt;miklos@szeredi.hu&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: filemap: don't plant shadow entries without radix tree node</title>
<updated>2016-10-05T16:17:56Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2016-10-04T20:02:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d3798ae8c6f3767c726403c2ca6ecc317752c9dd'/>
<id>urn:sha1:d3798ae8c6f3767c726403c2ca6ecc317752c9dd</id>
<content type='text'>
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:

  kernel BUG at ./include/linux/swap.h:276!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
   soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
  CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
  Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
  task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
  RIP: page_cache_tree_insert+0xf1/0x100
  Call Trace:
    __add_to_page_cache_locked+0x12e/0x270
    add_to_page_cache_lru+0x4e/0xe0
    mpage_readpages+0x112/0x1d0
    blkdev_readpages+0x1d/0x20
    __do_page_cache_readahead+0x1ad/0x290
    force_page_cache_readahead+0xaa/0x100
    page_cache_sync_readahead+0x3f/0x50
    generic_file_read_iter+0x5af/0x740
    blkdev_read_iter+0x35/0x40
    __vfs_read+0xe1/0x130
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x13/0x8f
  Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 &lt;0f&gt; 0b e8 88 68 ef ff 0f 1f 84 00
  RIP  page_cache_tree_insert+0xf1/0x100

This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node-&gt;count is split in half
to count shadows in the upper bits and pages in the lower bits.

Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node-&gt;count. When there is a shadow entry
directly in root-&gt;rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node-&gt;count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node-&gt;count underflows
and triggers the warning. Afterwards, without node-&gt;count reaching 0
again, the radix tree node is leaked.

Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-and-tested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs: update atime before I/O in generic_file_read_iter</title>
<updated>2016-10-02T22:48:08Z</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2016-10-02T22:48:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0d5b0cf246a3227d811e7bf55d756b273408e414'/>
<id>urn:sha1:0d5b0cf246a3227d811e7bf55d756b273408e414</id>
<content type='text'>
After the call to -&gt;direct_IO the final reference to the file might have
been dropped by aio_complete already, and the call to file_accessed might
cause a use after free.

Instead update the access time before the I/O, similar to how we
update the time stamps before writes.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Signed-off-by: Dave Chinner &lt;david@fromorbit.com&gt;

</content>
</entry>
</feed>
