<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/drivers/md/raid1.c, branch v4.7</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.7</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.7'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2016-05-09T16:24:02Z</updated>
<entry>
<title>md: set MD_CHANGE_PENDING in a atomic region</title>
<updated>2016-05-09T16:24:02Z</updated>
<author>
<name>Guoqing Jiang</name>
<email>gqjiang@suse.com</email>
</author>
<published>2016-05-04T02:22:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=85ad1d13ee9b3db00615ea24b031c15e5ba14fd1'/>
<id>urn:sha1:85ad1d13ee9b3db00615ea24b031c15e5ba14fd1</id>
<content type='text'>
Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.

Cc: Martin Kepplinger &lt;martink@posteo.de&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Denys Vlasenko &lt;dvlasenk@redhat.com&gt;
Cc: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Cc: &lt;linux-kernel@vger.kernel.org&gt;
Reviewed-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md:raid1: fix a dead loop when read from a WriteMostly disk</title>
<updated>2016-03-31T17:04:17Z</updated>
<author>
<name>Wei Fang</name>
<email>fangwei1@huawei.com</email>
</author>
<published>2016-03-21T11:18:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=816b0acf3deb6d6be5d0519b286fdd4bafade905'/>
<id>urn:sha1:816b0acf3deb6d6be5d0519b286fdd4bafade905</id>
<content type='text'>
If first_bad == this_sector when we get the WriteMostly disk
in read_balance(), valid disk will be returned with zero
max_sectors. It'll lead to a dead loop in make_request(), and
OOM will happen because of endless allocation of struct bio.

Since we can't get data from this disk in this case, so
continue for another disk.

Signed-off-by: Wei Fang &lt;fangwei1@huawei.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang</title>
<updated>2016-03-17T21:24:51Z</updated>
<author>
<name>Nate Dailey</name>
<email>nate.dailey@stratus.com</email>
</author>
<published>2016-02-29T15:43:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ccfc7bf1f09d6190ef86693ddc761d5fe3fa47cb'/>
<id>urn:sha1:ccfc7bf1f09d6190ef86693ddc761d5fe3fa47cb</id>
<content type='text'>
If raid1d is handling a mix of read and write errors, handle_read_error's
call to freeze_array can get stuck.

This can happen because, though the bio_end_io_list is initially drained,
writes can be added to it via handle_write_finished as the retry_list
is processed. These writes contribute to nr_pending but are not included
in nr_queued.

If a later entry on the retry_list triggers a call to handle_read_error,
freeze array hangs waiting for nr_pending == nr_queued+extra. The writes
on the bio_end_io_list aren't included in nr_queued so the condition will
never be satisfied.

To prevent the hang, include bio_end_io_list writes in nr_queued.

There's probably a better way to handle decrementing nr_queued, but this
seemed like the safest way to avoid breaking surrounding code.

I'm happy to supply the script I used to repro this hang.

Fixes: 55ce74d4bfe1b(md/raid1: ensure device failure recorded before write request returns.)
Cc: stable@vger.kernel.org (v4.3+)
Signed-off-by: Nate Dailey &lt;nate.dailey@stratus.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md/raid1: remove unnecessary BUG_ON</title>
<updated>2016-03-14T18:35:58Z</updated>
<author>
<name>Guoqing Jiang</name>
<email>gqjiang@suse.com</email>
</author>
<published>2016-03-14T09:01:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b3c95b425e0991840262dba8dd095f2682dbf848'/>
<id>urn:sha1:b3c95b425e0991840262dba8dd095f2682dbf848</id>
<content type='text'>
Since bitmap_start_sync will not return until
sync_blocks is not less than PAGE_SIZE&gt;&gt;9, so
the BUG_ON is not needed anymore.

Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>MD: rename some functions</title>
<updated>2016-01-20T21:52:20Z</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2016-01-20T21:52:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=849674e4fb175e47b7504249f7367367b18fe6a1'/>
<id>urn:sha1:849674e4fb175e47b7504249f7367367b18fe6a1</id>
<content type='text'>
These short function names are hard to search. Rename them to make vim happy.

Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md/raid: only permit hot-add of compatible integrity profiles</title>
<updated>2016-01-14T00:49:57Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2016-01-14T00:00:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1501efadc524a0c99494b576923091589a52d2a4'/>
<id>urn:sha1:1501efadc524a0c99494b576923091589a52d2a4</id>
<content type='text'>
It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue.  Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.

The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9a671 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.

This policy of disallowing changes the integrity profile once one has
been established is shared with DM.

Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]

[   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[   59.078302] md: data integrity enabled on md0
[..]
[   90.489209] md0: incompatible integrity profile for pmem1m
[..]
[  205.671277] md: super_written gets error=-5
[  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[  205.677386] md/raid1:md0: Operation continuing on 1 devices.
[  205.683037] RAID1 conf printout:
[  205.684699]  --- wd:1 rd:2
[  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
[  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
[  205.691717] md: recovery of RAID array md0

Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: &lt;stable@vger.kernel.org&gt;
Cc: Mike Snitzer &lt;snitzer@redhat.com&gt;
Reported-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
</content>
</entry>
<entry>
<title>Merge tag 'md/4.4' of git://neil.brown.name/md</title>
<updated>2015-11-05T05:12:47Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-11-05T05:12:47Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ac322de6bf5416cb145b58599297b8be73cd86ac'/>
<id>urn:sha1:ac322de6bf5416cb145b58599297b8be73cd86ac</id>
<content type='text'>
Pull md updates from Neil Brown:
 "Two major components to this update.

   1) The clustered-raid1 support from SUSE is nearly complete.  There
      are a few outstanding issues being worked on.  Maybe half a dozen
      patches will bring this to a usable state.

   2) The first stage of journalled-raid5 support from Facebook makes an
      appearance.  With a journal device configured (typically NVRAM or
      SSD), the "RAID5 write hole" should be closed - a crash during
      degraded operations cannot result in data corruption.

      The next stage will be to use the journal as a write-behind cache
      so that latency can be reduced and in some cases throughput
      increased by performing more full-stripe writes.

* tag 'md/4.4' of git://neil.brown.name/md: (66 commits)
  MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
  MD: set journal disk -&gt;raid_disk
  MD: kick out journal disk if it's not fresh
  raid5-cache: start raid5 readonly if journal is missing
  MD: add new bit to indicate raid array with journal
  raid5-cache: IO error handling
  raid5: journal disk can't be removed
  raid5-cache: add trim support for log
  MD: fix info output for journal disk
  raid5-cache: use bio chaining
  raid5-cache: small log-&gt;seq cleanup
  raid5-cache: new helper: r5_reserve_log_entry
  raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
  raid5-cache: take rdev-&gt;data_offset into account early on
  raid5-cache: refactor bio allocation
  raid5-cache: clean up r5l_get_meta
  raid5-cache: simplify state machine when caches flushes are not needed
  raid5-cache: factor out a helper to run all stripes for an I/O unit
  raid5-cache: rename flushed_ios to finished_ios
  raid5-cache: free I/O units earlier
  ...
</content>
</entry>
<entry>
<title>Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block</title>
<updated>2015-11-05T04:51:48Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-11-05T04:51:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=527d1529e38b36fd22e65711b653ab773179d9e8'/>
<id>urn:sha1:527d1529e38b36fd22e65711b653ab773179d9e8</id>
<content type='text'>
Pull block integrity updates from Jens Axboe:
 ""This is the joint work of Dan and Martin, cleaning up and improving
  the support for block data integrity"

* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
  block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
  block: blk_flush_integrity() for bio-based drivers
  block: move blk_integrity to request_queue
  block: generic request_queue reference counting
  nvme: suspend i/o during runtime blk_integrity_unregister
  md: suspend i/o during runtime blk_integrity_unregister
  md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
  block: Inline blk_integrity in struct gendisk
  block: Export integrity data interval size in sysfs
  block: Reduce the size of struct blk_integrity
  block: Consolidate static integrity profile properties
  block: Move integrity kobject to struct gendisk
</content>
</entry>
<entry>
<title>md-cluster: Call update_raid_disks() if another node --grow's raid_disks</title>
<updated>2015-10-24T06:16:18Z</updated>
<author>
<name>Goldwyn Rodrigues</name>
<email>rgoldwyn@suse.com</email>
</author>
<published>2015-10-22T05:01:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=28c1b9fdf4562b52fe104384b16238c39c8a8d40'/>
<id>urn:sha1:28c1b9fdf4562b52fe104384b16238c39c8a8d40</id>
<content type='text'>
To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.

This leads to call check_reshape() -&gt; md_allow_write() -&gt; md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.

In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.

mddev-&gt;recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev-&gt;recovery_cp
(which is read from sb-&gt;resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().

Signed-off-by: Goldwyn Rodrigues &lt;rgoldwyn@suse.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
</content>
</entry>
<entry>
<title>md/raid1: don't clear bitmap bit when bad-block-list write fails.</title>
<updated>2015-10-24T05:24:22Z</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2015-10-24T05:02:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bd8688a199b864944bf62eebed0ca13b46249453'/>
<id>urn:sha1:bd8688a199b864944bf62eebed0ca13b46249453</id>
<content type='text'>
When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data.  If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.

However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit.  Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced.  This leads to data corruption.

We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.

So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.

This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.

Backports will require at least
Commit: 55ce74d4bfe1 ("md/raid1: ensure device failure recorded before write request returns.")
as well.  I'll send that to 'stable' separately.

Note that of the two tests of R1BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed.  However doing it this way makes the
patch more obviously correct.  I will tidy the code up in a
future merge window.

Reported-and-tested-by: Nate Dailey &lt;nate.dailey@stratus.com&gt;
Cc: Jes Sorensen &lt;Jes.Sorensen@redhat.com&gt;
Fixes: cd5ff9a16f08 ("md/raid1:  Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
</content>
</entry>
</feed>
