| Age | Commit message (Collapse) | Author | Files | Lines |
|
If we detect an overlap, we set a flag and wait for a wakeup. When requests
are handled, if the flag was set, we perform the wakeup.
Note that the code currently in -mm is badly broken. With this patch applied,
it passes tests the use O_DIRECT to cause lots of overlapping requests.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The hashtable that linear uses to find the right device stores
two pointers for every entry.
The second is always one of:
The first plus 1
NULL
When NULL, it is never accessed, so any value can be stored.
Thus it could always be "first plus 1", and so we don't need to store
it as it is trivial to calculate.
This patch halves the size of this table, which results in some simpler
code as well.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The 'faulty' personality provides a layer over any block device in which
errors may be synthesised.
A variety of errors are possible including transient and persistent read
and write errors, and read errors that persist until the next write.
There error mode can be changed on a live array.
Accessing this personality requires mdadm 2.8.0 or later.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Some size fields were "int" instead of "sector_t".
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add some missing data_offset additions and some le_to_cpu convertions and fix
a few other little mistakes.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Both raid1 and multipath have a "retry_list" which is global, so all raid1
arrays (for example) us the same list. This is rather ugly, and it is simple
enough to make it per-array, so this patch does that.
It also changes to multipath code to use list.h lists instead of
roll-your-own.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch adds a 'raid10' module which provides features similar to both
raid0 and raid1 in the one array. Various combinations of layout are
supported.
This code is still "experimental", but appears to work.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
1/ Introduce "mddev->resync_max_sectors" so that an md personality
can ask for resync to cover a different address range than that of a
single drive. raid10 will use this.
2/ fix is_mddev_idle so that if there seem to be a negative number
of events, it doesn't immediately assume activity.
3/ make "sync_io" (the count of IO sectors used for array resync)
an atomic_t to avoid SMP races.
4/ Pass md_sync_acct a "block_device" rather than the containing "rdev",
as the whole rdev isn't needed. Also make this an inline function.
5/ Make sure recovery gets interrupted on any error.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This allows the number of "raid_disks" in a raid1 to be changed.
This requires allocating a new pool of "r1bio" structures which a different
number of bios, suspending IO, and swapping the new pool in place of the old.
(and a few other related changes).
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It is possible to have raid1/4/5/6 arrays that do not use all the space on the
drive. This can be done explicitly, or can happen info you, one by one,
replace all the drives with larger devices.
This patch extends the "SET_ARRAY_INFO" ioctl (which previously invalid on
active arrays) allow some attributes of the array to be changed and implements
changing of the "size" attribute.
"size" is the amount of each device that is actually used. If "size" is
increased, the new space will immediately be "resynced".
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
->nr_pending hits 0
md_check_recovery only locks a device and does stuff when it thinks there is a
real likelyhood that something needs doing. So the test at the top must cover
all possibilities.
But it didn't cover the possibility that the last outstanding request on a
failed device had finished and so the device needed to be removed.
As a result, a failed drive might not get removed from the personalities
perspective on the array, and so it could never be removed from the array as a
whole.
With this patch, whenever ->nr_pending hits zero on a faulty device,
MD_RECOVERY_NEEDED is set so that md_check_recovery will do stuff.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It no longer exists.
|
|
From: Neil Brown <neilb@cse.unsw.edu.au>
I've made a bunch of changes to the 'md' bits - largely moving the
unplugging into the individual personalities which know more about which
drives are actually in use.
|
|
From: Jens Axboe <axboe@suse.de>,
Chris Mason,
me, others.
The global unplug list causes horrid spinlock contention on many-disk
many-CPU setups - throughput is worse than halved.
The other problem with the global unplugging is of course that it will cause
the unplugging of queues which are unrelated to the I/O upon which the caller
is about to wait.
So what we do to solve these problems is to remove the global unplug and set
up the infrastructure under which the VFS can tell the block layer to unplug
only those queues which are relevant to the page or buffer_head whcih is
about to be waited upon.
We do this via the very appropriate address_space->backing_dev_info structure.
Most of the complexity is in devicemapper, MD and swapper_space, because for
these backing devices, multiple queues may need to be unplugged to complete a
page/buffer I/O. In each case we ensure that data structures are in place to
permit us to identify all the lower-level queues which contribute to the
higher-level backing_dev_info. Each contributing queue is told to unplug in
response to a higher-level unplug.
To simplify things in various places we also introduce the concept of a
"synchronous BIO": it is tagged with BIO_RW_SYNC. The block layer will
perform an immediate unplug when it sees one of these go past.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
This helps raid5 work on at least 1 very large array..
Thanks to Evan Felix <evan.felix@pnl.gov>
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
With this patch, md used two major numbers for arrays.
One Major is number 9 with name 'md' have unpartitioned md arrays, one per
minor number.
The other Major is allocated dynamically with name 'mdp' and had on array for
every 64 minors, allowing for upto 63 partitions.
The arrays under one major are completely separate from the arrays under the
other.
The preferred name for devices with the new major are of the form:
/dev/md/d1p3 # partion 3 of device 1 - minor 67
When a paritioned md device is assembled, the partitions are not recognised
until after the whole-array device is opened again. A future version of
mdadm will perform this open so that the need will be transparent.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
For each resync request, we allocate a "r1_bio" which has a bio "master_bio"
attached that goes largely unused. We also allocate a read_bio which is
used. This patch removes the read_bio and just uses the master_bio instead.
This fixes a bug wherein bi_bdev of the master_bio wasn't being set, but was
being used.
We also introduce a new "sectors" field into the r1_bio as we can no-longer
rely in master_bio->bi_sectors.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
next_r1 is never used, so it can just go.
read_bio isn't needed as we can easily use one of the pointers in the
write_bios array - write_bios[->read_disk]. So rename "write_bios" to "bios"
and store the pointer to the read bio in there.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
The only time it is really needed is to differentiate a retry-on-fail from a
write-after-read-for-resync request to raid1d. So we use a bit in 'state'
for that.
|
|
messages.
From: NeilBrown <neilb@cse.unsw.edu.au>
Instead of using ("md%d", mdidx(mddev)), we now use ("%s", mdname(mddev))
where mdname is the disk_name field in the associated gendisk structure.
This allows future flexability in naming.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
Move the pointers into mddev. The reduces dependance on MAX_MD_DEVS.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
Thanks dann frazier <dannf@hp.com>
|
|
From: "H. Peter Anvin" <hpa@zytor.com>
RAID6 implementation. See Kconfig help for usage details.
The next release of `mdadm' has raid6 userspace support.
|
|
Starting the conversion:
* internal dev_t made 32bit.
* new helpers - new_encode_dev(), new_decode_dev(), huge_encode_dev(),
huge_decode_dev(), new_valid_dev(). They do encoding/decoding of 32bit and
64bit values; for now huge_... are aliases for new_... and new_valid_dev()
is always true. We do 12:20 for 32bit; representation is compatible with
16bit one - we have major in bits 19--8 and minor in 31--20,7--0. That's
what the userland sees; internally we have (major << 20)|minor, of course.
* MKDEV(), MAJOR() and MINOR() updated.
* several places used to handle Missed'em'V dev_t (14:18 split)
manually; that stuff had been taken into common helpers.
Now we can start replacing old_... with new_... and huge_..., depending
on the width available. MKDEV() callers should (for now) make sure that major
and minor are within 12:20. That's what the next chunk will do.
|
|
To be able to properly be able to keep references to block queues,
we make blk_init_queue() return the queue that it initialized, and
let it be independently allocated and then cleaned up on the last
reference.
I have grepped high and low, and there really shouldn't be any broken
uses of blk_init_queue() in the kernel drivers left. The added bonus
being blk_init_queue() error checking is explicit now, most of the
drivers were broken in this regard (even IDE/SCSI).
No drivers have embedded request queue structures. Drivers that don't
use blk_init_queue() but blk_queue_make_request(), should allocate the
queue with blk_alloc_queue(gfp_mask). I've converted all of them to do
that, too. They can call blk_cleanup_queue() now too, using the define
blk_put_queue() is probably cleaner though.
|
|
|
|
Linear uses one array sized by MD_SB_DISKS inside a structure.
We move it to the end of the structure, declare it as size 0,
and arrange for approprate extra space to be allocated on
structure allocation.
|
|
raid1 uses MD_SB_DISKS to size two data structures,
but the new version-1 superblock allows for more than
this number of disks (and most actual arrays use many
fewer).
This patch sizes to two arrays dynamically.
One becomes a separate kmalloced array.
The other is moved to the end of the containing structure
and appropriate extra space is allocated.
Also, change r1buf_pool_alloc (which allocates buffers for
a mempool for doing re-sync) to not get r1bio structures
from the r1bio pool (which could exhaust the pool) but instead
to allocate them separately.
|
|
Arrays with type-1 superblock can have more than
MD_SB_DISKS, so we remove the dependancy on that number from
raid0, replacing several fixed sized arrays with one
dynamically allocated array.
|
|
One embeded array gets moved to end of structure and
sized dynamically.
|
|
Multipath has a dependancy on MD_SB_DISKS which is no
longer authoritative. We change it to use a separately
allocated array.
|
|
To cope with a raid0 array with differing sized devices,
raid0 divides an array into "strip zones".
The first zone covers the start of all devices, upto an offset
equal to the size of the smallest device.
The second strip zone covers the remaining devices upto the size of the
next smallest size, etc.
In order to determing which strip zone a given address is in,
the array is logically divided into slices the size of the smallest
zone, and a 'hash' table is created listing the first and, if relevant,
second zone in each slice.
As the smallest slice can be very small (imagine an array with a
76G drive and a 75.5G drive) this hash table can be rather large.
With this patch, we limit the size of the hash table to one page,
at the possible cost of making several probes into the zone list
before we find the correct zone.
We also cope with the possibility that a zone could be larger than
a 32bit sector address would allow.
|
|
Sometimes raid0 and linear are required to take a single page bio that
spans two devices. We use bio_split to split such a bio into two.
The the same time, bio.h is included by linux/raid/md.h so
we don't included it elsewhere anymore.
We also modify the mergeable_bvec functions to allow a bvec
that doesn't fit if it is the first bvec to be added to
the bio, and be careful never to return a negative length from a
bvec_mergable funciton.
|
|
From: Christoph Hellwig <hch@lst.de>
partition_name() is a variant of __bdevname() that caches results and
returns a pointrer to kmalloc()ed data instead of printing into a buffer.
Due to it's caching it gets utterly confused when the name for a dev_t
changes (can happen easily now with device mapper and probably in the
future with dynamic dev_t users).
It's only used by the raid code and most calls are through a wrapper,
bdev_partition_name() which takes a struct block_device * that maybe be
NULL.
The patch below changes the bdev_partition_name() to call bdevname() if
possible and the other calls where we really have nothing more than a dev_t
to __bdevname.
Btw, it would be nice if someone who knows the md code a bit better than me
could remove bdev_partition_name() in favour of direct calls to bdevname()
where possible - that would also get rid of the returns pointer to string
on stack issue that this patch can't fix yet.
|
|
|
|
Thanks to Angus Sawyer <angus.sawyer@dsl.pipex.com> and
Daniel McNeil <daniel@osdl.org>
|
|
Superblock format '1' resolves a number of issues with
superblock format '0'.
It is more dense and can support many more sub-devices.
It does not contains un-needed redundancy.
It adds a few new useful fields
|
|
from start of device.
Normally the data stored on a component of a RAID array is stored
from the start of the device. This patch allows a per-device
data_offset so the data can start elsewhere. This will allow
RAID arrays where the metadata is at the head of the device
rather than the tail.
|
|
From: Angus Sawyer <angus.sawyer@dsl.pipex.com>
If there are no writes for 20 milliseconds, write out superblock
to mark array as clean. Write out superblock with
dirty flag before allowing any further write to succeed.
If an md thread gets signaled with SIGKILL, reduce the
delay to 0.
Also tidy up some printk's and make sure writing the
superblock isn't noisy.
|
|
The md_recoveryd thread is responsible for initiating and cleaning
up resync threads.
This job can be equally well done by the per-array threads
for those arrays which might need it.
So the mdrecoveryd thread is gone and the core code that
it ran is now run by raid5d, raid1d or multipathd.
We add an MD_RECOVERY_NEEDED flag so those daemon don't have
to bother trying to lock the md array unless it is likely
that something needs to be done.
Also modify the names of all threads to have the number of
md device.
|
|
Md uses ->recovery_running and ->recovery_err to keep track of the
status or recovery. This is rather ad hoc and race prone.
This patch changes it to ->recovery which has bit flags for various
states.
|
|
From: Angus Sawyer <angus.sawyer@dsl.pipex.com>
Mainly straightforward convert of sprintf -> seq_printf. seq_start and
seq_next modelled on /proc/partitions. locking/ref counting as for
ITERATE_MDDEV.
pos == 0 -> header
pos == n -> nth mddev
pos == 0x10000 -> tail
|
|
When a raid1 or raid5 array is in 'safe-mode', then the array
is marked clean whenever there are no outstanding write requests,
and is marked dirty again before allowing any write request to
proceed.
This means than an unclean shutdown while no write activity is happening
will NOT cause a resync to be required. However it does mean extra
updates to the superblock.
Currently safe-mode is turned on by sending SIGKILL to the raid thread
as would happen at a normal shutdown. This should mean that the
reboot notifier is no longer needed.
After looking more at performance issues I may make safemode be on
all the time. I will almost certainly make it on when RAID5 is degraded
as an unclean shutdown of a degraded RAID5 means data loss.
This code was provided by Angus Sawyer <angus.sawyer@dsl.pipex.com>
|
|
This allows the thread to easily identified and signalled.
The point of signalling will appear in the next patch.
|
|
from there.
Add a new field to the md superblock, in an used area, to record where
resync was up-to on a clean shutdown while resync is active. Restart from
this point.
The extra field is verified by having a second copy of the event counter.
If the second event counter is wrong, we ignore the extra field.
This patch thanks to Angus Sawyer <angus.sawyer@dsl.pipex.com>
|
|
Define an interface for interpreting and updating superblocks
so we can more easily define new formats.
With this patch, (almost) all superblock layout information is
locating in a small set of routines dedicated to superblock
handling. This will allow us to provide a similar set for
a different format.
The two exceptions are:
1/ autostart_array where the devices listed in the superblock
are searched for.
2/ raid5 'knows' the maximum number of devices for
compute_parity.
These will be addressed in a later patch.
|
|
|
|
From Peter Chubb
Compaq Smart array sector_t cleanup: prepare for possible 64-bit sector_t
Clean up loop device to allow huge backing files.
MD transition to 64-bit sector_t.
- Hold sizes and offsets as sector_t not int;
- use 64-bit arithmetic if necessary to map block-in-raid to zone
and block-in-zone
|
|
partition_name() moved from md.c to partitions/check.c; disk_name() is not
exported anymore; partition_name() takes dev_t instead of kdev_t.
|