2019-05-20 20:08:12 +03:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* raid10.c : Multiple Devices driver for Linux
|
|
|
|
*
|
|
|
|
* Copyright (C) 2000-2004 Neil Brown
|
|
|
|
*
|
|
|
|
* RAID-10 support for md.
|
|
|
|
*
|
2011-03-31 05:57:33 +04:00
|
|
|
* Base on code in raid1.c. See raid1.c for further copyright information.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
|
|
|
#include <linux/slab.h>
|
2008-10-15 02:09:21 +04:00
|
|
|
#include <linux/delay.h>
|
2009-03-31 07:33:13 +04:00
|
|
|
#include <linux/blkdev.h>
|
2011-07-03 21:58:33 +04:00
|
|
|
#include <linux/module.h>
|
2009-03-31 07:33:13 +04:00
|
|
|
#include <linux/seq_file.h>
|
2011-07-27 05:00:36 +04:00
|
|
|
#include <linux/ratelimit.h>
|
2012-05-22 07:53:47 +04:00
|
|
|
#include <linux/kthread.h>
|
2018-10-18 11:37:41 +03:00
|
|
|
#include <linux/raid/md_p.h>
|
2016-11-18 05:22:04 +03:00
|
|
|
#include <trace/events/block.h>
|
2009-03-31 07:33:13 +04:00
|
|
|
#include "md.h"
|
2009-03-31 07:27:03 +04:00
|
|
|
#include "raid10.h"
|
2010-03-08 08:02:45 +03:00
|
|
|
#include "raid0.h"
|
2017-10-11 00:02:41 +03:00
|
|
|
#include "md-bitmap.h"
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* RAID10 provides a combination of RAID0 and RAID1 functionality.
|
|
|
|
* The layout of data is defined by
|
|
|
|
* chunk_size
|
|
|
|
* raid_disks
|
|
|
|
* near_copies (stored in low byte of layout)
|
|
|
|
* far_copies (stored in second byte of layout)
|
2006-06-26 11:27:41 +04:00
|
|
|
* far_offset (stored in bit 16 of layout )
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
* use_far_sets (stored in bit 17 of layout )
|
2015-10-22 05:20:15 +03:00
|
|
|
* use_far_sets_bugfixed (stored in bit 18 of layout )
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
* The data to be stored is divided into chunks using chunksize. Each device
|
|
|
|
* is divided into far_copies sections. In each section, chunks are laid out
|
|
|
|
* in a style similar to raid0, but near_copies copies of each chunk is stored
|
|
|
|
* (each on a different drive). The starting device for each section is offset
|
|
|
|
* near_copies from the starting device of the previous section. Thus there
|
|
|
|
* are (near_copies * far_copies) of each chunk, and each is on a different
|
|
|
|
* drive. near_copies and far_copies must be at least one, and their product
|
|
|
|
* is at most raid_disks.
|
2006-06-26 11:27:41 +04:00
|
|
|
*
|
|
|
|
* If far_offset is true, then the far_copies are handled a bit differently.
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
* The copies are still in different stripes, but instead of being very far
|
|
|
|
* apart on disk, there are adjacent stripes.
|
|
|
|
*
|
|
|
|
* The far and offset algorithms are handled slightly differently if
|
|
|
|
* 'use_far_sets' is true. In this case, the array's devices are grouped into
|
|
|
|
* sets that are (near_copies * far_copies) in size. The far copied stripes
|
|
|
|
* are still shifted by 'near_copies' devices, but this shifting stays confined
|
|
|
|
* to the set rather than the entire array. This is done to improve the number
|
|
|
|
* of device combinations that can fail without causing the array to fail.
|
|
|
|
* Example 'far' algorithm w/o 'use_far_sets' (each letter represents a chunk
|
|
|
|
* on a device):
|
|
|
|
* A B C D A B C D E
|
|
|
|
* ... ...
|
|
|
|
* D A B C E A B C D
|
|
|
|
* Example 'far' algorithm w/ 'use_far_sets' enabled (sets illustrated w/ []'s):
|
|
|
|
* [A B] [C D] [A B] [C D E]
|
|
|
|
* |...| |...| |...| | ... |
|
|
|
|
* [B A] [D C] [B A] [E C D]
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void allow_barrier(struct r10conf *conf);
|
|
|
|
static void lower_barrier(struct r10conf *conf);
|
2013-06-11 08:57:09 +04:00
|
|
|
static int _enough(struct r10conf *conf, int previous, int ignore);
|
2016-11-18 08:16:12 +03:00
|
|
|
static int enough(struct r10conf *conf, int ignore);
|
2012-05-22 07:53:47 +04:00
|
|
|
static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
|
|
|
|
int *skipped);
|
|
|
|
static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio);
|
2015-07-20 16:29:37 +03:00
|
|
|
static void end_reshape_write(struct bio *bio);
|
2012-05-22 07:53:47 +04:00
|
|
|
static void end_reshape(struct r10conf *conf);
|
2006-01-06 11:20:13 +03:00
|
|
|
|
2016-11-14 08:30:21 +03:00
|
|
|
#define raid10_log(md, fmt, args...) \
|
|
|
|
do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid10 " fmt, ##args); } while (0)
|
|
|
|
|
2017-07-14 11:14:43 +03:00
|
|
|
#include "raid1-10.c"
|
|
|
|
|
2017-03-16 19:12:33 +03:00
|
|
|
/*
|
|
|
|
* for resync bio, r10bio pointer can be retrieved from the per-bio
|
|
|
|
* 'struct resync_pages'.
|
|
|
|
*/
|
|
|
|
static inline struct r10bio *get_resync_r10bio(struct bio *bio)
|
|
|
|
{
|
|
|
|
return get_resync_pages(bio)->raid_bio;
|
|
|
|
}
|
|
|
|
|
2005-10-07 10:46:04 +04:00
|
|
|
static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = data;
|
2011-10-11 09:48:43 +04:00
|
|
|
int size = offsetof(struct r10bio, devs[conf->copies]);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-12-23 03:17:54 +04:00
|
|
|
/* allocate a r10bio with room for raid_disks entries in the
|
|
|
|
* bios array */
|
2011-03-10 10:52:07 +03:00
|
|
|
return kzalloc(size, gfp_flags);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2017-10-24 10:11:52 +03:00
|
|
|
#define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
|
2008-08-05 09:54:14 +04:00
|
|
|
/* amount of memory to reserve for resync requests */
|
|
|
|
#define RESYNC_WINDOW (1024*1024)
|
|
|
|
/* maximum number of concurrent requests, memory permitting */
|
|
|
|
#define RESYNC_DEPTH (32*1024*1024/RESYNC_BLOCK_SIZE)
|
2018-01-19 06:37:56 +03:00
|
|
|
#define CLUSTER_RESYNC_WINDOW (32 * RESYNC_WINDOW)
|
2017-10-24 10:11:52 +03:00
|
|
|
#define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When performing a resync, we need to read and compare, so
|
|
|
|
* we need as many pages are there are copies.
|
|
|
|
* When performing a recovery, we need 2 bios, one for read,
|
|
|
|
* one for write (we recover only one drive per r10buf)
|
|
|
|
*
|
|
|
|
*/
|
2005-10-07 10:46:04 +04:00
|
|
|
static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = data;
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *r10_bio;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct bio *bio;
|
2017-03-16 19:12:33 +03:00
|
|
|
int j;
|
|
|
|
int nalloc, nalloc_rp;
|
|
|
|
struct resync_pages *rps;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
r10_bio = r10bio_pool_alloc(gfp_flags, conf);
|
2011-03-10 10:52:07 +03:00
|
|
|
if (!r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
return NULL;
|
|
|
|
|
2012-05-22 07:53:47 +04:00
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &conf->mddev->recovery) ||
|
|
|
|
test_bit(MD_RECOVERY_RESHAPE, &conf->mddev->recovery))
|
2005-04-17 02:20:36 +04:00
|
|
|
nalloc = conf->copies; /* resync */
|
|
|
|
else
|
|
|
|
nalloc = 2; /* recovery */
|
|
|
|
|
2017-03-16 19:12:33 +03:00
|
|
|
/* allocate once for all bios */
|
|
|
|
if (!conf->have_replacement)
|
|
|
|
nalloc_rp = nalloc;
|
|
|
|
else
|
|
|
|
nalloc_rp = nalloc * 2;
|
treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:
kmalloc(a * b, gfp)
with:
kmalloc_array(a * b, gfp)
as well as handling cases of:
kmalloc(a * b * c, gfp)
with:
kmalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kmalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kmalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 23:55:00 +03:00
|
|
|
rps = kmalloc_array(nalloc_rp, sizeof(struct resync_pages), gfp_flags);
|
2017-03-16 19:12:33 +03:00
|
|
|
if (!rps)
|
|
|
|
goto out_free_r10bio;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Allocate bios.
|
|
|
|
*/
|
|
|
|
for (j = nalloc ; j-- ; ) {
|
2010-10-26 10:33:54 +04:00
|
|
|
bio = bio_kmalloc(gfp_flags, RESYNC_PAGES);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!bio)
|
|
|
|
goto out_free_bio;
|
|
|
|
r10_bio->devs[j].bio = bio;
|
2011-12-23 03:17:54 +04:00
|
|
|
if (!conf->have_replacement)
|
|
|
|
continue;
|
|
|
|
bio = bio_kmalloc(gfp_flags, RESYNC_PAGES);
|
|
|
|
if (!bio)
|
|
|
|
goto out_free_bio;
|
|
|
|
r10_bio->devs[j].repl_bio = bio;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Allocate RESYNC_PAGES data pages and attach them
|
|
|
|
* where needed.
|
|
|
|
*/
|
2017-03-16 19:12:33 +03:00
|
|
|
for (j = 0; j < nalloc; j++) {
|
2011-12-23 03:17:54 +04:00
|
|
|
struct bio *rbio = r10_bio->devs[j].repl_bio;
|
2017-03-16 19:12:33 +03:00
|
|
|
struct resync_pages *rp, *rp_repl;
|
|
|
|
|
|
|
|
rp = &rps[j];
|
|
|
|
if (rbio)
|
|
|
|
rp_repl = &rps[nalloc + j];
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
bio = r10_bio->devs[j].bio;
|
2017-03-16 19:12:33 +03:00
|
|
|
|
|
|
|
if (!j || test_bit(MD_RECOVERY_SYNC,
|
|
|
|
&conf->mddev->recovery)) {
|
|
|
|
if (resync_alloc_pages(rp, gfp_flags))
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out_free_pages;
|
2017-03-16 19:12:33 +03:00
|
|
|
} else {
|
|
|
|
memcpy(rp, &rps[0], sizeof(*rp));
|
|
|
|
resync_get_all_pages(rp);
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-03-16 19:12:33 +03:00
|
|
|
rp->raid_bio = r10_bio;
|
|
|
|
bio->bi_private = rp;
|
|
|
|
if (rbio) {
|
|
|
|
memcpy(rp_repl, rp, sizeof(*rp));
|
|
|
|
rbio->bi_private = rp_repl;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return r10_bio;
|
|
|
|
|
|
|
|
out_free_pages:
|
2017-03-16 19:12:33 +03:00
|
|
|
while (--j >= 0)
|
2019-11-12 03:43:20 +03:00
|
|
|
resync_free_pages(&rps[j]);
|
2017-03-16 19:12:33 +03:00
|
|
|
|
2012-05-22 07:55:03 +04:00
|
|
|
j = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
out_free_bio:
|
2012-05-22 07:55:03 +04:00
|
|
|
for ( ; j < nalloc; j++) {
|
|
|
|
if (r10_bio->devs[j].bio)
|
|
|
|
bio_put(r10_bio->devs[j].bio);
|
2011-12-23 03:17:54 +04:00
|
|
|
if (r10_bio->devs[j].repl_bio)
|
|
|
|
bio_put(r10_bio->devs[j].repl_bio);
|
|
|
|
}
|
2017-03-16 19:12:33 +03:00
|
|
|
kfree(rps);
|
|
|
|
out_free_r10bio:
|
2019-06-15 01:41:10 +03:00
|
|
|
rbio_pool_free(r10_bio, conf);
|
2005-04-17 02:20:36 +04:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void r10buf_pool_free(void *__r10_bio, void *data)
|
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = data;
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *r10bio = __r10_bio;
|
2005-04-17 02:20:36 +04:00
|
|
|
int j;
|
2017-03-16 19:12:33 +03:00
|
|
|
struct resync_pages *rp = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-03-16 19:12:33 +03:00
|
|
|
for (j = conf->copies; j--; ) {
|
2005-04-17 02:20:36 +04:00
|
|
|
struct bio *bio = r10bio->devs[j].bio;
|
2017-03-16 19:12:33 +03:00
|
|
|
|
2018-04-26 05:56:37 +03:00
|
|
|
if (bio) {
|
|
|
|
rp = get_resync_pages(bio);
|
|
|
|
resync_free_pages(rp);
|
|
|
|
bio_put(bio);
|
|
|
|
}
|
2017-03-16 19:12:33 +03:00
|
|
|
|
2011-12-23 03:17:54 +04:00
|
|
|
bio = r10bio->devs[j].repl_bio;
|
|
|
|
if (bio)
|
|
|
|
bio_put(bio);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2017-03-16 19:12:33 +03:00
|
|
|
|
|
|
|
/* resync pages array stored in the 1st bio's .bi_private */
|
|
|
|
kfree(rp);
|
|
|
|
|
2019-06-15 01:41:10 +03:00
|
|
|
rbio_pool_free(r10bio, conf);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void put_all_bios(struct r10conf *conf, struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < conf->copies; i++) {
|
|
|
|
struct bio **bio = & r10_bio->devs[i].bio;
|
2011-07-28 05:39:24 +04:00
|
|
|
if (!BIO_SPECIAL(*bio))
|
2005-04-17 02:20:36 +04:00
|
|
|
bio_put(*bio);
|
|
|
|
*bio = NULL;
|
2011-12-23 03:17:54 +04:00
|
|
|
bio = &r10_bio->devs[i].repl_bio;
|
|
|
|
if (r10_bio->read_slot < 0 && !BIO_SPECIAL(*bio))
|
|
|
|
bio_put(*bio);
|
|
|
|
*bio = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static void free_r10bio(struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = r10_bio->mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
put_all_bios(conf, r10_bio);
|
2018-05-21 01:25:52 +03:00
|
|
|
mempool_free(r10_bio, &conf->r10bio_pool);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static void put_buf(struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = r10_bio->mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
mempool_free(r10_bio, &conf->r10buf_pool);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2006-01-06 11:20:13 +03:00
|
|
|
lower_barrier(conf);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static void reschedule_retry(struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
2011-10-11 09:47:53 +04:00
|
|
|
struct mddev *mddev = r10_bio->mddev;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
|
|
|
list_add(&r10_bio->retry_list, &conf->retry_list);
|
2006-01-06 11:20:28 +03:00
|
|
|
conf->nr_queued ++;
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
|
|
|
|
2008-07-25 23:03:38 +04:00
|
|
|
/* wake up frozen array... */
|
|
|
|
wake_up(&conf->wait_barrier);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* raid_end_bio_io() is called when we have finished servicing a mirrored
|
|
|
|
* operation and are ready to return a success/failure code to the buffer
|
|
|
|
* cache layer.
|
|
|
|
*/
|
2011-10-11 09:48:43 +04:00
|
|
|
static void raid_end_bio_io(struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
struct bio *bio = r10_bio->master_bio;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = r10_bio->mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-07-28 05:39:23 +04:00
|
|
|
if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
|
2017-06-03 10:38:06 +03:00
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
md/raid10: stop using bi_phys_segments
raid10 currently repurposes bi_phys_segments on each
incoming bio to count how many r10bio was used to encode the
request.
We need to know when the number of attached r10bio reaches
zero to:
1/ call bio_endio() when all IO on the bio is finished
2/ decrement ->nr_pending so that resync IO can proceed.
Now that the bio has its own __bi_remaining counter, that
can be used instead. We can call bio_inc_remaining to
increment the counter and call bio_endio() every time an
r10bio completes, rather than only when bi_phys_segments
reaches zero.
This addresses point 1, but not point 2. bio_endio()
doesn't (and cannot) report when the last r10bio has
finished, so a different approach is needed.
So: instead of counting bios in ->nr_pending, count r10bios.
i.e. every time we attach a bio, increment nr_pending.
Every time an r10bio completes, decrement nr_pending.
Normally we only increment nr_pending after first checking
that ->barrier is zero, or some other non-trivial tests and
possible waiting. When attaching multiple r10bios to a bio,
we only need the tests and the waiting once. After the
first increment, subsequent increments can happen
unconditionally as they are really all part of the one
request.
So introduce inc_pending() which can be used when we know
that nr_pending is already elevated.
Note that this fixes a bug. freeze_array() contains the line
atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
which implies that the units for ->nr_pending, ->nr_queued and extra
are the same.
->nr_queue and extra count r10_bios, but prior to this patch,
->nr_pending counted bios. If a bio ever resulted in multiple
r10_bios (due to bad blocks), freeze_array() would not work correctly.
Now it does.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 06:05:13 +03:00
|
|
|
|
|
|
|
bio_endio(bio);
|
|
|
|
/*
|
|
|
|
* Wake up any possible resync thread that waits for the device
|
|
|
|
* to go idle.
|
|
|
|
*/
|
|
|
|
allow_barrier(conf);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
free_r10bio(r10_bio);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update disk head position estimator based on IRQ completion info.
|
|
|
|
*/
|
2011-10-11 09:48:43 +04:00
|
|
|
static inline void update_head_pos(int slot, struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = r10_bio->mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
conf->mirrors[r10_bio->devs[slot].devnum].head_position =
|
|
|
|
r10_bio->devs[slot].addr + (r10_bio->sectors);
|
|
|
|
}
|
|
|
|
|
2011-07-18 11:38:47 +04:00
|
|
|
/*
|
|
|
|
* Find the disk number which triggered given bio
|
|
|
|
*/
|
2011-10-11 09:49:02 +04:00
|
|
|
static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
|
2011-12-23 03:17:54 +04:00
|
|
|
struct bio *bio, int *slotp, int *replp)
|
2011-07-18 11:38:47 +04:00
|
|
|
{
|
|
|
|
int slot;
|
2011-12-23 03:17:54 +04:00
|
|
|
int repl = 0;
|
2011-07-18 11:38:47 +04:00
|
|
|
|
2011-12-23 03:17:54 +04:00
|
|
|
for (slot = 0; slot < conf->copies; slot++) {
|
2011-07-18 11:38:47 +04:00
|
|
|
if (r10_bio->devs[slot].bio == bio)
|
|
|
|
break;
|
2011-12-23 03:17:54 +04:00
|
|
|
if (r10_bio->devs[slot].repl_bio == bio) {
|
|
|
|
repl = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2011-07-18 11:38:47 +04:00
|
|
|
|
|
|
|
BUG_ON(slot == conf->copies);
|
|
|
|
update_head_pos(slot, r10_bio);
|
|
|
|
|
2011-07-28 05:39:24 +04:00
|
|
|
if (slotp)
|
|
|
|
*slotp = slot;
|
2011-12-23 03:17:54 +04:00
|
|
|
if (replp)
|
|
|
|
*replp = repl;
|
2011-07-18 11:38:47 +04:00
|
|
|
return r10_bio->devs[slot].devnum;
|
|
|
|
}
|
|
|
|
|
2015-07-20 16:29:37 +03:00
|
|
|
static void raid10_end_read_request(struct bio *bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2017-06-03 10:38:06 +03:00
|
|
|
int uptodate = !bio->bi_status;
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *r10_bio = bio->bi_private;
|
2017-10-11 13:46:54 +03:00
|
|
|
int slot;
|
2011-12-23 03:17:54 +04:00
|
|
|
struct md_rdev *rdev;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = r10_bio->mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
slot = r10_bio->read_slot;
|
2011-12-23 03:17:54 +04:00
|
|
|
rdev = r10_bio->devs[slot].rdev;
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* this branch is our 'one mirror IO has finished' event handler:
|
|
|
|
*/
|
2006-01-06 11:20:28 +03:00
|
|
|
update_head_pos(slot, r10_bio);
|
|
|
|
|
|
|
|
if (uptodate) {
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Set R10BIO_Uptodate in our master bio, so that
|
|
|
|
* we will return a good error code to the higher
|
|
|
|
* levels even if IO on some other mirrored buffer fails.
|
|
|
|
*
|
|
|
|
* The 'master' represents the composite IO operation to
|
|
|
|
* user-side. So if something waits for IO, then it will
|
|
|
|
* wait for the 'master' bio.
|
|
|
|
*/
|
|
|
|
set_bit(R10BIO_Uptodate, &r10_bio->state);
|
2012-02-14 04:10:10 +04:00
|
|
|
} else {
|
|
|
|
/* If all other devices that store this block have
|
|
|
|
* failed, we want to return the error upwards rather
|
|
|
|
* than fail the last device. Here we redefine
|
|
|
|
* "uptodate" to mean "Don't want to retry"
|
|
|
|
*/
|
2013-06-11 08:57:09 +04:00
|
|
|
if (!_enough(conf, test_bit(R10BIO_Previous, &r10_bio->state),
|
|
|
|
rdev->raid_disk))
|
2012-02-14 04:10:10 +04:00
|
|
|
uptodate = 1;
|
|
|
|
}
|
|
|
|
if (uptodate) {
|
2005-04-17 02:20:36 +04:00
|
|
|
raid_end_bio_io(r10_bio);
|
2011-12-23 03:17:54 +04:00
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
2006-01-06 11:20:28 +03:00
|
|
|
} else {
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
2011-05-11 08:53:17 +04:00
|
|
|
* oops, read error - keep the refcount on the rdev
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
char b[BDEVNAME_SIZE];
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_err_ratelimited("md/raid10:%s: %s: rescheduling sector %llu\n",
|
2011-07-27 05:00:36 +04:00
|
|
|
mdname(conf->mddev),
|
2011-12-23 03:17:54 +04:00
|
|
|
bdevname(rdev->bdev, b),
|
2011-07-27 05:00:36 +04:00
|
|
|
(unsigned long long)r10_bio->sector);
|
2011-07-28 05:39:23 +04:00
|
|
|
set_bit(R10BIO_ReadError, &r10_bio->state);
|
2005-04-17 02:20:36 +04:00
|
|
|
reschedule_retry(r10_bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static void close_write(struct r10bio *r10_bio)
|
2011-07-28 05:39:24 +04:00
|
|
|
{
|
|
|
|
/* clear the bitmap if all writes complete successfully */
|
2018-08-02 01:20:50 +03:00
|
|
|
md_bitmap_endwrite(r10_bio->mddev->bitmap, r10_bio->sector,
|
|
|
|
r10_bio->sectors,
|
|
|
|
!test_bit(R10BIO_Degraded, &r10_bio->state),
|
|
|
|
0);
|
2011-07-28 05:39:24 +04:00
|
|
|
md_write_end(r10_bio->mddev);
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static void one_write_done(struct r10bio *r10_bio)
|
2011-09-10 11:21:17 +04:00
|
|
|
{
|
|
|
|
if (atomic_dec_and_test(&r10_bio->remaining)) {
|
|
|
|
if (test_bit(R10BIO_WriteError, &r10_bio->state))
|
|
|
|
reschedule_retry(r10_bio);
|
|
|
|
else {
|
|
|
|
close_write(r10_bio);
|
|
|
|
if (test_bit(R10BIO_MadeGood, &r10_bio->state))
|
|
|
|
reschedule_retry(r10_bio);
|
|
|
|
else
|
|
|
|
raid_end_bio_io(r10_bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-07-20 16:29:37 +03:00
|
|
|
static void raid10_end_write_request(struct bio *bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *r10_bio = bio->bi_private;
|
2011-07-18 11:38:47 +04:00
|
|
|
int dev;
|
2011-07-28 05:39:24 +04:00
|
|
|
int dec_rdev = 1;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = r10_bio->mddev->private;
|
2011-12-23 03:17:55 +04:00
|
|
|
int slot, repl;
|
2011-12-23 03:17:55 +04:00
|
|
|
struct md_rdev *rdev = NULL;
|
2016-11-18 08:16:12 +03:00
|
|
|
struct bio *to_put = NULL;
|
2016-10-07 00:13:52 +03:00
|
|
|
bool discard_error;
|
|
|
|
|
2017-06-03 10:38:06 +03:00
|
|
|
discard_error = bio->bi_status && bio_op(bio) == REQ_OP_DISCARD;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
if (repl)
|
|
|
|
rdev = conf->mirrors[dev].replacement;
|
2011-12-23 03:17:55 +04:00
|
|
|
if (!rdev) {
|
|
|
|
smp_rmb();
|
|
|
|
repl = 0;
|
2011-12-23 03:17:55 +04:00
|
|
|
rdev = conf->mirrors[dev].rdev;
|
2011-12-23 03:17:55 +04:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* this branch is our 'one mirror IO has finished' event handler:
|
|
|
|
*/
|
2017-06-03 10:38:06 +03:00
|
|
|
if (bio->bi_status && !discard_error) {
|
2011-12-23 03:17:55 +04:00
|
|
|
if (repl)
|
|
|
|
/* Never record new bad blocks to replacement,
|
|
|
|
* just fail it.
|
|
|
|
*/
|
|
|
|
md_error(rdev->mddev, rdev);
|
|
|
|
else {
|
|
|
|
set_bit(WriteErrorSeen, &rdev->flags);
|
2011-12-23 03:17:56 +04:00
|
|
|
if (!test_and_set_bit(WantReplacement, &rdev->flags))
|
|
|
|
set_bit(MD_RECOVERY_NEEDED,
|
|
|
|
&rdev->mddev->recovery);
|
2016-11-18 08:16:12 +03:00
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
dec_rdev = 0;
|
2016-11-18 08:16:12 +03:00
|
|
|
if (test_bit(FailFast, &rdev->flags) &&
|
|
|
|
(bio->bi_opf & MD_FAILFAST)) {
|
|
|
|
md_error(rdev->mddev, rdev);
|
2019-07-19 08:48:47 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When the device is faulty, it is not necessary to
|
|
|
|
* handle write error.
|
|
|
|
* For failfast, this is the only remaining device,
|
|
|
|
* We need to retry the write without FailFast.
|
|
|
|
*/
|
|
|
|
if (!test_bit(Faulty, &rdev->flags))
|
2016-11-18 08:16:12 +03:00
|
|
|
set_bit(R10BIO_WriteError, &r10_bio->state);
|
2019-07-19 08:48:47 +03:00
|
|
|
else {
|
|
|
|
r10_bio->devs[slot].bio = NULL;
|
|
|
|
to_put = bio;
|
|
|
|
dec_rdev = 1;
|
|
|
|
}
|
2011-12-23 03:17:55 +04:00
|
|
|
}
|
2011-07-28 05:39:24 +04:00
|
|
|
} else {
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Set R10BIO_Uptodate in our master bio, so that
|
|
|
|
* we will return a good error code for to the higher
|
|
|
|
* levels even if IO on some other mirrored buffer fails.
|
|
|
|
*
|
|
|
|
* The 'master' represents the composite IO operation to
|
|
|
|
* user-side. So if something waits for IO, then it will
|
|
|
|
* wait for the 'master' bio.
|
|
|
|
*/
|
2011-07-28 05:39:24 +04:00
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
|
|
|
|
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
Without that fix, the following scenario could happen:
- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.
So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.
[neilb - applied identical change to raid10 as well]
This bug can result in lost data, so it is suitable for any
-stable kernel.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-04 21:42:21 +04:00
|
|
|
/*
|
|
|
|
* Do not set R10BIO_Uptodate if the current device is
|
|
|
|
* rebuilding or Faulty. This is because we cannot use
|
|
|
|
* such device for properly reading the data back (we could
|
|
|
|
* potentially use it, if the current write would have felt
|
|
|
|
* before rdev->recovery_offset, but for simplicity we don't
|
|
|
|
* check this here.
|
|
|
|
*/
|
|
|
|
if (test_bit(In_sync, &rdev->flags) &&
|
|
|
|
!test_bit(Faulty, &rdev->flags))
|
|
|
|
set_bit(R10BIO_Uptodate, &r10_bio->state);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-07-28 05:39:24 +04:00
|
|
|
/* Maybe we can clear some bad blocks. */
|
2011-12-23 03:17:55 +04:00
|
|
|
if (is_badblock(rdev,
|
2011-07-28 05:39:24 +04:00
|
|
|
r10_bio->devs[slot].addr,
|
|
|
|
r10_bio->sectors,
|
2016-10-07 00:13:52 +03:00
|
|
|
&first_bad, &bad_sectors) && !discard_error) {
|
2011-07-28 05:39:24 +04:00
|
|
|
bio_put(bio);
|
2011-12-23 03:17:55 +04:00
|
|
|
if (repl)
|
|
|
|
r10_bio->devs[slot].repl_bio = IO_MADE_GOOD;
|
|
|
|
else
|
|
|
|
r10_bio->devs[slot].bio = IO_MADE_GOOD;
|
2011-07-28 05:39:24 +04:00
|
|
|
dec_rdev = 0;
|
|
|
|
set_bit(R10BIO_MadeGood, &r10_bio->state);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
*
|
|
|
|
* Let's see if all mirrored write operations have finished
|
|
|
|
* already.
|
|
|
|
*/
|
2011-09-10 11:21:17 +04:00
|
|
|
one_write_done(r10_bio);
|
2011-07-28 05:39:24 +04:00
|
|
|
if (dec_rdev)
|
2012-11-22 08:12:09 +04:00
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
2016-11-18 08:16:12 +03:00
|
|
|
if (to_put)
|
|
|
|
bio_put(to_put);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RAID10 layout manager
|
2011-03-31 05:57:33 +04:00
|
|
|
* As well as the chunksize and raid_disks count, there are two
|
2005-04-17 02:20:36 +04:00
|
|
|
* parameters: near_copies and far_copies.
|
|
|
|
* near_copies * far_copies must be <= raid_disks.
|
|
|
|
* Normally one of these will be 1.
|
|
|
|
* If both are 1, we get raid0.
|
|
|
|
* If near_copies == raid_disks, we get raid1.
|
|
|
|
*
|
2011-03-31 05:57:33 +04:00
|
|
|
* Chunks are laid out in raid0 style with near_copies copies of the
|
2005-04-17 02:20:36 +04:00
|
|
|
* first chunk, followed by near_copies copies of the next chunk and
|
|
|
|
* so on.
|
|
|
|
* If far_copies > 1, then after 1/far_copies of the array has been assigned
|
|
|
|
* as described above, we start again with a device offset of near_copies.
|
|
|
|
* So we effectively have another copy of the whole array further down all
|
|
|
|
* the drives, but with blocks on different drives.
|
|
|
|
* With this layout, and block is never stored twice on the one device.
|
|
|
|
*
|
|
|
|
* raid10_find_phys finds the sector offset of a given virtual sector
|
2006-06-26 11:27:41 +04:00
|
|
|
* on each device that it is on.
|
2005-04-17 02:20:36 +04:00
|
|
|
*
|
|
|
|
* raid10_find_virt does the reverse mapping, from a device and a
|
|
|
|
* sector offset to a virtual address
|
|
|
|
*/
|
|
|
|
|
2012-05-21 03:28:33 +04:00
|
|
|
static void __raid10_find_phys(struct geom *geo, struct r10bio *r10bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
int n,f;
|
|
|
|
sector_t sector;
|
|
|
|
sector_t chunk;
|
|
|
|
sector_t stripe;
|
|
|
|
int dev;
|
|
|
|
int slot = 0;
|
2013-02-21 06:28:10 +04:00
|
|
|
int last_far_set_start, last_far_set_size;
|
|
|
|
|
|
|
|
last_far_set_start = (geo->raid_disks / geo->far_set_size) - 1;
|
|
|
|
last_far_set_start *= geo->far_set_size;
|
|
|
|
|
|
|
|
last_far_set_size = geo->far_set_size;
|
|
|
|
last_far_set_size += (geo->raid_disks % geo->far_set_size);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/* now calculate first sector/dev */
|
2012-05-21 03:28:20 +04:00
|
|
|
chunk = r10bio->sector >> geo->chunk_shift;
|
|
|
|
sector = r10bio->sector & geo->chunk_mask;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
chunk *= geo->near_copies;
|
2005-04-17 02:20:36 +04:00
|
|
|
stripe = chunk;
|
2012-05-21 03:28:20 +04:00
|
|
|
dev = sector_div(stripe, geo->raid_disks);
|
|
|
|
if (geo->far_offset)
|
|
|
|
stripe *= geo->far_copies;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
sector += stripe << geo->chunk_shift;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/* and calculate all the others */
|
2012-05-21 03:28:20 +04:00
|
|
|
for (n = 0; n < geo->near_copies; n++) {
|
2005-04-17 02:20:36 +04:00
|
|
|
int d = dev;
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
int set;
|
2005-04-17 02:20:36 +04:00
|
|
|
sector_t s = sector;
|
|
|
|
r10bio->devs[slot].devnum = d;
|
2013-02-21 06:28:09 +04:00
|
|
|
r10bio->devs[slot].addr = s;
|
2005-04-17 02:20:36 +04:00
|
|
|
slot++;
|
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
for (f = 1; f < geo->far_copies; f++) {
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
set = d / geo->far_set_size;
|
2012-05-21 03:28:20 +04:00
|
|
|
d += geo->near_copies;
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
|
2013-02-21 06:28:10 +04:00
|
|
|
if ((geo->raid_disks % geo->far_set_size) &&
|
|
|
|
(d > last_far_set_start)) {
|
|
|
|
d -= last_far_set_start;
|
|
|
|
d %= last_far_set_size;
|
|
|
|
d += last_far_set_start;
|
|
|
|
} else {
|
|
|
|
d %= geo->far_set_size;
|
|
|
|
d += geo->far_set_size * set;
|
|
|
|
}
|
2012-05-21 03:28:20 +04:00
|
|
|
s += geo->stride;
|
2005-04-17 02:20:36 +04:00
|
|
|
r10bio->devs[slot].devnum = d;
|
|
|
|
r10bio->devs[slot].addr = s;
|
|
|
|
slot++;
|
|
|
|
}
|
|
|
|
dev++;
|
2012-05-21 03:28:20 +04:00
|
|
|
if (dev >= geo->raid_disks) {
|
2005-04-17 02:20:36 +04:00
|
|
|
dev = 0;
|
2012-05-21 03:28:20 +04:00
|
|
|
sector += (geo->chunk_mask + 1);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
2012-05-21 03:28:33 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void raid10_find_phys(struct r10conf *conf, struct r10bio *r10bio)
|
|
|
|
{
|
|
|
|
struct geom *geo = &conf->geo;
|
|
|
|
|
|
|
|
if (conf->reshape_progress != MaxSector &&
|
|
|
|
((r10bio->sector >= conf->reshape_progress) !=
|
|
|
|
conf->mddev->reshape_backwards)) {
|
|
|
|
set_bit(R10BIO_Previous, &r10bio->state);
|
|
|
|
geo = &conf->prev;
|
|
|
|
} else
|
|
|
|
clear_bit(R10BIO_Previous, &r10bio->state);
|
|
|
|
|
|
|
|
__raid10_find_phys(geo, r10bio);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static sector_t raid10_find_virt(struct r10conf *conf, sector_t sector, int dev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
sector_t offset, chunk, vchunk;
|
2012-05-21 03:28:33 +04:00
|
|
|
/* Never use conf->prev as this is only called during resync
|
|
|
|
* or recovery, so reshape isn't happening
|
|
|
|
*/
|
2012-05-21 03:28:20 +04:00
|
|
|
struct geom *geo = &conf->geo;
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
int far_set_start = (dev / geo->far_set_size) * geo->far_set_size;
|
|
|
|
int far_set_size = geo->far_set_size;
|
2013-02-21 06:28:10 +04:00
|
|
|
int last_far_set_start;
|
|
|
|
|
|
|
|
if (geo->raid_disks % geo->far_set_size) {
|
|
|
|
last_far_set_start = (geo->raid_disks / geo->far_set_size) - 1;
|
|
|
|
last_far_set_start *= geo->far_set_size;
|
|
|
|
|
|
|
|
if (dev >= last_far_set_start) {
|
|
|
|
far_set_size = geo->far_set_size;
|
|
|
|
far_set_size += (geo->raid_disks % geo->far_set_size);
|
|
|
|
far_set_start = last_far_set_start;
|
|
|
|
}
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
offset = sector & geo->chunk_mask;
|
|
|
|
if (geo->far_offset) {
|
2006-06-26 11:27:41 +04:00
|
|
|
int fc;
|
2012-05-21 03:28:20 +04:00
|
|
|
chunk = sector >> geo->chunk_shift;
|
|
|
|
fc = sector_div(chunk, geo->far_copies);
|
|
|
|
dev -= fc * geo->near_copies;
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
if (dev < far_set_start)
|
|
|
|
dev += far_set_size;
|
2006-06-26 11:27:41 +04:00
|
|
|
} else {
|
2012-05-21 03:28:20 +04:00
|
|
|
while (sector >= geo->stride) {
|
|
|
|
sector -= geo->stride;
|
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...
"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...
Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.
The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...
This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]
We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)
This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-21 06:28:10 +04:00
|
|
|
if (dev < (geo->near_copies + far_set_start))
|
|
|
|
dev += far_set_size - geo->near_copies;
|
2006-06-26 11:27:41 +04:00
|
|
|
else
|
2012-05-21 03:28:20 +04:00
|
|
|
dev -= geo->near_copies;
|
2006-06-26 11:27:41 +04:00
|
|
|
}
|
2012-05-21 03:28:20 +04:00
|
|
|
chunk = sector >> geo->chunk_shift;
|
2006-06-26 11:27:41 +04:00
|
|
|
}
|
2012-05-21 03:28:20 +04:00
|
|
|
vchunk = chunk * geo->raid_disks + dev;
|
|
|
|
sector_div(vchunk, geo->near_copies);
|
|
|
|
return (vchunk << geo->chunk_shift) + offset;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This routine returns the disk from which the requested read should
|
|
|
|
* be done. There is a per-array 'next expected sequential IO' sector
|
|
|
|
* number - if this matches on the next IO then we use the last disk.
|
|
|
|
* There is also a per-disk 'last know head position' sector that is
|
|
|
|
* maintained from IRQ contexts, both the normal and the resync IO
|
|
|
|
* completion handlers update this position correctly. If there is no
|
|
|
|
* perfect sequential match then we pick the disk whose head is closest.
|
|
|
|
*
|
|
|
|
* If there are 2 mirrors in the same 2 devices, performance degrades
|
|
|
|
* because position is mirror, not device based.
|
|
|
|
*
|
|
|
|
* The rdev for the device selected will have nr_pending incremented.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* FIXME: possibly should rethink readbalancing and do it differently
|
|
|
|
* depending on near_copies / far_copies geometry.
|
|
|
|
*/
|
2011-12-23 03:17:54 +04:00
|
|
|
static struct md_rdev *read_balance(struct r10conf *conf,
|
|
|
|
struct r10bio *r10_bio,
|
|
|
|
int *max_sectors)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2010-05-08 02:20:17 +04:00
|
|
|
const sector_t this_sector = r10_bio->sector;
|
2011-05-11 08:27:03 +04:00
|
|
|
int disk, slot;
|
2011-07-28 05:39:23 +04:00
|
|
|
int sectors = r10_bio->sectors;
|
|
|
|
int best_good_sectors;
|
2011-05-11 08:27:03 +04:00
|
|
|
sector_t new_distance, best_dist;
|
2019-06-15 01:41:11 +03:00
|
|
|
struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL;
|
2011-05-11 08:27:03 +04:00
|
|
|
int do_balance;
|
2019-06-15 01:41:11 +03:00
|
|
|
int best_dist_slot, best_pending_slot;
|
|
|
|
bool has_nonrot_disk = false;
|
|
|
|
unsigned int min_pending;
|
2012-05-21 03:28:20 +04:00
|
|
|
struct geom *geo = &conf->geo;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
raid10_find_phys(conf, r10_bio);
|
|
|
|
rcu_read_lock();
|
2019-06-15 01:41:11 +03:00
|
|
|
best_dist_slot = -1;
|
|
|
|
min_pending = UINT_MAX;
|
|
|
|
best_dist_rdev = NULL;
|
|
|
|
best_pending_rdev = NULL;
|
2011-05-11 08:27:03 +04:00
|
|
|
best_dist = MaxSector;
|
2011-07-28 05:39:23 +04:00
|
|
|
best_good_sectors = 0;
|
2011-05-11 08:27:03 +04:00
|
|
|
do_balance = 1;
|
2016-11-18 08:16:12 +03:00
|
|
|
clear_bit(R10BIO_FailFast, &r10_bio->state);
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Check if we can balance. We can balance on the whole
|
2006-01-06 11:20:16 +03:00
|
|
|
* device if no resync is going on (recovery is ok), or below
|
|
|
|
* the resync window. We take the first readable disk when
|
|
|
|
* above the resync window.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2017-10-24 10:11:50 +03:00
|
|
|
if ((conf->mddev->recovery_cp < MaxSector
|
|
|
|
&& (this_sector + sectors >= conf->next_resync)) ||
|
|
|
|
(mddev_is_clustered(conf->mddev) &&
|
|
|
|
md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector,
|
|
|
|
this_sector + sectors)))
|
2011-05-11 08:27:03 +04:00
|
|
|
do_balance = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-05-11 08:27:03 +04:00
|
|
|
for (slot = 0; slot < conf->copies ; slot++) {
|
2011-07-28 05:39:23 +04:00
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
|
|
|
sector_t dev_sector;
|
2019-06-15 01:41:11 +03:00
|
|
|
unsigned int pending;
|
|
|
|
bool nonrot;
|
2011-07-28 05:39:23 +04:00
|
|
|
|
2011-05-11 08:27:03 +04:00
|
|
|
if (r10_bio->devs[slot].bio == IO_BLOCKED)
|
|
|
|
continue;
|
2005-04-17 02:20:36 +04:00
|
|
|
disk = r10_bio->devs[slot].devnum;
|
2011-12-23 03:17:54 +04:00
|
|
|
rdev = rcu_dereference(conf->mirrors[disk].replacement);
|
|
|
|
if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
|
|
|
|
r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
|
|
|
|
rdev = rcu_dereference(conf->mirrors[disk].rdev);
|
2012-03-19 05:46:39 +04:00
|
|
|
if (rdev == NULL ||
|
2015-04-28 09:48:34 +03:00
|
|
|
test_bit(Faulty, &rdev->flags))
|
2011-12-23 03:17:54 +04:00
|
|
|
continue;
|
|
|
|
if (!test_bit(In_sync, &rdev->flags) &&
|
|
|
|
r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
|
2011-05-11 08:27:03 +04:00
|
|
|
continue;
|
|
|
|
|
2011-07-28 05:39:23 +04:00
|
|
|
dev_sector = r10_bio->devs[slot].addr;
|
|
|
|
if (is_badblock(rdev, dev_sector, sectors,
|
|
|
|
&first_bad, &bad_sectors)) {
|
|
|
|
if (best_dist < MaxSector)
|
|
|
|
/* Already have a better slot */
|
|
|
|
continue;
|
|
|
|
if (first_bad <= dev_sector) {
|
|
|
|
/* Cannot read here. If this is the
|
|
|
|
* 'primary' device, then we must not read
|
|
|
|
* beyond 'bad_sectors' from another device.
|
|
|
|
*/
|
|
|
|
bad_sectors -= (dev_sector - first_bad);
|
|
|
|
if (!do_balance && sectors > bad_sectors)
|
|
|
|
sectors = bad_sectors;
|
|
|
|
if (best_good_sectors > sectors)
|
|
|
|
best_good_sectors = sectors;
|
|
|
|
} else {
|
|
|
|
sector_t good_sectors =
|
|
|
|
first_bad - dev_sector;
|
|
|
|
if (good_sectors > best_good_sectors) {
|
|
|
|
best_good_sectors = good_sectors;
|
2019-06-15 01:41:11 +03:00
|
|
|
best_dist_slot = slot;
|
|
|
|
best_dist_rdev = rdev;
|
2011-07-28 05:39:23 +04:00
|
|
|
}
|
|
|
|
if (!do_balance)
|
|
|
|
/* Must read from here */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
} else
|
|
|
|
best_good_sectors = sectors;
|
|
|
|
|
2011-05-11 08:27:03 +04:00
|
|
|
if (!do_balance)
|
|
|
|
break;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2019-06-15 01:41:11 +03:00
|
|
|
nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
|
|
|
|
has_nonrot_disk |= nonrot;
|
|
|
|
pending = atomic_read(&rdev->nr_pending);
|
|
|
|
if (min_pending > pending && nonrot) {
|
|
|
|
min_pending = pending;
|
|
|
|
best_pending_slot = slot;
|
|
|
|
best_pending_rdev = rdev;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (best_dist_slot >= 0)
|
2016-11-18 08:16:12 +03:00
|
|
|
/* At least 2 disks to choose from so failfast is OK */
|
|
|
|
set_bit(R10BIO_FailFast, &r10_bio->state);
|
2005-11-29 00:44:09 +03:00
|
|
|
/* This optimisation is debatable, and completely destroys
|
|
|
|
* sequential read speed for 'far copies' arrays. So only
|
|
|
|
* keep it for 'near' arrays, and review those later.
|
|
|
|
*/
|
2019-06-15 01:41:11 +03:00
|
|
|
if (geo->near_copies > 1 && !pending)
|
2016-11-18 08:16:12 +03:00
|
|
|
new_distance = 0;
|
2008-03-05 01:29:34 +03:00
|
|
|
|
|
|
|
/* for far > 1 always use the lowest address */
|
2016-11-18 08:16:12 +03:00
|
|
|
else if (geo->far_copies > 1)
|
2011-05-11 08:27:03 +04:00
|
|
|
new_distance = r10_bio->devs[slot].addr;
|
2008-03-05 01:29:34 +03:00
|
|
|
else
|
2011-05-11 08:27:03 +04:00
|
|
|
new_distance = abs(r10_bio->devs[slot].addr -
|
|
|
|
conf->mirrors[disk].head_position);
|
2019-06-15 01:41:11 +03:00
|
|
|
|
2011-05-11 08:27:03 +04:00
|
|
|
if (new_distance < best_dist) {
|
|
|
|
best_dist = new_distance;
|
2019-06-15 01:41:11 +03:00
|
|
|
best_dist_slot = slot;
|
|
|
|
best_dist_rdev = rdev;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
2011-12-23 03:17:54 +04:00
|
|
|
if (slot >= conf->copies) {
|
2019-06-15 01:41:11 +03:00
|
|
|
if (has_nonrot_disk) {
|
|
|
|
slot = best_pending_slot;
|
|
|
|
rdev = best_pending_rdev;
|
|
|
|
} else {
|
|
|
|
slot = best_dist_slot;
|
|
|
|
rdev = best_dist_rdev;
|
|
|
|
}
|
2011-12-23 03:17:54 +04:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-05-11 08:27:03 +04:00
|
|
|
if (slot >= 0) {
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
r10_bio->read_slot = slot;
|
|
|
|
} else
|
2011-12-23 03:17:54 +04:00
|
|
|
rdev = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
rcu_read_unlock();
|
2011-07-28 05:39:23 +04:00
|
|
|
*max_sectors = best_good_sectors;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-12-23 03:17:54 +04:00
|
|
|
return rdev;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2014-12-15 04:56:56 +03:00
|
|
|
static int raid10_congested(struct mddev *mddev, int bits)
|
2006-10-03 12:15:54 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2006-10-03 12:15:54 +04:00
|
|
|
int i, ret = 0;
|
|
|
|
|
2015-05-23 00:13:26 +03:00
|
|
|
if ((bits & (1 << WB_async_congested)) &&
|
2011-10-11 09:50:01 +04:00
|
|
|
conf->pending_count >= max_queued_requests)
|
|
|
|
return 1;
|
|
|
|
|
2006-10-03 12:15:54 +04:00
|
|
|
rcu_read_lock();
|
2012-05-21 03:28:33 +04:00
|
|
|
for (i = 0;
|
|
|
|
(i < conf->geo.raid_disks || i < conf->prev.raid_disks)
|
|
|
|
&& ret == 0;
|
|
|
|
i++) {
|
2011-10-11 09:45:26 +04:00
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
|
2006-10-03 12:15:54 +04:00
|
|
|
if (rdev && !test_bit(Faulty, &rdev->flags)) {
|
2007-07-24 11:28:11 +04:00
|
|
|
struct request_queue *q = bdev_get_queue(rdev->bdev);
|
2006-10-03 12:15:54 +04:00
|
|
|
|
2017-02-02 17:56:50 +03:00
|
|
|
ret |= bdi_congested(q->backing_dev_info, bits);
|
2006-10-03 12:15:54 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void flush_pending_writes(struct r10conf *conf)
|
2008-03-05 01:29:29 +03:00
|
|
|
{
|
|
|
|
/* Any writes that have been queued but are awaiting
|
|
|
|
* bitmap updates get flushed here.
|
|
|
|
*/
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
|
|
|
|
if (conf->pending_bio_list.head) {
|
2017-12-01 23:12:34 +03:00
|
|
|
struct blk_plug plug;
|
2008-03-05 01:29:29 +03:00
|
|
|
struct bio *bio;
|
2017-12-01 23:12:34 +03:00
|
|
|
|
2008-03-05 01:29:29 +03:00
|
|
|
bio = bio_list_get(&conf->pending_bio_list);
|
2011-10-11 09:50:01 +04:00
|
|
|
conf->pending_count = 0;
|
2008-03-05 01:29:29 +03:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2017-12-04 00:21:04 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* As this is called in a wait_event() loop (see freeze_array),
|
|
|
|
* current->state might be TASK_UNINTERRUPTIBLE which will
|
|
|
|
* cause a warning when we prepare to wait again. As it is
|
|
|
|
* rare that this path is taken, it is perfectly safe to force
|
|
|
|
* us to go around the wait_event() loop again, so the warning
|
|
|
|
* is a false-positive. Silence the warning by resetting
|
|
|
|
* thread state
|
|
|
|
*/
|
|
|
|
__set_current_state(TASK_RUNNING);
|
|
|
|
|
2017-12-01 23:12:34 +03:00
|
|
|
blk_start_plug(&plug);
|
2008-03-05 01:29:29 +03:00
|
|
|
/* flush any pending bitmap writes to disk
|
|
|
|
* before proceeding w/ I/O */
|
2018-08-02 01:20:50 +03:00
|
|
|
md_bitmap_unplug(conf->mddev->bitmap);
|
2011-10-11 09:50:01 +04:00
|
|
|
wake_up(&conf->wait_barrier);
|
2008-03-05 01:29:29 +03:00
|
|
|
|
|
|
|
while (bio) { /* submit pending writes */
|
|
|
|
struct bio *next = bio->bi_next;
|
2017-08-23 20:10:32 +03:00
|
|
|
struct md_rdev *rdev = (void*)bio->bi_disk;
|
2008-03-05 01:29:29 +03:00
|
|
|
bio->bi_next = NULL;
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(bio, rdev->bdev);
|
2016-11-04 08:46:03 +03:00
|
|
|
if (test_bit(Faulty, &rdev->flags)) {
|
2017-07-21 11:33:44 +03:00
|
|
|
bio_io_error(bio);
|
2016-11-04 08:46:03 +03:00
|
|
|
} else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
|
2017-08-23 20:10:32 +03:00
|
|
|
!blk_queue_discard(bio->bi_disk->queue)))
|
2012-10-11 06:30:52 +04:00
|
|
|
/* Just ignore it */
|
2015-07-20 16:29:37 +03:00
|
|
|
bio_endio(bio);
|
2012-10-11 06:30:52 +04:00
|
|
|
else
|
|
|
|
generic_make_request(bio);
|
2008-03-05 01:29:29 +03:00
|
|
|
bio = next;
|
|
|
|
}
|
2017-12-01 23:12:34 +03:00
|
|
|
blk_finish_plug(&plug);
|
2008-03-05 01:29:29 +03:00
|
|
|
} else
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
}
|
2011-03-10 10:52:07 +03:00
|
|
|
|
2006-01-06 11:20:13 +03:00
|
|
|
/* Barriers....
|
|
|
|
* Sometimes we need to suspend IO while we do something else,
|
|
|
|
* either some resync/recovery, or reconfigure the array.
|
|
|
|
* To do this we raise a 'barrier'.
|
|
|
|
* The 'barrier' is a counter that can be raised multiple times
|
|
|
|
* to count how many activities are happening which preclude
|
|
|
|
* normal IO.
|
|
|
|
* We can only raise the barrier if there is no pending IO.
|
|
|
|
* i.e. if nr_pending == 0.
|
|
|
|
* We choose only to raise the barrier if no-one is waiting for the
|
|
|
|
* barrier to go down. This means that as soon as an IO request
|
|
|
|
* is ready, no other operations which require a barrier will start
|
|
|
|
* until the IO request has had a chance.
|
|
|
|
*
|
|
|
|
* So: regular IO calls 'wait_barrier'. When that returns there
|
|
|
|
* is no backgroup IO happening, It must arrange to call
|
|
|
|
* allow_barrier when it has finished its IO.
|
|
|
|
* backgroup IO calls must call raise_barrier. Once that returns
|
|
|
|
* there is no normal IO happeing. It must arrange to call
|
|
|
|
* lower_barrier when the particular background IO completes.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void raise_barrier(struct r10conf *conf, int force)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2006-01-06 11:20:16 +03:00
|
|
|
BUG_ON(force && !conf->barrier);
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_lock_irq(&conf->resync_lock);
|
2006-01-06 11:20:13 +03:00
|
|
|
|
2006-01-06 11:20:16 +03:00
|
|
|
/* Wait until no block IO is waiting (unless 'force') */
|
|
|
|
wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
|
2012-11-30 14:42:40 +04:00
|
|
|
conf->resync_lock);
|
2006-01-06 11:20:13 +03:00
|
|
|
|
|
|
|
/* block any new IO from starting */
|
|
|
|
conf->barrier++;
|
|
|
|
|
2011-04-18 12:25:43 +04:00
|
|
|
/* Now wait for all pending IO to complete */
|
2006-01-06 11:20:13 +03:00
|
|
|
wait_event_lock_irq(conf->wait_barrier,
|
2016-06-24 15:20:16 +03:00
|
|
|
!atomic_read(&conf->nr_pending) && conf->barrier < RESYNC_DEPTH,
|
2012-11-30 14:42:40 +04:00
|
|
|
conf->resync_lock);
|
2006-01-06 11:20:13 +03:00
|
|
|
|
|
|
|
spin_unlock_irq(&conf->resync_lock);
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void lower_barrier(struct r10conf *conf)
|
2006-01-06 11:20:13 +03:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
spin_lock_irqsave(&conf->resync_lock, flags);
|
|
|
|
conf->barrier--;
|
|
|
|
spin_unlock_irqrestore(&conf->resync_lock, flags);
|
|
|
|
wake_up(&conf->wait_barrier);
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void wait_barrier(struct r10conf *conf)
|
2006-01-06 11:20:13 +03:00
|
|
|
{
|
|
|
|
spin_lock_irq(&conf->resync_lock);
|
|
|
|
if (conf->barrier) {
|
|
|
|
conf->nr_waiting++;
|
2012-03-19 05:46:38 +04:00
|
|
|
/* Wait for the barrier to drop.
|
|
|
|
* However if there are already pending
|
|
|
|
* requests (preventing the barrier from
|
|
|
|
* rising completely), and the
|
|
|
|
* pre-process bio queue isn't empty,
|
|
|
|
* then don't wait, as we need to empty
|
|
|
|
* that queue to get the nr_pending
|
|
|
|
* count down.
|
|
|
|
*/
|
2016-11-14 08:30:21 +03:00
|
|
|
raid10_log(conf->mddev, "wait barrier");
|
2012-03-19 05:46:38 +04:00
|
|
|
wait_event_lock_irq(conf->wait_barrier,
|
|
|
|
!conf->barrier ||
|
2016-06-24 15:20:16 +03:00
|
|
|
(atomic_read(&conf->nr_pending) &&
|
2012-03-19 05:46:38 +04:00
|
|
|
current->bio_list &&
|
2017-03-10 09:00:47 +03:00
|
|
|
(!bio_list_empty(¤t->bio_list[0]) ||
|
|
|
|
!bio_list_empty(¤t->bio_list[1]))),
|
2012-11-30 14:42:40 +04:00
|
|
|
conf->resync_lock);
|
2006-01-06 11:20:13 +03:00
|
|
|
conf->nr_waiting--;
|
2016-06-24 15:20:16 +03:00
|
|
|
if (!conf->nr_waiting)
|
|
|
|
wake_up(&conf->wait_barrier);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2016-06-24 15:20:16 +03:00
|
|
|
atomic_inc(&conf->nr_pending);
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock_irq(&conf->resync_lock);
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void allow_barrier(struct r10conf *conf)
|
2006-01-06 11:20:13 +03:00
|
|
|
{
|
2016-06-24 15:20:16 +03:00
|
|
|
if ((atomic_dec_and_test(&conf->nr_pending)) ||
|
|
|
|
(conf->array_freeze_pending))
|
|
|
|
wake_up(&conf->wait_barrier);
|
2006-01-06 11:20:13 +03:00
|
|
|
}
|
|
|
|
|
2013-06-12 05:01:22 +04:00
|
|
|
static void freeze_array(struct r10conf *conf, int extra)
|
2006-01-06 11:20:28 +03:00
|
|
|
{
|
|
|
|
/* stop syncio and normal IO and wait for everything to
|
2006-01-06 11:20:42 +03:00
|
|
|
* go quiet.
|
2006-01-06 11:20:28 +03:00
|
|
|
* We increment barrier and nr_waiting, and then
|
2013-06-12 05:01:22 +04:00
|
|
|
* wait until nr_pending match nr_queued+extra
|
2008-03-05 01:29:35 +03:00
|
|
|
* This is called in the context of one normal IO request
|
|
|
|
* that has failed. Thus any sync request that might be pending
|
|
|
|
* will be blocked by nr_pending, and we need to wait for
|
|
|
|
* pending IO requests to complete or be queued for re-try.
|
2013-06-12 05:01:22 +04:00
|
|
|
* Thus the number queued (nr_queued) plus this request (extra)
|
2008-03-05 01:29:35 +03:00
|
|
|
* must match the number of pending IOs (nr_pending) before
|
|
|
|
* we continue.
|
2006-01-06 11:20:28 +03:00
|
|
|
*/
|
|
|
|
spin_lock_irq(&conf->resync_lock);
|
2016-06-24 15:20:16 +03:00
|
|
|
conf->array_freeze_pending++;
|
2006-01-06 11:20:28 +03:00
|
|
|
conf->barrier++;
|
|
|
|
conf->nr_waiting++;
|
2012-11-30 14:42:40 +04:00
|
|
|
wait_event_lock_irq_cmd(conf->wait_barrier,
|
2016-06-24 15:20:16 +03:00
|
|
|
atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
|
2012-11-30 14:42:40 +04:00
|
|
|
conf->resync_lock,
|
|
|
|
flush_pending_writes(conf));
|
2011-04-18 12:25:43 +04:00
|
|
|
|
2016-06-24 15:20:16 +03:00
|
|
|
conf->array_freeze_pending--;
|
2006-01-06 11:20:28 +03:00
|
|
|
spin_unlock_irq(&conf->resync_lock);
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void unfreeze_array(struct r10conf *conf)
|
2006-01-06 11:20:28 +03:00
|
|
|
{
|
|
|
|
/* reverse the effect of the freeze */
|
|
|
|
spin_lock_irq(&conf->resync_lock);
|
|
|
|
conf->barrier--;
|
|
|
|
conf->nr_waiting--;
|
|
|
|
wake_up(&conf->wait_barrier);
|
|
|
|
spin_unlock_irq(&conf->resync_lock);
|
|
|
|
}
|
|
|
|
|
2012-05-21 03:28:33 +04:00
|
|
|
static sector_t choose_data_offset(struct r10bio *r10_bio,
|
|
|
|
struct md_rdev *rdev)
|
|
|
|
{
|
|
|
|
if (!test_bit(MD_RECOVERY_RESHAPE, &rdev->mddev->recovery) ||
|
|
|
|
test_bit(R10BIO_Previous, &r10_bio->state))
|
|
|
|
return rdev->data_offset;
|
|
|
|
else
|
|
|
|
return rdev->new_data_offset;
|
|
|
|
}
|
|
|
|
|
2012-10-11 06:32:13 +04:00
|
|
|
struct raid10_plug_cb {
|
|
|
|
struct blk_plug_cb cb;
|
|
|
|
struct bio_list pending;
|
|
|
|
int pending_cnt;
|
|
|
|
};
|
|
|
|
|
|
|
|
static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
|
|
|
|
{
|
|
|
|
struct raid10_plug_cb *plug = container_of(cb, struct raid10_plug_cb,
|
|
|
|
cb);
|
|
|
|
struct mddev *mddev = plug->cb.data;
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
struct bio *bio;
|
|
|
|
|
2012-11-27 05:14:40 +04:00
|
|
|
if (from_schedule || current->bio_list) {
|
2012-10-11 06:32:13 +04:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
bio_list_merge(&conf->pending_bio_list, &plug->pending);
|
|
|
|
conf->pending_count += plug->pending_cnt;
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2013-02-25 05:38:29 +04:00
|
|
|
wake_up(&conf->wait_barrier);
|
2012-10-11 06:32:13 +04:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
kfree(plug);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* we aren't scheduling, so we can do the write-out directly. */
|
|
|
|
bio = bio_list_get(&plug->pending);
|
2018-08-02 01:20:50 +03:00
|
|
|
md_bitmap_unplug(mddev->bitmap);
|
2012-10-11 06:32:13 +04:00
|
|
|
wake_up(&conf->wait_barrier);
|
|
|
|
|
|
|
|
while (bio) { /* submit pending writes */
|
|
|
|
struct bio *next = bio->bi_next;
|
2017-08-23 20:10:32 +03:00
|
|
|
struct md_rdev *rdev = (void*)bio->bi_disk;
|
2012-10-11 06:32:13 +04:00
|
|
|
bio->bi_next = NULL;
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(bio, rdev->bdev);
|
2016-11-04 08:46:03 +03:00
|
|
|
if (test_bit(Faulty, &rdev->flags)) {
|
2017-07-21 11:33:44 +03:00
|
|
|
bio_io_error(bio);
|
2016-11-04 08:46:03 +03:00
|
|
|
} else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
|
2017-08-23 20:10:32 +03:00
|
|
|
!blk_queue_discard(bio->bi_disk->queue)))
|
2013-04-28 14:26:38 +04:00
|
|
|
/* Just ignore it */
|
2015-07-20 16:29:37 +03:00
|
|
|
bio_endio(bio);
|
2013-04-28 14:26:38 +04:00
|
|
|
else
|
|
|
|
generic_make_request(bio);
|
2012-10-11 06:32:13 +04:00
|
|
|
bio = next;
|
|
|
|
}
|
|
|
|
kfree(plug);
|
|
|
|
}
|
|
|
|
|
2018-12-07 13:24:21 +03:00
|
|
|
/*
|
|
|
|
* 1. Register the new request and wait if the reconstruction thread has put
|
|
|
|
* up a bar for new requests. Continue immediately if no resync is active
|
|
|
|
* currently.
|
|
|
|
* 2. If IO spans the reshape position. Need to wait for reshape to pass.
|
|
|
|
*/
|
|
|
|
static void regular_request_wait(struct mddev *mddev, struct r10conf *conf,
|
|
|
|
struct bio *bio, sector_t sectors)
|
|
|
|
{
|
|
|
|
wait_barrier(conf);
|
|
|
|
while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
|
|
|
bio->bi_iter.bi_sector < conf->reshape_progress &&
|
|
|
|
bio->bi_iter.bi_sector + sectors > conf->reshape_progress) {
|
|
|
|
raid10_log(conf->mddev, "wait reshape");
|
|
|
|
allow_barrier(conf);
|
|
|
|
wait_event(conf->wait_barrier,
|
|
|
|
conf->reshape_progress <= bio->bi_iter.bi_sector ||
|
|
|
|
conf->reshape_progress >= bio->bi_iter.bi_sector +
|
|
|
|
sectors);
|
|
|
|
wait_barrier(conf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-12-05 23:02:58 +03:00
|
|
|
static void raid10_read_request(struct mddev *mddev, struct bio *bio,
|
|
|
|
struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct bio *read_bio;
|
2016-12-05 23:02:58 +03:00
|
|
|
const int op = bio_op(bio);
|
|
|
|
const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
|
|
|
|
int max_sectors;
|
|
|
|
struct md_rdev *rdev;
|
2017-04-05 07:05:51 +03:00
|
|
|
char b[BDEVNAME_SIZE];
|
|
|
|
int slot = r10_bio->read_slot;
|
|
|
|
struct md_rdev *err_rdev = NULL;
|
|
|
|
gfp_t gfp = GFP_NOIO;
|
2016-12-05 23:02:58 +03:00
|
|
|
|
2017-04-05 07:05:51 +03:00
|
|
|
if (r10_bio->devs[slot].rdev) {
|
|
|
|
/*
|
|
|
|
* This is an error retry, but we cannot
|
|
|
|
* safely dereference the rdev in the r10_bio,
|
|
|
|
* we must use the one in conf.
|
|
|
|
* If it has already been disconnected (unlikely)
|
|
|
|
* we lose the device name in error messages.
|
|
|
|
*/
|
|
|
|
int disk;
|
|
|
|
/*
|
|
|
|
* As we are blocking raid10, it is a little safer to
|
|
|
|
* use __GFP_HIGH.
|
|
|
|
*/
|
|
|
|
gfp = GFP_NOIO | __GFP_HIGH;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
disk = r10_bio->devs[slot].devnum;
|
|
|
|
err_rdev = rcu_dereference(conf->mirrors[disk].rdev);
|
|
|
|
if (err_rdev)
|
|
|
|
bdevname(err_rdev->bdev, b);
|
|
|
|
else {
|
|
|
|
strcpy(b, "???");
|
|
|
|
/* This never gets dereferenced */
|
|
|
|
err_rdev = r10_bio->devs[slot].rdev;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
2016-12-05 23:02:58 +03:00
|
|
|
|
2018-12-07 13:24:21 +03:00
|
|
|
regular_request_wait(mddev, conf, bio, r10_bio->sectors);
|
2016-12-05 23:02:58 +03:00
|
|
|
rdev = read_balance(conf, r10_bio, &max_sectors);
|
|
|
|
if (!rdev) {
|
2017-04-05 07:05:51 +03:00
|
|
|
if (err_rdev) {
|
|
|
|
pr_crit_ratelimited("md/raid10:%s: %s: unrecoverable I/O read error for block %llu\n",
|
|
|
|
mdname(mddev), b,
|
|
|
|
(unsigned long long)r10_bio->sector);
|
|
|
|
}
|
2016-12-05 23:02:58 +03:00
|
|
|
raid_end_bio_io(r10_bio);
|
|
|
|
return;
|
|
|
|
}
|
2017-04-05 07:05:51 +03:00
|
|
|
if (err_rdev)
|
|
|
|
pr_err_ratelimited("md/raid10:%s: %s: redirecting sector %llu to another mirror\n",
|
|
|
|
mdname(mddev),
|
|
|
|
bdevname(rdev->bdev, b),
|
|
|
|
(unsigned long long)r10_bio->sector);
|
2017-04-05 07:05:51 +03:00
|
|
|
if (max_sectors < bio_sectors(bio)) {
|
|
|
|
struct bio *split = bio_split(bio, max_sectors,
|
2018-05-21 01:25:52 +03:00
|
|
|
gfp, &conf->bio_split);
|
2017-04-05 07:05:51 +03:00
|
|
|
bio_chain(split, bio);
|
2018-12-19 09:19:25 +03:00
|
|
|
allow_barrier(conf);
|
2017-04-05 07:05:51 +03:00
|
|
|
generic_make_request(bio);
|
2018-12-19 09:19:25 +03:00
|
|
|
wait_barrier(conf);
|
2017-04-05 07:05:51 +03:00
|
|
|
bio = split;
|
|
|
|
r10_bio->master_bio = bio;
|
|
|
|
r10_bio->sectors = max_sectors;
|
|
|
|
}
|
2016-12-05 23:02:58 +03:00
|
|
|
slot = r10_bio->read_slot;
|
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
read_bio = bio_clone_fast(bio, gfp, &mddev->bio_set);
|
2016-12-05 23:02:58 +03:00
|
|
|
|
|
|
|
r10_bio->devs[slot].bio = read_bio;
|
|
|
|
r10_bio->devs[slot].rdev = rdev;
|
|
|
|
|
|
|
|
read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr +
|
|
|
|
choose_data_offset(r10_bio, rdev);
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(read_bio, rdev->bdev);
|
2016-12-05 23:02:58 +03:00
|
|
|
read_bio->bi_end_io = raid10_end_read_request;
|
|
|
|
bio_set_op_attrs(read_bio, op, do_sync);
|
|
|
|
if (test_bit(FailFast, &rdev->flags) &&
|
|
|
|
test_bit(R10BIO_FailFast, &r10_bio->state))
|
|
|
|
read_bio->bi_opf |= MD_FAILFAST;
|
|
|
|
read_bio->bi_private = r10_bio;
|
|
|
|
|
|
|
|
if (mddev->gendisk)
|
2017-08-23 20:10:32 +03:00
|
|
|
trace_block_bio_remap(read_bio->bi_disk->queue,
|
2016-12-05 23:02:58 +03:00
|
|
|
read_bio, disk_devt(mddev->gendisk),
|
|
|
|
r10_bio->sector);
|
2017-04-05 07:05:51 +03:00
|
|
|
generic_make_request(read_bio);
|
2016-12-05 23:02:58 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-03-20 12:46:04 +03:00
|
|
|
static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
|
|
|
|
struct bio *bio, bool replacement,
|
2017-04-05 07:05:51 +03:00
|
|
|
int n_copy)
|
2016-12-05 23:02:58 +03:00
|
|
|
{
|
2016-06-05 22:32:07 +03:00
|
|
|
const int op = bio_op(bio);
|
2016-08-06 00:35:16 +03:00
|
|
|
const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
|
|
|
|
const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
|
2006-01-06 11:20:16 +03:00
|
|
|
unsigned long flags;
|
2012-10-11 06:32:13 +04:00
|
|
|
struct blk_plug_cb *cb;
|
|
|
|
struct raid10_plug_cb *plug = NULL;
|
2017-03-20 12:46:04 +03:00
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
int devnum = r10_bio->devs[n_copy].devnum;
|
|
|
|
struct bio *mbio;
|
|
|
|
|
|
|
|
if (replacement) {
|
|
|
|
rdev = conf->mirrors[devnum].replacement;
|
|
|
|
if (rdev == NULL) {
|
|
|
|
/* Replacement just got moved to main 'rdev' */
|
|
|
|
smp_mb();
|
|
|
|
rdev = conf->mirrors[devnum].rdev;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
rdev = conf->mirrors[devnum].rdev;
|
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
|
2017-03-20 12:46:04 +03:00
|
|
|
if (replacement)
|
|
|
|
r10_bio->devs[n_copy].repl_bio = mbio;
|
|
|
|
else
|
|
|
|
r10_bio->devs[n_copy].bio = mbio;
|
|
|
|
|
|
|
|
mbio->bi_iter.bi_sector = (r10_bio->devs[n_copy].addr +
|
|
|
|
choose_data_offset(r10_bio, rdev));
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(mbio, rdev->bdev);
|
2017-03-20 12:46:04 +03:00
|
|
|
mbio->bi_end_io = raid10_end_write_request;
|
|
|
|
bio_set_op_attrs(mbio, op, do_sync | do_fua);
|
|
|
|
if (!replacement && test_bit(FailFast,
|
|
|
|
&conf->mirrors[devnum].rdev->flags)
|
|
|
|
&& enough(conf, devnum))
|
|
|
|
mbio->bi_opf |= MD_FAILFAST;
|
|
|
|
mbio->bi_private = r10_bio;
|
|
|
|
|
|
|
|
if (conf->mddev->gendisk)
|
2017-08-23 20:10:32 +03:00
|
|
|
trace_block_bio_remap(mbio->bi_disk->queue,
|
2017-03-20 12:46:04 +03:00
|
|
|
mbio, disk_devt(conf->mddev->gendisk),
|
|
|
|
r10_bio->sector);
|
|
|
|
/* flush_pending_writes() needs access to the rdev so...*/
|
2017-08-23 20:10:32 +03:00
|
|
|
mbio->bi_disk = (void *)rdev;
|
2017-03-20 12:46:04 +03:00
|
|
|
|
|
|
|
atomic_inc(&r10_bio->remaining);
|
|
|
|
|
|
|
|
cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug));
|
|
|
|
if (cb)
|
|
|
|
plug = container_of(cb, struct raid10_plug_cb, cb);
|
|
|
|
else
|
|
|
|
plug = NULL;
|
|
|
|
if (plug) {
|
|
|
|
bio_list_add(&plug->pending, mbio);
|
|
|
|
plug->pending_cnt++;
|
|
|
|
} else {
|
2017-05-10 18:47:11 +03:00
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
2017-03-20 12:46:04 +03:00
|
|
|
bio_list_add(&conf->pending_bio_list, mbio);
|
|
|
|
conf->pending_count++;
|
2017-05-10 18:47:11 +03:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
2017-03-20 12:46:04 +03:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2017-05-10 18:47:11 +03:00
|
|
|
}
|
2017-03-20 12:46:04 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void raid10_write_request(struct mddev *mddev, struct bio *bio,
|
|
|
|
struct r10bio *r10_bio)
|
|
|
|
{
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
int i;
|
|
|
|
struct md_rdev *blocked_rdev;
|
2016-12-05 23:02:58 +03:00
|
|
|
sector_t sectors;
|
2011-07-28 05:39:24 +04:00
|
|
|
int max_sectors;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2017-10-24 10:11:51 +03:00
|
|
|
if ((mddev_is_clustered(mddev) &&
|
|
|
|
md_cluster_ops->area_resyncing(mddev, WRITE,
|
|
|
|
bio->bi_iter.bi_sector,
|
|
|
|
bio_end_sector(bio)))) {
|
|
|
|
DEFINE_WAIT(w);
|
|
|
|
for (;;) {
|
|
|
|
prepare_to_wait(&conf->wait_barrier,
|
|
|
|
&w, TASK_IDLE);
|
|
|
|
if (!md_cluster_ops->area_resyncing(mddev, WRITE,
|
|
|
|
bio->bi_iter.bi_sector, bio_end_sector(bio)))
|
|
|
|
break;
|
|
|
|
schedule();
|
|
|
|
}
|
|
|
|
finish_wait(&conf->wait_barrier, &w);
|
|
|
|
}
|
|
|
|
|
2017-04-05 07:05:51 +03:00
|
|
|
sectors = r10_bio->sectors;
|
2018-12-07 13:24:21 +03:00
|
|
|
regular_request_wait(mddev, conf, bio, sectors);
|
2012-05-22 07:53:47 +04:00
|
|
|
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
|
|
|
(mddev->reshape_backwards
|
2013-10-12 02:44:27 +04:00
|
|
|
? (bio->bi_iter.bi_sector < conf->reshape_safe &&
|
|
|
|
bio->bi_iter.bi_sector + sectors > conf->reshape_progress)
|
|
|
|
: (bio->bi_iter.bi_sector + sectors > conf->reshape_safe &&
|
|
|
|
bio->bi_iter.bi_sector < conf->reshape_progress))) {
|
2012-05-22 07:53:47 +04:00
|
|
|
/* Need to update reshape_position in metadata */
|
|
|
|
mddev->reshape_position = conf->reshape_progress;
|
2016-12-09 02:48:19 +03:00
|
|
|
set_mask_bits(&mddev->sb_flags, 0,
|
|
|
|
BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
|
2012-05-22 07:53:47 +04:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2016-11-14 08:30:21 +03:00
|
|
|
raid10_log(conf->mddev, "wait reshape metadata");
|
2012-05-22 07:53:47 +04:00
|
|
|
wait_event(mddev->sb_wait,
|
2016-12-09 02:48:19 +03:00
|
|
|
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
|
2012-05-22 07:53:47 +04:00
|
|
|
|
|
|
|
conf->reshape_safe = mddev->reshape_position;
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:50:01 +04:00
|
|
|
if (conf->pending_count >= max_queued_requests) {
|
|
|
|
md_wakeup_thread(mddev->thread);
|
2016-11-14 08:30:21 +03:00
|
|
|
raid10_log(mddev, "wait queued");
|
2011-10-11 09:50:01 +04:00
|
|
|
wait_event(conf->wait_barrier,
|
|
|
|
conf->pending_count < max_queued_requests);
|
|
|
|
}
|
2008-04-30 11:52:32 +04:00
|
|
|
/* first select target devices under rcu_lock and
|
2005-04-17 02:20:36 +04:00
|
|
|
* inc refcount on their rdev. Record them by setting
|
|
|
|
* bios[x] to bio
|
2011-07-28 05:39:24 +04:00
|
|
|
* If there are known/acknowledged bad blocks on any device
|
|
|
|
* on which we have seen a write error, we want to avoid
|
|
|
|
* writing to those blocks. This potentially requires several
|
|
|
|
* writes to write around the bad blocks. Each set of writes
|
md/raid10: stop using bi_phys_segments
raid10 currently repurposes bi_phys_segments on each
incoming bio to count how many r10bio was used to encode the
request.
We need to know when the number of attached r10bio reaches
zero to:
1/ call bio_endio() when all IO on the bio is finished
2/ decrement ->nr_pending so that resync IO can proceed.
Now that the bio has its own __bi_remaining counter, that
can be used instead. We can call bio_inc_remaining to
increment the counter and call bio_endio() every time an
r10bio completes, rather than only when bi_phys_segments
reaches zero.
This addresses point 1, but not point 2. bio_endio()
doesn't (and cannot) report when the last r10bio has
finished, so a different approach is needed.
So: instead of counting bios in ->nr_pending, count r10bios.
i.e. every time we attach a bio, increment nr_pending.
Every time an r10bio completes, decrement nr_pending.
Normally we only increment nr_pending after first checking
that ->barrier is zero, or some other non-trivial tests and
possible waiting. When attaching multiple r10bios to a bio,
we only need the tests and the waiting once. After the
first increment, subsequent increments can happen
unconditionally as they are really all part of the one
request.
So introduce inc_pending() which can be used when we know
that nr_pending is already elevated.
Note that this fixes a bug. freeze_array() contains the line
atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
which implies that the units for ->nr_pending, ->nr_queued and extra
are the same.
->nr_queue and extra count r10_bios, but prior to this patch,
->nr_pending counted bios. If a bio ever resulted in multiple
r10_bios (due to bad blocks), freeze_array() would not work correctly.
Now it does.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 06:05:13 +03:00
|
|
|
* gets its own r10_bio with a set of bios attached.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2011-04-18 12:25:43 +04:00
|
|
|
|
2011-12-23 03:17:54 +04:00
|
|
|
r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
|
2005-04-17 02:20:36 +04:00
|
|
|
raid10_find_phys(conf, r10_bio);
|
2011-07-28 05:39:24 +04:00
|
|
|
retry_write:
|
2008-05-07 07:42:32 +04:00
|
|
|
blocked_rdev = NULL;
|
2005-04-17 02:20:36 +04:00
|
|
|
rcu_read_lock();
|
2011-07-28 05:39:24 +04:00
|
|
|
max_sectors = r10_bio->sectors;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
for (i = 0; i < conf->copies; i++) {
|
|
|
|
int d = r10_bio->devs[i].devnum;
|
2011-10-11 09:45:26 +04:00
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev);
|
2011-12-23 03:17:55 +04:00
|
|
|
struct md_rdev *rrdev = rcu_dereference(
|
|
|
|
conf->mirrors[d].replacement);
|
2011-12-23 03:17:55 +04:00
|
|
|
if (rdev == rrdev)
|
|
|
|
rrdev = NULL;
|
2008-04-30 11:52:32 +04:00
|
|
|
if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
blocked_rdev = rdev;
|
|
|
|
break;
|
|
|
|
}
|
2011-12-23 03:17:55 +04:00
|
|
|
if (rrdev && unlikely(test_bit(Blocked, &rrdev->flags))) {
|
|
|
|
atomic_inc(&rrdev->nr_pending);
|
|
|
|
blocked_rdev = rrdev;
|
|
|
|
break;
|
|
|
|
}
|
2015-04-28 09:48:34 +03:00
|
|
|
if (rdev && (test_bit(Faulty, &rdev->flags)))
|
2012-11-22 07:42:49 +04:00
|
|
|
rdev = NULL;
|
2015-04-28 09:48:34 +03:00
|
|
|
if (rrdev && (test_bit(Faulty, &rrdev->flags)))
|
2011-12-23 03:17:55 +04:00
|
|
|
rrdev = NULL;
|
|
|
|
|
2011-07-28 05:39:24 +04:00
|
|
|
r10_bio->devs[i].bio = NULL;
|
2011-12-23 03:17:55 +04:00
|
|
|
r10_bio->devs[i].repl_bio = NULL;
|
2012-11-22 07:42:49 +04:00
|
|
|
|
|
|
|
if (!rdev && !rrdev) {
|
2006-01-06 11:20:16 +03:00
|
|
|
set_bit(R10BIO_Degraded, &r10_bio->state);
|
2011-07-28 05:39:24 +04:00
|
|
|
continue;
|
|
|
|
}
|
2012-11-22 07:42:49 +04:00
|
|
|
if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
|
2011-07-28 05:39:24 +04:00
|
|
|
sector_t first_bad;
|
|
|
|
sector_t dev_sector = r10_bio->devs[i].addr;
|
|
|
|
int bad_sectors;
|
|
|
|
int is_bad;
|
|
|
|
|
2016-12-05 23:02:58 +03:00
|
|
|
is_bad = is_badblock(rdev, dev_sector, max_sectors,
|
2011-07-28 05:39:24 +04:00
|
|
|
&first_bad, &bad_sectors);
|
|
|
|
if (is_bad < 0) {
|
|
|
|
/* Mustn't write here until the bad block
|
|
|
|
* is acknowledged
|
|
|
|
*/
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
set_bit(BlockedBadBlocks, &rdev->flags);
|
|
|
|
blocked_rdev = rdev;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (is_bad && first_bad <= dev_sector) {
|
|
|
|
/* Cannot write here at all */
|
|
|
|
bad_sectors -= (dev_sector - first_bad);
|
|
|
|
if (bad_sectors < max_sectors)
|
|
|
|
/* Mustn't write more than bad_sectors
|
|
|
|
* to other devices yet
|
|
|
|
*/
|
|
|
|
max_sectors = bad_sectors;
|
|
|
|
/* We don't set R10BIO_Degraded as that
|
|
|
|
* only applies if the disk is missing,
|
|
|
|
* so it might be re-added, and we want to
|
|
|
|
* know to recover this chunk.
|
|
|
|
* In this case the device is here, and the
|
|
|
|
* fact that this chunk is not in-sync is
|
|
|
|
* recorded in the bad block log.
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (is_bad) {
|
|
|
|
int good_sectors = first_bad - dev_sector;
|
|
|
|
if (good_sectors < max_sectors)
|
|
|
|
max_sectors = good_sectors;
|
|
|
|
}
|
2006-01-06 11:20:16 +03:00
|
|
|
}
|
2012-11-22 07:42:49 +04:00
|
|
|
if (rdev) {
|
|
|
|
r10_bio->devs[i].bio = bio;
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
}
|
2011-12-23 03:17:55 +04:00
|
|
|
if (rrdev) {
|
|
|
|
r10_bio->devs[i].repl_bio = bio;
|
|
|
|
atomic_inc(&rrdev->nr_pending);
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2008-04-30 11:52:32 +04:00
|
|
|
if (unlikely(blocked_rdev)) {
|
|
|
|
/* Have to wait for this device to get unblocked, then retry */
|
|
|
|
int j;
|
|
|
|
int d;
|
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
for (j = 0; j < i; j++) {
|
2008-04-30 11:52:32 +04:00
|
|
|
if (r10_bio->devs[j].bio) {
|
|
|
|
d = r10_bio->devs[j].devnum;
|
|
|
|
rdev_dec_pending(conf->mirrors[d].rdev, mddev);
|
|
|
|
}
|
2011-12-23 03:17:55 +04:00
|
|
|
if (r10_bio->devs[j].repl_bio) {
|
2011-12-23 03:17:55 +04:00
|
|
|
struct md_rdev *rdev;
|
2011-12-23 03:17:55 +04:00
|
|
|
d = r10_bio->devs[j].devnum;
|
2011-12-23 03:17:55 +04:00
|
|
|
rdev = conf->mirrors[d].replacement;
|
|
|
|
if (!rdev) {
|
|
|
|
/* Race with remove_disk */
|
|
|
|
smp_mb();
|
|
|
|
rdev = conf->mirrors[d].rdev;
|
|
|
|
}
|
|
|
|
rdev_dec_pending(rdev, mddev);
|
2011-12-23 03:17:55 +04:00
|
|
|
}
|
|
|
|
}
|
2008-04-30 11:52:32 +04:00
|
|
|
allow_barrier(conf);
|
2016-11-14 08:30:21 +03:00
|
|
|
raid10_log(conf->mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
|
2008-04-30 11:52:32 +04:00
|
|
|
md_wait_for_blocked_rdev(blocked_rdev, mddev);
|
|
|
|
wait_barrier(conf);
|
|
|
|
goto retry_write;
|
|
|
|
}
|
|
|
|
|
2017-03-15 06:05:13 +03:00
|
|
|
if (max_sectors < r10_bio->sectors)
|
2011-07-28 05:39:24 +04:00
|
|
|
r10_bio->sectors = max_sectors;
|
2017-04-05 07:05:51 +03:00
|
|
|
|
|
|
|
if (r10_bio->sectors < bio_sectors(bio)) {
|
|
|
|
struct bio *split = bio_split(bio, r10_bio->sectors,
|
2018-05-21 01:25:52 +03:00
|
|
|
GFP_NOIO, &conf->bio_split);
|
2017-04-05 07:05:51 +03:00
|
|
|
bio_chain(split, bio);
|
2018-12-19 09:19:25 +03:00
|
|
|
allow_barrier(conf);
|
2017-04-05 07:05:51 +03:00
|
|
|
generic_make_request(bio);
|
2018-12-19 09:19:25 +03:00
|
|
|
wait_barrier(conf);
|
2017-04-05 07:05:51 +03:00
|
|
|
bio = split;
|
|
|
|
r10_bio->master_bio = bio;
|
2011-07-28 05:39:24 +04:00
|
|
|
}
|
|
|
|
|
2010-10-19 05:54:01 +04:00
|
|
|
atomic_set(&r10_bio->remaining, 1);
|
2018-08-02 01:20:50 +03:00
|
|
|
md_bitmap_startwrite(mddev->bitmap, r10_bio->sector, r10_bio->sectors, 0);
|
2005-06-22 04:17:12 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
for (i = 0; i < conf->copies; i++) {
|
2017-03-20 12:46:04 +03:00
|
|
|
if (r10_bio->devs[i].bio)
|
2017-04-05 07:05:51 +03:00
|
|
|
raid10_write_one_disk(mddev, r10_bio, bio, false, i);
|
2017-03-20 12:46:04 +03:00
|
|
|
if (r10_bio->devs[i].repl_bio)
|
2017-04-05 07:05:51 +03:00
|
|
|
raid10_write_one_disk(mddev, r10_bio, bio, true, i);
|
2011-07-28 05:39:24 +04:00
|
|
|
}
|
2011-09-10 11:21:23 +04:00
|
|
|
one_write_done(r10_bio);
|
2013-11-24 06:21:01 +04:00
|
|
|
}
|
|
|
|
|
2017-04-05 07:05:51 +03:00
|
|
|
static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
|
2016-12-05 23:02:58 +03:00
|
|
|
{
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
struct r10bio *r10_bio;
|
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
|
2016-12-05 23:02:58 +03:00
|
|
|
|
|
|
|
r10_bio->master_bio = bio;
|
2017-04-05 07:05:51 +03:00
|
|
|
r10_bio->sectors = sectors;
|
2016-12-05 23:02:58 +03:00
|
|
|
|
|
|
|
r10_bio->mddev = mddev;
|
|
|
|
r10_bio->sector = bio->bi_iter.bi_sector;
|
|
|
|
r10_bio->state = 0;
|
2017-04-05 07:05:51 +03:00
|
|
|
memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->copies);
|
2016-12-05 23:02:58 +03:00
|
|
|
|
|
|
|
if (bio_data_dir(bio) == READ)
|
|
|
|
raid10_read_request(mddev, bio, r10_bio);
|
|
|
|
else
|
|
|
|
raid10_write_request(mddev, bio, r10_bio);
|
|
|
|
}
|
|
|
|
|
2017-06-05 09:49:39 +03:00
|
|
|
static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
|
2013-11-24 06:21:01 +04:00
|
|
|
{
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
sector_t chunk_mask = (conf->geo.chunk_mask & conf->prev.chunk_mask);
|
|
|
|
int chunk_sects = chunk_mask + 1;
|
2017-04-05 07:05:51 +03:00
|
|
|
int sectors = bio_sectors(bio);
|
2013-11-24 06:21:01 +04:00
|
|
|
|
2019-09-16 20:15:14 +03:00
|
|
|
if (unlikely(bio->bi_opf & REQ_PREFLUSH)
|
|
|
|
&& md_flush_request(mddev, bio))
|
2017-06-05 09:49:39 +03:00
|
|
|
return true;
|
2013-11-24 06:21:01 +04:00
|
|
|
|
2017-06-05 09:49:39 +03:00
|
|
|
if (!md_write_start(mddev, bio))
|
|
|
|
return false;
|
|
|
|
|
2017-04-05 07:05:51 +03:00
|
|
|
/*
|
|
|
|
* If this request crosses a chunk boundary, we need to split
|
|
|
|
* it.
|
|
|
|
*/
|
|
|
|
if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
|
|
|
|
sectors > chunk_sects
|
|
|
|
&& (conf->geo.near_copies < conf->geo.raid_disks
|
|
|
|
|| conf->prev.near_copies <
|
|
|
|
conf->prev.raid_disks)))
|
|
|
|
sectors = chunk_sects -
|
|
|
|
(bio->bi_iter.bi_sector &
|
|
|
|
(chunk_sects - 1));
|
|
|
|
__make_request(mddev, bio, sectors);
|
2011-09-10 11:21:23 +04:00
|
|
|
|
|
|
|
/* In case raid10d snuck in to freeze_array */
|
|
|
|
wake_up(&conf->wait_barrier);
|
2017-06-05 09:49:39 +03:00
|
|
|
return true;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2016-01-21 00:52:20 +03:00
|
|
|
static void raid10_status(struct seq_file *seq, struct mddev *mddev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
int i;
|
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
if (conf->geo.near_copies < conf->geo.raid_disks)
|
2009-06-18 02:45:01 +04:00
|
|
|
seq_printf(seq, " %dK chunks", mddev->chunk_sectors / 2);
|
2012-05-21 03:28:20 +04:00
|
|
|
if (conf->geo.near_copies > 1)
|
|
|
|
seq_printf(seq, " %d near-copies", conf->geo.near_copies);
|
|
|
|
if (conf->geo.far_copies > 1) {
|
|
|
|
if (conf->geo.far_offset)
|
|
|
|
seq_printf(seq, " %d offset-copies", conf->geo.far_copies);
|
2006-06-26 11:27:41 +04:00
|
|
|
else
|
2012-05-21 03:28:20 +04:00
|
|
|
seq_printf(seq, " %d far-copies", conf->geo.far_copies);
|
2015-10-22 05:20:15 +03:00
|
|
|
if (conf->geo.far_set_size != conf->geo.raid_disks)
|
|
|
|
seq_printf(seq, " %d devices per set", conf->geo.far_set_size);
|
2006-06-26 11:27:41 +04:00
|
|
|
}
|
2012-05-21 03:28:20 +04:00
|
|
|
seq_printf(seq, " [%d/%d] [", conf->geo.raid_disks,
|
|
|
|
conf->geo.raid_disks - mddev->degraded);
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
|
|
|
for (i = 0; i < conf->geo.raid_disks; i++) {
|
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
|
|
|
|
seq_printf(seq, "%s", rdev && test_bit(In_sync, &rdev->flags) ? "U" : "_");
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2005-04-17 02:20:36 +04:00
|
|
|
seq_printf(seq, "]");
|
|
|
|
}
|
|
|
|
|
2011-07-27 05:00:36 +04:00
|
|
|
/* check if there are enough drives for
|
|
|
|
* every block to appear on atleast one.
|
|
|
|
* Don't consider the device numbered 'ignore'
|
|
|
|
* as we might be about to remove it.
|
|
|
|
*/
|
2013-06-11 08:57:09 +04:00
|
|
|
static int _enough(struct r10conf *conf, int previous, int ignore)
|
2011-07-27 05:00:36 +04:00
|
|
|
{
|
|
|
|
int first = 0;
|
2013-06-11 09:08:03 +04:00
|
|
|
int has_enough = 0;
|
2013-06-11 08:57:09 +04:00
|
|
|
int disks, ncopies;
|
|
|
|
if (previous) {
|
|
|
|
disks = conf->prev.raid_disks;
|
|
|
|
ncopies = conf->prev.near_copies;
|
|
|
|
} else {
|
|
|
|
disks = conf->geo.raid_disks;
|
|
|
|
ncopies = conf->geo.near_copies;
|
|
|
|
}
|
2011-07-27 05:00:36 +04:00
|
|
|
|
2013-06-11 09:08:03 +04:00
|
|
|
rcu_read_lock();
|
2011-07-27 05:00:36 +04:00
|
|
|
do {
|
|
|
|
int n = conf->copies;
|
|
|
|
int cnt = 0;
|
2012-09-27 06:35:21 +04:00
|
|
|
int this = first;
|
2011-07-27 05:00:36 +04:00
|
|
|
while (n--) {
|
2013-06-11 09:08:03 +04:00
|
|
|
struct md_rdev *rdev;
|
|
|
|
if (this != ignore &&
|
|
|
|
(rdev = rcu_dereference(conf->mirrors[this].rdev)) &&
|
|
|
|
test_bit(In_sync, &rdev->flags))
|
2011-07-27 05:00:36 +04:00
|
|
|
cnt++;
|
2013-06-11 08:57:09 +04:00
|
|
|
this = (this+1) % disks;
|
2011-07-27 05:00:36 +04:00
|
|
|
}
|
|
|
|
if (cnt == 0)
|
2013-06-11 09:08:03 +04:00
|
|
|
goto out;
|
2013-06-11 08:57:09 +04:00
|
|
|
first = (first + ncopies) % disks;
|
2011-07-27 05:00:36 +04:00
|
|
|
} while (first != 0);
|
2013-06-11 09:08:03 +04:00
|
|
|
has_enough = 1;
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return has_enough;
|
2011-07-27 05:00:36 +04:00
|
|
|
}
|
|
|
|
|
2012-05-21 03:28:33 +04:00
|
|
|
static int enough(struct r10conf *conf, int ignore)
|
|
|
|
{
|
2013-06-11 08:57:09 +04:00
|
|
|
/* when calling 'enough', both 'prev' and 'geo' must
|
|
|
|
* be stable.
|
|
|
|
* This is ensured if ->reconfig_mutex or ->device_lock
|
|
|
|
* is held.
|
|
|
|
*/
|
|
|
|
return _enough(conf, 0, ignore) &&
|
|
|
|
_enough(conf, 1, ignore);
|
2012-05-21 03:28:33 +04:00
|
|
|
}
|
|
|
|
|
2016-01-21 00:52:20 +03:00
|
|
|
static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
char b[BDEVNAME_SIZE];
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2013-06-11 08:57:09 +04:00
|
|
|
unsigned long flags;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If it is not operational, then we have already marked it as dead
|
2019-07-24 12:09:19 +03:00
|
|
|
* else if it is the last working disks with "fail_last_dev == false",
|
|
|
|
* ignore the error, let the next level up know.
|
2005-04-17 02:20:36 +04:00
|
|
|
* else mark the drive as failed
|
|
|
|
*/
|
2013-06-11 08:57:09 +04:00
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
2019-07-24 12:09:19 +03:00
|
|
|
if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev
|
2013-06-11 08:57:09 +04:00
|
|
|
&& !enough(conf, rdev->raid_disk)) {
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Don't fail the drive, just return an IO error.
|
|
|
|
*/
|
2013-06-11 08:57:09 +04:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
2005-04-17 02:20:36 +04:00
|
|
|
return;
|
2013-06-11 08:57:09 +04:00
|
|
|
}
|
2014-07-31 04:16:29 +04:00
|
|
|
if (test_and_clear_bit(In_sync, &rdev->flags))
|
2005-04-17 02:20:36 +04:00
|
|
|
mddev->degraded++;
|
2014-07-31 04:16:29 +04:00
|
|
|
/*
|
|
|
|
* If recovery is running, make sure it aborts.
|
|
|
|
*/
|
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
2011-07-28 05:31:48 +04:00
|
|
|
set_bit(Blocked, &rdev->flags);
|
2005-11-09 08:39:31 +03:00
|
|
|
set_bit(Faulty, &rdev->flags);
|
2016-12-09 02:48:19 +03:00
|
|
|
set_mask_bits(&mddev->sb_flags, 0,
|
|
|
|
BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
|
2013-06-11 08:57:09 +04:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_crit("md/raid10:%s: Disk failure on %s, disabling device.\n"
|
|
|
|
"md/raid10:%s: Operation continuing on %d devices.\n",
|
|
|
|
mdname(mddev), bdevname(rdev->bdev, b),
|
|
|
|
mdname(mddev), conf->geo.raid_disks - mddev->degraded);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void print_conf(struct r10conf *conf)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
int i;
|
2016-06-02 09:19:52 +03:00
|
|
|
struct md_rdev *rdev;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_debug("RAID10 conf printout:\n");
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!conf) {
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_debug("(!conf)\n");
|
2005-04-17 02:20:36 +04:00
|
|
|
return;
|
|
|
|
}
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_debug(" --- wd:%d rd:%d\n", conf->geo.raid_disks - conf->mddev->degraded,
|
|
|
|
conf->geo.raid_disks);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-06-02 09:19:52 +03:00
|
|
|
/* This is only called with ->reconfix_mutex held, so
|
|
|
|
* rcu protection of rdev is not needed */
|
2012-05-21 03:28:20 +04:00
|
|
|
for (i = 0; i < conf->geo.raid_disks; i++) {
|
2005-04-17 02:20:36 +04:00
|
|
|
char b[BDEVNAME_SIZE];
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev = conf->mirrors[i].rdev;
|
|
|
|
if (rdev)
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_debug(" disk %d, wo:%d, o:%d, dev:%s\n",
|
|
|
|
i, !test_bit(In_sync, &rdev->flags),
|
|
|
|
!test_bit(Faulty, &rdev->flags),
|
|
|
|
bdevname(rdev->bdev,b));
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void close_sync(struct r10conf *conf)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2006-01-06 11:20:13 +03:00
|
|
|
wait_barrier(conf);
|
|
|
|
allow_barrier(conf);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
mempool_exit(&conf->r10buf_pool);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:47:53 +04:00
|
|
|
static int raid10_spare_active(struct mddev *mddev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
int i;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2012-07-31 04:03:52 +04:00
|
|
|
struct raid10_info *tmp;
|
2010-08-18 05:56:59 +04:00
|
|
|
int count = 0;
|
|
|
|
unsigned long flags;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Find all non-in_sync disks within the RAID10 configuration
|
|
|
|
* and mark them in_sync
|
|
|
|
*/
|
2012-05-21 03:28:20 +04:00
|
|
|
for (i = 0; i < conf->geo.raid_disks; i++) {
|
2005-04-17 02:20:36 +04:00
|
|
|
tmp = conf->mirrors + i;
|
2011-12-23 03:17:55 +04:00
|
|
|
if (tmp->replacement
|
|
|
|
&& tmp->replacement->recovery_offset == MaxSector
|
|
|
|
&& !test_bit(Faulty, &tmp->replacement->flags)
|
|
|
|
&& !test_and_set_bit(In_sync, &tmp->replacement->flags)) {
|
|
|
|
/* Replacement has just become active */
|
|
|
|
if (!tmp->rdev
|
|
|
|
|| !test_and_clear_bit(In_sync, &tmp->rdev->flags))
|
|
|
|
count++;
|
|
|
|
if (tmp->rdev) {
|
|
|
|
/* Replaced device not technically faulty,
|
|
|
|
* but we need to be sure it gets removed
|
|
|
|
* and never re-added.
|
|
|
|
*/
|
|
|
|
set_bit(Faulty, &tmp->rdev->flags);
|
|
|
|
sysfs_notify_dirent_safe(
|
|
|
|
tmp->rdev->sysfs_state);
|
|
|
|
}
|
|
|
|
sysfs_notify_dirent_safe(tmp->replacement->sysfs_state);
|
|
|
|
} else if (tmp->rdev
|
2013-10-24 05:55:17 +04:00
|
|
|
&& tmp->rdev->recovery_offset == MaxSector
|
2011-12-23 03:17:55 +04:00
|
|
|
&& !test_bit(Faulty, &tmp->rdev->flags)
|
|
|
|
&& !test_and_set_bit(In_sync, &tmp->rdev->flags)) {
|
2010-08-18 05:56:59 +04:00
|
|
|
count++;
|
2012-10-11 06:38:58 +04:00
|
|
|
sysfs_notify_dirent_safe(tmp->rdev->sysfs_state);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
}
|
2010-08-18 05:56:59 +04:00
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
|
|
|
mddev->degraded -= count;
|
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
print_conf(conf);
|
2010-08-18 05:56:59 +04:00
|
|
|
return count;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:47:53 +04:00
|
|
|
static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2008-06-28 02:31:33 +04:00
|
|
|
int err = -EEXIST;
|
2005-04-17 02:20:36 +04:00
|
|
|
int mirror;
|
2008-06-28 02:31:31 +04:00
|
|
|
int first = 0;
|
2012-05-21 03:28:20 +04:00
|
|
|
int last = conf->geo.raid_disks - 1;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
if (mddev->recovery_cp < MaxSector)
|
|
|
|
/* only hot-add to in-sync arrays, as recovery is
|
|
|
|
* very different from resync
|
|
|
|
*/
|
2008-06-28 02:31:33 +04:00
|
|
|
return -EBUSY;
|
2013-06-11 08:57:09 +04:00
|
|
|
if (rdev->saved_raid_disk < 0 && !_enough(conf, 1, -1))
|
2008-06-28 02:31:33 +04:00
|
|
|
return -EINVAL;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-01-14 03:00:07 +03:00
|
|
|
if (md_integrity_add_rdev(rdev, mddev))
|
|
|
|
return -ENXIO;
|
|
|
|
|
2008-11-06 09:28:20 +03:00
|
|
|
if (rdev->raid_disk >= 0)
|
2008-06-28 02:31:31 +04:00
|
|
|
first = last = rdev->raid_disk;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-07-18 11:38:43 +04:00
|
|
|
if (rdev->saved_raid_disk >= first &&
|
2018-10-15 03:05:07 +03:00
|
|
|
rdev->saved_raid_disk < conf->geo.raid_disks &&
|
2006-01-06 11:20:16 +03:00
|
|
|
conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
|
|
|
|
mirror = rdev->saved_raid_disk;
|
|
|
|
else
|
2008-06-28 02:31:31 +04:00
|
|
|
mirror = first;
|
2011-07-27 05:00:36 +04:00
|
|
|
for ( ; mirror <= last ; mirror++) {
|
2012-07-31 04:03:52 +04:00
|
|
|
struct raid10_info *p = &conf->mirrors[mirror];
|
2011-07-27 05:00:36 +04:00
|
|
|
if (p->recovery_disabled == mddev->recovery_disabled)
|
|
|
|
continue;
|
2011-12-23 03:17:56 +04:00
|
|
|
if (p->rdev) {
|
|
|
|
if (!test_bit(WantReplacement, &p->rdev->flags) ||
|
|
|
|
p->replacement != NULL)
|
|
|
|
continue;
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
|
|
|
set_bit(Replacement, &rdev->flags);
|
|
|
|
rdev->raid_disk = mirror;
|
|
|
|
err = 0;
|
2013-05-02 23:19:24 +04:00
|
|
|
if (mddev->gendisk)
|
|
|
|
disk_stack_limits(mddev->gendisk, rdev->bdev,
|
|
|
|
rdev->data_offset << 9);
|
2011-12-23 03:17:56 +04:00
|
|
|
conf->fullsync = 1;
|
|
|
|
rcu_assign_pointer(p->replacement, rdev);
|
|
|
|
break;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2013-05-02 23:19:24 +04:00
|
|
|
if (mddev->gendisk)
|
|
|
|
disk_stack_limits(mddev->gendisk, rdev->bdev,
|
|
|
|
rdev->data_offset << 9);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-07-27 05:00:36 +04:00
|
|
|
p->head_position = 0;
|
2011-10-26 04:54:39 +04:00
|
|
|
p->recovery_disabled = mddev->recovery_disabled - 1;
|
2011-07-27 05:00:36 +04:00
|
|
|
rdev->raid_disk = mirror;
|
|
|
|
err = 0;
|
|
|
|
if (rdev->saved_raid_disk != mirror)
|
|
|
|
conf->fullsync = 1;
|
|
|
|
rcu_assign_pointer(p->rdev, rdev);
|
|
|
|
break;
|
|
|
|
}
|
2012-10-31 04:42:30 +04:00
|
|
|
if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
|
2018-03-08 04:10:10 +03:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DISCARD, mddev->queue);
|
2012-10-11 06:30:52 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
print_conf(conf);
|
2008-06-28 02:31:33 +04:00
|
|
|
return err;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-12-23 03:17:51 +04:00
|
|
|
static int raid10_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
int err = 0;
|
2011-12-23 03:17:51 +04:00
|
|
|
int number = rdev->raid_disk;
|
2011-12-23 03:17:54 +04:00
|
|
|
struct md_rdev **rdevp;
|
2012-07-31 04:03:52 +04:00
|
|
|
struct raid10_info *p = conf->mirrors + number;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
print_conf(conf);
|
2011-12-23 03:17:54 +04:00
|
|
|
if (rdev == p->rdev)
|
|
|
|
rdevp = &p->rdev;
|
|
|
|
else if (rdev == p->replacement)
|
|
|
|
rdevp = &p->replacement;
|
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (test_bit(In_sync, &rdev->flags) ||
|
|
|
|
atomic_read(&rdev->nr_pending)) {
|
|
|
|
err = -EBUSY;
|
|
|
|
goto abort;
|
|
|
|
}
|
2016-06-02 09:19:53 +03:00
|
|
|
/* Only remove non-faulty devices if recovery
|
2011-12-23 03:17:54 +04:00
|
|
|
* is not possible.
|
|
|
|
*/
|
|
|
|
if (!test_bit(Faulty, &rdev->flags) &&
|
|
|
|
mddev->recovery_disabled != p->recovery_disabled &&
|
2011-12-23 03:17:55 +04:00
|
|
|
(!p->replacement || p->replacement == rdev) &&
|
2012-05-22 07:55:33 +04:00
|
|
|
number < conf->geo.raid_disks &&
|
2011-12-23 03:17:54 +04:00
|
|
|
enough(conf, -1)) {
|
|
|
|
err = -EBUSY;
|
|
|
|
goto abort;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2011-12-23 03:17:54 +04:00
|
|
|
*rdevp = NULL;
|
2016-06-02 09:19:53 +03:00
|
|
|
if (!test_bit(RemoveSynchronized, &rdev->flags)) {
|
|
|
|
synchronize_rcu();
|
|
|
|
if (atomic_read(&rdev->nr_pending)) {
|
|
|
|
/* lost the race, try later */
|
|
|
|
err = -EBUSY;
|
|
|
|
*rdevp = rdev;
|
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (p->replacement) {
|
2011-12-23 03:17:55 +04:00
|
|
|
/* We must have just cleared 'rdev' */
|
|
|
|
p->rdev = p->replacement;
|
|
|
|
clear_bit(Replacement, &p->replacement->flags);
|
|
|
|
smp_mb(); /* Make sure other CPUs may see both as identical
|
|
|
|
* but will never see neither -- if they are careful.
|
|
|
|
*/
|
|
|
|
p->replacement = NULL;
|
2017-04-24 10:58:04 +03:00
|
|
|
}
|
2011-12-23 03:17:55 +04:00
|
|
|
|
2017-04-24 10:58:04 +03:00
|
|
|
clear_bit(WantReplacement, &rdev->flags);
|
2011-12-23 03:17:54 +04:00
|
|
|
err = md_integrity_register(mddev);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
abort:
|
|
|
|
|
|
|
|
print_conf(conf);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-03-16 19:12:32 +03:00
|
|
|
static void __end_sync_read(struct r10bio *r10_bio, struct bio *bio, int d)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = r10_bio->mddev->private;
|
2006-01-06 11:20:29 +03:00
|
|
|
|
2017-06-03 10:38:06 +03:00
|
|
|
if (!bio->bi_status)
|
2006-01-06 11:20:29 +03:00
|
|
|
set_bit(R10BIO_Uptodate, &r10_bio->state);
|
2011-07-28 05:39:25 +04:00
|
|
|
else
|
|
|
|
/* The write handler will notice the lack of
|
|
|
|
* R10BIO_Uptodate and record any errors etc
|
|
|
|
*/
|
2006-01-06 11:20:52 +03:00
|
|
|
atomic_add(r10_bio->sectors,
|
|
|
|
&conf->mirrors[d].rdev->corrected_errors);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
/* for reconstruct, we always reschedule after a read.
|
|
|
|
* for resync, only after all reads
|
|
|
|
*/
|
2009-02-25 05:18:47 +03:00
|
|
|
rdev_dec_pending(conf->mirrors[d].rdev, conf->mddev);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (test_bit(R10BIO_IsRecover, &r10_bio->state) ||
|
|
|
|
atomic_dec_and_test(&r10_bio->remaining)) {
|
|
|
|
/* we have read all the blocks,
|
|
|
|
* do the comparison in process context in raid10d
|
|
|
|
*/
|
|
|
|
reschedule_retry(r10_bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-03-16 19:12:32 +03:00
|
|
|
static void end_sync_read(struct bio *bio)
|
|
|
|
{
|
2017-03-16 19:12:33 +03:00
|
|
|
struct r10bio *r10_bio = get_resync_r10bio(bio);
|
2017-03-16 19:12:32 +03:00
|
|
|
struct r10conf *conf = r10_bio->mddev->private;
|
|
|
|
int d = find_bio_disk(conf, r10_bio, bio, NULL, NULL);
|
|
|
|
|
|
|
|
__end_sync_read(r10_bio, bio, d);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void end_reshape_read(struct bio *bio)
|
|
|
|
{
|
2017-03-16 19:12:33 +03:00
|
|
|
/* reshape read bio isn't allocated from r10buf_pool */
|
2017-03-16 19:12:32 +03:00
|
|
|
struct r10bio *r10_bio = bio->bi_private;
|
|
|
|
|
|
|
|
__end_sync_read(r10_bio, bio, r10_bio->read_slot);
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static void end_sync_request(struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:47:53 +04:00
|
|
|
struct mddev *mddev = r10_bio->mddev;
|
md: restart recovery cleanly after device failure.
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 00:04:39 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
while (atomic_dec_and_test(&r10_bio->remaining)) {
|
|
|
|
if (r10_bio->master_bio == NULL) {
|
|
|
|
/* the primary of several recovery bios */
|
2009-02-25 05:18:47 +03:00
|
|
|
sector_t s = r10_bio->sectors;
|
2011-07-28 05:39:25 +04:00
|
|
|
if (test_bit(R10BIO_MadeGood, &r10_bio->state) ||
|
|
|
|
test_bit(R10BIO_WriteError, &r10_bio->state))
|
2011-07-28 05:39:24 +04:00
|
|
|
reschedule_retry(r10_bio);
|
|
|
|
else
|
|
|
|
put_buf(r10_bio);
|
2009-02-25 05:18:47 +03:00
|
|
|
md_done_sync(mddev, s, 1);
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
} else {
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *r10_bio2 = (struct r10bio *)r10_bio->master_bio;
|
2011-07-28 05:39:25 +04:00
|
|
|
if (test_bit(R10BIO_MadeGood, &r10_bio->state) ||
|
|
|
|
test_bit(R10BIO_WriteError, &r10_bio->state))
|
2011-07-28 05:39:24 +04:00
|
|
|
reschedule_retry(r10_bio);
|
|
|
|
else
|
|
|
|
put_buf(r10_bio);
|
2005-04-17 02:20:36 +04:00
|
|
|
r10_bio = r10_bio2;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-07-20 16:29:37 +03:00
|
|
|
static void end_sync_write(struct bio *bio)
|
2011-07-28 05:39:25 +04:00
|
|
|
{
|
2017-03-16 19:12:33 +03:00
|
|
|
struct r10bio *r10_bio = get_resync_r10bio(bio);
|
2011-10-11 09:47:53 +04:00
|
|
|
struct mddev *mddev = r10_bio->mddev;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2011-07-28 05:39:25 +04:00
|
|
|
int d;
|
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
|
|
|
int slot;
|
2011-12-23 03:17:55 +04:00
|
|
|
int repl;
|
2011-12-23 03:17:55 +04:00
|
|
|
struct md_rdev *rdev = NULL;
|
2011-07-28 05:39:25 +04:00
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
d = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
|
|
|
|
if (repl)
|
|
|
|
rdev = conf->mirrors[d].replacement;
|
2012-03-13 04:21:20 +04:00
|
|
|
else
|
2011-12-23 03:17:55 +04:00
|
|
|
rdev = conf->mirrors[d].rdev;
|
2011-07-28 05:39:25 +04:00
|
|
|
|
2017-06-03 10:38:06 +03:00
|
|
|
if (bio->bi_status) {
|
2011-12-23 03:17:55 +04:00
|
|
|
if (repl)
|
|
|
|
md_error(mddev, rdev);
|
|
|
|
else {
|
|
|
|
set_bit(WriteErrorSeen, &rdev->flags);
|
2011-12-23 03:17:56 +04:00
|
|
|
if (!test_and_set_bit(WantReplacement, &rdev->flags))
|
|
|
|
set_bit(MD_RECOVERY_NEEDED,
|
|
|
|
&rdev->mddev->recovery);
|
2011-12-23 03:17:55 +04:00
|
|
|
set_bit(R10BIO_WriteError, &r10_bio->state);
|
|
|
|
}
|
|
|
|
} else if (is_badblock(rdev,
|
2011-07-28 05:39:25 +04:00
|
|
|
r10_bio->devs[slot].addr,
|
|
|
|
r10_bio->sectors,
|
|
|
|
&first_bad, &bad_sectors))
|
|
|
|
set_bit(R10BIO_MadeGood, &r10_bio->state);
|
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
rdev_dec_pending(rdev, mddev);
|
2011-07-28 05:39:25 +04:00
|
|
|
|
|
|
|
end_sync_request(r10_bio);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Note: sync and recover and handled very differently for raid10
|
|
|
|
* This code is for resync.
|
|
|
|
* For resync, we read through virtual addresses and read all blocks.
|
|
|
|
* If there is any error, we schedule a write. The lowest numbered
|
|
|
|
* drive is authoritative.
|
|
|
|
* However requests come for physical address, so we need to map.
|
|
|
|
* For every physical address there are raid_disks/copies virtual addresses,
|
|
|
|
* which is always are least one, but is not necessarly an integer.
|
|
|
|
* This means that a physical address can span multiple chunks, so we may
|
|
|
|
* have to submit multiple io requests for a single sync request.
|
|
|
|
*/
|
|
|
|
/*
|
|
|
|
* We check if all blocks are in-sync and only write to blocks that
|
|
|
|
* aren't in sync
|
|
|
|
*/
|
2011-10-11 09:48:43 +04:00
|
|
|
static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
int i, first;
|
|
|
|
struct bio *tbio, *fbio;
|
2012-04-12 10:04:47 +04:00
|
|
|
int vcnt;
|
2017-03-16 19:12:34 +03:00
|
|
|
struct page **tpages, **fpages;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
atomic_set(&r10_bio->remaining, 1);
|
|
|
|
|
|
|
|
/* find the first device with a block */
|
|
|
|
for (i=0; i<conf->copies; i++)
|
2017-06-03 10:38:06 +03:00
|
|
|
if (!r10_bio->devs[i].bio->bi_status)
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (i == conf->copies)
|
|
|
|
goto done;
|
|
|
|
|
|
|
|
first = i;
|
|
|
|
fbio = r10_bio->devs[i].bio;
|
2015-12-18 07:19:16 +03:00
|
|
|
fbio->bi_iter.bi_size = r10_bio->sectors << 9;
|
|
|
|
fbio->bi_iter.bi_idx = 0;
|
2017-03-16 19:12:34 +03:00
|
|
|
fpages = get_resync_pages(fbio)->pages;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-04-12 10:04:47 +04:00
|
|
|
vcnt = (r10_bio->sectors + (PAGE_SIZE >> 9) - 1) >> (PAGE_SHIFT - 9);
|
2005-04-17 02:20:36 +04:00
|
|
|
/* now find blocks with errors */
|
2006-01-06 11:20:29 +03:00
|
|
|
for (i=0 ; i < conf->copies ; i++) {
|
|
|
|
int j, d;
|
2016-11-18 08:16:12 +03:00
|
|
|
struct md_rdev *rdev;
|
2017-03-16 19:12:33 +03:00
|
|
|
struct resync_pages *rp;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
tbio = r10_bio->devs[i].bio;
|
2006-01-06 11:20:29 +03:00
|
|
|
|
|
|
|
if (tbio->bi_end_io != end_sync_read)
|
|
|
|
continue;
|
|
|
|
if (i == first)
|
2005-04-17 02:20:36 +04:00
|
|
|
continue;
|
2017-03-16 19:12:34 +03:00
|
|
|
|
|
|
|
tpages = get_resync_pages(tbio)->pages;
|
2016-11-18 08:16:12 +03:00
|
|
|
d = r10_bio->devs[i].devnum;
|
|
|
|
rdev = conf->mirrors[d].rdev;
|
2017-06-03 10:38:06 +03:00
|
|
|
if (!r10_bio->devs[i].bio->bi_status) {
|
2006-01-06 11:20:29 +03:00
|
|
|
/* We know that the bi_io_vec layout is the same for
|
|
|
|
* both 'first' and 'i', so we just compare them.
|
|
|
|
* All vec entries are PAGE_SIZE;
|
|
|
|
*/
|
2013-07-16 10:50:47 +04:00
|
|
|
int sectors = r10_bio->sectors;
|
|
|
|
for (j = 0; j < vcnt; j++) {
|
|
|
|
int len = PAGE_SIZE;
|
|
|
|
if (sectors < (len / 512))
|
|
|
|
len = sectors * 512;
|
2017-03-16 19:12:34 +03:00
|
|
|
if (memcmp(page_address(fpages[j]),
|
|
|
|
page_address(tpages[j]),
|
2013-07-16 10:50:47 +04:00
|
|
|
len))
|
2006-01-06 11:20:29 +03:00
|
|
|
break;
|
2013-07-16 10:50:47 +04:00
|
|
|
sectors -= len/512;
|
|
|
|
}
|
2006-01-06 11:20:29 +03:00
|
|
|
if (j == vcnt)
|
|
|
|
continue;
|
2012-10-11 07:17:59 +04:00
|
|
|
atomic64_add(r10_bio->sectors, &mddev->resync_mismatches);
|
2011-07-28 05:39:25 +04:00
|
|
|
if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
|
|
|
|
/* Don't fix anything. */
|
|
|
|
continue;
|
2016-11-18 08:16:12 +03:00
|
|
|
} else if (test_bit(FailFast, &rdev->flags)) {
|
|
|
|
/* Just give up on this device */
|
|
|
|
md_error(rdev->mddev, rdev);
|
|
|
|
continue;
|
2006-01-06 11:20:29 +03:00
|
|
|
}
|
2011-07-28 05:39:25 +04:00
|
|
|
/* Ok, we need to write this bio, either to correct an
|
|
|
|
* inconsistency or to correct an unreadable block.
|
2005-04-17 02:20:36 +04:00
|
|
|
* First we need to fixup bv_offset, bv_len and
|
|
|
|
* bi_vecs, as the read request might have corrupted these
|
|
|
|
*/
|
2017-03-16 19:12:33 +03:00
|
|
|
rp = get_resync_pages(tbio);
|
2012-09-07 01:14:43 +04:00
|
|
|
bio_reset(tbio);
|
|
|
|
|
2017-07-14 11:14:43 +03:00
|
|
|
md_bio_reset_resync_pages(tbio, rp, fbio->bi_iter.bi_size);
|
|
|
|
|
2017-03-16 19:12:33 +03:00
|
|
|
rp->raid_bio = r10_bio;
|
|
|
|
tbio->bi_private = rp;
|
2013-10-12 02:44:27 +04:00
|
|
|
tbio->bi_iter.bi_sector = r10_bio->devs[i].addr;
|
2005-04-17 02:20:36 +04:00
|
|
|
tbio->bi_end_io = end_sync_write;
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(tbio, REQ_OP_WRITE, 0);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2015-05-07 09:34:20 +03:00
|
|
|
bio_copy_data(tbio, fbio);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
atomic_inc(&conf->mirrors[d].rdev->nr_pending);
|
|
|
|
atomic_inc(&r10_bio->remaining);
|
2013-02-06 03:19:29 +04:00
|
|
|
md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio));
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-11-18 08:16:12 +03:00
|
|
|
if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
|
|
|
|
tbio->bi_opf |= MD_FAILFAST;
|
2013-10-12 02:44:27 +04:00
|
|
|
tbio->bi_iter.bi_sector += conf->mirrors[d].rdev->data_offset;
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(tbio, conf->mirrors[d].rdev->bdev);
|
2005-04-17 02:20:36 +04:00
|
|
|
generic_make_request(tbio);
|
|
|
|
}
|
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
/* Now write out to any replacement devices
|
|
|
|
* that are active
|
|
|
|
*/
|
|
|
|
for (i = 0; i < conf->copies; i++) {
|
2015-05-07 09:34:20 +03:00
|
|
|
int d;
|
2011-12-23 03:17:55 +04:00
|
|
|
|
|
|
|
tbio = r10_bio->devs[i].repl_bio;
|
|
|
|
if (!tbio || !tbio->bi_end_io)
|
|
|
|
continue;
|
|
|
|
if (r10_bio->devs[i].bio->bi_end_io != end_sync_write
|
|
|
|
&& r10_bio->devs[i].bio != fbio)
|
2015-05-07 09:34:20 +03:00
|
|
|
bio_copy_data(tbio, fbio);
|
2011-12-23 03:17:55 +04:00
|
|
|
d = r10_bio->devs[i].devnum;
|
|
|
|
atomic_inc(&r10_bio->remaining);
|
|
|
|
md_sync_acct(conf->mirrors[d].replacement->bdev,
|
2013-02-06 03:19:29 +04:00
|
|
|
bio_sectors(tbio));
|
2011-12-23 03:17:55 +04:00
|
|
|
generic_make_request(tbio);
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
done:
|
|
|
|
if (atomic_dec_and_test(&r10_bio->remaining)) {
|
|
|
|
md_done_sync(mddev, r10_bio->sectors, 1);
|
|
|
|
put_buf(r10_bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now for the recovery code.
|
|
|
|
* Recovery happens across physical sectors.
|
|
|
|
* We recover all non-is_sync drives by finding the virtual address of
|
|
|
|
* each, and then choose a working drive that also has that virt address.
|
|
|
|
* There is a separate r10_bio for each non-in_sync drive.
|
|
|
|
* Only the first two slots are in use. The first for reading,
|
|
|
|
* The second for writing.
|
|
|
|
*
|
|
|
|
*/
|
2011-10-11 09:48:43 +04:00
|
|
|
static void fix_recovery_read_error(struct r10bio *r10_bio)
|
2011-07-28 05:39:25 +04:00
|
|
|
{
|
|
|
|
/* We got a read error during recovery.
|
|
|
|
* We repeat the read in smaller page-sized sections.
|
|
|
|
* If a read succeeds, write it to the new device or record
|
|
|
|
* a bad block if we cannot.
|
|
|
|
* If a read fails, record a bad block on both old and
|
|
|
|
* new devices.
|
|
|
|
*/
|
2011-10-11 09:47:53 +04:00
|
|
|
struct mddev *mddev = r10_bio->mddev;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2011-07-28 05:39:25 +04:00
|
|
|
struct bio *bio = r10_bio->devs[0].bio;
|
|
|
|
sector_t sect = 0;
|
|
|
|
int sectors = r10_bio->sectors;
|
|
|
|
int idx = 0;
|
|
|
|
int dr = r10_bio->devs[0].devnum;
|
|
|
|
int dw = r10_bio->devs[1].devnum;
|
2017-03-16 19:12:34 +03:00
|
|
|
struct page **pages = get_resync_pages(bio)->pages;
|
2011-07-28 05:39:25 +04:00
|
|
|
|
|
|
|
while (sectors) {
|
|
|
|
int s = sectors;
|
2011-10-11 09:45:26 +04:00
|
|
|
struct md_rdev *rdev;
|
2011-07-28 05:39:25 +04:00
|
|
|
sector_t addr;
|
|
|
|
int ok;
|
|
|
|
|
|
|
|
if (s > (PAGE_SIZE>>9))
|
|
|
|
s = PAGE_SIZE >> 9;
|
|
|
|
|
|
|
|
rdev = conf->mirrors[dr].rdev;
|
|
|
|
addr = r10_bio->devs[0].addr + sect,
|
|
|
|
ok = sync_page_io(rdev,
|
|
|
|
addr,
|
|
|
|
s << 9,
|
2017-03-16 19:12:34 +03:00
|
|
|
pages[idx],
|
2016-06-05 22:32:07 +03:00
|
|
|
REQ_OP_READ, 0, false);
|
2011-07-28 05:39:25 +04:00
|
|
|
if (ok) {
|
|
|
|
rdev = conf->mirrors[dw].rdev;
|
|
|
|
addr = r10_bio->devs[1].addr + sect;
|
|
|
|
ok = sync_page_io(rdev,
|
|
|
|
addr,
|
|
|
|
s << 9,
|
2017-03-16 19:12:34 +03:00
|
|
|
pages[idx],
|
2016-06-05 22:32:07 +03:00
|
|
|
REQ_OP_WRITE, 0, false);
|
2011-12-23 03:17:56 +04:00
|
|
|
if (!ok) {
|
2011-07-28 05:39:25 +04:00
|
|
|
set_bit(WriteErrorSeen, &rdev->flags);
|
2011-12-23 03:17:56 +04:00
|
|
|
if (!test_and_set_bit(WantReplacement,
|
|
|
|
&rdev->flags))
|
|
|
|
set_bit(MD_RECOVERY_NEEDED,
|
|
|
|
&rdev->mddev->recovery);
|
|
|
|
}
|
2011-07-28 05:39:25 +04:00
|
|
|
}
|
|
|
|
if (!ok) {
|
|
|
|
/* We don't worry if we cannot set a bad block -
|
|
|
|
* it really is bad so there is no loss in not
|
|
|
|
* recording it yet
|
|
|
|
*/
|
|
|
|
rdev_set_badblocks(rdev, addr, s, 0);
|
|
|
|
|
|
|
|
if (rdev != conf->mirrors[dw].rdev) {
|
|
|
|
/* need bad block on destination too */
|
2011-10-11 09:45:26 +04:00
|
|
|
struct md_rdev *rdev2 = conf->mirrors[dw].rdev;
|
2011-07-28 05:39:25 +04:00
|
|
|
addr = r10_bio->devs[1].addr + sect;
|
|
|
|
ok = rdev_set_badblocks(rdev2, addr, s, 0);
|
|
|
|
if (!ok) {
|
|
|
|
/* just abort the recovery */
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_notice("md/raid10:%s: recovery aborted due to read error\n",
|
|
|
|
mdname(mddev));
|
2011-07-28 05:39:25 +04:00
|
|
|
|
|
|
|
conf->mirrors[dw].recovery_disabled
|
|
|
|
= mddev->recovery_disabled;
|
|
|
|
set_bit(MD_RECOVERY_INTR,
|
|
|
|
&mddev->recovery);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
sectors -= s;
|
|
|
|
sect += s;
|
|
|
|
idx++;
|
|
|
|
}
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2011-07-18 11:38:49 +04:00
|
|
|
int d;
|
2011-12-23 03:17:55 +04:00
|
|
|
struct bio *wbio, *wbio2;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-07-28 05:39:25 +04:00
|
|
|
if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) {
|
|
|
|
fix_recovery_read_error(r10_bio);
|
|
|
|
end_sync_request(r10_bio);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2011-07-18 11:38:49 +04:00
|
|
|
/*
|
|
|
|
* share the pages with the first bio
|
2005-04-17 02:20:36 +04:00
|
|
|
* and submit the write request
|
|
|
|
*/
|
|
|
|
d = r10_bio->devs[1].devnum;
|
2011-12-23 03:17:55 +04:00
|
|
|
wbio = r10_bio->devs[1].bio;
|
|
|
|
wbio2 = r10_bio->devs[1].repl_bio;
|
2013-07-24 09:37:42 +04:00
|
|
|
/* Need to test wbio2->bi_end_io before we call
|
|
|
|
* generic_make_request as if the former is NULL,
|
|
|
|
* the latter is free to free wbio2.
|
|
|
|
*/
|
|
|
|
if (wbio2 && !wbio2->bi_end_io)
|
|
|
|
wbio2 = NULL;
|
2011-12-23 03:17:55 +04:00
|
|
|
if (wbio->bi_end_io) {
|
|
|
|
atomic_inc(&conf->mirrors[d].rdev->nr_pending);
|
2013-02-06 03:19:29 +04:00
|
|
|
md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio));
|
2011-12-23 03:17:55 +04:00
|
|
|
generic_make_request(wbio);
|
|
|
|
}
|
2013-07-24 09:37:42 +04:00
|
|
|
if (wbio2) {
|
2011-12-23 03:17:55 +04:00
|
|
|
atomic_inc(&conf->mirrors[d].replacement->nr_pending);
|
|
|
|
md_sync_acct(conf->mirrors[d].replacement->bdev,
|
2013-02-06 03:19:29 +04:00
|
|
|
bio_sectors(wbio2));
|
2011-12-23 03:17:55 +04:00
|
|
|
generic_make_request(wbio2);
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2009-12-14 04:49:58 +03:00
|
|
|
/*
|
|
|
|
* Used by fix_read_error() to decay the per rdev read_errors.
|
|
|
|
* We halve the read error count for every hour that has elapsed
|
|
|
|
* since the last recorded read error.
|
|
|
|
*
|
|
|
|
*/
|
2011-10-11 09:47:53 +04:00
|
|
|
static void check_decay_read_errors(struct mddev *mddev, struct md_rdev *rdev)
|
2009-12-14 04:49:58 +03:00
|
|
|
{
|
2016-06-17 18:33:10 +03:00
|
|
|
long cur_time_mon;
|
2009-12-14 04:49:58 +03:00
|
|
|
unsigned long hours_since_last;
|
|
|
|
unsigned int read_errors = atomic_read(&rdev->read_errors);
|
|
|
|
|
2016-06-17 18:33:10 +03:00
|
|
|
cur_time_mon = ktime_get_seconds();
|
2009-12-14 04:49:58 +03:00
|
|
|
|
2016-06-17 18:33:10 +03:00
|
|
|
if (rdev->last_read_error == 0) {
|
2009-12-14 04:49:58 +03:00
|
|
|
/* first time we've seen a read error */
|
|
|
|
rdev->last_read_error = cur_time_mon;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2016-06-17 18:33:10 +03:00
|
|
|
hours_since_last = (long)(cur_time_mon -
|
|
|
|
rdev->last_read_error) / 3600;
|
2009-12-14 04:49:58 +03:00
|
|
|
|
|
|
|
rdev->last_read_error = cur_time_mon;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if hours_since_last is > the number of bits in read_errors
|
|
|
|
* just set read errors to 0. We do this to avoid
|
|
|
|
* overflowing the shift of read_errors by hours_since_last.
|
|
|
|
*/
|
|
|
|
if (hours_since_last >= 8 * sizeof(read_errors))
|
|
|
|
atomic_set(&rdev->read_errors, 0);
|
|
|
|
else
|
|
|
|
atomic_set(&rdev->read_errors, read_errors >> hours_since_last);
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:45:26 +04:00
|
|
|
static int r10_sync_page_io(struct md_rdev *rdev, sector_t sector,
|
2011-07-28 05:39:25 +04:00
|
|
|
int sectors, struct page *page, int rw)
|
|
|
|
{
|
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
|
|
|
|
|
|
|
if (is_badblock(rdev, sector, sectors, &first_bad, &bad_sectors)
|
|
|
|
&& (rw == READ || test_bit(WriteErrorSeen, &rdev->flags)))
|
|
|
|
return -1;
|
2016-06-05 22:32:07 +03:00
|
|
|
if (sync_page_io(rdev, sector, sectors << 9, page, rw, 0, false))
|
2011-07-28 05:39:25 +04:00
|
|
|
/* success */
|
|
|
|
return 1;
|
2011-12-23 03:17:56 +04:00
|
|
|
if (rw == WRITE) {
|
2011-07-28 05:39:25 +04:00
|
|
|
set_bit(WriteErrorSeen, &rdev->flags);
|
2011-12-23 03:17:56 +04:00
|
|
|
if (!test_and_set_bit(WantReplacement, &rdev->flags))
|
|
|
|
set_bit(MD_RECOVERY_NEEDED,
|
|
|
|
&rdev->mddev->recovery);
|
|
|
|
}
|
2011-07-28 05:39:25 +04:00
|
|
|
/* need to record an error - either for the block or the device */
|
|
|
|
if (!rdev_set_badblocks(rdev, sector, sectors, 0))
|
|
|
|
md_error(rdev->mddev, rdev);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* This is a kernel thread which:
|
|
|
|
*
|
|
|
|
* 1. Retries failed read operations on working mirrors.
|
|
|
|
* 2. Updates the raid superblock when problems encounter.
|
2006-10-03 12:15:45 +04:00
|
|
|
* 3. Performs writes following reads for array synchronising.
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10bio *r10_bio)
|
2006-10-03 12:15:45 +04:00
|
|
|
{
|
|
|
|
int sect = 0; /* Offset from r10_bio->sector */
|
|
|
|
int sectors = r10_bio->sectors;
|
2018-04-23 12:37:30 +03:00
|
|
|
struct md_rdev *rdev;
|
2009-12-14 04:49:58 +03:00
|
|
|
int max_read_errors = atomic_read(&mddev->max_corr_read_errors);
|
2010-06-24 07:31:03 +04:00
|
|
|
int d = r10_bio->devs[r10_bio->read_slot].devnum;
|
2009-12-14 04:49:58 +03:00
|
|
|
|
2011-05-11 08:53:17 +04:00
|
|
|
/* still own a reference to this rdev, so it cannot
|
|
|
|
* have been cleared recently.
|
|
|
|
*/
|
|
|
|
rdev = conf->mirrors[d].rdev;
|
2009-12-14 04:49:58 +03:00
|
|
|
|
2011-05-11 08:53:17 +04:00
|
|
|
if (test_bit(Faulty, &rdev->flags))
|
|
|
|
/* drive has already been failed, just ignore any
|
|
|
|
more fix_read_error() attempts */
|
|
|
|
return;
|
2009-12-14 04:49:58 +03:00
|
|
|
|
2011-05-11 08:53:17 +04:00
|
|
|
check_decay_read_errors(mddev, rdev);
|
|
|
|
atomic_inc(&rdev->read_errors);
|
|
|
|
if (atomic_read(&rdev->read_errors) > max_read_errors) {
|
|
|
|
char b[BDEVNAME_SIZE];
|
|
|
|
bdevname(rdev->bdev, b);
|
2009-12-14 04:49:58 +03:00
|
|
|
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_notice("md/raid10:%s: %s: Raid device exceeded read_error threshold [cur %d:max %d]\n",
|
|
|
|
mdname(mddev), b,
|
|
|
|
atomic_read(&rdev->read_errors), max_read_errors);
|
|
|
|
pr_notice("md/raid10:%s: %s: Failing raid device\n",
|
|
|
|
mdname(mddev), b);
|
2016-06-02 09:19:52 +03:00
|
|
|
md_error(mddev, rdev);
|
2012-02-14 04:10:10 +04:00
|
|
|
r10_bio->devs[r10_bio->read_slot].bio = IO_BLOCKED;
|
2011-05-11 08:53:17 +04:00
|
|
|
return;
|
2009-12-14 04:49:58 +03:00
|
|
|
}
|
|
|
|
|
2006-10-03 12:15:45 +04:00
|
|
|
while(sectors) {
|
|
|
|
int s = sectors;
|
|
|
|
int sl = r10_bio->read_slot;
|
|
|
|
int success = 0;
|
|
|
|
int start;
|
|
|
|
|
|
|
|
if (s > (PAGE_SIZE>>9))
|
|
|
|
s = PAGE_SIZE >> 9;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
do {
|
2011-07-28 05:39:24 +04:00
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
|
|
|
|
2010-06-24 07:31:03 +04:00
|
|
|
d = r10_bio->devs[sl].devnum;
|
2006-10-03 12:15:45 +04:00
|
|
|
rdev = rcu_dereference(conf->mirrors[d].rdev);
|
|
|
|
if (rdev &&
|
2011-07-28 05:39:24 +04:00
|
|
|
test_bit(In_sync, &rdev->flags) &&
|
2016-06-02 09:19:53 +03:00
|
|
|
!test_bit(Faulty, &rdev->flags) &&
|
2011-07-28 05:39:24 +04:00
|
|
|
is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
|
|
|
|
&first_bad, &bad_sectors) == 0) {
|
2006-10-03 12:15:45 +04:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
rcu_read_unlock();
|
2010-10-27 08:16:40 +04:00
|
|
|
success = sync_page_io(rdev,
|
2006-10-03 12:15:45 +04:00
|
|
|
r10_bio->devs[sl].addr +
|
2011-01-14 01:14:33 +03:00
|
|
|
sect,
|
2006-10-03 12:15:45 +04:00
|
|
|
s<<9,
|
2016-06-05 22:32:07 +03:00
|
|
|
conf->tmppage,
|
|
|
|
REQ_OP_READ, 0, false);
|
2006-10-03 12:15:45 +04:00
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
rcu_read_lock();
|
|
|
|
if (success)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
sl++;
|
|
|
|
if (sl == conf->copies)
|
|
|
|
sl = 0;
|
|
|
|
} while (!success && sl != r10_bio->read_slot);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
if (!success) {
|
2011-07-28 05:39:25 +04:00
|
|
|
/* Cannot read from anywhere, just mark the block
|
|
|
|
* as bad on the first device to discourage future
|
|
|
|
* reads.
|
|
|
|
*/
|
2006-10-03 12:15:45 +04:00
|
|
|
int dn = r10_bio->devs[r10_bio->read_slot].devnum;
|
2011-07-28 05:39:25 +04:00
|
|
|
rdev = conf->mirrors[dn].rdev;
|
|
|
|
|
|
|
|
if (!rdev_set_badblocks(
|
|
|
|
rdev,
|
|
|
|
r10_bio->devs[r10_bio->read_slot].addr
|
|
|
|
+ sect,
|
2012-02-14 04:10:10 +04:00
|
|
|
s, 0)) {
|
2011-07-28 05:39:25 +04:00
|
|
|
md_error(mddev, rdev);
|
2012-02-14 04:10:10 +04:00
|
|
|
r10_bio->devs[r10_bio->read_slot].bio
|
|
|
|
= IO_BLOCKED;
|
|
|
|
}
|
2006-10-03 12:15:45 +04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
start = sl;
|
|
|
|
/* write it back and re-read */
|
|
|
|
rcu_read_lock();
|
|
|
|
while (sl != r10_bio->read_slot) {
|
2009-12-14 04:49:57 +03:00
|
|
|
char b[BDEVNAME_SIZE];
|
2010-06-24 07:31:03 +04:00
|
|
|
|
2006-10-03 12:15:45 +04:00
|
|
|
if (sl==0)
|
|
|
|
sl = conf->copies;
|
|
|
|
sl--;
|
|
|
|
d = r10_bio->devs[sl].devnum;
|
|
|
|
rdev = rcu_dereference(conf->mirrors[d].rdev);
|
2011-07-28 05:39:23 +04:00
|
|
|
if (!rdev ||
|
2016-06-02 09:19:53 +03:00
|
|
|
test_bit(Faulty, &rdev->flags) ||
|
2011-07-28 05:39:23 +04:00
|
|
|
!test_bit(In_sync, &rdev->flags))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
rcu_read_unlock();
|
2011-07-28 05:39:25 +04:00
|
|
|
if (r10_sync_page_io(rdev,
|
|
|
|
r10_bio->devs[sl].addr +
|
|
|
|
sect,
|
2012-07-03 09:55:33 +04:00
|
|
|
s, conf->tmppage, WRITE)
|
2011-07-28 05:39:23 +04:00
|
|
|
== 0) {
|
|
|
|
/* Well, this device is dead */
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_notice("md/raid10:%s: read correction write failed (%d sectors at %llu on %s)\n",
|
|
|
|
mdname(mddev), s,
|
|
|
|
(unsigned long long)(
|
|
|
|
sect +
|
|
|
|
choose_data_offset(r10_bio,
|
|
|
|
rdev)),
|
|
|
|
bdevname(rdev->bdev, b));
|
|
|
|
pr_notice("md/raid10:%s: %s: failing drive\n",
|
|
|
|
mdname(mddev),
|
|
|
|
bdevname(rdev->bdev, b));
|
2006-10-03 12:15:45 +04:00
|
|
|
}
|
2011-07-28 05:39:23 +04:00
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
rcu_read_lock();
|
2006-10-03 12:15:45 +04:00
|
|
|
}
|
|
|
|
sl = start;
|
|
|
|
while (sl != r10_bio->read_slot) {
|
2011-07-28 05:39:23 +04:00
|
|
|
char b[BDEVNAME_SIZE];
|
2010-06-24 07:31:03 +04:00
|
|
|
|
2006-10-03 12:15:45 +04:00
|
|
|
if (sl==0)
|
|
|
|
sl = conf->copies;
|
|
|
|
sl--;
|
|
|
|
d = r10_bio->devs[sl].devnum;
|
|
|
|
rdev = rcu_dereference(conf->mirrors[d].rdev);
|
2011-07-28 05:39:23 +04:00
|
|
|
if (!rdev ||
|
2016-06-02 09:19:53 +03:00
|
|
|
test_bit(Faulty, &rdev->flags) ||
|
2011-07-28 05:39:23 +04:00
|
|
|
!test_bit(In_sync, &rdev->flags))
|
|
|
|
continue;
|
2006-10-03 12:15:45 +04:00
|
|
|
|
2011-07-28 05:39:23 +04:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
rcu_read_unlock();
|
2011-07-28 05:39:25 +04:00
|
|
|
switch (r10_sync_page_io(rdev,
|
|
|
|
r10_bio->devs[sl].addr +
|
|
|
|
sect,
|
2012-07-03 09:55:33 +04:00
|
|
|
s, conf->tmppage,
|
2011-07-28 05:39:25 +04:00
|
|
|
READ)) {
|
|
|
|
case 0:
|
2011-07-28 05:39:23 +04:00
|
|
|
/* Well, this device is dead */
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_notice("md/raid10:%s: unable to read back corrected sectors (%d sectors at %llu on %s)\n",
|
2011-07-28 05:39:23 +04:00
|
|
|
mdname(mddev), s,
|
|
|
|
(unsigned long long)(
|
2012-05-21 03:28:33 +04:00
|
|
|
sect +
|
|
|
|
choose_data_offset(r10_bio, rdev)),
|
2011-07-28 05:39:23 +04:00
|
|
|
bdevname(rdev->bdev, b));
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_notice("md/raid10:%s: %s: failing drive\n",
|
2011-07-28 05:39:23 +04:00
|
|
|
mdname(mddev),
|
|
|
|
bdevname(rdev->bdev, b));
|
2011-07-28 05:39:25 +04:00
|
|
|
break;
|
|
|
|
case 1:
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_info("md/raid10:%s: read error corrected (%d sectors at %llu on %s)\n",
|
2011-07-28 05:39:23 +04:00
|
|
|
mdname(mddev), s,
|
|
|
|
(unsigned long long)(
|
2012-05-21 03:28:33 +04:00
|
|
|
sect +
|
|
|
|
choose_data_offset(r10_bio, rdev)),
|
2011-07-28 05:39:23 +04:00
|
|
|
bdevname(rdev->bdev, b));
|
|
|
|
atomic_add(s, &rdev->corrected_errors);
|
2006-10-03 12:15:45 +04:00
|
|
|
}
|
2011-07-28 05:39:23 +04:00
|
|
|
|
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
rcu_read_lock();
|
2006-10-03 12:15:45 +04:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
sectors -= s;
|
|
|
|
sect += s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static int narrow_write_error(struct r10bio *r10_bio, int i)
|
2011-07-28 05:39:24 +04:00
|
|
|
{
|
|
|
|
struct bio *bio = r10_bio->master_bio;
|
2011-10-11 09:47:53 +04:00
|
|
|
struct mddev *mddev = r10_bio->mddev;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2011-10-11 09:45:26 +04:00
|
|
|
struct md_rdev *rdev = conf->mirrors[r10_bio->devs[i].devnum].rdev;
|
2011-07-28 05:39:24 +04:00
|
|
|
/* bio has the data to be written to slot 'i' where
|
|
|
|
* we just recently had a write error.
|
|
|
|
* We repeatedly clone the bio and trim down to one block,
|
|
|
|
* then try the write. Where the write fails we record
|
|
|
|
* a bad block.
|
|
|
|
* It is conceivable that the bio doesn't exactly align with
|
|
|
|
* blocks. We must handle this.
|
|
|
|
*
|
|
|
|
* We currently own a reference to the rdev.
|
|
|
|
*/
|
|
|
|
|
|
|
|
int block_sectors;
|
|
|
|
sector_t sector;
|
|
|
|
int sectors;
|
|
|
|
int sect_to_write = r10_bio->sectors;
|
|
|
|
int ok = 1;
|
|
|
|
|
|
|
|
if (rdev->badblocks.shift < 0)
|
|
|
|
return 0;
|
|
|
|
|
2015-02-16 06:51:54 +03:00
|
|
|
block_sectors = roundup(1 << rdev->badblocks.shift,
|
|
|
|
bdev_logical_block_size(rdev->bdev) >> 9);
|
2011-07-28 05:39:24 +04:00
|
|
|
sector = r10_bio->sector;
|
|
|
|
sectors = ((r10_bio->sector + block_sectors)
|
|
|
|
& ~(sector_t)(block_sectors - 1))
|
|
|
|
- sector;
|
|
|
|
|
|
|
|
while (sect_to_write) {
|
|
|
|
struct bio *wbio;
|
2016-08-23 11:53:57 +03:00
|
|
|
sector_t wsector;
|
2011-07-28 05:39:24 +04:00
|
|
|
if (sectors > sect_to_write)
|
|
|
|
sectors = sect_to_write;
|
|
|
|
/* Write at 'sector' for 'sectors' */
|
2018-05-21 01:25:52 +03:00
|
|
|
wbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
|
2013-10-12 02:44:27 +04:00
|
|
|
bio_trim(wbio, sector - bio->bi_iter.bi_sector, sectors);
|
2016-08-23 11:53:57 +03:00
|
|
|
wsector = r10_bio->devs[i].addr + (sector - r10_bio->sector);
|
|
|
|
wbio->bi_iter.bi_sector = wsector +
|
|
|
|
choose_data_offset(r10_bio, rdev);
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(wbio, rdev->bdev);
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
|
2016-06-05 22:31:41 +03:00
|
|
|
|
|
|
|
if (submit_bio_wait(wbio) < 0)
|
2011-07-28 05:39:24 +04:00
|
|
|
/* Failure! */
|
2016-08-23 11:53:57 +03:00
|
|
|
ok = rdev_set_badblocks(rdev, wsector,
|
2011-07-28 05:39:24 +04:00
|
|
|
sectors, 0)
|
|
|
|
&& ok;
|
|
|
|
|
|
|
|
bio_put(wbio);
|
|
|
|
sect_to_write -= sectors;
|
|
|
|
sector += sectors;
|
|
|
|
sectors = block_sectors;
|
|
|
|
}
|
|
|
|
return ok;
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:48:43 +04:00
|
|
|
static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
|
2011-07-28 05:39:23 +04:00
|
|
|
{
|
|
|
|
int slot = r10_bio->read_slot;
|
|
|
|
struct bio *bio;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2011-12-23 03:17:54 +04:00
|
|
|
struct md_rdev *rdev = r10_bio->devs[slot].rdev;
|
2011-07-28 05:39:23 +04:00
|
|
|
|
|
|
|
/* we got a read error. Maybe the drive is bad. Maybe just
|
|
|
|
* the block and we can fix it.
|
|
|
|
* We freeze all other IO, and try reading the block from
|
|
|
|
* other devices. When we find one, we re-write
|
|
|
|
* and check it that fixes the read error.
|
|
|
|
* This is all done synchronously while the array is
|
|
|
|
* frozen.
|
|
|
|
*/
|
2012-02-14 04:10:10 +04:00
|
|
|
bio = r10_bio->devs[slot].bio;
|
|
|
|
bio_put(bio);
|
|
|
|
r10_bio->devs[slot].bio = NULL;
|
|
|
|
|
2016-11-18 08:16:12 +03:00
|
|
|
if (mddev->ro)
|
|
|
|
r10_bio->devs[slot].bio = IO_BLOCKED;
|
|
|
|
else if (!test_bit(FailFast, &rdev->flags)) {
|
2013-06-12 05:01:22 +04:00
|
|
|
freeze_array(conf, 1);
|
2011-07-28 05:39:23 +04:00
|
|
|
fix_read_error(conf, mddev, r10_bio);
|
|
|
|
unfreeze_array(conf);
|
2012-02-14 04:10:10 +04:00
|
|
|
} else
|
2016-11-18 08:16:12 +03:00
|
|
|
md_error(mddev, rdev);
|
2012-02-14 04:10:10 +04:00
|
|
|
|
2011-12-23 03:17:54 +04:00
|
|
|
rdev_dec_pending(rdev, mddev);
|
2017-04-05 07:05:51 +03:00
|
|
|
allow_barrier(conf);
|
|
|
|
r10_bio->state = 0;
|
|
|
|
raid10_read_request(mddev, r10_bio->master_bio, r10_bio);
|
2011-07-28 05:39:23 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
|
2011-07-28 05:39:24 +04:00
|
|
|
{
|
|
|
|
/* Some sort of write request has finished and it
|
|
|
|
* succeeded in writing where we thought there was a
|
|
|
|
* bad block. So forget the bad block.
|
2011-07-28 05:39:25 +04:00
|
|
|
* Or possibly if failed and we need to record
|
|
|
|
* a bad block.
|
2011-07-28 05:39:24 +04:00
|
|
|
*/
|
|
|
|
int m;
|
2011-10-11 09:45:26 +04:00
|
|
|
struct md_rdev *rdev;
|
2011-07-28 05:39:24 +04:00
|
|
|
|
|
|
|
if (test_bit(R10BIO_IsSync, &r10_bio->state) ||
|
|
|
|
test_bit(R10BIO_IsRecover, &r10_bio->state)) {
|
2011-07-28 05:39:25 +04:00
|
|
|
for (m = 0; m < conf->copies; m++) {
|
|
|
|
int dev = r10_bio->devs[m].devnum;
|
|
|
|
rdev = conf->mirrors[dev].rdev;
|
md raid10: fix NULL deference in handle_write_completed()
In the case of 'recover', an r10bio with R10BIO_WriteError &
R10BIO_IsRecover will be progressed by handle_write_completed().
This function traverses all r10bio->devs[copies].
If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement
is also not NULL. However, this is not always true.
When there is an rdev of raid10 has replacement, then each r10bio
->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover',
even if corresponded replacement is NULL, it doesn't clear r10bio
->devs[m].repl_bio, resulting in replacement NULL deference.
This bug was introduced when replacement support for raid10 was
added in Linux 3.3.
As NeilBrown suggested:
Elsewhere the determination of "is this device part of the
resync/recovery" is made by resting bio->bi_end_io.
If this is end_sync_write, then we tried to write here.
If it is NULL, then we didn't try to write.
Fixes: 9ad1aefc8ae8 ("md/raid10: Handle replacement devices during resync.")
Cc: stable (V3.3+)
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-06 12:39:15 +03:00
|
|
|
if (r10_bio->devs[m].bio == NULL ||
|
|
|
|
r10_bio->devs[m].bio->bi_end_io == NULL)
|
2011-07-28 05:39:25 +04:00
|
|
|
continue;
|
2017-06-03 10:38:06 +03:00
|
|
|
if (!r10_bio->devs[m].bio->bi_status) {
|
2011-07-28 05:39:24 +04:00
|
|
|
rdev_clear_badblocks(
|
|
|
|
rdev,
|
|
|
|
r10_bio->devs[m].addr,
|
2012-05-21 03:27:00 +04:00
|
|
|
r10_bio->sectors, 0);
|
2011-07-28 05:39:25 +04:00
|
|
|
} else {
|
|
|
|
if (!rdev_set_badblocks(
|
|
|
|
rdev,
|
|
|
|
r10_bio->devs[m].addr,
|
|
|
|
r10_bio->sectors, 0))
|
|
|
|
md_error(conf->mddev, rdev);
|
2011-07-28 05:39:24 +04:00
|
|
|
}
|
2011-12-23 03:17:55 +04:00
|
|
|
rdev = conf->mirrors[dev].replacement;
|
md raid10: fix NULL deference in handle_write_completed()
In the case of 'recover', an r10bio with R10BIO_WriteError &
R10BIO_IsRecover will be progressed by handle_write_completed().
This function traverses all r10bio->devs[copies].
If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement
is also not NULL. However, this is not always true.
When there is an rdev of raid10 has replacement, then each r10bio
->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover',
even if corresponded replacement is NULL, it doesn't clear r10bio
->devs[m].repl_bio, resulting in replacement NULL deference.
This bug was introduced when replacement support for raid10 was
added in Linux 3.3.
As NeilBrown suggested:
Elsewhere the determination of "is this device part of the
resync/recovery" is made by resting bio->bi_end_io.
If this is end_sync_write, then we tried to write here.
If it is NULL, then we didn't try to write.
Fixes: 9ad1aefc8ae8 ("md/raid10: Handle replacement devices during resync.")
Cc: stable (V3.3+)
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-06 12:39:15 +03:00
|
|
|
if (r10_bio->devs[m].repl_bio == NULL ||
|
|
|
|
r10_bio->devs[m].repl_bio->bi_end_io == NULL)
|
2011-12-23 03:17:55 +04:00
|
|
|
continue;
|
2015-07-20 16:29:37 +03:00
|
|
|
|
2017-06-03 10:38:06 +03:00
|
|
|
if (!r10_bio->devs[m].repl_bio->bi_status) {
|
2011-12-23 03:17:55 +04:00
|
|
|
rdev_clear_badblocks(
|
|
|
|
rdev,
|
|
|
|
r10_bio->devs[m].addr,
|
2012-05-21 03:27:00 +04:00
|
|
|
r10_bio->sectors, 0);
|
2011-12-23 03:17:55 +04:00
|
|
|
} else {
|
|
|
|
if (!rdev_set_badblocks(
|
|
|
|
rdev,
|
|
|
|
r10_bio->devs[m].addr,
|
|
|
|
r10_bio->sectors, 0))
|
|
|
|
md_error(conf->mddev, rdev);
|
|
|
|
}
|
2011-07-28 05:39:25 +04:00
|
|
|
}
|
2011-07-28 05:39:24 +04:00
|
|
|
put_buf(r10_bio);
|
|
|
|
} else {
|
2015-08-14 04:26:17 +03:00
|
|
|
bool fail = false;
|
2011-07-28 05:39:24 +04:00
|
|
|
for (m = 0; m < conf->copies; m++) {
|
|
|
|
int dev = r10_bio->devs[m].devnum;
|
|
|
|
struct bio *bio = r10_bio->devs[m].bio;
|
|
|
|
rdev = conf->mirrors[dev].rdev;
|
|
|
|
if (bio == IO_MADE_GOOD) {
|
2011-07-28 05:39:24 +04:00
|
|
|
rdev_clear_badblocks(
|
|
|
|
rdev,
|
|
|
|
r10_bio->devs[m].addr,
|
2012-05-21 03:27:00 +04:00
|
|
|
r10_bio->sectors, 0);
|
2011-07-28 05:39:24 +04:00
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
2017-06-03 10:38:06 +03:00
|
|
|
} else if (bio != NULL && bio->bi_status) {
|
2015-08-14 04:26:17 +03:00
|
|
|
fail = true;
|
2011-07-28 05:39:24 +04:00
|
|
|
if (!narrow_write_error(r10_bio, m)) {
|
|
|
|
md_error(conf->mddev, rdev);
|
|
|
|
set_bit(R10BIO_Degraded,
|
|
|
|
&r10_bio->state);
|
|
|
|
}
|
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
2011-07-28 05:39:24 +04:00
|
|
|
}
|
2011-12-23 03:17:55 +04:00
|
|
|
bio = r10_bio->devs[m].repl_bio;
|
|
|
|
rdev = conf->mirrors[dev].replacement;
|
2011-12-23 03:17:55 +04:00
|
|
|
if (rdev && bio == IO_MADE_GOOD) {
|
2011-12-23 03:17:55 +04:00
|
|
|
rdev_clear_badblocks(
|
|
|
|
rdev,
|
|
|
|
r10_bio->devs[m].addr,
|
2012-05-21 03:27:00 +04:00
|
|
|
r10_bio->sectors, 0);
|
2011-12-23 03:17:55 +04:00
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
|
|
|
}
|
2011-07-28 05:39:24 +04:00
|
|
|
}
|
2015-08-14 04:26:17 +03:00
|
|
|
if (fail) {
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
list_add(&r10_bio->retry_list, &conf->bio_end_io_list);
|
2016-03-14 21:49:32 +03:00
|
|
|
conf->nr_queued++;
|
2015-08-14 04:26:17 +03:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2017-04-17 12:11:05 +03:00
|
|
|
/*
|
|
|
|
* In case freeze_array() is waiting for condition
|
|
|
|
* nr_pending == nr_queued + extra to be true.
|
|
|
|
*/
|
|
|
|
wake_up(&conf->wait_barrier);
|
2015-08-14 04:26:17 +03:00
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
2015-10-24 08:23:48 +03:00
|
|
|
} else {
|
|
|
|
if (test_bit(R10BIO_WriteError,
|
|
|
|
&r10_bio->state))
|
|
|
|
close_write(r10_bio);
|
2015-08-14 04:26:17 +03:00
|
|
|
raid_end_bio_io(r10_bio);
|
2015-10-24 08:23:48 +03:00
|
|
|
}
|
2011-07-28 05:39:24 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-10-11 06:34:00 +04:00
|
|
|
static void raid10d(struct md_thread *thread)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2012-10-11 06:34:00 +04:00
|
|
|
struct mddev *mddev = thread->mddev;
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *r10_bio;
|
2005-04-17 02:20:36 +04:00
|
|
|
unsigned long flags;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct list_head *head = &conf->retry_list;
|
2011-04-18 12:25:41 +04:00
|
|
|
struct blk_plug plug;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
md_check_recovery(mddev);
|
|
|
|
|
2015-08-14 04:26:17 +03:00
|
|
|
if (!list_empty_careful(&conf->bio_end_io_list) &&
|
2016-12-09 02:48:19 +03:00
|
|
|
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
|
2015-08-14 04:26:17 +03:00
|
|
|
LIST_HEAD(tmp);
|
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
2016-12-09 02:48:19 +03:00
|
|
|
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
|
2016-03-14 21:49:32 +03:00
|
|
|
while (!list_empty(&conf->bio_end_io_list)) {
|
|
|
|
list_move(conf->bio_end_io_list.prev, &tmp);
|
|
|
|
conf->nr_queued--;
|
|
|
|
}
|
2015-08-14 04:26:17 +03:00
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
|
|
|
while (!list_empty(&tmp)) {
|
2015-10-01 22:17:43 +03:00
|
|
|
r10_bio = list_first_entry(&tmp, struct r10bio,
|
|
|
|
retry_list);
|
2015-08-14 04:26:17 +03:00
|
|
|
list_del(&r10_bio->retry_list);
|
2015-10-24 08:23:48 +03:00
|
|
|
if (mddev->degraded)
|
|
|
|
set_bit(R10BIO_Degraded, &r10_bio->state);
|
|
|
|
|
|
|
|
if (test_bit(R10BIO_WriteError,
|
|
|
|
&r10_bio->state))
|
|
|
|
close_write(r10_bio);
|
2015-08-14 04:26:17 +03:00
|
|
|
raid_end_bio_io(r10_bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-04-18 12:25:41 +04:00
|
|
|
blk_start_plug(&plug);
|
2005-04-17 02:20:36 +04:00
|
|
|
for (;;) {
|
2006-01-06 11:20:16 +03:00
|
|
|
|
2012-07-31 11:08:14 +04:00
|
|
|
flush_pending_writes(conf);
|
2006-01-06 11:20:16 +03:00
|
|
|
|
2008-03-05 01:29:29 +03:00
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
|
|
|
if (list_empty(head)) {
|
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
2005-04-17 02:20:36 +04:00
|
|
|
break;
|
2008-03-05 01:29:29 +03:00
|
|
|
}
|
2011-10-11 09:48:43 +04:00
|
|
|
r10_bio = list_entry(head->prev, struct r10bio, retry_list);
|
2005-04-17 02:20:36 +04:00
|
|
|
list_del(head->prev);
|
2006-01-06 11:20:28 +03:00
|
|
|
conf->nr_queued--;
|
2005-04-17 02:20:36 +04:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
|
|
|
|
|
|
|
mddev = r10_bio->mddev;
|
2009-06-16 10:54:21 +04:00
|
|
|
conf = mddev->private;
|
2011-07-28 05:39:24 +04:00
|
|
|
if (test_bit(R10BIO_MadeGood, &r10_bio->state) ||
|
|
|
|
test_bit(R10BIO_WriteError, &r10_bio->state))
|
2011-07-28 05:39:24 +04:00
|
|
|
handle_write_completed(conf, r10_bio);
|
2012-05-22 07:53:47 +04:00
|
|
|
else if (test_bit(R10BIO_IsReshape, &r10_bio->state))
|
|
|
|
reshape_request_write(mddev, r10_bio);
|
2011-07-28 05:39:24 +04:00
|
|
|
else if (test_bit(R10BIO_IsSync, &r10_bio->state))
|
2005-04-17 02:20:36 +04:00
|
|
|
sync_request_write(mddev, r10_bio);
|
2011-03-10 10:52:07 +03:00
|
|
|
else if (test_bit(R10BIO_IsRecover, &r10_bio->state))
|
2005-04-17 02:20:36 +04:00
|
|
|
recovery_request_write(mddev, r10_bio);
|
2011-07-28 05:39:23 +04:00
|
|
|
else if (test_bit(R10BIO_ReadError, &r10_bio->state))
|
2011-07-28 05:39:23 +04:00
|
|
|
handle_read_error(mddev, r10_bio);
|
2017-04-05 07:05:51 +03:00
|
|
|
else
|
|
|
|
WARN_ON_ONCE(1);
|
2011-07-28 05:39:23 +04:00
|
|
|
|
2009-10-16 08:55:32 +04:00
|
|
|
cond_resched();
|
2016-12-09 02:48:19 +03:00
|
|
|
if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
|
2011-07-28 05:31:48 +04:00
|
|
|
md_check_recovery(mddev);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2011-04-18 12:25:41 +04:00
|
|
|
blk_finish_plug(&plug);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static int init_resync(struct r10conf *conf)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2018-05-21 01:25:52 +03:00
|
|
|
int ret, buffs, i;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
buffs = RESYNC_WINDOW / RESYNC_BLOCK_SIZE;
|
2018-05-21 01:25:52 +03:00
|
|
|
BUG_ON(mempool_initialized(&conf->r10buf_pool));
|
2011-12-23 03:17:54 +04:00
|
|
|
conf->have_replacement = 0;
|
2012-05-21 03:28:20 +04:00
|
|
|
for (i = 0; i < conf->geo.raid_disks; i++)
|
2011-12-23 03:17:54 +04:00
|
|
|
if (conf->mirrors[i].replacement)
|
|
|
|
conf->have_replacement = 1;
|
2018-05-21 01:25:52 +03:00
|
|
|
ret = mempool_init(&conf->r10buf_pool, buffs,
|
|
|
|
r10buf_pool_alloc, r10buf_pool_free, conf);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2005-04-17 02:20:36 +04:00
|
|
|
conf->next_resync = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-08-25 03:50:40 +03:00
|
|
|
static struct r10bio *raid10_alloc_init_r10buf(struct r10conf *conf)
|
|
|
|
{
|
2018-05-21 01:25:52 +03:00
|
|
|
struct r10bio *r10bio = mempool_alloc(&conf->r10buf_pool, GFP_NOIO);
|
2017-08-25 03:50:40 +03:00
|
|
|
struct rsync_pages *rp;
|
|
|
|
struct bio *bio;
|
|
|
|
int nalloc;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &conf->mddev->recovery) ||
|
|
|
|
test_bit(MD_RECOVERY_RESHAPE, &conf->mddev->recovery))
|
|
|
|
nalloc = conf->copies; /* resync */
|
|
|
|
else
|
|
|
|
nalloc = 2; /* recovery */
|
|
|
|
|
|
|
|
for (i = 0; i < nalloc; i++) {
|
|
|
|
bio = r10bio->devs[i].bio;
|
|
|
|
rp = bio->bi_private;
|
|
|
|
bio_reset(bio);
|
|
|
|
bio->bi_private = rp;
|
|
|
|
bio = r10bio->devs[i].repl_bio;
|
|
|
|
if (bio) {
|
|
|
|
rp = bio->bi_private;
|
|
|
|
bio_reset(bio);
|
|
|
|
bio->bi_private = rp;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return r10bio;
|
|
|
|
}
|
|
|
|
|
2017-10-24 10:11:52 +03:00
|
|
|
/*
|
|
|
|
* Set cluster_sync_high since we need other nodes to add the
|
|
|
|
* range [cluster_sync_low, cluster_sync_high] to suspend list.
|
|
|
|
*/
|
|
|
|
static void raid10_set_cluster_sync_high(struct r10conf *conf)
|
|
|
|
{
|
|
|
|
sector_t window_size;
|
|
|
|
int extra_chunk, chunks;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, here we define "stripe" as a unit which across
|
|
|
|
* all member devices one time, so we get chunks by use
|
|
|
|
* raid_disks / near_copies. Otherwise, if near_copies is
|
|
|
|
* close to raid_disks, then resync window could increases
|
|
|
|
* linearly with the increase of raid_disks, which means
|
|
|
|
* we will suspend a really large IO window while it is not
|
|
|
|
* necessary. If raid_disks is not divisible by near_copies,
|
|
|
|
* an extra chunk is needed to ensure the whole "stripe" is
|
|
|
|
* covered.
|
|
|
|
*/
|
|
|
|
|
|
|
|
chunks = conf->geo.raid_disks / conf->geo.near_copies;
|
|
|
|
if (conf->geo.raid_disks % conf->geo.near_copies == 0)
|
|
|
|
extra_chunk = 0;
|
|
|
|
else
|
|
|
|
extra_chunk = 1;
|
|
|
|
window_size = (chunks + extra_chunk) * conf->mddev->chunk_sectors;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* At least use a 32M window to align with raid1's resync window
|
|
|
|
*/
|
|
|
|
window_size = (CLUSTER_RESYNC_WINDOW_SECTORS > window_size) ?
|
|
|
|
CLUSTER_RESYNC_WINDOW_SECTORS : window_size;
|
|
|
|
|
|
|
|
conf->cluster_sync_high = conf->cluster_sync_low + window_size;
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* perform a "sync" on one "block"
|
|
|
|
*
|
|
|
|
* We need to make sure that no normal I/O request - particularly write
|
|
|
|
* requests - conflict with active sync requests.
|
|
|
|
*
|
|
|
|
* This is achieved by tracking pending requests and a 'barrier' concept
|
|
|
|
* that can be installed to exclude normal IO requests.
|
|
|
|
*
|
|
|
|
* Resync and recovery are handled very differently.
|
|
|
|
* We differentiate by looking at MD_RECOVERY_SYNC in mddev->recovery.
|
|
|
|
*
|
|
|
|
* For resync, we iterate over virtual addresses, read all copies,
|
|
|
|
* and update if there are differences. If only one copy is live,
|
|
|
|
* skip it.
|
|
|
|
* For recovery, we iterate over physical addresses, read a good
|
|
|
|
* value for each non-in_sync drive, and over-write.
|
|
|
|
*
|
|
|
|
* So, for recovery we may have several outstanding complex requests for a
|
|
|
|
* given address, one for each out-of-sync device. We model this by allocating
|
|
|
|
* a number of r10_bio structures, one for each out-of-sync device.
|
|
|
|
* As we setup these structures, we collect all bio's together into a list
|
|
|
|
* which we then process collectively to add pages, and then process again
|
|
|
|
* to pass to generic_make_request.
|
|
|
|
*
|
|
|
|
* The r10_bio structures are linked using a borrowed master_bio pointer.
|
|
|
|
* This link is counted in ->remaining. When the r10_bio that points to NULL
|
|
|
|
* has its remaining count decremented to 0, the whole complex operation
|
|
|
|
* is complete.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2016-01-21 00:52:20 +03:00
|
|
|
static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
|
2015-02-19 08:04:40 +03:00
|
|
|
int *skipped)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *r10_bio;
|
2005-04-17 02:20:36 +04:00
|
|
|
struct bio *biolist = NULL, *bio;
|
|
|
|
sector_t max_sector, nr_sectors;
|
|
|
|
int i;
|
2006-01-06 11:20:16 +03:00
|
|
|
int max_sync;
|
2010-10-19 03:03:39 +04:00
|
|
|
sector_t sync_blocks;
|
2005-04-17 02:20:36 +04:00
|
|
|
sector_t sectors_skipped = 0;
|
|
|
|
int chunks_skipped = 0;
|
2012-05-21 03:28:20 +04:00
|
|
|
sector_t chunk_mask = conf->geo.chunk_mask;
|
2017-07-14 11:14:42 +03:00
|
|
|
int page_idx = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
if (!mempool_initialized(&conf->r10buf_pool))
|
2005-04-17 02:20:36 +04:00
|
|
|
if (init_resync(conf))
|
2005-06-22 04:17:13 +04:00
|
|
|
return 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2013-04-24 05:42:42 +04:00
|
|
|
/*
|
|
|
|
* Allow skipping a full rebuild for incremental assembly
|
|
|
|
* of a clean array, like RAID1 does.
|
|
|
|
*/
|
|
|
|
if (mddev->bitmap == NULL &&
|
|
|
|
mddev->recovery_cp == MaxSector &&
|
2013-07-04 10:41:53 +04:00
|
|
|
mddev->reshape_position == MaxSector &&
|
|
|
|
!test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
|
2013-04-24 05:42:42 +04:00
|
|
|
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
|
2013-07-04 10:41:53 +04:00
|
|
|
!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
|
2013-04-24 05:42:42 +04:00
|
|
|
conf->fullsync == 0) {
|
|
|
|
*skipped = 1;
|
2013-07-04 10:41:53 +04:00
|
|
|
return mddev->dev_sectors - sector_nr;
|
2013-04-24 05:42:42 +04:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
skipped:
|
2009-03-31 07:33:13 +04:00
|
|
|
max_sector = mddev->dev_sectors;
|
2012-05-22 07:53:47 +04:00
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
|
|
|
|
test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
|
2005-04-17 02:20:36 +04:00
|
|
|
max_sector = mddev->resync_max_sectors;
|
|
|
|
if (sector_nr >= max_sector) {
|
2017-10-24 10:11:52 +03:00
|
|
|
conf->cluster_sync_low = 0;
|
|
|
|
conf->cluster_sync_high = 0;
|
|
|
|
|
2006-01-06 11:20:16 +03:00
|
|
|
/* If we aborted, we need to abort the
|
|
|
|
* sync on the 'current' bitmap chucks (there can
|
|
|
|
* be several when recovering multiple devices).
|
|
|
|
* as we may have started syncing it but not finished.
|
|
|
|
* We can find the current address in
|
|
|
|
* mddev->curr_resync, but for recovery,
|
|
|
|
* we need to convert that to several
|
|
|
|
* virtual addresses.
|
|
|
|
*/
|
2012-05-22 07:53:47 +04:00
|
|
|
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
|
|
|
|
end_reshape(conf);
|
2014-08-18 07:59:50 +04:00
|
|
|
close_sync(conf);
|
2012-05-22 07:53:47 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-01-06 11:20:16 +03:00
|
|
|
if (mddev->curr_resync < max_sector) { /* aborted */
|
|
|
|
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
|
2018-08-02 01:20:50 +03:00
|
|
|
md_bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
|
|
|
|
&sync_blocks, 1);
|
2012-05-21 03:28:20 +04:00
|
|
|
else for (i = 0; i < conf->geo.raid_disks; i++) {
|
2006-01-06 11:20:16 +03:00
|
|
|
sector_t sect =
|
|
|
|
raid10_find_virt(conf, mddev->curr_resync, i);
|
2018-08-02 01:20:50 +03:00
|
|
|
md_bitmap_end_sync(mddev->bitmap, sect,
|
|
|
|
&sync_blocks, 1);
|
2006-01-06 11:20:16 +03:00
|
|
|
}
|
2011-12-23 03:17:55 +04:00
|
|
|
} else {
|
|
|
|
/* completed sync */
|
|
|
|
if ((!mddev->bitmap || conf->fullsync)
|
|
|
|
&& conf->have_replacement
|
|
|
|
&& test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
|
|
|
|
/* Completed a full sync so the replacements
|
|
|
|
* are now fully recovered.
|
|
|
|
*/
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
|
|
|
for (i = 0; i < conf->geo.raid_disks; i++) {
|
|
|
|
struct md_rdev *rdev =
|
|
|
|
rcu_dereference(conf->mirrors[i].replacement);
|
|
|
|
if (rdev)
|
|
|
|
rdev->recovery_offset = MaxSector;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2011-12-23 03:17:55 +04:00
|
|
|
}
|
2006-01-06 11:20:16 +03:00
|
|
|
conf->fullsync = 0;
|
2011-12-23 03:17:55 +04:00
|
|
|
}
|
2018-08-02 01:20:50 +03:00
|
|
|
md_bitmap_close_sync(mddev->bitmap);
|
2005-04-17 02:20:36 +04:00
|
|
|
close_sync(conf);
|
2005-06-22 04:17:13 +04:00
|
|
|
*skipped = 1;
|
2005-04-17 02:20:36 +04:00
|
|
|
return sectors_skipped;
|
|
|
|
}
|
2012-05-22 07:53:47 +04:00
|
|
|
|
|
|
|
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
|
|
|
|
return reshape_request(mddev, sector_nr, skipped);
|
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
if (chunks_skipped >= conf->geo.raid_disks) {
|
2005-04-17 02:20:36 +04:00
|
|
|
/* if there has been nothing to do on any drive,
|
|
|
|
* then there is nothing to do at all..
|
|
|
|
*/
|
2005-06-22 04:17:13 +04:00
|
|
|
*skipped = 1;
|
|
|
|
return (max_sector - sector_nr) + sectors_skipped;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2008-02-06 12:39:52 +03:00
|
|
|
if (max_sector > mddev->resync_max)
|
|
|
|
max_sector = mddev->resync_max; /* Don't do IO beyond here */
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/* make sure whole request will fit in a chunk - if chunks
|
|
|
|
* are meaningful
|
|
|
|
*/
|
2012-05-21 03:28:20 +04:00
|
|
|
if (conf->geo.near_copies < conf->geo.raid_disks &&
|
|
|
|
max_sector > (sector_nr | chunk_mask))
|
|
|
|
max_sector = (sector_nr | chunk_mask) + 1;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-06-13 16:51:19 +03:00
|
|
|
/*
|
|
|
|
* If there is non-resync activity waiting for a turn, then let it
|
|
|
|
* though before starting on this new sync request.
|
|
|
|
*/
|
|
|
|
if (conf->nr_waiting)
|
|
|
|
schedule_timeout_uninterruptible(1);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
/* Again, very different code for resync and recovery.
|
|
|
|
* Both must result in an r10bio with a list of bios that
|
2017-08-23 20:10:32 +03:00
|
|
|
* have bi_end_io, bi_sector, bi_disk set,
|
2005-04-17 02:20:36 +04:00
|
|
|
* and bi_private set to the r10bio.
|
|
|
|
* For recovery, we may actually create several r10bios
|
|
|
|
* with 2 bios in each, that correspond to the bios in the main one.
|
|
|
|
* In this case, the subordinate r10bios link back through a
|
|
|
|
* borrowed master_bio pointer, and the counter in the master
|
|
|
|
* includes a ref from each subordinate.
|
|
|
|
*/
|
|
|
|
/* First, we decide what to do and set ->bi_end_io
|
|
|
|
* To end_sync_read if we want to read, and
|
|
|
|
* end_sync_write if we will want to write.
|
|
|
|
*/
|
|
|
|
|
2006-01-06 11:20:16 +03:00
|
|
|
max_sync = RESYNC_PAGES << (PAGE_SHIFT-9);
|
2005-04-17 02:20:36 +04:00
|
|
|
if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
|
|
|
|
/* recovery... the complicated one */
|
2011-07-28 05:39:24 +04:00
|
|
|
int j;
|
2005-04-17 02:20:36 +04:00
|
|
|
r10_bio = NULL;
|
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
for (i = 0 ; i < conf->geo.raid_disks; i++) {
|
2011-05-11 08:54:41 +04:00
|
|
|
int still_degraded;
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *rb2;
|
2011-05-11 08:54:41 +04:00
|
|
|
sector_t sect;
|
|
|
|
int must_sync;
|
2011-07-28 05:39:24 +04:00
|
|
|
int any_working;
|
md/raid10: Fix raid10 replace hang when new added disk faulty
[Symptom]
Resync thread hang when new added disk faulty during replacing.
[Root Cause]
In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().
In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().
If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
1. In the first rcu protected section, resync thread did not detect
that mreplace is set faulty and pass the condition.
2. In the second rcu protected section, mreplace is set faulty.
3. But, resync thread will prepare the read object first, and then
check the write condition.
4. It will find that mreplace is set faulty and do not have to
prepare write object.
This cause we add resync sectors but never sub it.
[How to Reproduce]
This issue can be easily reproduced by the following steps:
mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
mdadm /dev/md0 -a /dev/sde
mdadm /dev/md0 --replace /dev/sdd
sleep 1
mdadm /dev/md0 -f /dev/sde
[How to Fix]
This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.
Previous 'commit 24afd80d99f8 ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.
Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Alex Wu <alexwu@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-21 11:05:03 +03:00
|
|
|
int need_recover = 0;
|
|
|
|
int need_replace = 0;
|
2012-07-31 04:03:52 +04:00
|
|
|
struct raid10_info *mirror = &conf->mirrors[i];
|
2016-06-02 09:19:52 +03:00
|
|
|
struct md_rdev *mrdev, *mreplace;
|
2011-12-23 03:17:55 +04:00
|
|
|
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
|
|
|
mrdev = rcu_dereference(mirror->rdev);
|
|
|
|
mreplace = rcu_dereference(mirror->replacement);
|
|
|
|
|
md/raid10: Fix raid10 replace hang when new added disk faulty
[Symptom]
Resync thread hang when new added disk faulty during replacing.
[Root Cause]
In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().
In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().
If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
1. In the first rcu protected section, resync thread did not detect
that mreplace is set faulty and pass the condition.
2. In the second rcu protected section, mreplace is set faulty.
3. But, resync thread will prepare the read object first, and then
check the write condition.
4. It will find that mreplace is set faulty and do not have to
prepare write object.
This cause we add resync sectors but never sub it.
[How to Reproduce]
This issue can be easily reproduced by the following steps:
mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
mdadm /dev/md0 -a /dev/sde
mdadm /dev/md0 --replace /dev/sdd
sleep 1
mdadm /dev/md0 -f /dev/sde
[How to Fix]
This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.
Previous 'commit 24afd80d99f8 ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.
Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Alex Wu <alexwu@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-21 11:05:03 +03:00
|
|
|
if (mrdev != NULL &&
|
|
|
|
!test_bit(Faulty, &mrdev->flags) &&
|
|
|
|
!test_bit(In_sync, &mrdev->flags))
|
|
|
|
need_recover = 1;
|
|
|
|
if (mreplace != NULL &&
|
|
|
|
!test_bit(Faulty, &mreplace->flags))
|
|
|
|
need_replace = 1;
|
|
|
|
|
|
|
|
if (!need_recover && !need_replace) {
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2011-05-11 08:54:41 +04:00
|
|
|
continue;
|
2016-06-02 09:19:52 +03:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-05-11 08:54:41 +04:00
|
|
|
still_degraded = 0;
|
|
|
|
/* want to reconstruct this device */
|
|
|
|
rb2 = r10_bio;
|
|
|
|
sect = raid10_find_virt(conf, sector_nr, i);
|
2012-07-03 04:37:30 +04:00
|
|
|
if (sect >= mddev->resync_max_sectors) {
|
|
|
|
/* last stripe is not complete - don't
|
|
|
|
* try to recover this sector.
|
|
|
|
*/
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2012-07-03 04:37:30 +04:00
|
|
|
continue;
|
|
|
|
}
|
2016-06-02 09:19:53 +03:00
|
|
|
if (mreplace && test_bit(Faulty, &mreplace->flags))
|
|
|
|
mreplace = NULL;
|
2011-12-23 03:17:55 +04:00
|
|
|
/* Unless we are doing a full sync, or a replacement
|
|
|
|
* we only need to recover the block if it is set in
|
|
|
|
* the bitmap
|
2011-05-11 08:54:41 +04:00
|
|
|
*/
|
2018-08-02 01:20:50 +03:00
|
|
|
must_sync = md_bitmap_start_sync(mddev->bitmap, sect,
|
|
|
|
&sync_blocks, 1);
|
2011-05-11 08:54:41 +04:00
|
|
|
if (sync_blocks < max_sync)
|
|
|
|
max_sync = sync_blocks;
|
|
|
|
if (!must_sync &&
|
2016-06-02 09:19:52 +03:00
|
|
|
mreplace == NULL &&
|
2011-05-11 08:54:41 +04:00
|
|
|
!conf->fullsync) {
|
|
|
|
/* yep, skip the sync_blocks here, but don't assume
|
|
|
|
* that there will never be anything to do here
|
|
|
|
*/
|
|
|
|
chunks_skipped = -1;
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2011-05-11 08:54:41 +04:00
|
|
|
continue;
|
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
atomic_inc(&mrdev->nr_pending);
|
|
|
|
if (mreplace)
|
|
|
|
atomic_inc(&mreplace->nr_pending);
|
|
|
|
rcu_read_unlock();
|
2006-01-06 11:20:16 +03:00
|
|
|
|
2017-08-25 03:50:40 +03:00
|
|
|
r10_bio = raid10_alloc_init_r10buf(conf);
|
2014-08-18 08:38:45 +04:00
|
|
|
r10_bio->state = 0;
|
2011-05-11 08:54:41 +04:00
|
|
|
raise_barrier(conf, rb2 != NULL);
|
|
|
|
atomic_set(&r10_bio->remaining, 0);
|
2009-05-07 06:48:10 +04:00
|
|
|
|
2011-05-11 08:54:41 +04:00
|
|
|
r10_bio->master_bio = (struct bio*)rb2;
|
|
|
|
if (rb2)
|
|
|
|
atomic_inc(&rb2->remaining);
|
|
|
|
r10_bio->mddev = mddev;
|
|
|
|
set_bit(R10BIO_IsRecover, &r10_bio->state);
|
|
|
|
r10_bio->sector = sect;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2011-05-11 08:54:41 +04:00
|
|
|
raid10_find_phys(conf, r10_bio);
|
|
|
|
|
|
|
|
/* Need to check if the array will still be
|
|
|
|
* degraded
|
|
|
|
*/
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
|
|
|
for (j = 0; j < conf->geo.raid_disks; j++) {
|
|
|
|
struct md_rdev *rdev = rcu_dereference(
|
|
|
|
conf->mirrors[j].rdev);
|
|
|
|
if (rdev == NULL || test_bit(Faulty, &rdev->flags)) {
|
2011-05-11 08:54:41 +04:00
|
|
|
still_degraded = 1;
|
2005-09-10 03:24:04 +04:00
|
|
|
break;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
}
|
2011-05-11 08:54:41 +04:00
|
|
|
|
2018-08-02 01:20:50 +03:00
|
|
|
must_sync = md_bitmap_start_sync(mddev->bitmap, sect,
|
|
|
|
&sync_blocks, still_degraded);
|
2011-05-11 08:54:41 +04:00
|
|
|
|
2011-07-28 05:39:24 +04:00
|
|
|
any_working = 0;
|
2011-05-11 08:54:41 +04:00
|
|
|
for (j=0; j<conf->copies;j++) {
|
2011-07-28 05:39:24 +04:00
|
|
|
int k;
|
2011-05-11 08:54:41 +04:00
|
|
|
int d = r10_bio->devs[j].devnum;
|
2011-07-28 05:39:25 +04:00
|
|
|
sector_t from_addr, to_addr;
|
2016-06-02 09:19:52 +03:00
|
|
|
struct md_rdev *rdev =
|
|
|
|
rcu_dereference(conf->mirrors[d].rdev);
|
2011-07-28 05:39:24 +04:00
|
|
|
sector_t sector, first_bad;
|
|
|
|
int bad_sectors;
|
2016-06-02 09:19:52 +03:00
|
|
|
if (!rdev ||
|
|
|
|
!test_bit(In_sync, &rdev->flags))
|
2011-05-11 08:54:41 +04:00
|
|
|
continue;
|
|
|
|
/* This is where we read from */
|
2011-07-28 05:39:24 +04:00
|
|
|
any_working = 1;
|
2011-07-28 05:39:24 +04:00
|
|
|
sector = r10_bio->devs[j].addr;
|
|
|
|
|
|
|
|
if (is_badblock(rdev, sector, max_sync,
|
|
|
|
&first_bad, &bad_sectors)) {
|
|
|
|
if (first_bad > sector)
|
|
|
|
max_sync = first_bad - sector;
|
|
|
|
else {
|
|
|
|
bad_sectors -= (sector
|
|
|
|
- first_bad);
|
|
|
|
if (max_sync > bad_sectors)
|
|
|
|
max_sync = bad_sectors;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
2011-05-11 08:54:41 +04:00
|
|
|
bio = r10_bio->devs[0].bio;
|
|
|
|
bio->bi_next = biolist;
|
|
|
|
biolist = bio;
|
|
|
|
bio->bi_end_io = end_sync_read;
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(bio, REQ_OP_READ, 0);
|
2016-11-18 08:16:12 +03:00
|
|
|
if (test_bit(FailFast, &rdev->flags))
|
|
|
|
bio->bi_opf |= MD_FAILFAST;
|
2011-07-28 05:39:25 +04:00
|
|
|
from_addr = r10_bio->devs[j].addr;
|
2013-10-12 02:44:27 +04:00
|
|
|
bio->bi_iter.bi_sector = from_addr +
|
|
|
|
rdev->data_offset;
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(bio, rdev->bdev);
|
2011-12-23 03:17:55 +04:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
/* and we write to 'i' (if not in_sync) */
|
2011-05-11 08:54:41 +04:00
|
|
|
|
|
|
|
for (k=0; k<conf->copies; k++)
|
|
|
|
if (r10_bio->devs[k].devnum == i)
|
|
|
|
break;
|
|
|
|
BUG_ON(k == conf->copies);
|
2011-07-28 05:39:25 +04:00
|
|
|
to_addr = r10_bio->devs[k].addr;
|
2011-05-11 08:54:41 +04:00
|
|
|
r10_bio->devs[0].devnum = d;
|
2011-07-28 05:39:25 +04:00
|
|
|
r10_bio->devs[0].addr = from_addr;
|
2011-05-11 08:54:41 +04:00
|
|
|
r10_bio->devs[1].devnum = i;
|
2011-07-28 05:39:25 +04:00
|
|
|
r10_bio->devs[1].addr = to_addr;
|
2011-05-11 08:54:41 +04:00
|
|
|
|
md/raid10: Fix raid10 replace hang when new added disk faulty
[Symptom]
Resync thread hang when new added disk faulty during replacing.
[Root Cause]
In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().
In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().
If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
1. In the first rcu protected section, resync thread did not detect
that mreplace is set faulty and pass the condition.
2. In the second rcu protected section, mreplace is set faulty.
3. But, resync thread will prepare the read object first, and then
check the write condition.
4. It will find that mreplace is set faulty and do not have to
prepare write object.
This cause we add resync sectors but never sub it.
[How to Reproduce]
This issue can be easily reproduced by the following steps:
mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
mdadm /dev/md0 -a /dev/sde
mdadm /dev/md0 --replace /dev/sdd
sleep 1
mdadm /dev/md0 -f /dev/sde
[How to Fix]
This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.
Previous 'commit 24afd80d99f8 ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.
Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Alex Wu <alexwu@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-21 11:05:03 +03:00
|
|
|
if (need_recover) {
|
2011-12-23 03:17:55 +04:00
|
|
|
bio = r10_bio->devs[1].bio;
|
|
|
|
bio->bi_next = biolist;
|
|
|
|
biolist = bio;
|
|
|
|
bio->bi_end_io = end_sync_write;
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
|
2013-10-12 02:44:27 +04:00
|
|
|
bio->bi_iter.bi_sector = to_addr
|
2016-06-02 09:19:52 +03:00
|
|
|
+ mrdev->data_offset;
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(bio, mrdev->bdev);
|
2011-12-23 03:17:55 +04:00
|
|
|
atomic_inc(&r10_bio->remaining);
|
|
|
|
} else
|
|
|
|
r10_bio->devs[1].bio->bi_end_io = NULL;
|
|
|
|
|
|
|
|
/* and maybe write to replacement */
|
|
|
|
bio = r10_bio->devs[1].repl_bio;
|
|
|
|
if (bio)
|
|
|
|
bio->bi_end_io = NULL;
|
md/raid10: Fix raid10 replace hang when new added disk faulty
[Symptom]
Resync thread hang when new added disk faulty during replacing.
[Root Cause]
In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().
In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().
If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
1. In the first rcu protected section, resync thread did not detect
that mreplace is set faulty and pass the condition.
2. In the second rcu protected section, mreplace is set faulty.
3. But, resync thread will prepare the read object first, and then
check the write condition.
4. It will find that mreplace is set faulty and do not have to
prepare write object.
This cause we add resync sectors but never sub it.
[How to Reproduce]
This issue can be easily reproduced by the following steps:
mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
mdadm /dev/md0 -a /dev/sde
mdadm /dev/md0 --replace /dev/sdd
sleep 1
mdadm /dev/md0 -f /dev/sde
[How to Fix]
This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.
Previous 'commit 24afd80d99f8 ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.
Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Alex Wu <alexwu@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-21 11:05:03 +03:00
|
|
|
/* Note: if need_replace, then bio
|
2011-12-23 03:17:55 +04:00
|
|
|
* cannot be NULL as r10buf_pool_alloc will
|
|
|
|
* have allocated it.
|
|
|
|
*/
|
md/raid10: Fix raid10 replace hang when new added disk faulty
[Symptom]
Resync thread hang when new added disk faulty during replacing.
[Root Cause]
In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().
In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().
If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
1. In the first rcu protected section, resync thread did not detect
that mreplace is set faulty and pass the condition.
2. In the second rcu protected section, mreplace is set faulty.
3. But, resync thread will prepare the read object first, and then
check the write condition.
4. It will find that mreplace is set faulty and do not have to
prepare write object.
This cause we add resync sectors but never sub it.
[How to Reproduce]
This issue can be easily reproduced by the following steps:
mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
mdadm /dev/md0 -a /dev/sde
mdadm /dev/md0 --replace /dev/sdd
sleep 1
mdadm /dev/md0 -f /dev/sde
[How to Fix]
This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.
Previous 'commit 24afd80d99f8 ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.
Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Alex Wu <alexwu@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-21 11:05:03 +03:00
|
|
|
if (!need_replace)
|
2011-12-23 03:17:55 +04:00
|
|
|
break;
|
|
|
|
bio->bi_next = biolist;
|
|
|
|
biolist = bio;
|
|
|
|
bio->bi_end_io = end_sync_write;
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
|
2013-10-12 02:44:27 +04:00
|
|
|
bio->bi_iter.bi_sector = to_addr +
|
2016-06-02 09:19:52 +03:00
|
|
|
mreplace->data_offset;
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(bio, mreplace->bdev);
|
2011-12-23 03:17:55 +04:00
|
|
|
atomic_inc(&r10_bio->remaining);
|
2011-05-11 08:54:41 +04:00
|
|
|
break;
|
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2011-05-11 08:54:41 +04:00
|
|
|
if (j == conf->copies) {
|
2011-07-28 05:39:24 +04:00
|
|
|
/* Cannot recover, so abort the recovery or
|
|
|
|
* record a bad block */
|
|
|
|
if (any_working) {
|
|
|
|
/* problem is that there are bad blocks
|
|
|
|
* on other device(s)
|
|
|
|
*/
|
|
|
|
int k;
|
|
|
|
for (k = 0; k < conf->copies; k++)
|
|
|
|
if (r10_bio->devs[k].devnum == i)
|
|
|
|
break;
|
2011-12-23 03:17:55 +04:00
|
|
|
if (!test_bit(In_sync,
|
2016-06-02 09:19:52 +03:00
|
|
|
&mrdev->flags)
|
2011-12-23 03:17:55 +04:00
|
|
|
&& !rdev_set_badblocks(
|
2016-06-02 09:19:52 +03:00
|
|
|
mrdev,
|
2011-12-23 03:17:55 +04:00
|
|
|
r10_bio->devs[k].addr,
|
|
|
|
max_sync, 0))
|
|
|
|
any_working = 0;
|
2016-06-02 09:19:52 +03:00
|
|
|
if (mreplace &&
|
2011-12-23 03:17:55 +04:00
|
|
|
!rdev_set_badblocks(
|
2016-06-02 09:19:52 +03:00
|
|
|
mreplace,
|
2011-07-28 05:39:24 +04:00
|
|
|
r10_bio->devs[k].addr,
|
|
|
|
max_sync, 0))
|
|
|
|
any_working = 0;
|
|
|
|
}
|
|
|
|
if (!any_working) {
|
|
|
|
if (!test_and_set_bit(MD_RECOVERY_INTR,
|
|
|
|
&mddev->recovery))
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_warn("md/raid10:%s: insufficient working devices for recovery.\n",
|
2011-07-28 05:39:24 +04:00
|
|
|
mdname(mddev));
|
2011-12-23 03:17:55 +04:00
|
|
|
mirror->recovery_disabled
|
2011-07-28 05:39:24 +04:00
|
|
|
= mddev->recovery_disabled;
|
|
|
|
}
|
2014-01-06 03:35:34 +04:00
|
|
|
put_buf(r10_bio);
|
|
|
|
if (rb2)
|
|
|
|
atomic_dec(&rb2->remaining);
|
|
|
|
r10_bio = rb2;
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev_dec_pending(mrdev, mddev);
|
|
|
|
if (mreplace)
|
|
|
|
rdev_dec_pending(mreplace, mddev);
|
2011-05-11 08:54:41 +04:00
|
|
|
break;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev_dec_pending(mrdev, mddev);
|
|
|
|
if (mreplace)
|
|
|
|
rdev_dec_pending(mreplace, mddev);
|
2016-11-18 08:16:12 +03:00
|
|
|
if (r10_bio->devs[0].bio->bi_opf & MD_FAILFAST) {
|
|
|
|
/* Only want this if there is elsewhere to
|
|
|
|
* read from. 'j' is currently the first
|
|
|
|
* readable copy.
|
|
|
|
*/
|
|
|
|
int targets = 1;
|
|
|
|
for (; j < conf->copies; j++) {
|
|
|
|
int d = r10_bio->devs[j].devnum;
|
|
|
|
if (conf->mirrors[d].rdev &&
|
|
|
|
test_bit(In_sync,
|
|
|
|
&conf->mirrors[d].rdev->flags))
|
|
|
|
targets++;
|
|
|
|
}
|
|
|
|
if (targets == 1)
|
|
|
|
r10_bio->devs[0].bio->bi_opf
|
|
|
|
&= ~MD_FAILFAST;
|
|
|
|
}
|
2011-05-11 08:54:41 +04:00
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
if (biolist == NULL) {
|
|
|
|
while (r10_bio) {
|
2011-10-11 09:48:43 +04:00
|
|
|
struct r10bio *rb2 = r10_bio;
|
|
|
|
r10_bio = (struct r10bio*) rb2->master_bio;
|
2005-04-17 02:20:36 +04:00
|
|
|
rb2->master_bio = NULL;
|
|
|
|
put_buf(rb2);
|
|
|
|
}
|
|
|
|
goto giveup;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* resync. Schedule a read for every block at this virt offset */
|
|
|
|
int count = 0;
|
2006-01-06 11:20:16 +03:00
|
|
|
|
2017-10-24 10:11:52 +03:00
|
|
|
/*
|
|
|
|
* Since curr_resync_completed could probably not update in
|
|
|
|
* time, and we will set cluster_sync_low based on it.
|
|
|
|
* Let's check against "sector_nr + 2 * RESYNC_SECTORS" for
|
|
|
|
* safety reason, which ensures curr_resync_completed is
|
|
|
|
* updated in bitmap_cond_end_sync.
|
|
|
|
*/
|
2018-08-02 01:20:50 +03:00
|
|
|
md_bitmap_cond_end_sync(mddev->bitmap, sector_nr,
|
|
|
|
mddev_is_clustered(mddev) &&
|
|
|
|
(sector_nr + 2 * RESYNC_SECTORS > conf->cluster_sync_high));
|
2009-02-25 05:18:47 +03:00
|
|
|
|
2018-08-02 01:20:50 +03:00
|
|
|
if (!md_bitmap_start_sync(mddev->bitmap, sector_nr,
|
|
|
|
&sync_blocks, mddev->degraded) &&
|
2011-05-11 08:54:41 +04:00
|
|
|
!conf->fullsync && !test_bit(MD_RECOVERY_REQUESTED,
|
|
|
|
&mddev->recovery)) {
|
2006-01-06 11:20:16 +03:00
|
|
|
/* We can skip this block */
|
|
|
|
*skipped = 1;
|
|
|
|
return sync_blocks + sectors_skipped;
|
|
|
|
}
|
|
|
|
if (sync_blocks < max_sync)
|
|
|
|
max_sync = sync_blocks;
|
2017-08-25 03:50:40 +03:00
|
|
|
r10_bio = raid10_alloc_init_r10buf(conf);
|
2014-08-18 08:38:45 +04:00
|
|
|
r10_bio->state = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
r10_bio->mddev = mddev;
|
|
|
|
atomic_set(&r10_bio->remaining, 0);
|
2006-01-06 11:20:16 +03:00
|
|
|
raise_barrier(conf, 0);
|
|
|
|
conf->next_resync = sector_nr;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
r10_bio->master_bio = NULL;
|
|
|
|
r10_bio->sector = sector_nr;
|
|
|
|
set_bit(R10BIO_IsSync, &r10_bio->state);
|
|
|
|
raid10_find_phys(conf, r10_bio);
|
2012-05-21 03:28:20 +04:00
|
|
|
r10_bio->sectors = (sector_nr | chunk_mask) - sector_nr + 1;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
for (i = 0; i < conf->copies; i++) {
|
2005-04-17 02:20:36 +04:00
|
|
|
int d = r10_bio->devs[i].devnum;
|
2011-07-28 05:39:24 +04:00
|
|
|
sector_t first_bad, sector;
|
|
|
|
int bad_sectors;
|
2016-06-02 09:19:52 +03:00
|
|
|
struct md_rdev *rdev;
|
2011-07-28 05:39:24 +04:00
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
if (r10_bio->devs[i].repl_bio)
|
|
|
|
r10_bio->devs[i].repl_bio->bi_end_io = NULL;
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
bio = r10_bio->devs[i].bio;
|
2017-06-03 10:38:06 +03:00
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
|
|
|
rdev = rcu_dereference(conf->mirrors[d].rdev);
|
|
|
|
if (rdev == NULL || test_bit(Faulty, &rdev->flags)) {
|
|
|
|
rcu_read_unlock();
|
2005-04-17 02:20:36 +04:00
|
|
|
continue;
|
2016-06-02 09:19:52 +03:00
|
|
|
}
|
2011-07-28 05:39:24 +04:00
|
|
|
sector = r10_bio->devs[i].addr;
|
2016-06-02 09:19:52 +03:00
|
|
|
if (is_badblock(rdev, sector, max_sync,
|
2011-07-28 05:39:24 +04:00
|
|
|
&first_bad, &bad_sectors)) {
|
|
|
|
if (first_bad > sector)
|
|
|
|
max_sync = first_bad - sector;
|
|
|
|
else {
|
|
|
|
bad_sectors -= (sector - first_bad);
|
|
|
|
if (max_sync > bad_sectors)
|
2012-10-11 07:20:58 +04:00
|
|
|
max_sync = bad_sectors;
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2011-07-28 05:39:24 +04:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
2005-04-17 02:20:36 +04:00
|
|
|
atomic_inc(&r10_bio->remaining);
|
|
|
|
bio->bi_next = biolist;
|
|
|
|
biolist = bio;
|
|
|
|
bio->bi_end_io = end_sync_read;
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(bio, REQ_OP_READ, 0);
|
2017-06-13 06:16:08 +03:00
|
|
|
if (test_bit(FailFast, &rdev->flags))
|
2016-11-18 08:16:12 +03:00
|
|
|
bio->bi_opf |= MD_FAILFAST;
|
2016-06-02 09:19:52 +03:00
|
|
|
bio->bi_iter.bi_sector = sector + rdev->data_offset;
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(bio, rdev->bdev);
|
2005-04-17 02:20:36 +04:00
|
|
|
count++;
|
2011-12-23 03:17:55 +04:00
|
|
|
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev = rcu_dereference(conf->mirrors[d].replacement);
|
|
|
|
if (rdev == NULL || test_bit(Faulty, &rdev->flags)) {
|
|
|
|
rcu_read_unlock();
|
2011-12-23 03:17:55 +04:00
|
|
|
continue;
|
2016-06-02 09:19:52 +03:00
|
|
|
}
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
2011-12-23 03:17:55 +04:00
|
|
|
|
|
|
|
/* Need to set up for writing to the replacement */
|
|
|
|
bio = r10_bio->devs[i].repl_bio;
|
2017-06-03 10:38:06 +03:00
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
2011-12-23 03:17:55 +04:00
|
|
|
|
|
|
|
sector = r10_bio->devs[i].addr;
|
|
|
|
bio->bi_next = biolist;
|
|
|
|
biolist = bio;
|
|
|
|
bio->bi_end_io = end_sync_write;
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
|
2017-06-13 06:16:08 +03:00
|
|
|
if (test_bit(FailFast, &rdev->flags))
|
2016-11-18 08:16:12 +03:00
|
|
|
bio->bi_opf |= MD_FAILFAST;
|
2016-06-02 09:19:52 +03:00
|
|
|
bio->bi_iter.bi_sector = sector + rdev->data_offset;
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(bio, rdev->bdev);
|
2011-12-23 03:17:55 +04:00
|
|
|
count++;
|
2017-06-13 06:16:08 +03:00
|
|
|
rcu_read_unlock();
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (count < 2) {
|
|
|
|
for (i=0; i<conf->copies; i++) {
|
|
|
|
int d = r10_bio->devs[i].devnum;
|
|
|
|
if (r10_bio->devs[i].bio->bi_end_io)
|
2011-05-11 08:54:41 +04:00
|
|
|
rdev_dec_pending(conf->mirrors[d].rdev,
|
|
|
|
mddev);
|
2011-12-23 03:17:55 +04:00
|
|
|
if (r10_bio->devs[i].repl_bio &&
|
|
|
|
r10_bio->devs[i].repl_bio->bi_end_io)
|
|
|
|
rdev_dec_pending(
|
|
|
|
conf->mirrors[d].replacement,
|
|
|
|
mddev);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
put_buf(r10_bio);
|
|
|
|
biolist = NULL;
|
|
|
|
goto giveup;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
nr_sectors = 0;
|
2006-01-06 11:20:16 +03:00
|
|
|
if (sector_nr + max_sync < max_sector)
|
|
|
|
max_sector = sector_nr + max_sync;
|
2005-04-17 02:20:36 +04:00
|
|
|
do {
|
|
|
|
struct page *page;
|
|
|
|
int len = PAGE_SIZE;
|
|
|
|
if (sector_nr + (len>>9) > max_sector)
|
|
|
|
len = (max_sector - sector_nr) << 9;
|
|
|
|
if (len == 0)
|
|
|
|
break;
|
|
|
|
for (bio= biolist ; bio ; bio=bio->bi_next) {
|
2017-03-16 19:12:33 +03:00
|
|
|
struct resync_pages *rp = get_resync_pages(bio);
|
2017-07-14 11:14:42 +03:00
|
|
|
page = resync_fetch_page(rp, page_idx);
|
2017-03-16 19:12:22 +03:00
|
|
|
/*
|
|
|
|
* won't fail because the vec table is big enough
|
|
|
|
* to hold all these pages
|
|
|
|
*/
|
|
|
|
bio_add_page(bio, page, len, 0);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
nr_sectors += len>>9;
|
|
|
|
sector_nr += len>>9;
|
2017-07-14 11:14:42 +03:00
|
|
|
} while (++page_idx < RESYNC_PAGES);
|
2005-04-17 02:20:36 +04:00
|
|
|
r10_bio->sectors = nr_sectors;
|
|
|
|
|
2017-10-24 10:11:52 +03:00
|
|
|
if (mddev_is_clustered(mddev) &&
|
|
|
|
test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
|
|
|
|
/* It is resync not recovery */
|
|
|
|
if (conf->cluster_sync_high < sector_nr + nr_sectors) {
|
|
|
|
conf->cluster_sync_low = mddev->curr_resync_completed;
|
|
|
|
raid10_set_cluster_sync_high(conf);
|
|
|
|
/* Send resync message */
|
|
|
|
md_cluster_ops->resync_info_update(mddev,
|
|
|
|
conf->cluster_sync_low,
|
|
|
|
conf->cluster_sync_high);
|
|
|
|
}
|
|
|
|
} else if (mddev_is_clustered(mddev)) {
|
|
|
|
/* This is recovery not resync */
|
|
|
|
sector_t sect_va1, sect_va2;
|
|
|
|
bool broadcast_msg = false;
|
|
|
|
|
|
|
|
for (i = 0; i < conf->geo.raid_disks; i++) {
|
|
|
|
/*
|
|
|
|
* sector_nr is a device address for recovery, so we
|
|
|
|
* need translate it to array address before compare
|
|
|
|
* with cluster_sync_high.
|
|
|
|
*/
|
|
|
|
sect_va1 = raid10_find_virt(conf, sector_nr, i);
|
|
|
|
|
|
|
|
if (conf->cluster_sync_high < sect_va1 + nr_sectors) {
|
|
|
|
broadcast_msg = true;
|
|
|
|
/*
|
|
|
|
* curr_resync_completed is similar as
|
|
|
|
* sector_nr, so make the translation too.
|
|
|
|
*/
|
|
|
|
sect_va2 = raid10_find_virt(conf,
|
|
|
|
mddev->curr_resync_completed, i);
|
|
|
|
|
|
|
|
if (conf->cluster_sync_low == 0 ||
|
|
|
|
conf->cluster_sync_low > sect_va2)
|
|
|
|
conf->cluster_sync_low = sect_va2;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (broadcast_msg) {
|
|
|
|
raid10_set_cluster_sync_high(conf);
|
|
|
|
md_cluster_ops->resync_info_update(mddev,
|
|
|
|
conf->cluster_sync_low,
|
|
|
|
conf->cluster_sync_high);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
while (biolist) {
|
|
|
|
bio = biolist;
|
|
|
|
biolist = biolist->bi_next;
|
|
|
|
|
|
|
|
bio->bi_next = NULL;
|
2017-03-16 19:12:33 +03:00
|
|
|
r10_bio = get_resync_r10bio(bio);
|
2005-04-17 02:20:36 +04:00
|
|
|
r10_bio->sectors = nr_sectors;
|
|
|
|
|
|
|
|
if (bio->bi_end_io == end_sync_read) {
|
2017-08-23 20:10:32 +03:00
|
|
|
md_sync_acct_bio(bio, nr_sectors);
|
2017-06-03 10:38:06 +03:00
|
|
|
bio->bi_status = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
generic_make_request(bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-06-22 04:17:13 +04:00
|
|
|
if (sectors_skipped)
|
|
|
|
/* pretend they weren't skipped, it makes
|
|
|
|
* no important difference in this case
|
|
|
|
*/
|
|
|
|
md_done_sync(mddev, sectors_skipped, 1);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
return sectors_skipped + nr_sectors;
|
|
|
|
giveup:
|
|
|
|
/* There is nowhere to write, so all non-sync
|
2011-07-28 05:39:24 +04:00
|
|
|
* drives must be failed or in resync, all drives
|
|
|
|
* have a bad block, so try the next chunk...
|
2005-04-17 02:20:36 +04:00
|
|
|
*/
|
2009-02-25 05:18:47 +03:00
|
|
|
if (sector_nr + max_sync < max_sector)
|
|
|
|
max_sector = sector_nr + max_sync;
|
|
|
|
|
|
|
|
sectors_skipped += (max_sector - sector_nr);
|
2005-04-17 02:20:36 +04:00
|
|
|
chunks_skipped ++;
|
|
|
|
sector_nr = max_sector;
|
|
|
|
goto skipped;
|
|
|
|
}
|
|
|
|
|
2009-03-18 04:10:40 +03:00
|
|
|
static sector_t
|
2011-10-11 09:47:53 +04:00
|
|
|
raid10_size(struct mddev *mddev, sector_t sectors, int raid_disks)
|
2009-03-18 04:10:40 +03:00
|
|
|
{
|
|
|
|
sector_t size;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2009-03-18 04:10:40 +03:00
|
|
|
|
|
|
|
if (!raid_disks)
|
2012-05-22 07:53:47 +04:00
|
|
|
raid_disks = min(conf->geo.raid_disks,
|
|
|
|
conf->prev.raid_disks);
|
2009-03-18 04:10:40 +03:00
|
|
|
if (!sectors)
|
2010-03-08 08:02:45 +03:00
|
|
|
sectors = conf->dev_sectors;
|
2009-03-18 04:10:40 +03:00
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
size = sectors >> conf->geo.chunk_shift;
|
|
|
|
sector_div(size, conf->geo.far_copies);
|
2009-03-18 04:10:40 +03:00
|
|
|
size = size * raid_disks;
|
2012-05-21 03:28:20 +04:00
|
|
|
sector_div(size, conf->geo.near_copies);
|
2009-03-18 04:10:40 +03:00
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
return size << conf->geo.chunk_shift;
|
2009-03-18 04:10:40 +03:00
|
|
|
}
|
|
|
|
|
2012-05-17 04:08:45 +04:00
|
|
|
static void calc_sectors(struct r10conf *conf, sector_t size)
|
|
|
|
{
|
|
|
|
/* Calculate the number of sectors-per-device that will
|
|
|
|
* actually be used, and set conf->dev_sectors and
|
|
|
|
* conf->stride
|
|
|
|
*/
|
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
size = size >> conf->geo.chunk_shift;
|
|
|
|
sector_div(size, conf->geo.far_copies);
|
|
|
|
size = size * conf->geo.raid_disks;
|
|
|
|
sector_div(size, conf->geo.near_copies);
|
2012-05-17 04:08:45 +04:00
|
|
|
/* 'size' is now the number of chunks in the array */
|
|
|
|
/* calculate "used chunks per device" */
|
|
|
|
size = size * conf->copies;
|
|
|
|
|
|
|
|
/* We need to round up when dividing by raid_disks to
|
|
|
|
* get the stride size.
|
|
|
|
*/
|
2012-05-21 03:28:20 +04:00
|
|
|
size = DIV_ROUND_UP_SECTOR_T(size, conf->geo.raid_disks);
|
2012-05-17 04:08:45 +04:00
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
conf->dev_sectors = size << conf->geo.chunk_shift;
|
2012-05-17 04:08:45 +04:00
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
if (conf->geo.far_offset)
|
|
|
|
conf->geo.stride = 1 << conf->geo.chunk_shift;
|
2012-05-17 04:08:45 +04:00
|
|
|
else {
|
2012-05-21 03:28:20 +04:00
|
|
|
sector_div(size, conf->geo.far_copies);
|
|
|
|
conf->geo.stride = size << conf->geo.chunk_shift;
|
2012-05-17 04:08:45 +04:00
|
|
|
}
|
|
|
|
}
|
2010-03-08 08:02:45 +03:00
|
|
|
|
2012-05-21 03:28:33 +04:00
|
|
|
enum geo_type {geo_new, geo_old, geo_start};
|
|
|
|
static int setup_geo(struct geom *geo, struct mddev *mddev, enum geo_type new)
|
|
|
|
{
|
|
|
|
int nc, fc, fo;
|
|
|
|
int layout, chunk, disks;
|
|
|
|
switch (new) {
|
|
|
|
case geo_old:
|
|
|
|
layout = mddev->layout;
|
|
|
|
chunk = mddev->chunk_sectors;
|
|
|
|
disks = mddev->raid_disks - mddev->delta_disks;
|
|
|
|
break;
|
|
|
|
case geo_new:
|
|
|
|
layout = mddev->new_layout;
|
|
|
|
chunk = mddev->new_chunk_sectors;
|
|
|
|
disks = mddev->raid_disks;
|
|
|
|
break;
|
|
|
|
default: /* avoid 'may be unused' warnings */
|
|
|
|
case geo_start: /* new when starting reshape - raid_disks not
|
|
|
|
* updated yet. */
|
|
|
|
layout = mddev->new_layout;
|
|
|
|
chunk = mddev->new_chunk_sectors;
|
|
|
|
disks = mddev->raid_disks + mddev->delta_disks;
|
|
|
|
break;
|
|
|
|
}
|
2015-10-22 05:20:15 +03:00
|
|
|
if (layout >> 19)
|
2012-05-21 03:28:33 +04:00
|
|
|
return -1;
|
|
|
|
if (chunk < (PAGE_SIZE >> 9) ||
|
|
|
|
!is_power_of_2(chunk))
|
|
|
|
return -2;
|
|
|
|
nc = layout & 255;
|
|
|
|
fc = (layout >> 8) & 255;
|
|
|
|
fo = layout & (1<<16);
|
|
|
|
geo->raid_disks = disks;
|
|
|
|
geo->near_copies = nc;
|
|
|
|
geo->far_copies = fc;
|
|
|
|
geo->far_offset = fo;
|
2015-10-22 05:20:15 +03:00
|
|
|
switch (layout >> 17) {
|
|
|
|
case 0: /* original layout. simple but not always optimal */
|
|
|
|
geo->far_set_size = disks;
|
|
|
|
break;
|
|
|
|
case 1: /* "improved" layout which was buggy. Hopefully no-one is
|
|
|
|
* actually using this, but leave code here just in case.*/
|
|
|
|
geo->far_set_size = disks/fc;
|
|
|
|
WARN(geo->far_set_size < fc,
|
|
|
|
"This RAID10 layout does not provide data safety - please backup and create new array\n");
|
|
|
|
break;
|
|
|
|
case 2: /* "improved" layout fixed to match documentation */
|
|
|
|
geo->far_set_size = fc * nc;
|
|
|
|
break;
|
|
|
|
default: /* Not a valid layout */
|
|
|
|
return -1;
|
|
|
|
}
|
2012-05-21 03:28:33 +04:00
|
|
|
geo->chunk_mask = chunk - 1;
|
|
|
|
geo->chunk_shift = ffz(~chunk);
|
|
|
|
return nc*fc;
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:02 +04:00
|
|
|
static struct r10conf *setup_conf(struct mddev *mddev)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = NULL;
|
2010-03-08 08:02:45 +03:00
|
|
|
int err = -EINVAL;
|
2012-05-21 03:28:33 +04:00
|
|
|
struct geom geo;
|
|
|
|
int copies;
|
|
|
|
|
|
|
|
copies = setup_geo(&geo, mddev, geo_new);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-05-21 03:28:33 +04:00
|
|
|
if (copies == -2) {
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_warn("md/raid10:%s: chunk size must be at least PAGE_SIZE(%ld) and be a power of 2.\n",
|
|
|
|
mdname(mddev), PAGE_SIZE);
|
2010-03-08 08:02:45 +03:00
|
|
|
goto out;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2006-01-06 11:20:36 +03:00
|
|
|
|
2012-05-21 03:28:33 +04:00
|
|
|
if (copies < 2 || copies > mddev->raid_disks) {
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_warn("md/raid10:%s: unsupported raid10 layout: 0x%8x\n",
|
|
|
|
mdname(mddev), mddev->new_layout);
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out;
|
|
|
|
}
|
2010-03-08 08:02:45 +03:00
|
|
|
|
|
|
|
err = -ENOMEM;
|
2011-10-11 09:49:02 +04:00
|
|
|
conf = kzalloc(sizeof(struct r10conf), GFP_KERNEL);
|
2010-03-08 08:02:45 +03:00
|
|
|
if (!conf)
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out;
|
2010-03-08 08:02:45 +03:00
|
|
|
|
2012-05-22 07:53:47 +04:00
|
|
|
/* FIXME calc properly */
|
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 00:03:40 +03:00
|
|
|
conf->mirrors = kcalloc(mddev->raid_disks + max(0, -mddev->delta_disks),
|
|
|
|
sizeof(struct raid10_info),
|
2010-03-08 08:02:45 +03:00
|
|
|
GFP_KERNEL);
|
|
|
|
if (!conf->mirrors)
|
|
|
|
goto out;
|
2006-01-06 11:20:28 +03:00
|
|
|
|
|
|
|
conf->tmppage = alloc_page(GFP_KERNEL);
|
|
|
|
if (!conf->tmppage)
|
2010-03-08 08:02:45 +03:00
|
|
|
goto out;
|
|
|
|
|
2012-05-21 03:28:33 +04:00
|
|
|
conf->geo = geo;
|
|
|
|
conf->copies = copies;
|
2019-06-15 01:41:04 +03:00
|
|
|
err = mempool_init(&conf->r10bio_pool, NR_RAID_BIOS, r10bio_pool_alloc,
|
2019-06-15 01:41:10 +03:00
|
|
|
rbio_pool_free, conf);
|
2018-05-21 01:25:52 +03:00
|
|
|
if (err)
|
2010-03-08 08:02:45 +03:00
|
|
|
goto out;
|
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
err = bioset_init(&conf->bio_split, BIO_POOL_SIZE, 0, 0);
|
|
|
|
if (err)
|
2017-04-05 07:05:51 +03:00
|
|
|
goto out;
|
|
|
|
|
2012-05-17 04:08:45 +04:00
|
|
|
calc_sectors(conf, mddev->dev_sectors);
|
2012-05-22 07:53:47 +04:00
|
|
|
if (mddev->reshape_position == MaxSector) {
|
|
|
|
conf->prev = conf->geo;
|
|
|
|
conf->reshape_progress = MaxSector;
|
|
|
|
} else {
|
|
|
|
if (setup_geo(&conf->prev, mddev, geo_old) != conf->copies) {
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
conf->reshape_progress = mddev->reshape_position;
|
|
|
|
if (conf->prev.far_offset)
|
|
|
|
conf->prev.stride = 1 << conf->prev.chunk_shift;
|
|
|
|
else
|
|
|
|
/* far_copies must be 1 */
|
|
|
|
conf->prev.stride = conf->dev_sectors;
|
|
|
|
}
|
2015-07-06 10:37:49 +03:00
|
|
|
conf->reshape_safe = conf->reshape_progress;
|
2008-05-15 03:05:54 +04:00
|
|
|
spin_lock_init(&conf->device_lock);
|
2010-03-08 08:02:45 +03:00
|
|
|
INIT_LIST_HEAD(&conf->retry_list);
|
2015-08-14 04:26:17 +03:00
|
|
|
INIT_LIST_HEAD(&conf->bio_end_io_list);
|
2010-03-08 08:02:45 +03:00
|
|
|
|
|
|
|
spin_lock_init(&conf->resync_lock);
|
|
|
|
init_waitqueue_head(&conf->wait_barrier);
|
2016-06-24 15:20:16 +03:00
|
|
|
atomic_set(&conf->nr_pending, 0);
|
2010-03-08 08:02:45 +03:00
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
err = -ENOMEM;
|
2012-07-03 09:56:52 +04:00
|
|
|
conf->thread = md_register_thread(raid10d, mddev, "raid10");
|
2010-03-08 08:02:45 +03:00
|
|
|
if (!conf->thread)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
conf->mddev = mddev;
|
|
|
|
return conf;
|
|
|
|
|
|
|
|
out:
|
|
|
|
if (conf) {
|
2018-05-21 01:25:52 +03:00
|
|
|
mempool_exit(&conf->r10bio_pool);
|
2010-03-08 08:02:45 +03:00
|
|
|
kfree(conf->mirrors);
|
|
|
|
safe_put_page(conf->tmppage);
|
2018-05-21 01:25:52 +03:00
|
|
|
bioset_exit(&conf->bio_split);
|
2010-03-08 08:02:45 +03:00
|
|
|
kfree(conf);
|
|
|
|
}
|
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
|
2016-01-21 00:52:20 +03:00
|
|
|
static int raid10_run(struct mddev *mddev)
|
2010-03-08 08:02:45 +03:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf;
|
2010-03-08 08:02:45 +03:00
|
|
|
int i, disk_idx, chunk_size;
|
2012-07-31 04:03:52 +04:00
|
|
|
struct raid10_info *disk;
|
2011-10-11 09:45:26 +04:00
|
|
|
struct md_rdev *rdev;
|
2010-03-08 08:02:45 +03:00
|
|
|
sector_t size;
|
2012-05-22 07:53:47 +04:00
|
|
|
sector_t min_offset_diff = 0;
|
|
|
|
int first = 1;
|
2012-10-11 06:30:52 +04:00
|
|
|
bool discard_supported = false;
|
2010-03-08 08:02:45 +03:00
|
|
|
|
2017-06-05 09:05:13 +03:00
|
|
|
if (mddev_init_writes_pending(mddev) < 0)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-03-08 08:02:45 +03:00
|
|
|
if (mddev->private == NULL) {
|
|
|
|
conf = setup_conf(mddev);
|
|
|
|
if (IS_ERR(conf))
|
|
|
|
return PTR_ERR(conf);
|
|
|
|
mddev->private = conf;
|
|
|
|
}
|
|
|
|
conf = mddev->private;
|
|
|
|
if (!conf)
|
|
|
|
goto out;
|
|
|
|
|
2017-10-24 10:11:52 +03:00
|
|
|
if (mddev_is_clustered(conf->mddev)) {
|
|
|
|
int fc, fo;
|
|
|
|
|
|
|
|
fc = (mddev->layout >> 8) & 255;
|
|
|
|
fo = mddev->layout & (1<<16);
|
|
|
|
if (fc > 1 || fo > 0) {
|
|
|
|
pr_err("only near layout is supported by clustered"
|
|
|
|
" raid10\n");
|
2018-01-23 18:06:12 +03:00
|
|
|
goto out_free_conf;
|
2017-10-24 10:11:52 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-03-08 08:02:45 +03:00
|
|
|
mddev->thread = conf->thread;
|
|
|
|
conf->thread = NULL;
|
|
|
|
|
2009-07-01 05:13:45 +04:00
|
|
|
chunk_size = mddev->chunk_sectors << 9;
|
2012-07-31 04:03:53 +04:00
|
|
|
if (mddev->queue) {
|
2012-10-11 06:30:52 +04:00
|
|
|
blk_queue_max_discard_sectors(mddev->queue,
|
|
|
|
mddev->chunk_sectors);
|
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-12 18:37:43 +04:00
|
|
|
blk_queue_max_write_same_sectors(mddev->queue, 0);
|
2017-04-05 20:21:03 +03:00
|
|
|
blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
|
2012-07-31 04:03:53 +04:00
|
|
|
blk_queue_io_min(mddev->queue, chunk_size);
|
|
|
|
if (conf->geo.raid_disks % conf->geo.near_copies)
|
|
|
|
blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks);
|
|
|
|
else
|
|
|
|
blk_queue_io_opt(mddev->queue, chunk_size *
|
|
|
|
(conf->geo.raid_disks / conf->geo.near_copies));
|
|
|
|
}
|
2009-07-01 05:13:45 +04:00
|
|
|
|
2012-03-19 05:46:39 +04:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2012-05-22 07:53:47 +04:00
|
|
|
long long diff;
|
2011-07-28 05:31:47 +04:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
disk_idx = rdev->raid_disk;
|
2012-05-21 03:28:33 +04:00
|
|
|
if (disk_idx < 0)
|
|
|
|
continue;
|
|
|
|
if (disk_idx >= conf->geo.raid_disks &&
|
|
|
|
disk_idx >= conf->prev.raid_disks)
|
2005-04-17 02:20:36 +04:00
|
|
|
continue;
|
|
|
|
disk = conf->mirrors + disk_idx;
|
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
if (test_bit(Replacement, &rdev->flags)) {
|
|
|
|
if (disk->replacement)
|
|
|
|
goto out_free_conf;
|
|
|
|
disk->replacement = rdev;
|
|
|
|
} else {
|
|
|
|
if (disk->rdev)
|
|
|
|
goto out_free_conf;
|
|
|
|
disk->rdev = rdev;
|
|
|
|
}
|
2012-05-22 07:53:47 +04:00
|
|
|
diff = (rdev->new_data_offset - rdev->data_offset);
|
|
|
|
if (!mddev->reshape_backwards)
|
|
|
|
diff = -diff;
|
|
|
|
if (diff < 0)
|
|
|
|
diff = 0;
|
|
|
|
if (first || diff < min_offset_diff)
|
|
|
|
min_offset_diff = diff;
|
2011-12-23 03:17:55 +04:00
|
|
|
|
2012-07-31 04:03:53 +04:00
|
|
|
if (mddev->gendisk)
|
|
|
|
disk_stack_limits(mddev->gendisk, rdev->bdev,
|
|
|
|
rdev->data_offset << 9);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
disk->head_position = 0;
|
2012-10-11 06:30:52 +04:00
|
|
|
|
|
|
|
if (blk_queue_discard(bdev_get_queue(rdev->bdev)))
|
|
|
|
discard_supported = true;
|
2017-04-06 04:12:18 +03:00
|
|
|
first = 0;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2012-05-22 07:53:47 +04:00
|
|
|
|
2012-10-31 04:42:30 +04:00
|
|
|
if (mddev->queue) {
|
|
|
|
if (discard_supported)
|
2018-03-08 04:10:10 +03:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DISCARD,
|
2012-10-31 04:42:30 +04:00
|
|
|
mddev->queue);
|
|
|
|
else
|
2018-03-08 04:10:10 +03:00
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_DISCARD,
|
2012-10-31 04:42:30 +04:00
|
|
|
mddev->queue);
|
|
|
|
}
|
2005-09-10 03:24:03 +04:00
|
|
|
/* need to check that every block has at least one working mirror */
|
2011-07-27 05:00:36 +04:00
|
|
|
if (!enough(conf, -1)) {
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_err("md/raid10:%s: not enough operational mirrors.\n",
|
2005-09-10 03:24:03 +04:00
|
|
|
mdname(mddev));
|
2005-04-17 02:20:36 +04:00
|
|
|
goto out_free_conf;
|
|
|
|
}
|
|
|
|
|
2012-05-22 07:53:47 +04:00
|
|
|
if (conf->reshape_progress != MaxSector) {
|
|
|
|
/* must ensure that shape change is supported */
|
|
|
|
if (conf->geo.far_copies != 1 &&
|
|
|
|
conf->geo.far_offset == 0)
|
|
|
|
goto out_free_conf;
|
|
|
|
if (conf->prev.far_copies != 1 &&
|
2013-07-02 09:58:05 +04:00
|
|
|
conf->prev.far_offset == 0)
|
2012-05-22 07:53:47 +04:00
|
|
|
goto out_free_conf;
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
mddev->degraded = 0;
|
2012-05-21 03:28:33 +04:00
|
|
|
for (i = 0;
|
|
|
|
i < conf->geo.raid_disks
|
|
|
|
|| i < conf->prev.raid_disks;
|
|
|
|
i++) {
|
2005-04-17 02:20:36 +04:00
|
|
|
|
|
|
|
disk = conf->mirrors + i;
|
|
|
|
|
2011-12-23 03:17:55 +04:00
|
|
|
if (!disk->rdev && disk->replacement) {
|
|
|
|
/* The replacement is all we have - use it */
|
|
|
|
disk->rdev = disk->replacement;
|
|
|
|
disk->replacement = NULL;
|
|
|
|
clear_bit(Replacement, &disk->rdev->flags);
|
|
|
|
}
|
|
|
|
|
2006-06-26 11:27:40 +04:00
|
|
|
if (!disk->rdev ||
|
2006-10-21 21:24:07 +04:00
|
|
|
!test_bit(In_sync, &disk->rdev->flags)) {
|
2005-04-17 02:20:36 +04:00
|
|
|
disk->head_position = 0;
|
|
|
|
mddev->degraded++;
|
2014-01-14 09:30:10 +04:00
|
|
|
if (disk->rdev &&
|
|
|
|
disk->rdev->saved_raid_disk < 0)
|
Ensure interrupted recovery completed properly (v1 metadata plus bitmap)
If, while assembling an array, we find a device which is not fully
in-sync with the array, it is important to set the "fullsync" flags.
This is an exact analog to the setting of this flag in hot_add_disk
methods.
Currently, only v1.x metadata supports having devices in an array
which are not fully in-sync (it keep track of how in sync they are).
The 'fullsync' flag only makes a difference when a write-intent bitmap
is being used. In this case it tells recovery to ignore the bitmap
and recovery all blocks.
This fix is already in place for raid1, but not raid5/6 or raid10.
So without this fix, a raid1 ir raid4/5/6 array with version 1.x
metadata and a write intent bitmaps, that is stopped in the middle
of a recovery, will appear to complete the recovery instantly
after it is reassembled, but the recovery will not be correct.
If you might have an array like that, issueing
echo repair > /sys/block/mdXX/md/sync_action
will make sure recovery completes properly.
Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 02:30:52 +04:00
|
|
|
conf->fullsync = 1;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
2018-06-28 13:40:11 +03:00
|
|
|
|
|
|
|
if (disk->replacement &&
|
|
|
|
!test_bit(In_sync, &disk->replacement->flags) &&
|
|
|
|
disk->replacement->saved_raid_disk < 0) {
|
|
|
|
conf->fullsync = 1;
|
|
|
|
}
|
|
|
|
|
2011-10-26 04:54:39 +04:00
|
|
|
disk->recovery_disabled = mddev->recovery_disabled - 1;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2009-06-18 02:48:06 +04:00
|
|
|
if (mddev->recovery_cp != MaxSector)
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_notice("md/raid10:%s: not clean -- starting background reconstruction\n",
|
|
|
|
mdname(mddev));
|
|
|
|
pr_info("md/raid10:%s: active with %d out of %d devices\n",
|
2012-05-21 03:28:20 +04:00
|
|
|
mdname(mddev), conf->geo.raid_disks - mddev->degraded,
|
|
|
|
conf->geo.raid_disks);
|
2005-04-17 02:20:36 +04:00
|
|
|
/*
|
|
|
|
* Ok, everything is just fine now
|
|
|
|
*/
|
2010-03-08 08:02:45 +03:00
|
|
|
mddev->dev_sectors = conf->dev_sectors;
|
|
|
|
size = raid10_size(mddev, 0, 0);
|
|
|
|
md_set_array_sectors(mddev, size);
|
|
|
|
mddev->resync_max_sectors = size;
|
2016-11-18 08:16:11 +03:00
|
|
|
set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-07-31 04:03:53 +04:00
|
|
|
if (mddev->queue) {
|
2012-05-21 03:28:20 +04:00
|
|
|
int stripe = conf->geo.raid_disks *
|
2009-06-18 02:45:01 +04:00
|
|
|
((mddev->chunk_sectors << 9) / PAGE_SIZE);
|
2012-07-31 04:03:53 +04:00
|
|
|
|
|
|
|
/* Calculate max read-ahead size.
|
|
|
|
* We need to readahead at least twice a whole stripe....
|
|
|
|
* maybe...
|
|
|
|
*/
|
2012-05-21 03:28:20 +04:00
|
|
|
stripe /= conf->geo.near_copies;
|
2017-02-02 17:56:50 +03:00
|
|
|
if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
|
|
|
|
mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2011-03-17 13:11:05 +03:00
|
|
|
if (md_integrity_register(mddev))
|
|
|
|
goto out_free_conf;
|
|
|
|
|
2012-05-22 07:53:47 +04:00
|
|
|
if (conf->reshape_progress != MaxSector) {
|
|
|
|
unsigned long before_length, after_length;
|
|
|
|
|
|
|
|
before_length = ((1 << conf->prev.chunk_shift) *
|
|
|
|
conf->prev.far_copies);
|
|
|
|
after_length = ((1 << conf->geo.chunk_shift) *
|
|
|
|
conf->geo.far_copies);
|
|
|
|
|
|
|
|
if (max(before_length, after_length) > min_offset_diff) {
|
|
|
|
/* This cannot work */
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_warn("md/raid10: offset difference not enough to continue reshape\n");
|
2012-05-22 07:53:47 +04:00
|
|
|
goto out_free_conf;
|
|
|
|
}
|
|
|
|
conf->offset_diff = min_offset_diff;
|
|
|
|
|
|
|
|
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
|
|
|
|
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
|
|
|
|
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
|
|
|
mddev->sync_thread = md_register_thread(md_do_sync, mddev,
|
|
|
|
"reshape");
|
2019-03-05 01:48:54 +03:00
|
|
|
if (!mddev->sync_thread)
|
|
|
|
goto out_free_conf;
|
2012-05-22 07:53:47 +04:00
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_free_conf:
|
2011-09-21 09:30:20 +04:00
|
|
|
md_unregister_thread(&mddev->thread);
|
2018-05-21 01:25:52 +03:00
|
|
|
mempool_exit(&conf->r10bio_pool);
|
2006-01-06 11:20:40 +03:00
|
|
|
safe_put_page(conf->tmppage);
|
2005-06-22 04:17:30 +04:00
|
|
|
kfree(conf->mirrors);
|
2005-04-17 02:20:36 +04:00
|
|
|
kfree(conf);
|
|
|
|
mddev->private = NULL;
|
|
|
|
out:
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
2014-12-15 04:56:58 +03:00
|
|
|
static void raid10_free(struct mddev *mddev, void *priv)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2014-12-15 04:56:58 +03:00
|
|
|
struct r10conf *conf = priv;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2018-05-21 01:25:52 +03:00
|
|
|
mempool_exit(&conf->r10bio_pool);
|
2013-04-24 05:42:44 +04:00
|
|
|
safe_put_page(conf->tmppage);
|
2005-06-22 04:17:30 +04:00
|
|
|
kfree(conf->mirrors);
|
2014-08-23 14:19:26 +04:00
|
|
|
kfree(conf->mirrors_old);
|
|
|
|
kfree(conf->mirrors_new);
|
2018-05-21 01:25:52 +03:00
|
|
|
bioset_exit(&conf->bio_split);
|
2005-04-17 02:20:36 +04:00
|
|
|
kfree(conf);
|
|
|
|
}
|
|
|
|
|
2017-10-19 04:49:15 +03:00
|
|
|
static void raid10_quiesce(struct mddev *mddev, int quiesce)
|
2006-01-06 11:20:16 +03:00
|
|
|
{
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf = mddev->private;
|
2006-01-06 11:20:16 +03:00
|
|
|
|
2017-10-19 04:49:15 +03:00
|
|
|
if (quiesce)
|
2006-01-06 11:20:16 +03:00
|
|
|
raise_barrier(conf, 0);
|
2017-10-19 04:49:15 +03:00
|
|
|
else
|
2006-01-06 11:20:16 +03:00
|
|
|
lower_barrier(conf);
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2012-03-19 05:46:40 +04:00
|
|
|
static int raid10_resize(struct mddev *mddev, sector_t sectors)
|
|
|
|
{
|
|
|
|
/* Resize of 'far' arrays is not supported.
|
|
|
|
* For 'near' and 'offset' arrays we can set the
|
|
|
|
* number of sectors used to be an appropriate multiple
|
|
|
|
* of the chunk size.
|
|
|
|
* For 'offset', this is far_copies*chunksize.
|
|
|
|
* For 'near' the multiplier is the LCM of
|
|
|
|
* near_copies and raid_disks.
|
|
|
|
* So if far_copies > 1 && !far_offset, fail.
|
|
|
|
* Else find LCM(raid_disks, near_copy)*far_copies and
|
|
|
|
* multiply by chunk_size. Then round to this number.
|
|
|
|
* This is mostly done by raid10_size()
|
|
|
|
*/
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
sector_t oldsize, size;
|
|
|
|
|
2012-05-21 03:28:33 +04:00
|
|
|
if (mddev->reshape_position != MaxSector)
|
|
|
|
return -EBUSY;
|
|
|
|
|
2012-05-21 03:28:20 +04:00
|
|
|
if (conf->geo.far_copies > 1 && !conf->geo.far_offset)
|
2012-03-19 05:46:40 +04:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
oldsize = raid10_size(mddev, 0, 0);
|
|
|
|
size = raid10_size(mddev, sectors, 0);
|
2012-05-22 07:55:27 +04:00
|
|
|
if (mddev->external_size &&
|
|
|
|
mddev->array_sectors > size)
|
2012-03-19 05:46:40 +04:00
|
|
|
return -EINVAL;
|
2012-05-22 07:55:27 +04:00
|
|
|
if (mddev->bitmap) {
|
2018-08-02 01:20:50 +03:00
|
|
|
int ret = md_bitmap_resize(mddev->bitmap, size, 0, 0);
|
2012-05-22 07:55:27 +04:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
md_set_array_sectors(mddev, size);
|
2012-03-19 05:46:40 +04:00
|
|
|
if (sectors > mddev->dev_sectors &&
|
|
|
|
mddev->recovery_cp > oldsize) {
|
|
|
|
mddev->recovery_cp = oldsize;
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
}
|
2012-05-17 04:08:45 +04:00
|
|
|
calc_sectors(conf, sectors);
|
|
|
|
mddev->dev_sectors = conf->dev_sectors;
|
2012-03-19 05:46:40 +04:00
|
|
|
mddev->resync_max_sectors = size;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-02-12 06:09:57 +03:00
|
|
|
static void *raid10_takeover_raid0(struct mddev *mddev, sector_t size, int devs)
|
2010-03-08 08:02:45 +03:00
|
|
|
{
|
2011-10-11 09:45:26 +04:00
|
|
|
struct md_rdev *rdev;
|
2011-10-11 09:49:02 +04:00
|
|
|
struct r10conf *conf;
|
2010-03-08 08:02:45 +03:00
|
|
|
|
|
|
|
if (mddev->degraded > 0) {
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_warn("md/raid10:%s: Error: degraded raid0!\n",
|
|
|
|
mdname(mddev));
|
2010-03-08 08:02:45 +03:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
2015-02-12 06:09:57 +03:00
|
|
|
sector_div(size, devs);
|
2010-03-08 08:02:45 +03:00
|
|
|
|
|
|
|
/* Set new parameters */
|
|
|
|
mddev->new_level = 10;
|
|
|
|
/* new layout: far_copies = 1, near_copies = 2 */
|
|
|
|
mddev->new_layout = (1<<8) + 2;
|
|
|
|
mddev->new_chunk_sectors = mddev->chunk_sectors;
|
|
|
|
mddev->delta_disks = mddev->raid_disks;
|
|
|
|
mddev->raid_disks *= 2;
|
|
|
|
/* make sure it will be not marked as dirty */
|
|
|
|
mddev->recovery_cp = MaxSector;
|
2015-02-12 06:09:57 +03:00
|
|
|
mddev->dev_sectors = size;
|
2010-03-08 08:02:45 +03:00
|
|
|
|
|
|
|
conf = setup_conf(mddev);
|
2011-02-04 16:18:26 +03:00
|
|
|
if (!IS_ERR(conf)) {
|
2012-03-19 05:46:39 +04:00
|
|
|
rdev_for_each(rdev, mddev)
|
2015-02-12 06:09:57 +03:00
|
|
|
if (rdev->raid_disk >= 0) {
|
2010-06-15 12:36:03 +04:00
|
|
|
rdev->new_raid_disk = rdev->raid_disk * 2;
|
2015-02-12 06:09:57 +03:00
|
|
|
rdev->sectors = size;
|
|
|
|
}
|
2011-02-04 16:18:26 +03:00
|
|
|
conf->barrier = 1;
|
|
|
|
}
|
|
|
|
|
2010-03-08 08:02:45 +03:00
|
|
|
return conf;
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:47:53 +04:00
|
|
|
static void *raid10_takeover(struct mddev *mddev)
|
2010-03-08 08:02:45 +03:00
|
|
|
{
|
2011-10-11 09:48:59 +04:00
|
|
|
struct r0conf *raid0_conf;
|
2010-03-08 08:02:45 +03:00
|
|
|
|
|
|
|
/* raid10 can take over:
|
|
|
|
* raid0 - providing it has only two drives
|
|
|
|
*/
|
|
|
|
if (mddev->level == 0) {
|
|
|
|
/* for raid0 takeover only one zone is supported */
|
2011-10-11 09:48:59 +04:00
|
|
|
raid0_conf = mddev->private;
|
|
|
|
if (raid0_conf->nr_strip_zones > 1) {
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_warn("md/raid10:%s: cannot takeover raid 0 with more than one zone.\n",
|
|
|
|
mdname(mddev));
|
2010-03-08 08:02:45 +03:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
2015-02-12 06:09:57 +03:00
|
|
|
return raid10_takeover_raid0(mddev,
|
|
|
|
raid0_conf->strip_zone->zone_end,
|
|
|
|
raid0_conf->strip_zone->nb_dev);
|
2010-03-08 08:02:45 +03:00
|
|
|
}
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
2012-05-22 07:53:47 +04:00
|
|
|
static int raid10_check_reshape(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
/* Called when there is a request to change
|
|
|
|
* - layout (to ->new_layout)
|
|
|
|
* - chunk size (to ->new_chunk_sectors)
|
|
|
|
* - raid_disks (by delta_disks)
|
|
|
|
* or when trying to restart a reshape that was ongoing.
|
|
|
|
*
|
|
|
|
* We need to validate the request and possibly allocate
|
|
|
|
* space if that might be an issue later.
|
|
|
|
*
|
|
|
|
* Currently we reject any reshape of a 'far' mode array,
|
|
|
|
* allow chunk size to change if new is generally acceptable,
|
|
|
|
* allow raid_disks to increase, and allow
|
|
|
|
* a switch between 'near' mode and 'offset' mode.
|
|
|
|
*/
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
struct geom geo;
|
|
|
|
|
|
|
|
if (conf->geo.far_copies != 1 && !conf->geo.far_offset)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (setup_geo(&geo, mddev, geo_start) != conf->copies)
|
|
|
|
/* mustn't change number of copies */
|
|
|
|
return -EINVAL;
|
|
|
|
if (geo.far_copies > 1 && !geo.far_offset)
|
|
|
|
/* Cannot switch to 'far' mode */
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (mddev->array_sectors & geo.chunk_mask)
|
|
|
|
/* not factor of array size */
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (!enough(conf, -1))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
kfree(conf->mirrors_new);
|
|
|
|
conf->mirrors_new = NULL;
|
|
|
|
if (mddev->delta_disks > 0) {
|
|
|
|
/* allocate new 'mirrors' list */
|
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 00:03:40 +03:00
|
|
|
conf->mirrors_new =
|
|
|
|
kcalloc(mddev->raid_disks + mddev->delta_disks,
|
|
|
|
sizeof(struct raid10_info),
|
|
|
|
GFP_KERNEL);
|
2012-05-22 07:53:47 +04:00
|
|
|
if (!conf->mirrors_new)
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Need to check if array has failed when deciding whether to:
|
|
|
|
* - start an array
|
|
|
|
* - remove non-faulty devices
|
|
|
|
* - add a spare
|
|
|
|
* - allow a reshape
|
|
|
|
* This determination is simple when no reshape is happening.
|
|
|
|
* However if there is a reshape, we need to carefully check
|
|
|
|
* both the before and after sections.
|
|
|
|
* This is because some failed devices may only affect one
|
|
|
|
* of the two sections, and some non-in_sync devices may
|
|
|
|
* be insync in the section most affected by failed devices.
|
|
|
|
*/
|
|
|
|
static int calc_degraded(struct r10conf *conf)
|
|
|
|
{
|
|
|
|
int degraded, degraded2;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
degraded = 0;
|
|
|
|
/* 'prev' section first */
|
|
|
|
for (i = 0; i < conf->prev.raid_disks; i++) {
|
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
|
|
|
|
if (!rdev || test_bit(Faulty, &rdev->flags))
|
|
|
|
degraded++;
|
|
|
|
else if (!test_bit(In_sync, &rdev->flags))
|
|
|
|
/* When we can reduce the number of devices in
|
|
|
|
* an array, this might not contribute to
|
|
|
|
* 'degraded'. It does now.
|
|
|
|
*/
|
|
|
|
degraded++;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
if (conf->geo.raid_disks == conf->prev.raid_disks)
|
|
|
|
return degraded;
|
|
|
|
rcu_read_lock();
|
|
|
|
degraded2 = 0;
|
|
|
|
for (i = 0; i < conf->geo.raid_disks; i++) {
|
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
|
|
|
|
if (!rdev || test_bit(Faulty, &rdev->flags))
|
|
|
|
degraded2++;
|
|
|
|
else if (!test_bit(In_sync, &rdev->flags)) {
|
|
|
|
/* If reshape is increasing the number of devices,
|
|
|
|
* this section has already been recovered, so
|
|
|
|
* it doesn't contribute to degraded.
|
|
|
|
* else it does.
|
|
|
|
*/
|
|
|
|
if (conf->geo.raid_disks <= conf->prev.raid_disks)
|
|
|
|
degraded2++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
if (degraded2 > degraded)
|
|
|
|
return degraded2;
|
|
|
|
return degraded;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int raid10_start_reshape(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
/* A 'reshape' has been requested. This commits
|
|
|
|
* the various 'new' fields and sets MD_RECOVER_RESHAPE
|
|
|
|
* This also checks if there are enough spares and adds them
|
|
|
|
* to the array.
|
|
|
|
* We currently require enough spares to make the final
|
|
|
|
* array non-degraded. We also require that the difference
|
|
|
|
* between old and new data_offset - on each device - is
|
|
|
|
* enough that we never risk over-writing.
|
|
|
|
*/
|
|
|
|
|
|
|
|
unsigned long before_length, after_length;
|
|
|
|
sector_t min_offset_diff = 0;
|
|
|
|
int first = 1;
|
|
|
|
struct geom new;
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
int spares = 0;
|
2012-05-22 07:55:28 +04:00
|
|
|
int ret;
|
2012-05-22 07:53:47 +04:00
|
|
|
|
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
|
|
|
return -EBUSY;
|
|
|
|
|
|
|
|
if (setup_geo(&new, mddev, geo_start) != conf->copies)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
before_length = ((1 << conf->prev.chunk_shift) *
|
|
|
|
conf->prev.far_copies);
|
|
|
|
after_length = ((1 << conf->geo.chunk_shift) *
|
|
|
|
conf->geo.far_copies);
|
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
if (!test_bit(In_sync, &rdev->flags)
|
|
|
|
&& !test_bit(Faulty, &rdev->flags))
|
|
|
|
spares++;
|
|
|
|
if (rdev->raid_disk >= 0) {
|
|
|
|
long long diff = (rdev->new_data_offset
|
|
|
|
- rdev->data_offset);
|
|
|
|
if (!mddev->reshape_backwards)
|
|
|
|
diff = -diff;
|
|
|
|
if (diff < 0)
|
|
|
|
diff = 0;
|
|
|
|
if (first || diff < min_offset_diff)
|
|
|
|
min_offset_diff = diff;
|
2017-05-01 22:15:07 +03:00
|
|
|
first = 0;
|
2012-05-22 07:53:47 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (max(before_length, after_length) > min_offset_diff)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (spares < mddev->delta_disks)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
conf->offset_diff = min_offset_diff;
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
if (conf->mirrors_new) {
|
|
|
|
memcpy(conf->mirrors_new, conf->mirrors,
|
2012-07-31 04:03:52 +04:00
|
|
|
sizeof(struct raid10_info)*conf->prev.raid_disks);
|
2012-05-22 07:53:47 +04:00
|
|
|
smp_mb();
|
2014-08-23 14:19:26 +04:00
|
|
|
kfree(conf->mirrors_old);
|
2012-05-22 07:53:47 +04:00
|
|
|
conf->mirrors_old = conf->mirrors;
|
|
|
|
conf->mirrors = conf->mirrors_new;
|
|
|
|
conf->mirrors_new = NULL;
|
|
|
|
}
|
|
|
|
setup_geo(&conf->geo, mddev, geo_start);
|
|
|
|
smp_mb();
|
|
|
|
if (mddev->reshape_backwards) {
|
|
|
|
sector_t size = raid10_size(mddev, 0, 0);
|
|
|
|
if (size < mddev->array_sectors) {
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2016-11-02 06:16:50 +03:00
|
|
|
pr_warn("md/raid10:%s: array size must be reduce before number of disks\n",
|
|
|
|
mdname(mddev));
|
2012-05-22 07:53:47 +04:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
mddev->resync_max_sectors = size;
|
|
|
|
conf->reshape_progress = size;
|
|
|
|
} else
|
|
|
|
conf->reshape_progress = 0;
|
2015-07-06 10:37:49 +03:00
|
|
|
conf->reshape_safe = conf->reshape_progress;
|
2012-05-22 07:53:47 +04:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
|
2012-05-22 07:55:28 +04:00
|
|
|
if (mddev->delta_disks && mddev->bitmap) {
|
2018-10-18 11:37:41 +03:00
|
|
|
struct mdp_superblock_1 *sb = NULL;
|
|
|
|
sector_t oldsize, newsize;
|
|
|
|
|
|
|
|
oldsize = raid10_size(mddev, 0, 0);
|
|
|
|
newsize = raid10_size(mddev, 0, conf->geo.raid_disks);
|
|
|
|
|
|
|
|
if (!mddev_is_clustered(mddev)) {
|
|
|
|
ret = md_bitmap_resize(mddev->bitmap, newsize, 0, 0);
|
|
|
|
if (ret)
|
|
|
|
goto abort;
|
|
|
|
else
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
if (rdev->raid_disk > -1 &&
|
|
|
|
!test_bit(Faulty, &rdev->flags))
|
|
|
|
sb = page_address(rdev->sb_page);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* some node is already performing reshape, and no need to
|
|
|
|
* call md_bitmap_resize again since it should be called when
|
|
|
|
* receiving BITMAP_RESIZE msg
|
|
|
|
*/
|
|
|
|
if ((sb && (le32_to_cpu(sb->feature_map) &
|
|
|
|
MD_FEATURE_RESHAPE_ACTIVE)) || (oldsize == newsize))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = md_bitmap_resize(mddev->bitmap, newsize, 0, 0);
|
2012-05-22 07:55:28 +04:00
|
|
|
if (ret)
|
|
|
|
goto abort;
|
2018-10-18 11:37:41 +03:00
|
|
|
|
|
|
|
ret = md_cluster_ops->resize_bitmaps(mddev, newsize, oldsize);
|
|
|
|
if (ret) {
|
|
|
|
md_bitmap_resize(mddev->bitmap, oldsize, 0, 0);
|
|
|
|
goto abort;
|
|
|
|
}
|
2012-05-22 07:55:28 +04:00
|
|
|
}
|
2018-10-18 11:37:41 +03:00
|
|
|
out:
|
2012-05-22 07:53:47 +04:00
|
|
|
if (mddev->delta_disks > 0) {
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
if (rdev->raid_disk < 0 &&
|
|
|
|
!test_bit(Faulty, &rdev->flags)) {
|
|
|
|
if (raid10_add_disk(mddev, rdev) == 0) {
|
|
|
|
if (rdev->raid_disk >=
|
|
|
|
conf->prev.raid_disks)
|
|
|
|
set_bit(In_sync, &rdev->flags);
|
|
|
|
else
|
|
|
|
rdev->recovery_offset = 0;
|
|
|
|
|
|
|
|
if (sysfs_link_rdev(mddev, rdev))
|
|
|
|
/* Failure here is OK */;
|
|
|
|
}
|
|
|
|
} else if (rdev->raid_disk >= conf->prev.raid_disks
|
|
|
|
&& !test_bit(Faulty, &rdev->flags)) {
|
|
|
|
/* This is a spare that was manually added */
|
|
|
|
set_bit(In_sync, &rdev->flags);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/* When a reshape changes the number of devices,
|
|
|
|
* ->degraded is measured against the larger of the
|
|
|
|
* pre and post numbers.
|
|
|
|
*/
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
mddev->degraded = calc_degraded(conf);
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
mddev->raid_disks = conf->geo.raid_disks;
|
|
|
|
mddev->reshape_position = conf->reshape_progress;
|
2016-12-09 02:48:19 +03:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2012-05-22 07:53:47 +04:00
|
|
|
|
|
|
|
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
|
2015-06-12 13:05:04 +03:00
|
|
|
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
|
2012-05-22 07:53:47 +04:00
|
|
|
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
|
|
|
|
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
|
|
|
|
|
|
|
mddev->sync_thread = md_register_thread(md_do_sync, mddev,
|
|
|
|
"reshape");
|
|
|
|
if (!mddev->sync_thread) {
|
2012-05-22 07:55:28 +04:00
|
|
|
ret = -EAGAIN;
|
|
|
|
goto abort;
|
2012-05-22 07:53:47 +04:00
|
|
|
}
|
|
|
|
conf->reshape_checkpoint = jiffies;
|
|
|
|
md_wakeup_thread(mddev->sync_thread);
|
|
|
|
md_new_event(mddev);
|
|
|
|
return 0;
|
2012-05-22 07:55:28 +04:00
|
|
|
|
|
|
|
abort:
|
|
|
|
mddev->recovery = 0;
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
conf->geo = conf->prev;
|
|
|
|
mddev->raid_disks = conf->geo.raid_disks;
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
rdev->new_data_offset = rdev->data_offset;
|
|
|
|
smp_wmb();
|
|
|
|
conf->reshape_progress = MaxSector;
|
2015-07-06 10:37:49 +03:00
|
|
|
conf->reshape_safe = MaxSector;
|
2012-05-22 07:55:28 +04:00
|
|
|
mddev->reshape_position = MaxSector;
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
return ret;
|
2012-05-22 07:53:47 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Calculate the last device-address that could contain
|
|
|
|
* any block from the chunk that includes the array-address 's'
|
|
|
|
* and report the next address.
|
|
|
|
* i.e. the address returned will be chunk-aligned and after
|
|
|
|
* any data that is in the chunk containing 's'.
|
|
|
|
*/
|
|
|
|
static sector_t last_dev_address(sector_t s, struct geom *geo)
|
|
|
|
{
|
|
|
|
s = (s | geo->chunk_mask) + 1;
|
|
|
|
s >>= geo->chunk_shift;
|
|
|
|
s *= geo->near_copies;
|
|
|
|
s = DIV_ROUND_UP_SECTOR_T(s, geo->raid_disks);
|
|
|
|
s *= geo->far_copies;
|
|
|
|
s <<= geo->chunk_shift;
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Calculate the first device-address that could contain
|
|
|
|
* any block from the chunk that includes the array-address 's'.
|
|
|
|
* This too will be the start of a chunk
|
|
|
|
*/
|
|
|
|
static sector_t first_dev_address(sector_t s, struct geom *geo)
|
|
|
|
{
|
|
|
|
s >>= geo->chunk_shift;
|
|
|
|
s *= geo->near_copies;
|
|
|
|
sector_div(s, geo->raid_disks);
|
|
|
|
s *= geo->far_copies;
|
|
|
|
s <<= geo->chunk_shift;
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
|
|
|
|
int *skipped)
|
|
|
|
{
|
|
|
|
/* We simply copy at most one chunk (smallest of old and new)
|
|
|
|
* at a time, possibly less if that exceeds RESYNC_PAGES,
|
|
|
|
* or we hit a bad block or something.
|
|
|
|
* This might mean we pause for normal IO in the middle of
|
2015-07-06 09:33:47 +03:00
|
|
|
* a chunk, but that is not a problem as mddev->reshape_position
|
2012-05-22 07:53:47 +04:00
|
|
|
* can record any location.
|
|
|
|
*
|
|
|
|
* If we will want to write to a location that isn't
|
|
|
|
* yet recorded as 'safe' (i.e. in metadata on disk) then
|
|
|
|
* we need to flush all reshape requests and update the metadata.
|
|
|
|
*
|
|
|
|
* When reshaping forwards (e.g. to more devices), we interpret
|
|
|
|
* 'safe' as the earliest block which might not have been copied
|
|
|
|
* down yet. We divide this by previous stripe size and multiply
|
|
|
|
* by previous stripe length to get lowest device offset that we
|
|
|
|
* cannot write to yet.
|
|
|
|
* We interpret 'sector_nr' as an address that we want to write to.
|
|
|
|
* From this we use last_device_address() to find where we might
|
|
|
|
* write to, and first_device_address on the 'safe' position.
|
|
|
|
* If this 'next' write position is after the 'safe' position,
|
|
|
|
* we must update the metadata to increase the 'safe' position.
|
|
|
|
*
|
|
|
|
* When reshaping backwards, we round in the opposite direction
|
|
|
|
* and perform the reverse test: next write position must not be
|
|
|
|
* less than current safe position.
|
|
|
|
*
|
|
|
|
* In all this the minimum difference in data offsets
|
|
|
|
* (conf->offset_diff - always positive) allows a bit of slack,
|
2015-07-06 09:33:47 +03:00
|
|
|
* so next can be after 'safe', but not by more than offset_diff
|
2012-05-22 07:53:47 +04:00
|
|
|
*
|
|
|
|
* We need to prepare all the bios here before we start any IO
|
|
|
|
* to ensure the size we choose is acceptable to all devices.
|
|
|
|
* The means one for each copy for write-out and an extra one for
|
|
|
|
* read-in.
|
|
|
|
* We store the read-in bio in ->master_bio and the others in
|
|
|
|
* ->devs[x].bio and ->devs[x].repl_bio.
|
|
|
|
*/
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
struct r10bio *r10_bio;
|
|
|
|
sector_t next, safe, last;
|
|
|
|
int max_sectors;
|
|
|
|
int nr_sectors;
|
|
|
|
int s;
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
int need_flush = 0;
|
|
|
|
struct bio *blist;
|
|
|
|
struct bio *bio, *read_bio;
|
|
|
|
int sectors_done = 0;
|
2017-03-16 19:12:33 +03:00
|
|
|
struct page **pages;
|
2012-05-22 07:53:47 +04:00
|
|
|
|
|
|
|
if (sector_nr == 0) {
|
|
|
|
/* If restarting in the middle, skip the initial sectors */
|
|
|
|
if (mddev->reshape_backwards &&
|
|
|
|
conf->reshape_progress < raid10_size(mddev, 0, 0)) {
|
|
|
|
sector_nr = (raid10_size(mddev, 0, 0)
|
|
|
|
- conf->reshape_progress);
|
|
|
|
} else if (!mddev->reshape_backwards &&
|
|
|
|
conf->reshape_progress > 0)
|
|
|
|
sector_nr = conf->reshape_progress;
|
|
|
|
if (sector_nr) {
|
|
|
|
mddev->curr_resync_completed = sector_nr;
|
|
|
|
sysfs_notify(&mddev->kobj, NULL, "sync_completed");
|
|
|
|
*skipped = 1;
|
|
|
|
return sector_nr;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We don't use sector_nr to track where we are up to
|
|
|
|
* as that doesn't work well for ->reshape_backwards.
|
|
|
|
* So just use ->reshape_progress.
|
|
|
|
*/
|
|
|
|
if (mddev->reshape_backwards) {
|
|
|
|
/* 'next' is the earliest device address that we might
|
|
|
|
* write to for this chunk in the new layout
|
|
|
|
*/
|
|
|
|
next = first_dev_address(conf->reshape_progress - 1,
|
|
|
|
&conf->geo);
|
|
|
|
|
|
|
|
/* 'safe' is the last device address that we might read from
|
|
|
|
* in the old layout after a restart
|
|
|
|
*/
|
|
|
|
safe = last_dev_address(conf->reshape_safe - 1,
|
|
|
|
&conf->prev);
|
|
|
|
|
|
|
|
if (next + conf->offset_diff < safe)
|
|
|
|
need_flush = 1;
|
|
|
|
|
|
|
|
last = conf->reshape_progress - 1;
|
|
|
|
sector_nr = last & ~(sector_t)(conf->geo.chunk_mask
|
|
|
|
& conf->prev.chunk_mask);
|
|
|
|
if (sector_nr + RESYNC_BLOCK_SIZE/512 < last)
|
|
|
|
sector_nr = last + 1 - RESYNC_BLOCK_SIZE/512;
|
|
|
|
} else {
|
|
|
|
/* 'next' is after the last device address that we
|
|
|
|
* might write to for this chunk in the new layout
|
|
|
|
*/
|
|
|
|
next = last_dev_address(conf->reshape_progress, &conf->geo);
|
|
|
|
|
|
|
|
/* 'safe' is the earliest device address that we might
|
|
|
|
* read from in the old layout after a restart
|
|
|
|
*/
|
|
|
|
safe = first_dev_address(conf->reshape_safe, &conf->prev);
|
|
|
|
|
|
|
|
/* Need to update metadata if 'next' might be beyond 'safe'
|
|
|
|
* as that would possibly corrupt data
|
|
|
|
*/
|
|
|
|
if (next > safe + conf->offset_diff)
|
|
|
|
need_flush = 1;
|
|
|
|
|
|
|
|
sector_nr = conf->reshape_progress;
|
|
|
|
last = sector_nr | (conf->geo.chunk_mask
|
|
|
|
& conf->prev.chunk_mask);
|
|
|
|
|
|
|
|
if (sector_nr + RESYNC_BLOCK_SIZE/512 <= last)
|
|
|
|
last = sector_nr + RESYNC_BLOCK_SIZE/512 - 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (need_flush ||
|
|
|
|
time_after(jiffies, conf->reshape_checkpoint + 10*HZ)) {
|
|
|
|
/* Need to update reshape_position in metadata */
|
|
|
|
wait_barrier(conf);
|
|
|
|
mddev->reshape_position = conf->reshape_progress;
|
|
|
|
if (mddev->reshape_backwards)
|
|
|
|
mddev->curr_resync_completed = raid10_size(mddev, 0, 0)
|
|
|
|
- conf->reshape_progress;
|
|
|
|
else
|
|
|
|
mddev->curr_resync_completed = conf->reshape_progress;
|
|
|
|
conf->reshape_checkpoint = jiffies;
|
2016-12-09 02:48:19 +03:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2012-05-22 07:53:47 +04:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2016-12-09 02:48:19 +03:00
|
|
|
wait_event(mddev->sb_wait, mddev->sb_flags == 0 ||
|
2013-11-19 05:02:01 +04:00
|
|
|
test_bit(MD_RECOVERY_INTR, &mddev->recovery));
|
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
|
|
|
|
allow_barrier(conf);
|
|
|
|
return sectors_done;
|
|
|
|
}
|
2012-05-22 07:53:47 +04:00
|
|
|
conf->reshape_safe = mddev->reshape_position;
|
|
|
|
allow_barrier(conf);
|
|
|
|
}
|
|
|
|
|
2018-08-30 10:57:09 +03:00
|
|
|
raise_barrier(conf, 0);
|
2012-05-22 07:53:47 +04:00
|
|
|
read_more:
|
|
|
|
/* Now schedule reads for blocks from sector_nr to last */
|
2017-08-25 03:50:40 +03:00
|
|
|
r10_bio = raid10_alloc_init_r10buf(conf);
|
2014-08-18 08:38:45 +04:00
|
|
|
r10_bio->state = 0;
|
2018-08-30 10:57:09 +03:00
|
|
|
raise_barrier(conf, 1);
|
2012-05-22 07:53:47 +04:00
|
|
|
atomic_set(&r10_bio->remaining, 0);
|
|
|
|
r10_bio->mddev = mddev;
|
|
|
|
r10_bio->sector = sector_nr;
|
|
|
|
set_bit(R10BIO_IsReshape, &r10_bio->state);
|
|
|
|
r10_bio->sectors = last - sector_nr + 1;
|
|
|
|
rdev = read_balance(conf, r10_bio, &max_sectors);
|
|
|
|
BUG_ON(!test_bit(R10BIO_Previous, &r10_bio->state));
|
|
|
|
|
|
|
|
if (!rdev) {
|
|
|
|
/* Cannot read from here, so need to record bad blocks
|
|
|
|
* on all the target devices.
|
|
|
|
*/
|
|
|
|
// FIXME
|
2018-05-21 01:25:52 +03:00
|
|
|
mempool_free(r10_bio, &conf->r10buf_pool);
|
2012-05-22 07:53:47 +04:00
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
|
|
|
return sectors_done;
|
|
|
|
}
|
|
|
|
|
|
|
|
read_bio = bio_alloc_mddev(GFP_KERNEL, RESYNC_PAGES, mddev);
|
|
|
|
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(read_bio, rdev->bdev);
|
2013-10-12 02:44:27 +04:00
|
|
|
read_bio->bi_iter.bi_sector = (r10_bio->devs[r10_bio->read_slot].addr
|
2012-05-22 07:53:47 +04:00
|
|
|
+ rdev->data_offset);
|
|
|
|
read_bio->bi_private = r10_bio;
|
2017-03-16 19:12:32 +03:00
|
|
|
read_bio->bi_end_io = end_reshape_read;
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(read_bio, REQ_OP_READ, 0);
|
2014-08-18 07:56:38 +04:00
|
|
|
read_bio->bi_flags &= (~0UL << BIO_RESET_BITS);
|
2017-06-03 10:38:06 +03:00
|
|
|
read_bio->bi_status = 0;
|
2012-05-22 07:53:47 +04:00
|
|
|
read_bio->bi_vcnt = 0;
|
2013-10-12 02:44:27 +04:00
|
|
|
read_bio->bi_iter.bi_size = 0;
|
2012-05-22 07:53:47 +04:00
|
|
|
r10_bio->master_bio = read_bio;
|
|
|
|
r10_bio->read_slot = r10_bio->devs[r10_bio->read_slot].devnum;
|
|
|
|
|
md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.
Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.
For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.
Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.
2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 11:37:42 +03:00
|
|
|
/*
|
|
|
|
* Broadcast RESYNC message to other nodes, so all nodes would not
|
|
|
|
* write to the region to avoid conflict.
|
|
|
|
*/
|
|
|
|
if (mddev_is_clustered(mddev) && conf->cluster_sync_high <= sector_nr) {
|
|
|
|
struct mdp_superblock_1 *sb = NULL;
|
|
|
|
int sb_reshape_pos = 0;
|
|
|
|
|
|
|
|
conf->cluster_sync_low = sector_nr;
|
|
|
|
conf->cluster_sync_high = sector_nr + CLUSTER_RESYNC_WINDOW_SECTORS;
|
|
|
|
sb = page_address(rdev->sb_page);
|
|
|
|
if (sb) {
|
|
|
|
sb_reshape_pos = le64_to_cpu(sb->reshape_position);
|
|
|
|
/*
|
|
|
|
* Set cluster_sync_low again if next address for array
|
|
|
|
* reshape is less than cluster_sync_low. Since we can't
|
|
|
|
* update cluster_sync_low until it has finished reshape.
|
|
|
|
*/
|
|
|
|
if (sb_reshape_pos < conf->cluster_sync_low)
|
|
|
|
conf->cluster_sync_low = sb_reshape_pos;
|
|
|
|
}
|
|
|
|
|
|
|
|
md_cluster_ops->resync_info_update(mddev, conf->cluster_sync_low,
|
|
|
|
conf->cluster_sync_high);
|
|
|
|
}
|
|
|
|
|
2012-05-22 07:53:47 +04:00
|
|
|
/* Now find the locations in the new layout */
|
|
|
|
__raid10_find_phys(&conf->geo, r10_bio);
|
|
|
|
|
|
|
|
blist = read_bio;
|
|
|
|
read_bio->bi_next = NULL;
|
|
|
|
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
2012-05-22 07:53:47 +04:00
|
|
|
for (s = 0; s < conf->copies*2; s++) {
|
|
|
|
struct bio *b;
|
|
|
|
int d = r10_bio->devs[s/2].devnum;
|
|
|
|
struct md_rdev *rdev2;
|
|
|
|
if (s&1) {
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev2 = rcu_dereference(conf->mirrors[d].replacement);
|
2012-05-22 07:53:47 +04:00
|
|
|
b = r10_bio->devs[s/2].repl_bio;
|
|
|
|
} else {
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev2 = rcu_dereference(conf->mirrors[d].rdev);
|
2012-05-22 07:53:47 +04:00
|
|
|
b = r10_bio->devs[s/2].bio;
|
|
|
|
}
|
|
|
|
if (!rdev2 || test_bit(Faulty, &rdev2->flags))
|
|
|
|
continue;
|
2012-09-07 01:14:43 +04:00
|
|
|
|
2017-08-23 20:10:32 +03:00
|
|
|
bio_set_dev(b, rdev2->bdev);
|
2013-10-12 02:44:27 +04:00
|
|
|
b->bi_iter.bi_sector = r10_bio->devs[s/2].addr +
|
|
|
|
rdev2->new_data_offset;
|
2012-05-22 07:53:47 +04:00
|
|
|
b->bi_end_io = end_reshape_write;
|
2016-06-05 22:32:07 +03:00
|
|
|
bio_set_op_attrs(b, REQ_OP_WRITE, 0);
|
2012-05-22 07:53:47 +04:00
|
|
|
b->bi_next = blist;
|
|
|
|
blist = b;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now add as many pages as possible to all of these bios. */
|
|
|
|
|
|
|
|
nr_sectors = 0;
|
2017-03-16 19:12:33 +03:00
|
|
|
pages = get_resync_pages(r10_bio->devs[0].bio)->pages;
|
2012-05-22 07:53:47 +04:00
|
|
|
for (s = 0 ; s < max_sectors; s += PAGE_SIZE >> 9) {
|
2017-03-16 19:12:33 +03:00
|
|
|
struct page *page = pages[s / (PAGE_SIZE >> 9)];
|
2012-05-22 07:53:47 +04:00
|
|
|
int len = (max_sectors - s) << 9;
|
|
|
|
if (len > PAGE_SIZE)
|
|
|
|
len = PAGE_SIZE;
|
|
|
|
for (bio = blist; bio ; bio = bio->bi_next) {
|
2017-03-16 19:12:22 +03:00
|
|
|
/*
|
|
|
|
* won't fail because the vec table is big enough
|
|
|
|
* to hold all these pages
|
|
|
|
*/
|
|
|
|
bio_add_page(bio, page, len, 0);
|
2012-05-22 07:53:47 +04:00
|
|
|
}
|
|
|
|
sector_nr += len >> 9;
|
|
|
|
nr_sectors += len >> 9;
|
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2012-05-22 07:53:47 +04:00
|
|
|
r10_bio->sectors = nr_sectors;
|
|
|
|
|
|
|
|
/* Now submit the read */
|
2017-08-23 20:10:32 +03:00
|
|
|
md_sync_acct_bio(read_bio, r10_bio->sectors);
|
2012-05-22 07:53:47 +04:00
|
|
|
atomic_inc(&r10_bio->remaining);
|
|
|
|
read_bio->bi_next = NULL;
|
|
|
|
generic_make_request(read_bio);
|
|
|
|
sectors_done += nr_sectors;
|
|
|
|
if (sector_nr <= last)
|
|
|
|
goto read_more;
|
|
|
|
|
2018-08-30 10:57:09 +03:00
|
|
|
lower_barrier(conf);
|
|
|
|
|
2012-05-22 07:53:47 +04:00
|
|
|
/* Now that we have done the whole section we can
|
|
|
|
* update reshape_progress
|
|
|
|
*/
|
|
|
|
if (mddev->reshape_backwards)
|
|
|
|
conf->reshape_progress -= sectors_done;
|
|
|
|
else
|
|
|
|
conf->reshape_progress += sectors_done;
|
|
|
|
|
|
|
|
return sectors_done;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void end_reshape_request(struct r10bio *r10_bio);
|
|
|
|
static int handle_reshape_read_error(struct mddev *mddev,
|
|
|
|
struct r10bio *r10_bio);
|
|
|
|
static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio)
|
|
|
|
{
|
|
|
|
/* Reshape read completed. Hopefully we have a block
|
|
|
|
* to write out.
|
|
|
|
* If we got a read error then we do sync 1-page reads from
|
|
|
|
* elsewhere until we find the data - or give up.
|
|
|
|
*/
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
int s;
|
|
|
|
|
|
|
|
if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
|
|
|
|
if (handle_reshape_read_error(mddev, r10_bio) < 0) {
|
|
|
|
/* Reshape has been aborted */
|
|
|
|
md_done_sync(mddev, r10_bio->sectors, 0);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We definitely have the data in the pages, schedule the
|
|
|
|
* writes.
|
|
|
|
*/
|
|
|
|
atomic_set(&r10_bio->remaining, 1);
|
|
|
|
for (s = 0; s < conf->copies*2; s++) {
|
|
|
|
struct bio *b;
|
|
|
|
int d = r10_bio->devs[s/2].devnum;
|
|
|
|
struct md_rdev *rdev;
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
2012-05-22 07:53:47 +04:00
|
|
|
if (s&1) {
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev = rcu_dereference(conf->mirrors[d].replacement);
|
2012-05-22 07:53:47 +04:00
|
|
|
b = r10_bio->devs[s/2].repl_bio;
|
|
|
|
} else {
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev = rcu_dereference(conf->mirrors[d].rdev);
|
2012-05-22 07:53:47 +04:00
|
|
|
b = r10_bio->devs[s/2].bio;
|
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
if (!rdev || test_bit(Faulty, &rdev->flags)) {
|
|
|
|
rcu_read_unlock();
|
2012-05-22 07:53:47 +04:00
|
|
|
continue;
|
2016-06-02 09:19:52 +03:00
|
|
|
}
|
2012-05-22 07:53:47 +04:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2017-08-23 20:10:32 +03:00
|
|
|
md_sync_acct_bio(b, r10_bio->sectors);
|
2012-05-22 07:53:47 +04:00
|
|
|
atomic_inc(&r10_bio->remaining);
|
|
|
|
b->bi_next = NULL;
|
|
|
|
generic_make_request(b);
|
|
|
|
}
|
|
|
|
end_reshape_request(r10_bio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void end_reshape(struct r10conf *conf)
|
|
|
|
{
|
|
|
|
if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery))
|
|
|
|
return;
|
|
|
|
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
conf->prev = conf->geo;
|
|
|
|
md_finish_reshape(conf->mddev);
|
|
|
|
smp_wmb();
|
|
|
|
conf->reshape_progress = MaxSector;
|
2015-07-06 10:37:49 +03:00
|
|
|
conf->reshape_safe = MaxSector;
|
2012-05-22 07:53:47 +04:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
|
|
|
|
/* read-ahead size must cover two whole stripes, which is
|
|
|
|
* 2 * (datadisks) * chunksize where 'n' is the number of raid devices
|
|
|
|
*/
|
|
|
|
if (conf->mddev->queue) {
|
|
|
|
int stripe = conf->geo.raid_disks *
|
|
|
|
((conf->mddev->chunk_sectors << 9) / PAGE_SIZE);
|
|
|
|
stripe /= conf->geo.near_copies;
|
2017-02-02 17:56:50 +03:00
|
|
|
if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
|
|
|
|
conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
|
2012-05-22 07:53:47 +04:00
|
|
|
}
|
|
|
|
conf->fullsync = 0;
|
|
|
|
}
|
|
|
|
|
md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.
Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.
For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.
Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.
2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 11:37:42 +03:00
|
|
|
static void raid10_update_reshape_pos(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
struct r10conf *conf = mddev->private;
|
2018-10-18 11:37:43 +03:00
|
|
|
sector_t lo, hi;
|
md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.
Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.
For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.
Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.
2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 11:37:42 +03:00
|
|
|
|
2018-10-18 11:37:43 +03:00
|
|
|
md_cluster_ops->resync_info_get(mddev, &lo, &hi);
|
|
|
|
if (((mddev->reshape_position <= hi) && (mddev->reshape_position >= lo))
|
|
|
|
|| mddev->reshape_position == MaxSector)
|
|
|
|
conf->reshape_progress = mddev->reshape_position;
|
|
|
|
else
|
|
|
|
WARN_ON_ONCE(1);
|
md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.
Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.
For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.
Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.
2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 11:37:42 +03:00
|
|
|
}
|
|
|
|
|
2012-05-22 07:53:47 +04:00
|
|
|
static int handle_reshape_read_error(struct mddev *mddev,
|
|
|
|
struct r10bio *r10_bio)
|
|
|
|
{
|
|
|
|
/* Use sync reads to get the blocks from somewhere else */
|
|
|
|
int sectors = r10_bio->sectors;
|
|
|
|
struct r10conf *conf = mddev->private;
|
2017-10-05 21:28:47 +03:00
|
|
|
struct r10bio *r10b;
|
2012-05-22 07:53:47 +04:00
|
|
|
int slot = 0;
|
|
|
|
int idx = 0;
|
2017-03-16 19:12:35 +03:00
|
|
|
struct page **pages;
|
|
|
|
|
2019-06-15 01:41:09 +03:00
|
|
|
r10b = kmalloc(struct_size(r10b, devs, conf->copies), GFP_NOIO);
|
2017-10-05 21:28:47 +03:00
|
|
|
if (!r10b) {
|
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2017-03-16 19:12:35 +03:00
|
|
|
/* reshape IOs share pages from .devs[0].bio */
|
|
|
|
pages = get_resync_pages(r10_bio->devs[0].bio)->pages;
|
2012-05-22 07:53:47 +04:00
|
|
|
|
2012-08-18 03:51:42 +04:00
|
|
|
r10b->sector = r10_bio->sector;
|
|
|
|
__raid10_find_phys(&conf->prev, r10b);
|
2012-05-22 07:53:47 +04:00
|
|
|
|
|
|
|
while (sectors) {
|
|
|
|
int s = sectors;
|
|
|
|
int success = 0;
|
|
|
|
int first_slot = slot;
|
|
|
|
|
|
|
|
if (s > (PAGE_SIZE >> 9))
|
|
|
|
s = PAGE_SIZE >> 9;
|
|
|
|
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
2012-05-22 07:53:47 +04:00
|
|
|
while (!success) {
|
2012-08-18 03:51:42 +04:00
|
|
|
int d = r10b->devs[slot].devnum;
|
2016-06-02 09:19:52 +03:00
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev);
|
2012-05-22 07:53:47 +04:00
|
|
|
sector_t addr;
|
|
|
|
if (rdev == NULL ||
|
|
|
|
test_bit(Faulty, &rdev->flags) ||
|
|
|
|
!test_bit(In_sync, &rdev->flags))
|
|
|
|
goto failed;
|
|
|
|
|
2012-08-18 03:51:42 +04:00
|
|
|
addr = r10b->devs[slot].addr + idx * PAGE_SIZE;
|
2016-06-02 09:19:52 +03:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
rcu_read_unlock();
|
2012-05-22 07:53:47 +04:00
|
|
|
success = sync_page_io(rdev,
|
|
|
|
addr,
|
|
|
|
s << 9,
|
2017-03-16 19:12:35 +03:00
|
|
|
pages[idx],
|
2016-06-05 22:32:07 +03:00
|
|
|
REQ_OP_READ, 0, false);
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
rcu_read_lock();
|
2012-05-22 07:53:47 +04:00
|
|
|
if (success)
|
|
|
|
break;
|
|
|
|
failed:
|
|
|
|
slot++;
|
|
|
|
if (slot >= conf->copies)
|
|
|
|
slot = 0;
|
|
|
|
if (slot == first_slot)
|
|
|
|
break;
|
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2012-05-22 07:53:47 +04:00
|
|
|
if (!success) {
|
|
|
|
/* couldn't read this block, must give up */
|
|
|
|
set_bit(MD_RECOVERY_INTR,
|
|
|
|
&mddev->recovery);
|
2017-10-05 21:28:47 +03:00
|
|
|
kfree(r10b);
|
2012-05-22 07:53:47 +04:00
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
sectors -= s;
|
|
|
|
idx++;
|
|
|
|
}
|
2017-10-05 21:28:47 +03:00
|
|
|
kfree(r10b);
|
2012-05-22 07:53:47 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-07-20 16:29:37 +03:00
|
|
|
static void end_reshape_write(struct bio *bio)
|
2012-05-22 07:53:47 +04:00
|
|
|
{
|
2017-03-16 19:12:33 +03:00
|
|
|
struct r10bio *r10_bio = get_resync_r10bio(bio);
|
2012-05-22 07:53:47 +04:00
|
|
|
struct mddev *mddev = r10_bio->mddev;
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
int d;
|
|
|
|
int slot;
|
|
|
|
int repl;
|
|
|
|
struct md_rdev *rdev = NULL;
|
|
|
|
|
|
|
|
d = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
|
|
|
|
if (repl)
|
|
|
|
rdev = conf->mirrors[d].replacement;
|
|
|
|
if (!rdev) {
|
|
|
|
smp_mb();
|
|
|
|
rdev = conf->mirrors[d].rdev;
|
|
|
|
}
|
|
|
|
|
2017-06-03 10:38:06 +03:00
|
|
|
if (bio->bi_status) {
|
2012-05-22 07:53:47 +04:00
|
|
|
/* FIXME should record badblock */
|
|
|
|
md_error(mddev, rdev);
|
|
|
|
}
|
|
|
|
|
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
end_reshape_request(r10_bio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void end_reshape_request(struct r10bio *r10_bio)
|
|
|
|
{
|
|
|
|
if (!atomic_dec_and_test(&r10_bio->remaining))
|
|
|
|
return;
|
|
|
|
md_done_sync(r10_bio->mddev, r10_bio->sectors, 1);
|
|
|
|
bio_put(r10_bio->master_bio);
|
|
|
|
put_buf(r10_bio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void raid10_finish_reshape(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
struct r10conf *conf = mddev->private;
|
|
|
|
|
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (mddev->delta_disks > 0) {
|
|
|
|
if (mddev->recovery_cp > mddev->resync_max_sectors) {
|
|
|
|
mddev->recovery_cp = mddev->resync_max_sectors;
|
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
}
|
md: fix a potential deadlock of raid5/raid10 reshape
There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.
How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
write through critical data into raid5 emulated block device. So it
waits for raid5 kernel thread handling stripes in order to finish it
I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
called first in order to reap the raid5 resync thread. That is,
raid5_finish_reshape() will be called. In this function, it will try
to update conf and call VFS revalidate_disk() to grow the raid5
emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.
The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.
For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.
2017/12/29:
Thanks to Guoqing Jiang <gqjiang@suse.com>,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.
Reported-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-22 08:34:46 +03:00
|
|
|
mddev->resync_max_sectors = mddev->array_sectors;
|
2012-05-22 07:55:33 +04:00
|
|
|
} else {
|
|
|
|
int d;
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_lock();
|
2012-05-22 07:55:33 +04:00
|
|
|
for (d = conf->geo.raid_disks ;
|
|
|
|
d < conf->geo.raid_disks - mddev->delta_disks;
|
|
|
|
d++) {
|
2016-06-02 09:19:52 +03:00
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev);
|
2012-05-22 07:55:33 +04:00
|
|
|
if (rdev)
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2016-06-02 09:19:52 +03:00
|
|
|
rdev = rcu_dereference(conf->mirrors[d].replacement);
|
2012-05-22 07:55:33 +04:00
|
|
|
if (rdev)
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
|
|
|
}
|
2016-06-02 09:19:52 +03:00
|
|
|
rcu_read_unlock();
|
2012-05-22 07:53:47 +04:00
|
|
|
}
|
|
|
|
mddev->layout = mddev->new_layout;
|
|
|
|
mddev->chunk_sectors = 1 << conf->geo.chunk_shift;
|
|
|
|
mddev->reshape_position = MaxSector;
|
|
|
|
mddev->delta_disks = 0;
|
|
|
|
mddev->reshape_backwards = 0;
|
|
|
|
}
|
|
|
|
|
2011-10-11 09:49:58 +04:00
|
|
|
static struct md_personality raid10_personality =
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
|
|
|
.name = "raid10",
|
2006-01-06 11:20:36 +03:00
|
|
|
.level = 10,
|
2005-04-17 02:20:36 +04:00
|
|
|
.owner = THIS_MODULE,
|
2016-01-21 00:52:20 +03:00
|
|
|
.make_request = raid10_make_request,
|
|
|
|
.run = raid10_run,
|
2014-12-15 04:56:58 +03:00
|
|
|
.free = raid10_free,
|
2016-01-21 00:52:20 +03:00
|
|
|
.status = raid10_status,
|
|
|
|
.error_handler = raid10_error,
|
2005-04-17 02:20:36 +04:00
|
|
|
.hot_add_disk = raid10_add_disk,
|
|
|
|
.hot_remove_disk= raid10_remove_disk,
|
|
|
|
.spare_active = raid10_spare_active,
|
2016-01-21 00:52:20 +03:00
|
|
|
.sync_request = raid10_sync_request,
|
2006-01-06 11:20:16 +03:00
|
|
|
.quiesce = raid10_quiesce,
|
2009-03-18 04:10:40 +03:00
|
|
|
.size = raid10_size,
|
2012-03-19 05:46:40 +04:00
|
|
|
.resize = raid10_resize,
|
2010-03-08 08:02:45 +03:00
|
|
|
.takeover = raid10_takeover,
|
2012-05-22 07:53:47 +04:00
|
|
|
.check_reshape = raid10_check_reshape,
|
|
|
|
.start_reshape = raid10_start_reshape,
|
|
|
|
.finish_reshape = raid10_finish_reshape,
|
md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.
Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.
For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.
Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.
2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 11:37:42 +03:00
|
|
|
.update_reshape_pos = raid10_update_reshape_pos,
|
2014-12-15 04:56:56 +03:00
|
|
|
.congested = raid10_congested,
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
static int __init raid_init(void)
|
|
|
|
{
|
2006-01-06 11:20:36 +03:00
|
|
|
return register_md_personality(&raid10_personality);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void raid_exit(void)
|
|
|
|
{
|
2006-01-06 11:20:36 +03:00
|
|
|
unregister_md_personality(&raid10_personality);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
module_init(raid_init);
|
|
|
|
module_exit(raid_exit);
|
|
|
|
MODULE_LICENSE("GPL");
|
2009-12-14 04:49:58 +03:00
|
|
|
MODULE_DESCRIPTION("RAID10 (striped mirror) personality for MD");
|
2005-04-17 02:20:36 +04:00
|
|
|
MODULE_ALIAS("md-personality-9"); /* RAID10 */
|
2006-01-06 11:20:51 +03:00
|
|
|
MODULE_ALIAS("md-raid10");
|
2006-01-06 11:20:36 +03:00
|
|
|
MODULE_ALIAS("md-level-10");
|
2011-10-11 09:50:01 +04:00
|
|
|
|
|
|
|
module_param(max_queued_requests, int, S_IRUGO|S_IWUSR);
|