* 'for-linus' of git://neil.brown.name/md:
md: support bitmaps on RAID10 arrays larger then 2 terabytes
md: update sync_completed and reshape_position even more often.
md: improve usefulness and accuracy of sysfs file md/sync_completed.
md: allow setting newly added device to 'in_sync' via sysfs.
md: tiny md.h cleanups
.. and other arrays with components larger than 2 terabytes.
We use a "long" rather than a "sector_t" in part of the bitmap
size calculations, which is sad.
Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NeilBrown <neilb@suse.de>
There are circumstances when a user-space process might need to
"oversee" a resync/reshape process. For example when doing an
in-place reshape of a raid5, it is prudent to take a backup of each
section before reshaping it as this is the only way to provide
safety against an unplanned shutdown (i.e. crash/power failure).
The sync_max sysfs value can be used to stop the resync from
advancing beyond a particular point.
So user-space can:
suspend IO to the first section and back it up
set 'sync_max' to the end of the section
wait for 'sync_completed' to reach that point
resume IO on the first section and move on to the next section.
However this process requires the kernel and user-space to run in
lock-step which could introduce unnecessary delays.
It would be better if a 'double buffered' approach could be used with
userspace and kernel space working on different sections with the
'next' section always ready when the 'current' section is finished.
One problem with implementing this is that sync_completed is only
guaranteed to be updated when the sync process reaches sync_max.
(it is updated on a time basis at other times, but it is hard to rely
on that). This defeats some of the double buffering.
With this patch, sync_completed (and reshape_position) get updated as
the current position approaches sync_max, so there is room for
userspace to advance sync_max early without losing updates.
To be precise, sync_completed is updated when the current sync
position reaches half way between the current value of sync_completed
and the value of sync_max. This will usually be a good time for user
space to update sync_max.
If sync_max does not get updated, the updates to sync_completed
(together with associated metadata updates) will occur at an
exponentially increasing frequency which will get unreasonably fast
(one update every page) immediately before the process hits sync_max
and stops. So the update rate will be unreasonably fast only for an
insignificant period of time.
Signed-off-by: NeilBrown <neilb@suse.de>
It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The sync_completed file reports how much of a resync (or recovery or
reshape) has been completed.
However due to the possibility of out-of-order completion of writes,
it is not certain to be accurate.
We have an internal value - mddev->curr_resync_completed - which is an
accurate value (though it might not always be quite so uptodate).
So:
- make curr_resync_completed be uptodate a little more often,
particularly when raid5 reshape updates status in the metadata
- report curr_resync_completed in the sysfs file
- allow poll/select to report all updates to md/sync_completed.
This makes sync_completed completed usable by any external metadata
handler that wants to record this status information in its metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
When adding devices to an active array via sysfs, there is currently
no way to mark a device as 'in-sync' which is useful when
incrementally assembling an array.
So add that option.
Signed-off-by: NeilBrown <neilb@suse.de>
- update inclusion guard and make sure it covers the whole file
- remove superflous #ifdef CONFIG_BLOCK
- make sure all required headers are included so that new users aren't
required to include others before
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
If the thread calling dm_kcopyd_copy is delayed due to scheduling inside
split_job/segment_complete and the subjobs complete before the loop in
split_job completes, the kcopyd callback could be invoked from the
thread that called dm_kcopyd_copy instead of the kcopyd workqueue.
dm_kcopyd_copy -> split_job -> segment_complete -> job->fn()
Snapshots depend on the fact that callbacks are called from the singlethreaded
kcopyd workqueue and expect that there is no racing between individual
callbacks. The racing between callbacks can lead to corruption of exception
store and it can also mean that exception store callbacks are called twice
for the same exception - a likely reason for crashes reported inside
pending_complete() / remove_exception().
This patch fixes two problems:
1. job->fn being called from the thread that submitted the job (see above).
- Fix: hand over the completion callback to the kcopyd thread.
2. job->fn(read_err, write_err, job->context); in segment_complete
reports the error of the last subjob, not the union of all errors.
- Fix: pass job->write_err to the callback to report all error bits
(it is done already in run_complete_job)
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use a variable in segment_complete() to point to the dm_kcopyd_client
struct and only release job->pages in run_complete_job() if any are
defined. These changes are needed by the next patch.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Barriers are submitted to a worker thread that issues them in-order.
The thread is modified so that when it sees a barrier request it waits
for all pending IO before the request then submits the barrier and
waits for it. (We must wait, otherwise it could be intermixed with
following requests.)
Errors from the barrier request are recorded in a per-device barrier_error
variable. There may be only one barrier request in progress at once.
For now, the barrier request is converted to a non-barrier request when
sending it to the underlying device.
This patch guarantees correct barrier behavior if the underlying device
doesn't perform write-back caching. The same requirement existed before
barriers were supported in dm.
Bottom layer barrier support (sending barriers by target drivers) and
handling devices with write-back caches will be done in further patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove queue_io return value and a loop in dm_request.
IO may be submitted to a worker thread with queue_io(). queue_io() sets
DMF_QUEUE_IO_TO_THREAD so that all further IO is queued for the thread. When
the thread finishes its work, it clears DMF_QUEUE_IO_TO_THREAD and from this
point on, requests are submitted from dm_request again. This will be used
for processing barriers.
Remove the loop in dm_request. queue_io() can submit I/Os to the worker thread
even if DMF_QUEUE_IO_TO_THREAD was not set.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rework shutting down on suspend and document the associated rules.
Drop write lock in __split_and_process_bio to allow more processing
concurrency.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Refactor the code in dm_request().
Require the new DMF_BLOCK_FOR_SUSPEND flag on readahead bios we will
discard so we don't drop such bios while processing a barrier.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Split the DMF_BLOCK_IO flag into two.
DMF_BLOCK_IO_FOR_SUSPEND is set when I/O must be blocked while suspending a
device. DMF_QUEUE_IO_TO_THREAD is set when I/O must be queued to a
worker thread for later processing.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Prepare for full barrier implementation: first remove the restricted support.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch provides support for data integrity passthrough in the device
mapper.
- If one or more component devices support integrity an integrity
profile is preallocated for the DM device.
- If all component devices have compatible profiles the DM device is
flagged as capable.
- Handle integrity metadata when splitting and cloning bios.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix this build error:
drivers/md/raid1.c: In function 'raid1_congested':
drivers/md/raid1.c:589: error: 'BDI_write_congested' undeclared
BDI_write_congested was changed in commit 1faa16d228 ("block: change the
request allocation/congestion logic to be sync/async based")
Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit d3f761104b
newly allocated bvecs aren't initialised to NULL, so we have
to be more careful about freeing a bio which only managed
to get a few pages allocated to it. Otherwise the resync
process crashes.
This patch is appropriate for 2.6.29-stable.
Cc: stable@kernel.org
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Reported-by: Gabriele Tozzi <gabriele@tozzi.eu>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md: (53 commits)
md/raid5 revise rules for when to update metadata during reshape
md/raid5: minor code cleanups in make_request.
md: remove CONFIG_MD_RAID_RESHAPE config option.
md/raid5: be more careful about write ordering when reshaping.
md: don't display meaningless values in sysfs files resync_start and sync_speed
md/raid5: allow layout and chunksize to be changed on active array.
md/raid5: reshape using largest of old and new chunk size
md/raid5: prepare for allowing reshape to change layout
md/raid5: prepare for allowing reshape to change chunksize.
md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
Documentation/md.txt update
md: allow number of drives in raid5 to be reduced
md/raid5: change reshape-progress measurement to cope with reshaping backwards.
md: add explicit method to signal the end of a reshape.
md/raid5: enhance raid5_size to work correctly with negative delta_disks
md/raid5: drop qd_idx from r6_state
md/raid6: move raid6 data processing to raid6_pq.ko
md: raid5 run(): Fix max_degraded for raid level 4.
md: 'array_size' sysfs attribute
md: centralize ->array_sectors modifications
...
Set queue ordered mode. It doesn't really matter what we set here
because we don't ever put any requests on the queue. But we need to set
something other than QUEUE_ORDERED_NONE so that __generic_make_request
passes barrier requests to us.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move wait queue declaration and unplug to dm_wait_for_completion.
The purpose is to minimize duplicate code in the further patches.
The patch reorders functions a little bit. It doesn't change any
functionality. For proper non-deadlock operation, add_wait_queue must
happen before set_current_state(interruptible) and before the test for
!atomic_read(&md->pending).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Merge pushback and deferred lists into one list - use deferred list
for both deferred and pushed-back bios.
This will be needed for proper support of barrier bios: it is impossible to
support ordering correctly with two lists because the requests on both lists
will be mixed up.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allow uninterruptible wait for pending IOs.
Add argument "interruptible" to dm_wait_for_completion that specifies
either interruptible or uninterruptible waiting.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Merge __flush_deferred_io() into the only caller, dm_wq_work().
There's no need to have a function that has only one caller.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the bio_io_error() calls directly into __split_and_process_bio().
This avoids some code duplication in later patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename __split_bio() to __split_and_process_bio() because it not only splits
the bio to serveral parts, but also submits them to target drivers.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove struct dm_wq_req and move "work" directly into struct mapped_device.
In the revised implementation, the thread will do just one type of work
(processing the queue).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the context field from struct dm_wq_req because we will no longer
need it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove "type" field from struct dm_wq_req because we no longer need it
to have more than one value.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce a function that adds a bio to the head of the list for
use by the patch that will support barriers.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The persistent exception store destructor does not properly
account for all conditions in which it can be called. If it
is called after 'ctr' but before 'read_metadata' (e.g. if
something else in 'snapshot_ctr' fails) then it will attempt
to free areas of memory that haven't been allocated yet.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Let the exception store types print out their status through
the new API, rather than having the snapshot code do it.
Adjust the buffer position to allow for the preceding DMEMIT in the
arguments to type->status().
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
First step of having the exception stores parse their own arguments -
generalizing the interface.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use DMEMIT in place of snprintf. This makes it easier later when
other modules are helping to populate our status output.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move some of the last bits from dm-snap.h into dm-snap.c where they
belong and remove dm-snap.h.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move useful functions out of dm-snap.h and stop using dm-snap.h.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move COW device from snapshot to exception store.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move chunk fields from snapshot to exception store.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move target pointer from snapshot to exception store.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The logging API needs an extra function to make cluster mirroring
possible. This new function allows us to check whether a mirror
region is being recovered on another machine in the cluster. This
helps us prevent simultaneous recovery I/O and process I/O to the
same locations on disk.
Cluster-aware log modules will implement this function. Single
machine log modules will not. So, there is no performance
penalty for single machine mirrors.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the 'dm_dirty_log_internal' structure. The resulting cleanup
eliminates extra memory allocations. Therefore exposing the internal
list_head to the external 'dm_dirty_log_type' structure is a worthwhile
compromise.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Avoid private module usage accounting by removing 'use' from
dm_dirty_log_internal. The standard module reference counting is
sufficient.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use kzfree() instead of memset() + kfree().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The tt_internal is really just a list_head to manage registered target_type
in a double linked list,
Here embed the list_head into target_type directly,
1. to avoid kmalloc/kfree;
2. then tt_internal is really unneeded;
Cc: stable@kernel.org
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
upgrade_mode() sets bdev to NULL temporarily, and does not have any
locking to exclude anything from seeing that NULL.
In dm_table_any_congested() bdev_get_queue() can dereference that NULL and
cause a reported oops.
Fix this by not changing that field during the mode upgrade.
Cc: stable@kernel.org
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The tt_internal's 'use' field is superfluous: the module's refcount can do
the work properly. An acceptable side-effect is that this increases the
reference counts reported by 'lsmod'.
Remove the superfluous test when removing a target module.
[Crash possible without this on SMP - agk]
Cc: stable@kernel.org
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
We need to check if the exception was completed after dropping the lock.
After regaining the lock, __find_pending_exception checks if the exception
was already placed into &s->pending hash.
But we don't check if the exception was already completed and placed into
&s->complete hash. If the process waiting in alloc_pending_exception was
delayed at this point because of a scheduling latency and the exception
was meanwhile completed, we'd miss that and allocate another pending
exception for already completed chunk.
It would lead to a situation where two records for the same chunk exist
and potential data corruption because multiple snapshot I/Os to the
affected chunk could be redirected to different locations in the
snapshot.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
It is uncommon and bug-prone to drop a lock in a function that is called with
the lock held, so this is moved to the caller.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move looking-up of a pending exception from __find_pending_exception to another
function.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If someone sends signal to a process performing synchronous dm-io call,
the kernel may crash.
The function sync_io attempts to exit with -EINTR if it has pending signal,
however the structure "io" is allocated on stack, so already submitted io
requests end up touching unallocated stack space and corrupting kernel memory.
sync_io sets its state to TASK_UNINTERRUPTIBLE, so the signal can't break out
of io_schedule() --- however, if the signal was pending before sync_io entered
while (1) loop, the corruption of kernel memory will happen.
There is no way to cancel in-progress IOs, so the best solution is to ignore
signals at this point.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
With my previous patch to save bi_io_vec, the size of dm_raid1_read_record
is significantly increased (the vector list takes 3072 bytes on 32-bit machines
and 4096 bytes on 64-bit machines).
The structure dm_raid1_read_record used to be allocated with kmalloc,
but kmalloc aligns the size on the next power-of-two so an object
slightly greater than 4096 will allocate 8192 bytes of memory and half of
that memory will be wasted.
This patch turns kmalloc into a slab cache which doesn't have this
padding so it will reduce the memory consumed.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Device mapper saves and restores various fields in the bio, but it doesn't save
bi_io_vec. If the device driver modifies this after a partially successful
request, dm-raid1 and dm-multipath may attempt to resubmit a bio that has
bi_size inconsistent with the size of vector.
To make requests resubmittable in dm-raid1 and dm-multipath, we must save
and restore the bio vector as well.
To reduce the memory overhead involved in this, we do not save the pages in a
vector and use a 16-bit field size if the page size is less than 65536.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
We currently update the metadata :
1/ every 3Megabytes
2/ When the place we will write new-layout data to is recorded in
the metadata as still containing old-layout data.
Rule one exists to avoid having to re-do too much reshaping in the
face of a crash/restart. So it should really be time based rather
than size based. So change it to "every 10 seconds".
Rule two turns out to be too harsh when restriping an array
'in-place', as in that case the metadata much be updates for every
stripe.
For the in-place update, it can only possibly be safe from a crash if
some user-space program data a backup of every e.g. few hundred
stripes before allowing them to be reshaped. In that case, the
constant metadata update is pointless.
So only update the metadata if the new metadata will report that the
end of the 'old-layout' data is beyond where we are currently
writing 'new-layout' data.
Signed-off-by: NeilBrown <neilb@suse.de>
... and to be certain the that make_request doesn't wait forever,
add a 'wake_up' when ->reshape_progress has been set to MaxSector
Signed-off-by: NeilBrown <neilb@suse.de>
This was only needed when the code was experimental. Most of it
is well tested now, so the option is no longer useful.
Signed-off-by: NeilBrown <neilb@suse.de>
When we are reshaping an array, it is very important that we read
the data from a particular sector offset before writing new data
at that offset.
In most cases when growing or shrinking an array we read long before
we even consider writing. But when restriping an array without
changing it size, there is a small possibility that we might have
some data to available write before the read has happened at the same
location. This would require some stripes to be in cache already.
To guard against this small possibility, we check, before writing,
that the 'old' stripe at the same location is not in the process of
being read. And we ensure that we mark all 'source' stripes as such
before allowing new 'destination' stripes to proceed.
Signed-off-by: NeilBrown <neilb@suse.de>
When no resync if happening, both of these files currently have
meaningless values (is slightly different ways).
Change them to "none" in that case.
Signed-off-by: NeilBrown <neilb@suse.de>
If an array has 3 or more devices, we allow the chunksize or layout
to be changed and when a reshape starts, we use these as the 'new'
values.
Signed-off-by: NeilBrown <neilb@suse.de>
This ensures that even when old and new stripes are overlapping,
we will try to read all of the old before having to write any
of the new.
Signed-off-by: NeilBrown <neilb@suse.de>
Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
remember what the chunk size was before the reshape that is currently
underway.
This seems like duplication with "chunk_size" and "new_chunk" in
mddev_t, and to some extent it is, but there are differences.
The values in mddev_t are always defined and often the same.
The prev* values are only defined if a reshape is underway.
Also (and more significantly) the raid5_conf_t values will be changed
at the same time (inside an appropriate lock) that the reshape is
started by setting reshape_position. In contrast, the new_chunk value
is set when the sysfs file is written which could be well before the
reshape starts.
Signed-off-by: NeilBrown <neilb@suse.de>
During a raid5 reshape, we have some stripes in the cache that are
'before' the reshape (and are still to be processed) and some that are
'after'. They are currently differentiated by having different
->disks values as the only reshape current supported involves changing
the number of disks.
However we will soon support reshapes that do not change the number
of disks (chunk parity or chunk size). So make the difference more
explicit with a 'generation' number.
Signed-off-by: NeilBrown <neilb@suse.de>
When reshaping a raid5 to have fewer devices, we work from the end of
the array to the beginning.
md_do_sync gives addresses to sync_request that go from the beginning
to the end. So largely ignore them use the internal state variable
"reshape_progress" to keep track of what to do next.
Never allow the size to be reduced below the minimum (4 for raid6,
3 otherwise).
We require that the size of the array has already been reduced before
the array is reshaped to a smaller size. This is because simply
reducing the size is an easily reversible operation, while the reshape
is immediately destructive and so is not reversible for the blocks at
the ends of the devices.
Thus to reshape an array to have fewer devices, you must first write
an appropriately small size to md/array_size.
When reshape finished, we remove any drives that are no longer
needed and fix up ->degraded.
Signed-off-by: NeilBrown <neilb@suse.de>
When reducing the number of devices in a raid4/5/6, the reshape
process has to start at the end of the array and work down to the
beginning. So we need to handle expand_progress and expand_lo
differently.
This patch renames "expand_progress" and "expand_lo" to avoid the
implication that anything is getting bigger (expand->reshape) and
every place they are used, we make sure that they are used the right
way depending on whether delta_disks is positive or negative.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.
This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.
The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.
The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.
This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.
Signed-off-by: NeilBrown <neilb@suse.de>
This is the first of four patches which combine to allow md/raid5 to
reduce the number of devices in the array by restriping the data over
a subset of the devices.
If the number of disks in a raid4/5/6 is being reduced, then the
default size must be based on the new number, not the old number
of devices.
In general, it should be based on the smaller of new and old.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the raid6 data processing routines into a standalone module
(raid6_pq) to prepare them to be called from async_tx wrappers and other
non-md drivers/modules. This precludes a circular dependency of raid456
needing the async modules for data processing while those modules in
turn depend on raid456 for the base level synchronous raid6 routines.
To support this move:
1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
2/ The raid6_call, recovery calls, and table symbols are exported
3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
compile
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Allow userspace to set the size of the array according to the following
semantics:
1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
a) If size is set before the array is running, do_md_run will fail
if size is greater than the default size
b) A reshape attempt that reduces the default size to less than the set
array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
kernel and reverts to the size reported by the personality
Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size. Resync/reshape operations
always follow the default size.
Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Get personalities out of the business of directly modifying
->array_sectors. Lays groundwork to introduce policy on when
->array_sectors can be modified.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested. For personalities that do not reshape emit
a warning if anything but the default size is requested.
In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Hello,
I found a typo Bosto"m" in FSF address.
And I am checking around linux source code.
Here is the only place which uses Bosto"m" (not Boston).
Signed-off-by: Atsushi SAKAI <sakaia@jp.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a raid6 is still in the layout that comes from converting raid5
into a raid6. this will allow us to convert it back again.
Signed-off-by: NeilBrown <neilb@suse.de>
2-drive raid5's aren't very interesting. But if you are converting
a raid1 into a raid5, you will at least temporarily have one. And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.
layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.
Signed-off-by: NeilBrown <neilb@suse.de>
Implement this for RAID6 to be able to 'takeover' a RAID5 array. The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.
Signed-off-by: NeilBrown <neilb@suse.de>
To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.
The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.
So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.
There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).
Signed-off-by: NeilBrown <neilb@suse.de>
Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first. It is safer
to put the check inside md_unregister_thread itself.
Signed-off-by: NeilBrown <neilb@suse.de>
.. so that the code to create the private data structures is separate.
This will help with future code to change the level of an active
array.
Signed-off-by: NeilBrown <neilb@suse.de>
When an md array is undergoing a change, we have new_* fields that
show the new values.
When no change is happening, it is least confusing if these have
the same value as the normal fields.
This is true in most cases, but not when the values are set via sysfs.
So fix this up.
A subsequent patch will BUG_ON if these things aren't consistent.
Signed-off-by: NeilBrown <neilb@suse.de>
DDF requires RAID6 calculations over different devices in a different
order.
For md/raid6, we calculate over just the data devices, starting
immediately after the 'Q' block.
For ddf/raid6 we calculate over all devices, using zeros in place of
the P and Q blocks.
This requires unfortunately complex loops...
Signed-off-by: NeilBrown <neilb@suse.de>
DDF uses different layouts for P and Q blocks than current md/raid6
so add those that are missing.
Also add support for RAID6 layouts that are identical to various
raid5 layouts with the simple addition of one device to hold all of
the 'Q' blocks.
Finally add 'raid5' layouts to match raid4.
These last to will allow online level conversion.
Note that this does not provide correct support for DDF/raid6 yet
as the order in which data blocks are summed to produce the Q block
is significant and different between current md code and DDF
requirements.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
a 'struct stripe_head *' and fill in the relevant fields. This is
more extensible.
Signed-off-by: NeilBrown <neilb@suse.de>
Code currently assumes that the devices in a raid6 stripe are
0 1 ... N-1 P Q
in some rotated order. We will shortly add new layouts in which
this strict pattern is broken.
So remove this expectation. We still assume that the data disks
are roughly in-order. However P and Q can be inserted anywhere within
that order.
Signed-off-by: NeilBrown <neilb@suse.de>
This similar to the recent change to get_active_stripe.
There is no functional change, just come rearrangement to make
future patches cleaner.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than passing 'pd_idx' and 'disks' to these functions, just pass
'previous' which tells whether to use the 'previous' or 'current'
geometry during a reshape, and let init_stripe calculate
disks and pd_idx and anything else it might need.
This is not a substantial simplification and even adds a division.
However we will shortly be adding more complexity to init_stripe
to handle more interesting 'reshape' activities, and without this
change, the interface to these functions would get very complex.
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.
All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.
All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.
In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When a drive is added to an array using ADD_NEW_DISK, there are two
places we can get certain flags from: the metadata on the disk or the
flags passed through the IOCTL.
For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
from either of those sources depending on if it is set (i.e. we
effectively 'or' the two sources together).
This makes it awkward to clear, and is at best inconsistent.
As documented code (in mdadm) requires that setting
MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
inconsistency by always using the value for this flag from the ioctl,
and ignoring the value on disk.
Signed-off-by: NeilBrown <neilb@suse.de>
Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.
One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed. And we need to know what has completed to record
how much is recovered. So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.
When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date. So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number. But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.
Signed-off-by: NeilBrown <neilb@suse.de>
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h
Remove include/raid/md.h as its only remaining use was to #include
other files.
Signed-off-by: NeilBrown <neilb@suse.de>
.. as they are part of the user-space interface.
Also move MdpMinorShift into there so we can remove duplication.
Lastly move mdp_major in. It is less obviously part of the user-space
interface, but do_mounts_md.c uses it, and it is acting a bit like
user-space.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away. md.h is left where it is for now as there
are some uses from the outside.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Use the -y variables instead of the old -objs so we can easily add
conditional objects to the modules. Also always use += to add
subobjects to avoid problems when placing additional objects in
some place in the file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
kernels, so no need to keep it around.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
md: Add support for data integrity to MD
If all subdevices support the same protection format the MD device is
flagged as integrity capable.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we add some spares to an array and start recovery, and we have
a bitmap which is stored 'internally' on all devices, we call
bitmap_write_all to make sure the bitmap is correct on the new
device(s).
However that doesn't work as write_sb_page only writes to
'In_sync' devices, and devices undergoing recovery are not
'In_sync' until recovery finishes.
So extend write_sb_page (actually next_active_rdev) to include devices
that are under recovery.
Signed-off-by: NeilBrown <neilb@suse.de>
It is safe to clear a bit from the write-intent bitmap for a raid1
if we know the data has been written to all devices, which is
what the current test does.
But it is not always safe to update the 'events_cleared' counter in
that case. This is because one request could complete successfully
after some other request has partially failed.
So simply disable the clearing and updating of events_cleared whenever
the array is degraded. This might end up not clearing some bits that
could safely be cleared, but it is safest approach.
Note that the bug fixed here did not risk corrupting data by letting
the array get out-of-sync. Rather it meant that when a device is
removed and re-added to the array, it might incorrectly require a full
recovery rather than just recovering based on the bitmap.
Signed-off-by: NeilBrown <neilb@suse.de>
md currently insists that the chunk size used for write-intent
bitmaps (the amount of data that corresponds to one chunk)
be at least one page.
The reason for this restriction is lost in the mists of time,
but a review of the code (and a vague memory) suggests that the only
problem would be related to resync. Resync tries very hard to
work in multiples of a page, but also needs to sync with units
of a bitmap_chunk too.
This connection comes out in the bitmap_start_sync call.
So change bitmap_start_sync to always work in multiples of a page.
If the bitmap chunk size is less that one page, we flag multiple
chunks as 'syncing' and generally make them all appear to the
resync routines like one chunk.
All other code either already works with data ranges that could
span multiple chunks, or explicitly only cares about a single chunk.
Signed-off-by: Neil Brown <neilb@suse.de>
There are two problems with is_mddev_idle.
1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the
rest are 'long'.
So if sync_io were to wrap on a 64bit host, the value of
curr_events would go very negative suddenly, and take a very
long time to return to positive.
So do all calculations as 'int'. That gives us plenty of precision
for what we need.
2/ To initialise rdev->last_events we simply call is_mddev_idle, on
the assumption that it will make sure that last_events is in a
suitable range. It used to do this, but now it does not.
So now we need to be more explicit about initialisation.
Signed-off-by: NeilBrown <neilb@suse.de>
The following oops has been reported when dm-crypt runs over a loop device.
...
[ 70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000)
...
[ 70.381058] Call Trace:
[ 70.381058] [<d0d76601>] ? crypt_dec_pending+0x5e/0x62 [dm_crypt]
[ 70.381058] [<d0d767b8>] ? crypt_endio+0xa2/0xaa [dm_crypt]
[ 70.381058] [<d0d76716>] ? crypt_endio+0x0/0xaa [dm_crypt]
[ 70.381058] [<c01a2f24>] ? bio_endio+0x2b/0x2e
[ 70.381058] [<d0806530>] ? dec_pending+0x224/0x23b [dm_mod]
[ 70.381058] [<d08066e4>] ? clone_endio+0x79/0xa4 [dm_mod]
[ 70.381058] [<d080666b>] ? clone_endio+0x0/0xa4 [dm_mod]
[ 70.381058] [<c01a2f24>] ? bio_endio+0x2b/0x2e
[ 70.381058] [<c02bad86>] ? loop_thread+0x380/0x3b7
[ 70.381058] [<c02ba8a1>] ? do_lo_send_aops+0x0/0x165
[ 70.381058] [<c013754f>] ? autoremove_wake_function+0x0/0x33
[ 70.381058] [<c02baa06>] ? loop_thread+0x0/0x3b7
When a table is being replaced, it waits for I/O to complete
before destroying the mempool, but the endio function doesn't
call mempool_free() until after completing the bio.
Fix it by swapping the order of those two operations.
The same problem occurs in dm.c with md referenced after dec_pending.
Again, we swap the order.
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
In the async encryption-complete function (kcryptd_async_done), the
crypto_async_request passed in may be different from the one passed to
crypto_ablkcipher_encrypt/decrypt. Only crypto_async_request->data is
guaranteed to be same as the one passed in. The current
kcryptd_async_done uses the passed-in crypto_async_request directly
which may cause the AES-NI-based AES algorithm implementation to panic.
This patch fixes this bug by only using crypto_async_request->data,
which points to dm_crypt_request, the crypto_async_request passed in.
The original data (convert_context) is gotten from dm_crypt_request.
[mbroz@redhat.com: reworked]
Cc: stable@kernel.org
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-io calls bio_get_nr_vecs to get the maximum number of pages to use
for a given device. It allocates one additional bio_vec to use
internally but failed to respect BIO_MAX_PAGES, so fix this.
This was the likely cause of:
https://bugzilla.redhat.com/show_bug.cgi?id=173153
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix an error introduced in dm-table-rework-reference-counting.patch.
When there is failure after table initialization, we need to use
dm_table_destroy, not dm_table_put, to free the table.
dm_table_put may be used only after dm_table_get.
Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When renaming a mapped device validate the length of the new name.
The rename ioctl accepted any correctly-terminated string enclosed
within the data passed from userspace. The other ioctls enforce a
size limit of DM_NAME_LEN. If the name is changed and becomes longer
than that, the device can no longer be addressed by name.
Fix it by properly checking for device name length (including
terminating zero).
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There has been a race in raid10 and raid1 for a long time
which has only recently started showing up due to a scheduler changed.
When a sync_read request finishes, as soon as reschedule_retry
is called, another thread can mark the resync request as having
completed, so md_do_sync can finish, ->stop can be called, and
->conf can be freed. So using conf after reschedule_retry is not
safe.
Similarly, when finishing a sync_write, calling md_done_sync must be
the last thing we do, as it allows a chain of events which will free
conf and other data structures.
The first of these requires action in raid10.c
The second requires action in raid1.c and raid10.c
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
For raid1/4/5/6, resync (fixing inconsistencies between devices) is
very similar to recovery (rebuilding a failed device onto a spare).
The both walk through the device addresses in order.
For raid10 it can be quite different. resync follows the 'array'
address, and makes sure all copies are the same. Recover walks
through 'device' addresses and recreates each missing block.
The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
(When present) to be updated to reflect a partially completed resync.
It makes assumptions which mean that it does not work correctly for
raid10 recovery at all.
In particularly, it can cause bitmap-directed recovery of a raid10 to
not recovery some of the blocks that need to be recovered.
So move the call to bitmap_cond_end_sync into the resync path, rather
than being in the common "resync or recovery" path.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When doing recovery on a raid10 with a write-intent bitmap, we only
need to recovery chunks that are flagged in the bitmap.
However if we choose to skip a chunk as it isn't flag, the code
currently skips the whole raid10-chunk, thus it might not recovery
some blocks that need recovering.
This patch fixes it.
In case that is confusing, it might help to understand that there
is a 'raid10 chunk size' which guides how data is distributed across
the devices, and a 'bitmap chunk size' which says how much data
corresponds to a single bit in the bitmap.
This bug only affects cases where the bitmap chunk size is smaller
than the raid10 chunk size.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.
We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.
We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.
So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.
This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
ab5bd5cbc8 introduced the following
bug in linear software raid for large arrays on 32 bit machines:
which_dev() computes the device holding a given sector by shifting
down the sector number to a 32 bit range, dividing by the array
spacing and looking up the resulting index in the hash table of
the array.
Because the computed index might be slightly too small, a loop at
the end of which_dev() increases the index until the given sector
actually falls into the range of the device associated with that index.
The changes of the above mentioned commit caused this loop to check
whether the _index_ rather than the sector number is small enough,
effectively bypassing the loop and thus possibly returning the wrong
device.
As reported by Simon Kirby, this leads to errors such as
linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464
Fix this bug by introducing a local variable for the index so that
the variable containing the passed sector is left unchanged.
Cc: stable@kernel.org
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
If a raid1 only has a single working device and gets a read error,
we choose to simply return that error up to the filesystem (or whatever)
rather than failing the whole array.
However the codes doesn't quite do that. We attempt a readbalance
which allocates the same drive, so we retry the read - indefinitely.
Instead: If read_balance in the error case chooses the same drive that just
failed, treat it as a failure and don't retry.
Signed-off-by: NeilBrown <neilb@suse.de>
If a raid1 has only one working drive and it has a sector which
gives an error on read, then an attempt to recover onto a spare will
fail, but as the single remaining drive is not removed from the
array, the recovery will be immediately re-attempted, resulting
in an infinite recovery loop.
So detect this situation and don't retry recovery once an error
on the lone remaining drive is detected.
Allow recovery to be retried once every time a spare is added
in case the problem wasn't actually a media error.
Signed-off-by: NeilBrown <neilb@suse.de>
Using sequential numbers to identify md devices is somewhat artificial.
Using names can be a lot more user-friendly.
Also, creating md devices by opening the device special file is a bit
awkward.
So this patch provides a new option for creating and naming devices.
Writing a name such as "md_home" to
/sys/modules/md_mod/parameters/new_array
will cause an array with that name to be created. It will appear in
/sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
It will have an arbitrary minor number allocated.
md devices that a created by an open are destroyed on the last
close when the device is inactive.
For named md devices, they will not be destroyed until the array
is explicitly stopped, either with the STOP_ARRAY ioctl or by
writing 'clear' to /sys/block/md_XXXX/md/array_state.
The name of the array must start 'md_' to avoid conflict with
other devices.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
md_free is the .release handler for the md kobj_type.
So it makes sense to release all the objects referenced by
the mddev in there, rather than just prior to calling kobject_put
for what we think is the last time.
Signed-off-by: NeilBrown <neilb@suse.de>
It is more balanced to just do simple initialisation in mddev_find,
which allocates and links a new md device, and leave all the
more sophisticated allocation to md_probe (which calls mddev_find).
md_probe already allocated the gendisk. It should allocate the
queue too.
Signed-off-by: NeilBrown <neilb@suse.de>
The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
list_for_each_entry_safe, from <linux/list.h>, it should be defined to
use list_for_each_entry_safe, instead of reinventing the wheel.
But some calls to each_entry_safe don't really need a safe version,
just a direct list_for_each_entry is enough, this could save a temp
variable (tmp) in every function that used rdev_for_each.
In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
totally save many tmp vars; and only in the other situations that will call
list_del to delete an entry, the safe version is used.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the hash_spacing and preshift members of struct
raid0_private_data to spacing and sector_shift respectively and
changes the semantics as follows:
We always have spacing = 2 * hash_spacing. In case
sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1
while sector_shift = preshift = 0 otherwise.
Note that the values of nb_zone and zone are unaffected by these changes
because in the sector_div() preceeding the assignement of these two
variables both arguments double.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
This completes the block -> sector conversion of struct strip_zone.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
current_offset and curr_zone_offset stored the corresponding offsets
as 1K quantities. Rename them to current_start and curr_zone_start
to match the naming of struct strip_zone and store the offsets as
sector counts.
Also, add KERN_INFO to the printk() affected by this change to make
checkpatch happy.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
For the same reason as in the previous patch, rename it from zone_offset
to zone_start.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Rename zone->dev_offset to zone->dev_start to make sure all users
have been converted.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
There is no compelling need for this, but sysfs_notify_dirent is a
nicer interface and the change is good for consistency.
Signed-off-by: NeilBrown <neilb@suse.de>
commit a2ed9615e3
fixed a bug with 'internal' bitmaps, but in the process broke
'in a file' bitmaps. So they are broken in 2.6.28
This fixes it, and needs to go in 2.6.28-stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
Supply dm_add_exception as a callback to the read_metadata function.
Add a status function ready for a later patch and name the functions
consistently.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the existing snapshot exception store implementations out into
separate files. Later patches will place these behind a new
interface in preparation for alternative implementations.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename struct exception_store to dm_exception_store.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Pull structures that bridge the gap between snapshot and
exception store out of dm-snap.h and put them in a new
.h file - dm-exception-store.h. This file will define the
API for new exception stores.
Ultimately, dm-snap.h is unnecessary, since only dm-snap.c
should be using it.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The same workqueue is used both for sending uevents and processing queued I/O.
Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting
for the queued I/O to be processed. Use scheduled_work() for the asynchronous
uevents instead.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Implement simple read-only sysfs entry for device-mapper block device.
This patch adds a simple sysfs directory named "dm" under block device
properties and implements
- name attribute (string containing mapped device name)
- uuid attribute (string containing UUID, or empty string if not set)
The kobject is embedded in mapped_device struct, so no additional
memory allocation is needed for initializing sysfs entry.
During the processing of sysfs attribute we need to lock mapped device
which is done by a new function dm_get_from_kobj, which returns the md
associated with kobject and increases the usage count.
Each 'show attribute' function is responsible for its own locking.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rework table reference counting.
The existing code uses a reference counter. When the last reference is
dropped and the counter reaches zero, the table destructor is called.
Table reference counters are acquired/released from upcalls from other
kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
If the reference counter reaches zero in one of the upcalls, the table
destructor is called from almost random kernel code.
This leads to various problems:
* dm_any_congested being called under a spinlock, which calls the
destructor, which calls some sleeping function.
* the destructor attempting to take a lock that is already taken by the
same process.
* stale reference from some other kernel code keeps the table
constructed, which keeps some devices open, even after successful
return from "dmsetup remove". This can confuse lvm and prevent closing
of underlying devices or reusing device minor numbers.
The patch changes reference counting so that the table destructor can be
called only at predetermined places.
The table has always exactly one reference from either mapped_device->map
or hash_cell->new_map. After this patch, this reference is not counted
in table->holders. A pair of dm_create_table/dm_destroy_table functions
is used for table creation/destruction.
Temporary references from the other code increase table->holders. A pair
of dm_table_get/dm_table_put functions is used to manipulate it.
When the table is about to be destroyed, we wait for table->holders to
reach 0. Then, we call the table destructor. We use active waiting with
msleep(1), because the situation happens rarely (to one user in 5 years)
and removing the device isn't performance-critical task: the user doesn't
care if it takes one tick more or not.
This way, the destructor is called only at specific points
(dm_table_destroy function) and the above problems associated with lazy
destruction can't happen.
Finally remove the temporary protection added to dm_any_congested().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Implement barrier support for single device DM devices
This patch implements barrier support in DM for the common case of dm linear
just remapping a single underlying device. In this case we can safely
pass the barrier through because there can be no reordering between
devices.
NB. Any DM device might cease to support barriers if it gets
reconfigured so code must continue to allow for a possible
-EOPNOTSUPP on every barrier bio submitted. - agk
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allow NULL buffer in dm_copy_name_and_uuid if you only want to return one of
the fields.
(Required by a following patch that adds these fields to sysfs.)
Signed-off-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>