Граф коммитов

1202 Коммитов

Автор SHA1 Сообщение Дата
Jens Axboe bb799ca020 bio: allow individual slabs in the bio_set
Instead of having a global bio slab cache, add a reference to one
in each bio_set that is created. This allows for personalized slabs
in each bio_set, so that they can have bios of different sizes.

This means we can personalize the bios we return. File systems may
want to embed the bio inside another structure, to avoid allocation
more items (and stuffing them in ->bi_private) after the get a bio.
Or we may want to embed a number of bio_vecs directly at the end
of a bio, to avoid doing two allocations to return a bio. This is now
possible.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-12-29 08:29:23 +01:00
Ingo Molnar db8862eafe Merge branch 'linus' into tracing/hw-branch-tracing 2008-12-24 21:08:26 +01:00
NeilBrown a2ed9615e3 md: Don't read past end of bitmap when reading bitmap.
When we read the write-intent-bitmap off the device, we currently
read a whole number of pages.
When PAGE_SIZE is 4K, this works due to the alignment we enforce
on the superblock and bitmap.
When PAGE_SIZE is 64K, this case read past the end-of-device
which causes an error.

When we write the superblock, we ensure to clip the last page
to just be the required size.  Copy that code into the read path
to just read the required number of sectors.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: stable@kernel.org
2008-12-19 16:25:01 +11:00
Ingo Molnar 970987beb9 Merge branches 'tracing/ftrace', 'tracing/function-graph-tracer' and 'tracing/urgent' into tracing/core 2008-12-05 14:45:22 +01:00
Milan Broz 0e435ac26e block: fix setting of max_segment_size and seg_boundary mask
Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
devices.

When stacking devices (LVM over MD over SCSI) some of the request queue
parameters are not set up correctly in some cases by default, namely
max_segment_size and and seg_boundary mask.

If you create MD device over SCSI, these attributes are zeroed.

Problem become when there is over this mapping next device-mapper mapping
- queue attributes are set in DM this way:

request_queue   max_segment_size  seg_boundary_mask
SCSI                65536             0xffffffff
MD RAID1                0                      0
LVM                 65536                 -1 (64bit)

Unfortunately bio_add_page (resp.  bio_phys_segments) calculates number of
physical segments according to these parameters.

During the generic_make_request() is segment cout recalculated and can
increase bio->bi_phys_segments count over the allowed limit.  (After
bio_clone() in stack operation.)

Thi is specially problem in CCISS driver, where it produce OOPS here

    BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

(MAXSEGENTRIES is 31 by default.)

Sometimes even this command is enough to cause oops:

  dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10

This command generates bios with 250 sectors, allocated in 32 4k-pages
(last page uses only 1024 bytes).

For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
unfortunatelly on lower layer it is recalculated to 32 segments and this
violates CCISS restriction and triggers BUG_ON().

The patch tries to fix it by:

 * initializing attributes above in queue request constructor
   blk_queue_make_request()

 * make sure that blk_queue_stack_limits() inherits setting

 (DM uses its own function to set the limits because it
 blk_queue_stack_limits() was introduced later.  It should probably switch
 to use generic stack limit function too.)

 * sets the default seg_boundary value in one place (blkdev.h)

 * use this mask as default in DM (instead of -1, which differs in 64bit)

Bugs related to this:
https://bugzilla.redhat.com/show_bug.cgi?id=471639
http://bugzilla.kernel.org/show_bug.cgi?id=8672

Signed-off-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-12-03 12:55:55 +01:00
Ingo Molnar 0bfc24559d blktrace: port to tracepoints, update
Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
sites. Spread them out to the usage sites, as suggested by
Mathieu Desnoyers.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
2008-11-26 13:04:35 +01:00
Arnaldo Carvalho de Melo 5f3ea37c77 blktrace: port to tracepoints
This was a forward port of work done by Mathieu Desnoyers, I changed it to
encode the 'what' parameter on the tracepoint name, so that one can register
interest in specific events and not on classes of events to then check the
'what' parameter.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-26 12:13:34 +01:00
Chandra Seetharaman 8a57dfc6f9 dm: avoid destroying table in dm_any_congested
dm_any_congested() just checks for the DMF_BLOCK_IO and has no
code to make sure that suspend waits for dm_any_congested() to
complete.  This patch adds such a check.

Without it, a race can occur with dm_table_put() attempting to
destroying the table in the wrong thread, the one running
dm_any_congested() which is meant to be quick and return
immediately.

Two examples of problems:
1. Sleeping functions called from congested code, the caller
   of which holds a spin lock.
2. An ABBA deadlock between pdflush and multipathd. The two locks
   in contention are inode lock and kernel lock.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-11-13 23:39:14 +00:00
Mikulas Patocka d221d2e776 dm: move pending queue wake_up end_io_acct
This doesn't fix any bug, just moves wake_up immediately after decrementing
md->pending, for better code readability.

It must be clear to anyone manipulating md->pending to wake up
the queue if md->pending reaches zero, so move the wakeup as close to
the decrementing as possible.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-11-13 23:39:10 +00:00
Chandra Seetharaman 14e98c5ca8 dm mpath: warn if args ignored
Currently dm ignores the parameters provided to hardware handlers
without providing any notifications to the user.

This patch just prints a warning message so that the user knows that
the arguments are ignored.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-11-13 23:39:06 +00:00
Chandra Seetharaman b81aa1c792 dm mpath: avoid attempting to activate null path
Path activation code is called even when the pgpath is NULL. This could
lead to a panic in activate_path(). Such a panic is seen in -rt kernel.

This problem has been there before the pg_init() was moved to a
workqueue.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-11-13 23:39:00 +00:00
Heinz Mauelshagen 6edebdee48 dm stripe: fix init failure
Don't proceed if dm_stripe_init() fails to register itself as a dm target.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-11-13 23:38:56 +00:00
Mikulas Patocka 18776c7316 dm raid1: flush workqueue before destruction
We queue work on keventd queue --- so this queue must be flushed in the
destructor. Otherwise, keventd could access mirror_set after it was freed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2008-11-13 23:38:52 +00:00
Andre Noll f1cd14ae52 md: linear: Fix a division by zero bug for very small arrays.
We currently oops with a divide error on starting a linear software
raid array consisting of at least two very small (< 500K) devices.

The bug is caused by the calculation of the hash table size which
tries to compute sector_div(sz, base) with "base" being zero due to
the small size of the component devices of the array.

Fix this by requiring the hash spacing to be at least one which
implies that also "base" is non-zero.

This bug has existed since about 2.6.14.

Cc: stable@kernel.org
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-11-06 19:41:24 +11:00
NeilBrown a53a6c8575 md: fix bug in raid10 recovery.
Adding a spare to a raid10 doesn't cause recovery to start.
This is due to an silly type in
  commit 6c2fce2ef6
and so is a bug in 2.6.27 and .28-rc.

Thanks to Thomas Backlund for bisecting to find this.

Cc: Thomas Backlund <tmb@mandriva.org>
Cc: stable@kernel.org

Signed-off-by: NeilBrown <neilb@suse.de>
2008-11-06 17:28:20 +11:00
NeilBrown cb3ac42b8a md: revert the recent addition of a call to the BLKRRPART ioctl.
It turns out that it is only safe to call blkdev_ioctl when the device
is actually open (as ->bd_disk is set to NULL on last close).  And it
is quite possible for do_md_stop to be called when the device is not
open.  So discard the call to blkdev_ioctl(BLKRRPART) which was
added in
   commit 934d9c23b4

It is just as easy to call this ioctl from userspace when needed (on
mdadm -S) so leave it out of the kernel

Signed-off-by: NeilBrown <neilb@suse.de>
2008-11-06 17:28:01 +11:00
Linus Torvalds 721d5dfe7e Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: destroy partitions and notify udev when md array is stopped.
2008-10-30 18:36:16 -07:00
Mikulas Patocka 879129d208 dm snapshot: wait for chunks in destructor
If there are several snapshots sharing an origin and one is removed
while the origin is being written to, the snapshot's mempool may get
deleted while elements are still referenced.

Prior to dm-snapshot-use-per-device-mempools.patch the pending
exceptions may still have been referenced after the snapshot was
destroyed, but this was not a problem because the shared mempool
was still there.

This patch fixes the problem by tracking the number of mempool elements
in use.

The scenario:
- You have an origin and two snapshots 1 and 2.
- Someone writes to the origin.
- It creates two exceptions in the snapshots, snapshot 1 will be primary
exception, snapshot 2's pending_exception->primary_pe will point to the
exception in snapshot 1.
- The exceptions are being relocated, relocation of exception 1 finishes
(but it's pending_exception is still allocated, because it is referenced
by an exception from snapshot 2)
- The user lvremoves snapshot 1 --- it calls just suspend (does nothing)
and destructor. md->pending is zero (there is no I/O submitted to the
snapshot by md layer), so it won't help us.
- The destructor waits for kcopyd jobs to finish on snapshot 1 --- but
there are none.
- The destructor on snapshot 1 cleans up everything.
- The relocation of exception on snapshot 2 finishes, it drops reference
on primary_pe. This frees its primary_pe pointer. Primary_pe points to
pending exception created for snapshot 1. So it frees memory into
non-existing mempool.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-30 13:33:16 +00:00
Mikulas Patocka 60c856c8e2 dm snapshot: fix register_snapshot deadlock
register_snapshot() performs a GFP_KERNEL allocation while holding
_origins_lock for write, but that could write out dirty pages onto a
device that attempts to acquire _origins_lock for read, resulting in
deadlock.

So move the allocation up before taking the lock.

This path is not performance-critical, so it doesn't matter that we
allocate memory and free it if we find that we won't need it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-30 13:33:12 +00:00
Ilpo Jarvinen b34578a484 dm raid1: fix do_failures
Missing braces.  Commit 1f965b1943 (dm raid1: separate region_hash interface
part1) broke it.

Signed-off-by: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Heinz Mauelshagen <hjm@redhat.com>
2008-10-30 13:33:07 +00:00
NeilBrown 934d9c23b4 md: destroy partitions and notify udev when md array is stopped.
md arrays are not currently destroyed when they are stopped - they
remain in /sys/block.  Last time I tried this I tripped over locking
too much.

A consequence of this is that udev doesn't remove anything from /dev.
This is rather ugly.

As an interim measure until proper device removal can be achieved,
make sure all partitions are removed using the BLKRRPART ioctl, and
send a KOBJ_CHANGE when an md array is stopped.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-28 17:01:23 +11:00
Linus Torvalds f8d56f1771 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: allow extended partitions on md devices.
  md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state
  md: use sysfs_notify_dirent to notify changes to md/array_state
2008-10-26 16:42:18 -07:00
Linus Torvalds 2248485640 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
  [PATCH] kill the rest of struct file propagation in block ioctls
  [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
  [PATCH] get rid of blkdev_locked_ioctl()
  [PATCH] get rid of blkdev_driver_ioctl()
  [PATCH] sanitize blkdev_get() and friends
  [PATCH] remember mode of reiserfs journal
  [PATCH] propagate mode through swsusp_close()
  [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
  [PATCH] pass fmode_t to blkdev_put()
  [PATCH] kill the unused bsize on the send side of /dev/loop
  [PATCH] trim file propagation in block/compat_ioctl.c
  [PATCH] end of methods switch: remove the old ones
  [PATCH] switch sr
  [PATCH] switch sd
  [PATCH] switch ide-scsi
  [PATCH] switch tape_block
  [PATCH] switch dcssblk
  [PATCH] switch dasd
  [PATCH] switch mtd_blkdevs
  [PATCH] switch mmc
  ...
2008-10-23 10:23:07 -07:00
Linus Torvalds 5ed487bc2c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits)
  [PATCH] fs: add a sanity check in d_free
  [PATCH] i_version: remount support
  [patch] vfs: make security_inode_setattr() calling consistent
  [patch 1/3] FS_MBCACHE: don't needlessly make it built-in
  [PATCH] move executable checking into ->permission()
  [PATCH] fs/dcache.c: update comment of d_validate()
  [RFC PATCH] touch_mnt_namespace when the mount flags change
  [PATCH] reiserfs: add missing llseek method
  [PATCH] fix ->llseek for more directories
  [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent
  [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup
  [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate()
  [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper
  [PATCH vfs-2.6 2/6] vfs: add d_ancestor()
  [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT()
  [PATCH] get rid of on-stack dentry in udf
  [PATCH 2/2] anondev: switch to IDA
  [PATCH 1/2] anondev: init IDR statically
  [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup()
  [PATCH] Optimise NFS readdir hack slightly.
  ...
2008-10-23 10:22:40 -07:00
Christoph Hellwig 72e8264eda [PATCH] dm: kill lookup_device wrapper
Now that lookup_bdev is exported and used by dm just use it directly
instead of through a trivial wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-23 05:12:57 -04:00
Kiyoshi Ueda 51157b4ab4 dm: tidy local_init
This patch tidies local_init() in preparation for request-based dm.
No functional change.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:08 +01:00
Kiyoshi Ueda f431d9666f dm: remove unused flush_all
This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary.

The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
is never invoked because:
  - 'goto flush_and_out' is the same as 'goto out' because
    the 'goto flush_and_out' is called only when '!noflush'
  - If r is non-zero, then the code above will invoke 'goto out'
    and skip this code.

No functional change.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:07 +01:00
Heinz Mauelshagen 1f965b1943 dm raid1: separate region_hash interface part1
Separate the region hash code from raid1 so it can be shared by forthcoming
targets.  Use BUG_ON() for failed async dm_io() calls.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:06 +01:00
Martin K. Petersen f3e1d26ede dm: mark split bio as cloned
When a bio gets split, mark its fragments with the BIO_CLONED flag.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:04 +01:00
Milan Broz 0a4a1047a4 dm crypt: remove waitqueue
Remove waitqueue no longer needed with the async crypto interface.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:03 +01:00
Milan Broz 393b47ef23 dm crypt: fix async split
When writing io, dm-crypt has to allocate a new cloned bio
and encrypt the data into newly-allocated pages attached to this bio.
In rare cases, because of hw restrictions (e.g. physical segment limit)
or memory pressure, sometimes more than one cloned bio has to be used,
each processing a different fragment of the original.

Currently there is one waitqueue which waits for one fragment to finish
and continues processing the next fragment.

But when using asynchronous crypto this doesn't work, because several
fragments may be processed asynchronously or in parallel and there is
only one crypt context that cannot be shared between the bio fragments.
The result may be corruption of the data contained in the encrypted bio.

The patch fixes this by allocating new dm_crypt_io structs (with new
crypto contexts) and running them independently.

The fragments contains a pointer to the base dm_crypt_io struct to
handle reference counting, so the base one is properly deallocated
after all the fragments are finished.

In a low memory situation, this only uses one additional object from the
mempool.  If the mempool is empty, the next allocation simple waits for
previous fragments to complete.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:02 +01:00
Milan Broz b635b00e0e dm crypt: tidy sector
Prepare local sector variable (offset) for later patch.
Do not update io->sector for still-running I/O.

No functional change.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:00 +01:00
Mikulas Patocka 586e80e6ee dm: remove dm header from targets
Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
Targets should not need direct access to internal DM structures.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:44:59 +01:00
Mikulas Patocka d63a5ce3c0 dm: publish array_too_big
Move array_too_big to include/linux/device-mapper.h because it is
used by targets.

Remove the test from dm-raid1 as the number of mirror legs is limited
such that it can never fail.  (Even for stripes it seems rather
unlikely.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:44:57 +01:00
Mikulas Patocka 7acedc5b98 dm exception store: fix misordered writes
We must zero the next chunk on disk *before* writing out the current chunk, not
after.  Otherwise if the machine crashes at the wrong time, the "end of metadata"
marker may be missing.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2008-10-21 17:44:56 +01:00
Alasdair G Kergon 7c9e6c1732 dm exception store: refactor zero_area
Use a separate buffer for writing zeroes to the on-disk snapshot
exception store, make the updating of ps->current_area explicit and
refactor the code in preparation for the fix in the next patch.

No functional change.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
2008-10-21 17:44:55 +01:00
Mikulas Patocka f68d4f3d39 dm snapshot: drop unused last_percent
The last_percent field is unused - remove it.
(It dates from when events were triggered as each X% filled up.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:44:53 +01:00
Mikulas Patocka 7c5f78b9d7 dm snapshot: fix primary_pe race
Fix a race condition with primary_pe ref_count handling.

put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test
on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count.

__origin_write does atomic_dec_and_test on primary_pe->ref_count without holding
dm_snapshot->lock.

This opens the following race condition:
Assume two CPUs, CPU1 is executing put_pending_exception (and holding
dm_snapshot->lock). CPU2 is executing __origin_write in parallel.
primary_pe->ref_count == 2.

CPU1:
if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count))
	origin_bios = bio_list_get(&primary_pe->origin_bios);
... decrements primary_pe->ref_count to 1. Doesn't load origin_bios

CPU2:
if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
	flush_bios(bio_list_get(&primary_pe->origin_bios));
	free_pending_exception(primary_pe);
	/* If we got here, pe_queue is necessarily empty. */
	return r;
}
... decrements primary_pe->ref_count to 0, submits pending bios, frees
primary_pe.

CPU1:
if (!primary_pe || primary_pe != pe)
	free_pending_exception(pe);
... this has no effect.
if (primary_pe && !atomic_read(&primary_pe->ref_count))
	free_pending_exception(primary_pe);
... sees ref_count == 0 (written by CPU 2), does double free !!

This bug can happen only if someone is simultaneously writing to both the
origin and the snapshot.

If someone is writing only to the origin, __origin_write will submit kcopyd
request after it decrements primary_pe->ref_count (so it can't happen that the
finished copy races with primary_pe->ref_count decrementation).

If someone is writing only to the snapshot, __origin_write isn't invoked at all
and the race can't happen.

The race happens when someone writes to the snapshot --- this creates
pending_exception with primary_pe == NULL and starts copying. Then, someone
writes to the same chunk in the snapshot, and __origin_write races with
termination of already submitted request in pending_complete (that calls
put_pending_exception).

This race may be reason for bugs:
  http://bugzilla.kernel.org/show_bug.cgi?id=11636
  https://bugzilla.redhat.com/show_bug.cgi?id=465825

The patch fixes the code to make sure that:
1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process
must no longer dereference primary_pe (because someone else may free it under
us).
2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process
is responsible for freeing primary_pe.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2008-10-21 17:44:51 +01:00
Kazuo Ito b673c3a819 dm kcopyd: avoid queue shuffle
Write throughput to LVM snapshot origin volume is an order
of magnitude slower than those to LV without snapshots or
snapshot target volumes, especially in the case of sequential
writes with O_SYNC on.

The following patch originally written by Kevin Jamieson and
Jan Blunck and slightly modified for the current RCs by myself
tries to improve the performance by modifying the behaviour
of kcopyd, so that it pushes back an I/O job to the head of
the job queue instead of the tail as process_jobs() currently
does when it has to wait for free pages. This way, write
requests aren't shuffled to cause extra seeks.

I tested the patch against 2.6.27-rc5 and got the following results.
The test is a dd command writing to snapshot origin followed by fsync
to the file just created/updated.  A couple of filesystem benchmarks
gave me similar results in case of sequential writes, while random
writes didn't suffer much.

dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=...
   [conv=notrunc when updating]

1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
average throughput (MB/s)
                     10M     100M    1000M
create,dd         511.46   610.72    11.81
create,dd+fsync     7.10     6.77     8.13
update,dd         431.63   917.41    12.75
update,dd+fsync     7.79     7.43     8.12

compared with write throughput to LV without any snapshots,
all dd+fsync and 1000 MiB writes perform very poorly.

                     10M     100M    1000M
create,dd         555.03   608.98   123.29
create,dd+fsync   114.27    72.78    76.65
update,dd         152.34  1267.27   124.04
update,dd+fsync   130.56    77.81    77.84

2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
average throughput (MB/s)

                     10M     100M    1000M
create,dd         537.06   589.44    46.21
create,dd+fsync    31.63    29.19    29.23
update,dd         487.59   897.65    37.76
update,dd+fsync    34.12    30.07    26.85

Although still not on par with plain LV performance -
cannot be avoided because it's copy on write anyway -
this simple patch successfully improves throughtput
of dd+fsync while not affecting the rest.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Kazuo Ito <ito.kazuo@oss.ntt.co.jp>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2008-10-21 17:44:50 +01:00
Al Viro 9a1c354276 [PATCH] pass fmode_t to blkdev_put()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:48:58 -04:00
Al Viro a39907fa2f [PATCH] switch md
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:48:31 -04:00
Al Viro fe5f9f2cd5 [PATCH] switch dm
ioctl() doesn't need BKL here

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:48:29 -04:00
Al Viro d4430d62fa [PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
	1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both.  That's this changeset.
	2) for each driver convert to new methods.  *ALL* drivers
are converted in this series.
	3) kill the old (renamed) methods.

Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain.  The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.

New methods:
	open(bdev, mode)
	release(disk, mode)
	ioctl(bdev, mode, cmd, arg)		/* Called without BKL */
	compat_ioctl(bdev, mode, cmd, arg)
	locked_ioctl(bdev, mode, cmd, arg)	/* Called with BKL, legacy */

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:32 -04:00
Al Viro 633a08b812 [PATCH] introduce __blkdev_driver_ioctl()
Analog of blkdev_driver_ioctl() with sane arguments.  For
now uses fake struct file, by the end of the series it won't
and blkdev_driver_ioctl() will become a wrapper around it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:26 -04:00
Al Viro 647b3d0084 [PATCH] lose unused arguments in dm ioctl callbacks
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:18 -04:00
Al Viro aeb5d72706 [PATCH] introduce fmode_t, do annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:06 -04:00
NeilBrown 92850bbd71 md: allow extended partitions on md devices.
The new extended partition support provides a much nicer was
to have partitions on md devices that the 'mdp' alternate major.
We cannot really get rid of 'mdp' at this time, but we can
enable extended partitions as that will probably make life
easier for sysadmins.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-21 13:25:32 +11:00
NeilBrown 3c0ee63a64 md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state
The 'state' file for a device reports, for example, when the device
has failed.  Changes should be reported to userspace ASAP without
the possibility of blocking on low-memory.  sysfs_notify does
have that possibility (as it takes a mutex which can be held
across a kmalloc) so use sysfs_notify_dirent instead.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-21 13:25:28 +11:00
NeilBrown b62b75905d md: use sysfs_notify_dirent to notify changes to md/array_state
Now that we have sysfs_notify_dirent, use it to notify changes
to md/array_state.
As sysfs_notify_dirent can be called in atomic context, we can
remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-21 13:25:21 +11:00
Linus Torvalds ed09441dac Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (39 commits)
  [SCSI] sd: fix compile failure with CONFIG_BLK_DEV_INTEGRITY=n
  libiscsi: fix locking in iscsi_eh_device_reset
  libiscsi: check reason why we are stopping iscsi session to determine error value
  [SCSI] iscsi_tcp: return a descriptive error value during connection errors
  [SCSI] libiscsi: rename host reset to target reset
  [SCSI] iscsi class: fix endpoint id handling
  [SCSI] libiscsi: Support drivers initiating session removal
  [SCSI] libiscsi: fix data corruption when target has to resend data-in packets
  [SCSI] sd: Switch kernel printing level for DIF messages
  [SCSI] sd: Correctly handle all combinations of DIF and DIX
  [SCSI] sd: Always print actual protection_type
  [SCSI] sd: Issue correct protection operation
  [SCSI] scsi_error: fix target reset handling
  [SCSI] lpfc 8.2.8 v2 : Add statistical reporting control and additional fc vendor events
  [SCSI] lpfc 8.2.8 v2 : Add sysfs control of target queue depth handling
  [SCSI] lpfc 8.2.8 v2 : Revert target busy in favor of transport disrupted
  [SCSI] scsi_dh_alua: remove REQ_NOMERGE
  [SCSI] lpfc 8.2.8 : update driver version to 8.2.8
  [SCSI] lpfc 8.2.8 : Add MSI-X support
  [SCSI] lpfc 8.2.8 : Update driver to use new Host byte error code DID_TRANSPORT_DISRUPTED
  ...
2008-10-17 09:00:23 -07:00
Linus Torvalds c472273f86 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: fix input truncation in safe_delay_store()
  md: check for memory allocation failure in faulty personality
  md: build failure due to missing delay.h
  md: Relax minimum size restrictions on chunk_size.
  md: remove space after function name in declaration and call.
  md: Remove unnecessary #includes, #defines, and function declarations.
  md: Convert remaining 1k representations in linear.c to sectors.
  md: linear.c: Make two local variables sector-based.
  md: linear: Represent dev_info->size and dev_info->offset in sectors.
  md: linear.c: Remove broken debug code.
  md: linear.c: Remove pointless initialization of curr_offset.
  md: linear.c: Fix typo in comment.
  md: Don't try to set an array to 'read-auto' if it is already in that state.
  md: Allow metadata_version to be updated for externally managed metadata.
  md: Fix rdev_size_store with size == 0
2008-10-16 11:55:11 -07:00
Dan Williams 97ce0a7f9c md: fix input truncation in safe_delay_store()
safe_delay_store() currently truncates the last character of input since
it tells strlcpy that the buffer can only hold 'len' characters, off by
one.  sysfs already null terminates the buffer, so just increase the
last argument to strlcpy.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-16 17:03:08 +11:00
Sven Wegener 08ff39f1c8 md: check for memory allocation failure in faulty personality
It's a fault injection module, but I don't think we should oops here.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-10-16 14:16:53 +11:00
Stephen Rothwell 255707274e md: build failure due to missing delay.h
Today's linux-next build (powerpc ppc64_defconfig) failed like this:

drivers/md/raid1.c: In function 'sync_request':
drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid1.o] Error 1
make[3]: *** Waiting for unfinished jobs....
drivers/md/raid10.c: In function 'sync_request':
drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid10.o] Error 1
drivers/md/md.c: In function 'md_do_sync':
drivers/md/md.c:5915: error: implicit declaration of function 'msleep'

Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
unnecessary #includes, #defines, and function declarations").  I added
the following patch.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-15 21:57:05 +11:00
Mike Christie 6000a368cd [SCSI] block: separate failfast into multiple bits.
Multipath is best at handling transport errors. If it gets a device
error then there is not much the multipath layer can do. It will just
access the same device but from a different path.

This patch breaks up failfast into device, transport and driver errors.
The multipath layers (md and dm mutlipath) only ask the lower levels to
fast fail transport errors. The user of failfast, read ahead, will ask
to fast fail on all errors.

Note that blk_noretry_request will return true if any failfast bit
is set. This allows drivers that do not support the multipath failfast
bits to continue to fail on any failfast error like before. Drivers
like scsi that are able to fail fast specific errors can check
for the specific fail fast type. In the next patch I will convert
scsi.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-10-13 09:28:52 -04:00
NeilBrown 4bbf3771ca md: Relax minimum size restrictions on chunk_size.
Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.

This makes moving an array to a machine with a larger PAGE_SIZE, or
changing the kernel to use a larger PAGE_SIZE, can stop an array from
working.

For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
process works on whole pages at a time, and assumes them to be wholly
within a stripe.  For other raid personalities, this restriction is
not needed at all and can be dropped.

So remove the test on chunk_size from common can, and add it in just
the places where it is needed: raid10 and raid4/5/6.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
NeilBrown d710e13812 md: remove space after function name in declaration and call.
Having
   function (args)
instead of
   function(args)

make is harder to search for calls of particular functions.
So remove all those spaces.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
NeilBrown fb4d8c76e5 md: Remove unnecessary #includes, #defines, and function declarations.
A lot of cruft has gathered over the years.  Time to remove it.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
Andre Noll ab5bd5cbc8 md: Convert remaining 1k representations in linear.c to sectors.
This patch renames hash_spacing and preshift to  spacing and
sector_shift respectively with the following change of semantics:

Case 1: (sizeof(sector_t) <= sizeof(u32)).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this case, we have sector_shift = preshift = 0 and spacing =
2 * hash_spacing.

Hence, the index for the hash table which is computed by the new code
in which_dev() as sector / spacing equals the old value which was
(sector/2) / hash_spacing.

Note also that the value of nb_zone stays the same because both sz
and base double.

Case 2: (sizeof(sector_t) > sizeof(u32)).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(aka the shifting dance case). Here we have sector_shift = preshift +
1 and

spacing = 2 * hash_spacing

during the computation of nb_zone and curr_sector, but

spacing = hash_spacing

in which_dev() because in the last hunk of the patch for linear.c we
shift down conf->spacing (= 2 * hash_spacing) by one more bit than
in the old code.

Hence in the computation of nb_zone, sz and base have the same value
as before, so nb_zone is not affected. Also curr_sector in the next
hunk stays the same.

In which_dev() the hash table index is computed as

(sector >> sector_shift) / spacing

In view of sector_shift = preshift + 1 and spacing = hash_spacing,
this equals

((sector/2) >> preshift) / hash_spacing

which is the value computed by the old code.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
Andre Noll 23242fbb47 md: linear.c: Make two local variables sector-based.
This is a preparation for representing also the remaining fields of struct
linear_private_data as sectors.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
Andre Noll 6283815d18 md: linear: Represent dev_info->size and dev_info->offset in sectors.
Rename them to num_sectors and start_sector which is more descriptive.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
Andre Noll 451708d2a4 md: linear.c: Remove broken debug code.
conf->smallest_size is undefined since day one of the git repo..

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
Andre Noll 481d86c7eb md: linear.c: Remove pointless initialization of curr_offset.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
Andre Noll e61130228e md: linear.c: Fix typo in comment.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
NeilBrown 80268ee927 md: Don't try to set an array to 'read-auto' if it is already in that state.
'read-auto' is a variant of 'readonly' which will switch to writable
on the first write attempt.

Calling do_md_stop to set the array readonly when it is already readonly
returns an error.  So make sure not to do that.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:12 +11:00
NeilBrown ea43ddd849 md: Allow metadata_version to be updated for externally managed metadata.
For externally managed metadata, the 'metadata_version' sysfs
attribute is really just a channel for user-space programs to
communicate about how the array is being managed.
It can be useful for this to be changed while the array is active.

Normally changes to metadata_version are not permitted while the array
is active.  Change that so that if the metadata is externally managed,
the metadata_version can be changed to a different flavour of external
management.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:11 +11:00
Chris Webb 7d3c6f8717 md: Fix rdev_size_store with size == 0
Fix rdev_size_store with size == 0.
size == 0 means to use the largest size allowed by the
underlying device and is used when modifying an active array.

This fixes a regression introduced by
 commit d7027458d6

Cc: <stable@kernel.org>
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-10-13 11:55:11 +11:00
Alan Jenkins ce52aebd02 raid, fastboot: hide RAID autodetect option if MD is compiled as a module
Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-12 08:25:14 -07:00
Arjan van de Ven a364092a41 raid: make RAID autodetect default a KConfig option
RAID autodetect has the side effect of requiring synchronisation
of all device drivers, which can make the boot several seconds longer
(I've measured 7 on one of my laptops).... even for systems that don't
have RAID setup for the root filesystem (the only FS where this matters).

This patch makes the default for autodetect a config option; either way
the user can always override via the kernel command line.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: NeilBrown <neilb@suse.de>
2008-10-12 08:25:02 -07:00
Linus Torvalds b0af205afb Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
  dm: detect lost queue
  dm: publish dm_vcalloc
  dm: publish dm_table_unplug_all
  dm: publish dm_get_mapinfo
  dm: export struct dm_dev
  dm crypt: avoid unnecessary wait when splitting bio
  dm crypt: tidy ctx pending
  dm crypt: fix async inc_pending
  dm crypt: move dec_pending on error into write_io_submit
  dm crypt: remove inc_pending from write_io_submit
  dm crypt: tidy write loop pending
  dm crypt: tidy crypt alloc
  dm crypt: tidy inc pending
  dm exception store: use chunk_t for_areas
  dm exception store: introduce area_location function
  dm raid1: kcopyd should stop on error if errors handled
  dm mpath: remove is_active from struct dm_path
  dm mpath: use more error codes

Fixed up trivial conflict in drivers/md/dm-mpath.c manually.
2008-10-10 11:11:47 -07:00
Alasdair G Kergon 0c2322e4ce dm: detect lost queue
Detect and report buggy drivers that destroy their request_queue.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Stefan Raspl <raspl@linux.vnet.ibm.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2008-10-10 13:37:13 +01:00
Mikulas Patocka 5416090426 dm: publish dm_vcalloc
Publish dm_vcalloc in include/linux/device-mapper.h because this function is
used by targets.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:12 +01:00
Mikulas Patocka ea0ec64094 dm: publish dm_table_unplug_all
Publish dm_table_unplug_all in include/linux/device-mapper.h because this
function is used by targets.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:11 +01:00
Mikulas Patocka 89343da077 dm: publish dm_get_mapinfo
Publish dm_get_mapinfo in include/linux/device-mapper.h because this function
is used by targets.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:10 +01:00
Mikulas Patocka 82b1519b34 dm: export struct dm_dev
Split struct dm_dev in two and publish the part that other targets need in
include/linux/device-mapper.h.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:09 +01:00
Milan Broz 933f01d433 dm crypt: avoid unnecessary wait when splitting bio
Don't wait between submitting crypt requests for a bio unless
we are short of memory.

There are two situations when we must split an encrypted bio:
  1) there are no free pages;
  2) the new bio would violate underlying device restrictions
(e.g. max hw segments).

In case (2) we do not need to wait.

Add output variable to crypt_alloc_buffer() to distinguish between
these cases.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:08 +01:00
Milan Broz c8081618a9 dm crypt: tidy ctx pending
Move the initialisation of ctx->pending into one place, at the
start of crypt_convert().

Introduce crypt_finished to indicate whether or not the encryption
is finished, for use in a later patch.

No functional change.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:08 +01:00
Milan Broz 4e59409891 dm crypt: fix async inc_pending
The pending reference count must be incremented *before* the async work is
queued to another thread, not after.  Otherwise there's a race if the
work completes and decrements the reference count before it gets incremented.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:07 +01:00
Milan Broz 6c031f41db dm crypt: move dec_pending on error into write_io_submit
Make kcryptd_crypt_write_io_submit() responsible for decrementing
the pending count after an error.

Also fixes a bug in the async path that forgot to decrement it.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:06 +01:00
Alasdair G Kergon 1e37bb8e55 dm crypt: remove inc_pending from write_io_submit
Make the caller reponsible for incrementing the pending count before calling
kcryptd_crypt_write_io_submit() in the non-async case to bring it into line
with the async case.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:05 +01:00
Milan Broz fc5a5e9aa8 dm crypt: tidy write loop pending
Move kcryptd_crypt_write_convert_loop inside kcryptd_crypt_write_convert.
This change is needed for a later patch.

No functional change.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:04 +01:00
Milan Broz dc440d1e56 dm crypt: tidy crypt alloc
Factor out crypt io allocation code.
Later patches will call it from another place.

No functional change.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:03 +01:00
Milan Broz 3e1a8bdd05 dm crypt: tidy inc pending
Move io pending to one place.

No functional change, usefull to simplify debugging.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:02 +01:00
Mikulas Patocka fd14acf6fc dm exception store: use chunk_t for_areas
Change uint32_t into chunk_t to remove 32-bit limitation on the
number of chunks on systems with 64-bit sector numbers.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:01 +01:00
Mikulas Patocka a481db7846 dm exception store: introduce area_location function
Move this logic to a function, because it will be reused later.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:37:00 +01:00
Jonathan Brassow f7c83e2e47 dm raid1: kcopyd should stop on error if errors handled
dm-raid1 is setting the 'DM_KCOPYD_IGNORE_ERROR' flag unconditionally
when assigning kcopyd work.  kcopyd is responsible for copying an
assigned section of disk to one or more other disks.  The
'DM_KCOPYD_IGNORE_ERROR' flag affects kcopyd in the following way:

When not set:
kcopyd will immediately stop the copy operation when an error is
encountered.

When set:
kcopyd will try to proceed regardless of errors and try to continue
copying any remaining amount.

Since dm-raid1 tracks regions of the address space that are (or
are not) in sync and it now has the ability to handle these
errors, we can safely enable this optimization.  This optimization
is conditional on whether mirror error handling has been enabled.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:36:59 +01:00
Kiyoshi Ueda 6680073d3e dm mpath: remove is_active from struct dm_path
This patch moves 'is_active' from struct dm_path to struct pgpath
as it does not need exporting.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:36:58 +01:00
Benjamin Marzinski 01460f3520 dm mpath: use more error codes
This patch allows path errors from the multipath ctr function to
propagate up to userspace as errno values from the ioctl() call.

This is in response to
  https://www.redhat.com/archives/dm-devel/2008-May/msg00000.html
and
  https://bugzilla.redhat.com/show_bug.cgi?id=444421

The patch only lets through the errors that it needs to in order to
get the path errors from parse_path().

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:36:57 +01:00
Denis ChengRq 6feef531f5 block: mark bio_split_pool static
Since all bio_split calls refer the same single bio_split_pool, the bio_split
function can use bio_split_pool directly instead of the mempool_t parameter;

then the mempool_t parameter can be removed from bio_split param list, and
bio_split_pool is only referred in fs/bio.c file, can be marked static.

Signed-off-by: Denis ChengRq <crquan@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:57:05 +02:00
Mike Anderson 224cb3e981 dm: Call blk_abort_queue on failed paths
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:14 +02:00
Tejun Heo 074a7aca7a block: move stats from disk to part0
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...

* part_stat_*() now updates part0 together if the specified partition
  is not part0.  ie. part_stat_*() are now essentially all_stat_*().

* {disk|all}_stat_*() are gone.

* part_round_stats() is updated similary.  It handles part0 stats
  automatically and disk_round_stats() is killed.

* part_{inc|dec}_in_fligh() is implemented which automatically updates
  part0 stats for parts other than part0.

* disk_map_sector_rcu() is updated to return part0 if no part matches.
  Combined with the above changes, this makes NULL special case
  handling in callers unnecessary.

* Separate stats show code paths for disk are collapsed into part
  stats show code paths.

* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()

While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:08 +02:00
Tejun Heo 0762b8bde9 block: always set bdev->bd_part
Till now, bdev->bd_part is set only if the bdev was for parts other
than part0.  This patch makes bdev->bd_part always set so that code
paths don't have to differenciate common handling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:08 +02:00
Tejun Heo b7db9956e5 block: move policy from disk to part0
Move disk->policy to part0->policy.  Implement and use get_disk_ro().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:07 +02:00
Tejun Heo ed9e198234 block: implement and use {disk|part}_to_dev()
Implement {disk|part}_to_dev() and use them to access generic device
instead of directly dereferencing {disk|part}->dev.  To make sure no
user is left behind, rename generic devices fields to __dev.

This is in preparation of unifying partition 0 handling with other
partitions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:07 +02:00
Tejun Heo c995905916 block: fix diskstats access
There are two variants of stat functions - ones prefixed with double
underbars which don't care about preemption and ones without which
disable preemption before manipulating per-cpu counters.  It's unclear
whether the underbarred ones assume that preemtion is disabled on
entry as some callers don't do that.

This patch unifies diskstats access by implementing disk_stat_lock()
and disk_stat_unlock() which take care of both RCU (for partition
access) and preemption (for per-cpu counter access).  diskstats access
should always be enclosed between the two functions.  As such, there's
no need for the versions which disables preemption.  They're removed
and double underbars ones are renamed to drop the underbars.  As an
extra argument is added, there's no danger of using the old version
unconverted.

disk_stat_lock() uses get_cpu() and returns the cpu index and all
diskstat functions which access per-cpu counters now has @cpu
argument to help RT.

This change adds RCU or preemption operations at some places but also
collapses several preemption ops into one at others.  Overall, the
performance difference should be negligible as all involved ops are
very lightweight per-cpu ones.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:06 +02:00
Tejun Heo f331c0296f block: don't depend on consecutive minor space
* Implement disk_devt() and part_devt() and use them to directly
  access devt instead of computing it from ->major and ->first_minor.

  Note that all references to ->major and ->first_minor outside of
  block layer is used to determine devt of the disk (the part0) and as
  ->major and ->first_minor will continue to represent devt for the
  disk, converting these users aren't strictly necessary.  However,
  convert them for consistency.

* Implement disk_max_parts() to avoid directly deferencing
  genhd->minors.

* Update bdget_disk() such that it doesn't assume consecutive minor
  space.

* Move devt computation from register_disk() to add_disk() and make it
  the only one (all other usages use the initially determined value).

These changes clean up the code and will help disk->part dereference
fix and extended block device numbers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:05 +02:00
Jens Axboe 5b99c2ffa9 block: make bi_phys_segments an unsigned int instead of short
raid5 can overflow with more than 255 stripes, and we can increase it
to an int for free on both 32 and 64-bit archs due to the padding.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:03 +02:00
Jens Axboe 960e739d9e block: raid fixups for removal of bi_hw_segments
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:03 +02:00
Mikulas Patocka 5df97b91b5 drop vmerge accounting
Remove hw_segments field from struct bio and struct request. Without virtual
merge accounting they have no purpose.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:03 +02:00
Chandra Seetharaman 7253a33434 dm mpath: add missing path switching locking
Moving the path activation to workqueue along with scsi_dh patches introduced
a race. It is due to the fact that the current_pgpath (in the multipath data
structure) can be modified if changes happen in any of the paths leading to
the lun. If the changes lead to current_pgpath being set to NULL, then it
leads to the invalid access which results in the panic below.

This patch fixes that by storing the pgpath to activate in the multipath data
structure and properly protecting it.

Note that if activate_path is called twice in succession with different pgpath,
with the second one being called before the first one is done, then activate
path will be called twice for the second pgpath, which is fine.

Unable to handle kernel paging request for data at address 0x00000020
Faulting instruction address: 0xd000000000aa1844
cpu 0x1: Vector: 300 (Data Access) at [c00000006b987a80]
    pc: d000000000aa1844: .activate_path+0x30/0x218 [dm_multipath]
    lr: c000000000087a2c: .run_workqueue+0x114/0x204
    sp: c00000006b987d00
   msr: 8000000000009032
   dar: 20
 dsisr: 40000000
  current = 0xc0000000676bb3f0
  paca    = 0xc0000000006f3680
    pid   = 2528, comm = kmpath_handlerd
enter ? for help
[c00000006b987da0] c000000000087a2c .run_workqueue+0x114/0x204
[c00000006b987e40] c000000000088b58 .worker_thread+0x120/0x144
[c00000006b987f00] c00000000008ca70 .kthread+0x78/0xc4
[c00000006b987f90] c000000000027cc8 .kernel_thread+0x4c/0x68

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-01 14:39:27 +01:00
Mikulas Patocka b01cd5ac43 dm: cope with access beyond end of device in dm_merge_bvec
If for any reason dm_merge_bvec() is given an offset beyond the end of the
device, avoid an oops and always allow one page to be added to an empty bio.
We'll reject the I/O later after the bio is submitted.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-01 14:39:24 +01:00
Mikulas Patocka 5037108acd dm: always allow one page in dm_merge_bvec
Some callers assume they can always add at least one page to an empty bio,
so dm_merge_bvec should not return 0 in this case: we'll reject the I/O
later after the bio is submitted.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-01 14:39:17 +01:00
NeilBrown 9744197c3d md: Don't wait UNINTERRUPTIBLE for other resync to finish
When two md arrays share some block device (e.g each uses different
partitions on the one device), a resync of one array will wait for
the resync on the other to finish.

This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
the softlockup code notices and complains.

So use TASK_INTERRUPTIBLE instead and make sure to flush signals
before calling schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-09-19 11:49:54 +10:00
NeilBrown b2d2c4cead Fix problem with waiting while holding rcu read lock in md/bitmap.c
A recent patch to protect the rdev list with rcu locking leaves us
with a problem because we can sleep on memalloc while holding the
rcu lock.

The rcu lock is only needed while walking the linked list as
uninteresting devices (failed or spares) can be removed at any time.

So only take the rcu lock while actually walking the linked list.
Take a refcount on the rdev during the time when we drop the lock
and do the memalloc to start IO.
When we return to the locked code, all the interesting devices
on the list will not have moved, so we can simply use
list_for_each_continue_rcu to pick up where we left off.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-09-01 12:48:13 +10:00
NeilBrown 271f5a9b8f Remove invalidate_partition call from do_md_stop.
When stopping an md array, or just switching to read-only, we
currently call invalidate_partition while holding the mddev lock.
The main reason for this is probably to ensure all dirty buffers
are flushed (invalidate_partition calls fsync_bdev).

However if any dirty buffers are found, it will almost certainly cause
a deadlock as starting writeout will require an update to the
superblock, and performing that updates requires taking the mddev
lock - which is already held.

This deadlock can be demonstrated by running "reboot -f -n" with
a root filesystem on md/raid, and some dirty buffers in memory.

All other calls to stop an array should already happen after a flush.
The normal sequence is to stop using the array (e.g. umount) which
will cause __blkdev_put to call sync_blockdev.  Then open the
array and issue the STOP_ARRAY ioctl while the buffers are all still
clean.

So this invalidate_partition is normally a no-op, except for one case
where it will cause a deadlock.

So remove it.

This patch possibly addresses the regression recored in
   http://bugzilla.kernel.org/show_bug.cgi?id=11460
and
   http://bugzilla.kernel.org/show_bug.cgi?id=11452

though it isn't yet clear how it ever worked.


Signed-off-by: NeilBrown <neilb@suse.de>
2008-09-01 12:32:52 +10:00
Dan Williams 56ac36d722 md: cancel check/repair requests when recovery is needed
If a 'repair' is requested when an array is in a position to 'recover' raid1
will perform the repair while md believes a recovery is happening.  Address
this at both ends, i.e. cancel check/repair requests upon detecting a
recover condition and do not call ->spare_active after completing a
check/repair.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-08-07 10:02:47 -07:00
NeilBrown 0310fa216d Allow raid10 resync to happening in larger chunks.
The raid10 resync/recovery code currently limits the amount of
in-flight resync IO to 2Meg.  This was copied from raid1 where
it seems quite adequate.  However for raid10, some layouts require
a bit of seeking to perform a resync, and allowing a larger buffer
size means that the seeking can be significantly reduced.

There is probably no real need to limit the amount of in-flight
IO at all.  Any shortage of memory will naturally reduce the
amount of buffer space available down to a set minimum, and any
concurrent normal IO will quickly cause resync IO to back off.

The only problem would be that normal IO has to wait for all resync IO
to finish, so a very large amount of resync IO could cause unpleasant
latency when normal IO starts up.

So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is
available) which seems to be a good amount.  Also reduce the amount
of memory reserved as there is no need to keep 2Meg just for resync if
memory is tight.

Thanks to Keld for the suggestion.

Cc: Keld Jørn Simonsen <keld@dkuug.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:32 +10:00
NeilBrown c89a8eee61 Allow faulty devices to be removed from a readonly array.
Removing faulty devices from an array is a two stage process.
First the device is moved from being a part of the active array
to being similar to a spare device.  Then it can be removed
by a request from user space.

The first step is currently not performed for read-only arrays,
so the second step can never succeed.

So allow readonly arrays to remove failed devices (which aren't
blocked).

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:32 +10:00
NeilBrown ac4090d24c Don't let a blocked_rdev interfere with read request in raid5/6
When we have externally managed metadata, we need to mark a failed
device as 'Blocked' and not allow any writes until that device
have been marked as faulty in the metadata and the Blocked flag has
been removed.

However it is perfectly OK to allow read requests when there is a
Blocked device, and with a readonly array, there may not be any
metadata-handler watching for blocked devices.

So in raid5/raid6 only allow a Blocked device to interfere with
Write request or resync.  Read requests go through untouched.

raid1 and raid10 already differentiate between read and write
properly.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:32 +10:00
NeilBrown dba034eef2 Fail safely when trying to grow an array with a write-intent bitmap.
We cannot currently change the size of a write-intent bitmap.
So if we change the size of an array which has such a bitmap, it
tries to set bits beyond the end of the bitmap.

For now, simply reject any request to change the size of an array
which has a bitmap.  mdadm can remove the bitmap and add a new one
after the array has changed size.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:32 +10:00
NeilBrown 2b25000bf5 Restore force switch of md array to readonly at reboot time.
A recent patch allowed do_md_stop to know whether it was being called
via an ioctl or not, and thus where to allow for an extra open file
descriptor when checking if it is in use.
This broke then switch to readonly performed by the shutdown notifier,
which needs to work even when the array is still (apparently) active
(as md doesn't get told when the filesystem becomes readonly).

So restore this feature by pretending that there can be lots of
file descriptors open, but we still want do_md_stop to switch to
readonly.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:31 +10:00
NeilBrown 19052c0e85 Make writes to md/safe_mode_delay immediately effective.
If we reduce the 'safe_mode_delay', it could still wait for the old
delay to completely expire before doing anything about safe_mode.
Thus the effect if the change is delayed.

To make the effect more immediate, run the timeout function
immediately if the delay was reduced.  This may cause it to run
slightly earlier that required, but that is the safer option.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05 15:56:31 +10:00
Linus Torvalds 1e24b15b26 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: raid10: wake up frozen array
  md: do not count blocked devices as spares
  md: do not progress the resync process if the stripe was blocked
  md: delay notification of 'active_idle' to the recovery thread
  md: fix merge error
  md: move async_tx_issue_pending_all outside spin_lock_irq
2008-08-01 11:56:07 -07:00
Linus Torvalds b17b3d479c Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  md: the bitmap code needs to use blk_plug_device_unlocked()
  block: add a blk_plug_device_unlocked() that grabs the queue lock
2008-08-01 11:46:00 -07:00
Jens Axboe 93769f5807 md: the bitmap code needs to use blk_plug_device_unlocked()
It doesn't hold the queue lock, so it's both racey on the queue flags
and thus spews a warning.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-08-01 20:32:31 +02:00
Al Viro d5686b444f [PATCH] switch mtd and dm-table to lookup_bdev()
No need to open-code it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01 11:25:31 -04:00
Arthur Jones 388667bed5 md: raid10: wake up frozen array
When rescheduling a bio in raid10, we wake up
the md thread, but if the array is frozen, this
will have no effect.  This causes the array to
remain frozen for eternity.  We add a wake_up
to allow the array to de-freeze.  This code is
nearly identical to the raid1 code, which has
this fix already.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-01 12:55:14 +10:00
Dan Williams e542713529 md: do not count blocked devices as spares
remove_and_add_spares() assumes that failed devices have been hot-removed
from the array.  Removal is skipped in the 'blocked' case so do not count a
device in this state as 'spare'.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-28 17:52:44 -07:00
Dan Williams df10cfbc4d md: do not progress the resync process if the stripe was blocked
handle_stripe will take no action on a stripe when waiting for userspace
to unblock the array, so do not report completed sectors.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-28 17:52:37 -07:00
Hannes Reinecke ae11b1b36d [SCSI] scsi_dh: attach to hardware handler from dm-mpath
multipath keeps a separate device table which may be
more current than the built-in one.
So we should make sure to always call ->attach whenever
a multipath map with hardware handler is instantiated.
And we should call ->detach on removal, too.

[sekharan: update as per comments from agk]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-07-26 15:14:53 -04:00
Dan Williams d8e64406a0 md: delay notification of 'active_idle' to the recovery thread
sysfs_notify might sleep, so do not call it from md_safemode_timeout.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-23 13:09:48 -07:00
Dan Williams 2339788376 md: fix merge error
The original STRIPE_OP_IO removal patch had the following hunk:

-               for (i = conf->raid_disks; i--; ) {
+               for (i = conf->raid_disks; i--; )
                        set_bit(R5_Wantwrite, &sh->dev[i].flags);
-                       if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
-                               sh->ops.count++;
-               }

However it appears the hunk became broken after merging:
-               for (i = conf->raid_disks; i--; ) {
+               for (i = conf->raid_disks; i--; )
                        set_bit(R5_Wantwrite, &sh->dev[i].flags);
                        set_bit(R5_LOCKED, &dev->flags);
                        s.locked++;
-                       if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
-                               sh->ops.count++;
-               }

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-23 13:09:45 -07:00
Dan Williams c9f21aaff1 md: move async_tx_issue_pending_all outside spin_lock_irq
Some dma drivers need to call spin_lock_bh in their device_issue_pending
routines.  This change avoids:

WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3a/0x85()

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-23 12:05:51 -07:00
Linus Torvalds b7e6f62fe2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
  dm crypt: add merge
  dm table: remove merge_bvec sector restriction
  dm: linear add merge
  dm: introduce merge_bvec_fn
  dm snapshot: use per device mempools
  dm snapshot: fix race during exception creation
  dm snapshot: track snapshot reads
  dm mpath: fix test for reinstate_path
  dm mpath: return parameter error
  dm io: remove struct padding
  dm log: make dm_dirty_log init and exit static
  dm mpath: free path selector on invalid args
2008-07-21 10:30:10 -07:00
Linus Torvalds 8a392625b6 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (52 commits)
  md: Protect access to mddev->disks list using RCU
  md: only count actual openers as access which prevent a 'stop'
  md: linear: Make array_size sector-based and rename it to array_sectors.
  md: Make mddev->array_size sector-based.
  md: Make super_type->rdev_size_change() take sector-based sizes.
  md: Fix check for overlapping devices.
  md: Tidy up rdev_size_store a bit:
  md: Remove some unused macros.
  md: Turn rdev->sb_offset into a sector-based quantity.
  md: Make calc_dev_sboffset() return a sector count.
  md: Replace calc_dev_size() by calc_num_sectors().
  md: Make update_size() take the number of sectors.
  md: Better control of when do_md_stop is allowed to stop the array.
  md: get_disk_info(): Don't convert between signed and unsigned and back.
  md: Simplify restart_array().
  md: alloc_disk_sb(): Return proper error value.
  md: Simplify sb_equal().
  md: Simplify uuid_equal().
  md: sb_equal(): Fix misleading printk.
  md: Fix a typo in the comment to cmd_match().
  ...
2008-07-21 10:29:12 -07:00
Milan Broz d41e26b901 dm crypt: add merge
This patch implements biovec merge function for crypt target.

If the underlying device has merge function defined, call it.
If not, keep precomputed value.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:40 +01:00
Milan Broz 9980c638a6 dm table: remove merge_bvec sector restriction
Remove max_sector restriction - merge function replaced it.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:39 +01:00
Milan Broz 7bc3447b69 dm: linear add merge
This patch implements biovec merge function for linear target.

If the underlying device has merge function defined, call it.
If not, keep precomputed value.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:38 +01:00
Milan Broz f6fccb1213 dm: introduce merge_bvec_fn
Introduce a bvec merge function for device mapper devices
for dynamic size restrictions.

This code ensures the requested biovec lies within a single
target and then calls a target-specific function to check
against any constraints imposed by underlying devices.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:37 +01:00
Mikulas Patocka 92e868122e dm snapshot: use per device mempools
Change snapshot per-module mempool to per-device mempool.

Per-module mempools could cause a deadlock if multiple
snapshot devices are stacked above each other.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:35 +01:00
Mikulas Patocka a8d41b59f3 dm snapshot: fix race during exception creation
Fix a race condition that returns incorrect data when a write causes an
exception to be allocated whilst a read is still in flight.

The race condition happens as follows:
* A read to non-reallocated sector in the snapshot is submitted so that the
  read is routed to the original device.
* A write to the original device is submitted. The write causes an exception
  that reallocates the block.  The write proceeds.
* The original read is dequeued and reads the wrong data.

This race can be triggered with CFQ scheduler and one thread writing and
multiple threads reading simultaneously.

(This patch relies upon the earlier dm-kcopyd-per-device.patch to avoid a
deadlock.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:34 +01:00
Mikulas Patocka cd45daffd1 dm snapshot: track snapshot reads
Whenever a snapshot read gets mapped through to the origin, track it in
a per-snapshot hash table indexed by chunk number, using memory allocated
from a new per-snapshot mempool.

We need to track these reads to avoid race conditions which will be fixed
by patches that follow.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:32 +01:00
Alasdair G Kergon def052d21c dm mpath: fix test for reinstate_path
Fix test for reinstate_path method before attempting to use it.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Julia Lawall <julia@diku.dk>
2008-07-21 12:00:31 +01:00
Mikulas Patocka 148acff615 dm mpath: return parameter error
Return a specific error message if there are an invalid number of multipath
arguments.

This invalid command returns an "Unknown error" because the ti->error field is
not set

dmsetup create --table '0 2 multipath 0 0 1 1 round-robin 0 1 1 /dev/sdh' mpath0

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:30 +01:00
Richard Kennedy 6ae2fa6718 dm io: remove struct padding
Rearrange struct dm_io.
Shrinks size from 40 -> 32 allowing more objects/slab.

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:28 +01:00
Adrian Bunk c8da2f8dd8 dm log: make dm_dirty_log init and exit static
dm_dirty_log_{init,exit}() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:27 +01:00
Mikulas Patocka 371b2e348b dm mpath: free path selector on invalid args
Free path selector if the arguments are invalid.

This command (note that it is invalid) causes reference leak on module
"dm_round_robin" and prevents the module from being removed.

dmsetup create --table '0 2 multipath 0 0 1 1 round-robin /dev/sdh' mpath0

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21 12:00:24 +01:00
NeilBrown 4b80991c6c md: Protect access to mddev->disks list using RCU
All modifications and most access to the mddev->disks list are made
under the reconfig_mutex lock.  However there are three places where
the list is walked without any locking.  If a reconfig happens at this
time, havoc (and oops) can ensue.

So use RCU to protect these accesses:
  - wrap them in rcu_read_{,un}lock()
  - use list_for_each_entry_rcu
  - add to the list with list_add_rcu
  - delete from the list with list_del_rcu
  - delay the 'free' with call_rcu rather than schedule_work

Note that export_rdev did a list_del_init on this list.  In almost all
cases the entry was not in the list anymore so it was a no-op and so
safe.  It is no longer safe as after list_del_rcu we may not touch
the list_head.
An audit shows that export_rdev is called:
  - after unbind_rdev_from_array, in which case the delete has
     already been done,
  - after bind_rdev_to_array fails, in which case the delete isn't needed.
  - before the device has been put on a list at all (e.g. in
      add_new_disk where reading the superblock fails).
  - and in autorun devices after a failure when the device is on a
      different list.

So remove the list_del_init call from export_rdev, and add it back
immediately before the called to export_rdev for that last case.

Note also that ->same_set is sometimes used for lists other than
mddev->list (e.g. candidates).  In these cases rcu is not needed.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 17:05:25 +10:00
NeilBrown f2ea68cf42 md: only count actual openers as access which prevent a 'stop'
Open isn't the only thing that increments ->active.  e.g. reading
/proc/mdstat will increment it briefly.  So to avoid false positives
in testing for concurrent access, introduce a new counter that counts
just the number of times the md device it open.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 17:05:25 +10:00
Andre Noll d6e2215052 md: linear: Make array_size sector-based and rename it to array_sectors.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 17:05:25 +10:00
Andre Noll f233ea5c9e md: Make mddev->array_size sector-based.
This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 17:05:22 +10:00
Andre Noll 15f4a5fdf3 md: Make super_type->rdev_size_change() take sector-based sizes.
Also, change the type of the size parameter from unsigned long long to
sector_t and rename it to num_sectors.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 14:42:12 +10:00
Andre Noll d07bd3bcc4 md: Fix check for overlapping devices.
The checks in overlaps() expect all parameters either in block-based
or sector-based quantities. However, its single caller passes two
rdev->data_offset arguments as well as two rdev->size arguments, the
former being sector counts while the latter are measured in 1K blocks.

This could cause rdev_size_store() to accept an invalid size from user
space. Fix it by passing only sector-based quantities to overlaps().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 14:42:07 +10:00
Neil Brown d7027458d6 md: Tidy up rdev_size_store a bit:
- used strict_strtoull in place of simple_strtoull
 - use my_mddev in place of rdev->mddev (they have the same value)
and more significantly,
 - don't adjust mddev->size to fit, rather reject changes which make
   rdev->size smaller than mddev->size

Adjusting mddev->size is a hangover from bind_rdev_to_array which
does a similar thing.  But it really is a better design to insist that
mddev->size is set as required, then the rdev->sizes are set to allow
for that.  The previous way invites confusion.

Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21 14:22:18 +10:00
Linus Torvalds 89a93f2f48 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (102 commits)
  [SCSI] scsi_dh: fix kconfig related build errors
  [SCSI] sym53c8xx: Fix bogus sym_que_entry re-implementation of container_of
  [SCSI] scsi_cmnd.h: remove double inclusion of linux/blkdev.h
  [SCSI] make struct scsi_{host,target}_type static
  [SCSI] fix locking in host use of blk_plug_device()
  [SCSI] zfcp: Cleanup external header file
  [SCSI] zfcp: Cleanup code in zfcp_erp.c
  [SCSI] zfcp: zfcp_fsf cleanup.
  [SCSI] zfcp: consolidate sysfs things into one file.
  [SCSI] zfcp: Cleanup of code in zfcp_aux.c
  [SCSI] zfcp: Cleanup of code in zfcp_scsi.c
  [SCSI] zfcp: Move status accessors from zfcp to SCSI include file.
  [SCSI] zfcp: Small QDIO cleanups
  [SCSI] zfcp: Adapter reopen for large number of unsolicited status
  [SCSI] zfcp: Fix error checking for ELS ADISC requests
  [SCSI] zfcp: wait until adapter is finished with ERP during auto-port
  [SCSI] ibmvfc: IBM Power Virtual Fibre Channel Adapter Client Driver
  [SCSI] sg: Add target reset support
  [SCSI] lib: Add support for the T10 (SCSI) Data Integrity Field CRC
  [SCSI] sd: Move scsi_disk() accessor function to sd.h
  ...
2008-07-15 18:58:04 -07:00
Chandra Seetharaman fe9233fb69 [SCSI] scsi_dh: fix kconfig related build errors
Do not automatically "select" SCSI_DH for dm-multipath. If SCSI_DH
doesn't exist,just do not allow  hardware handlers to be used.

Handle SCSI_DH being a module also. Make sure it doesn't allow DM_MULTIPATH
to be compiled in when SCSI_DH is a module.

[jejb: added comment for Kconfig syntax]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-07-15 09:16:43 -05:00
Linus Torvalds dddec01eb8 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits)
  splice: fix generic_file_splice_read() race with page invalidation
  ramfs: enable splice write
  drivers/block/pktcdvd.c: avoid useless memset
  cdrom: revert commit 22a9189 (cdrom: use kmalloced buffers instead of buffers on stack)
  scsi: sr avoids useless buffer allocation
  block: blk_rq_map_kern uses the bounce buffers for stack buffers
  block: add blk_queue_update_dma_pad
  DAC960: push down BKL
  pktcdvd: push BKL down into driver
  paride: push ioctl down into driver
  block: use get_unaligned_* helpers
  block: extend queue_flag bitops
  block: request_module(): use format string
  Add bvec_merge_data to handle stacked devices and ->merge_bvec()
  block: integrity flags can't use bit ops on unsigned short
  cmdfilter: extend default read filter
  sg: fix odd style (extra parenthesis) introduced by cmd filter patch
  block: add bounce support to blk_rq_map_user_iov
  cfq-iosched: get rid of enable_idle being unused warning
  allow userspace to modify scsi command filter on per device basis
  ...
2008-07-14 13:15:14 -07:00
Andre Noll 0f420358e3 md: Turn rdev->sb_offset into a sector-based quantity.
Rename it to sb_start to make sure all users have been converted.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:23 +10:00
Andre Noll b73df2d3d6 md: Make calc_dev_sboffset() return a sector count.
As BLOCK_SIZE_BITS is 10 and

	MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x),

the return value of calc_dev_sboffset() doubles. Fix up all three
callers accordingly.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:23 +10:00
Andre Noll e7debaa495 md: Replace calc_dev_size() by calc_num_sectors().
Number of sectors is the preferred unit for sizes of raid devices,
so change calc_dev_size() so that it returns this unit instead of
the number of 1K blocks.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:23 +10:00
Andre Noll d71f9f88d7 md: Make update_size() take the number of sectors.
Changing the internal representations of sizes of raid devices
from 1K blocks to sector counts (512B units) is desirable because
it allows to get rid of many divisions/multiplications and unnecessary
casts that are present in the current code.

This patch is a first step in this direction. It replaces the old
1K-based "size" argument of update_size() by "num_sectors" and
fixes up its two callers.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:22 +10:00
Neil Brown df5b20cf68 md: Better control of when do_md_stop is allowed to stop the array.
do_md_stop check the number of active users before allowing the array
to be stopped.
Two problems:
  1/ it assumes the request is coming through an open file descriptor
     (via ioctl) so it allows for that.  This is not always the case.
  2/ it doesn't do the check it the array hasn't been activated.
     This is not good for cases when we use an inactive array to hold
     some devices in a container.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:22 +10:00
Andre Noll 26ef379f53 md: get_disk_info(): Don't convert between signed and unsigned and back.
The current code copies a signed int from user space, converts it to
unsigned and passes the unsigned value to find_rdev_nr() which expects
a signed value. Simply pass the signed value from user space directly.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:21 +10:00
Andre Noll 80fab1d77b md: Simplify restart_array().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:21 +10:00
Andre Noll ebc2433728 md: alloc_disk_sb(): Return proper error value.
If alloc_page() fails, ENOMEM is a more suitable error value
than EINVAL.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:20 +10:00
Andre Noll ce0c8e05f8 md: Simplify sb_equal().
The only caller of sb_equal() tests the return value against
zero, so it's OK to return the negated return value of memcmp().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:20 +10:00
Andre Noll 05710466c9 md: Simplify uuid_equal().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-11 22:02:20 +10:00
Linus Torvalds 2283af5b0b Merge branch 'for-2.6.26' of git://neil.brown.name/md
* 'for-2.6.26' of git://neil.brown.name/md:
  md: ensure all blocks are uptodate or locked when syncing
2008-07-10 09:49:46 -07:00
Dan Williams 7a1fc53c5a md: ensure all blocks are uptodate or locked when syncing
Remove the dubious attempt to prefer 'compute' over 'read'.  Not only is it
wrong given commit c337869d (md: do not compute parity unless it is on a failed
drive), but it can trigger a BUG_ON in handle_parity_checks5().

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-10 15:25:18 +10:00
Andre Noll 35020f1a06 md: sb_equal(): Fix misleading printk.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:53:20 +10:00
Andre Noll 7f6ce76928 md: Fix a typo in the comment to cmd_match().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:53:00 +10:00
Andre Noll 910d8cb3f4 md: Fix typo in array_state comment.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:52:45 +10:00
Andre Noll 9687a60c78 md: sync_speed_show(): Trivial cleanups.
- Remove superfluous parentheses.
- Make format string match the type of the variable that is printed.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:52:26 +10:00
Andre Noll 13e53df354 md: do_md_run(): Fix misleading error message.
In case pers->run() succeeds but creating the bitmap fails, we
print an error message stating that pers->run() has failed.

Print this message only if pers->run() really failed.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:52:15 +10:00
Andre Noll 2f9618ce63 md: md_getgeo(): Move comment to proper position.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:52:00 +10:00
Andre Noll bb57fc64b2 md: md_ioctl(): Fix misleading indentation.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-07-08 10:51:29 +10:00
Neil Brown 0529613a19 Merge branch 'for-neil' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/md into for-next 2008-07-08 10:13:28 +10:00
Neil Brown 5b1a4bf220 Merge branch 'master' into for-next 2008-07-08 10:11:50 +10:00
Alasdair G Kergon cc371e66e3 Add bvec_merge_data to handle stacked devices and ->merge_bvec()
When devices are stacked, one device's merge_bvec_fn may need to perform
the mapping and then call one or more functions for its underlying devices.

The following bio fields are used:
  bio->bi_sector
  bio->bi_bdev
  bio->bi_size
  bio->bi_rw  using bio_data_dir()

This patch creates a new struct bvec_merge_data holding a copy of those
fields to avoid having to change them directly in the struct bio when
going down the stack only to have to change them back again on the way
back up.  (And then when the bio gets mapped for real, the whole
exercise gets repeated, but that's a problem for another day...)

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-03 13:21:15 +02:00
Linus Torvalds cefcade9e7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
  dm crypt: use cond_resched
2008-07-02 18:55:17 -07:00
Milan Broz c7f1b20441 dm crypt: use cond_resched
Add cond_resched() to prevent monopolising CPU when processing large bios.

dm-crypt processes encryption of bios in sector units.  If the bio request
is big it can spend a long time in the encryption call.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Tested-by: Yan Li <elliot.li.tech@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-02 09:34:28 +01:00
Dan Williams b5470dc5fc md: resolve external metadata handling deadlock in md_allow_write
md_allow_write() marks the metadata dirty while holding mddev->lock and then
waits for the write to complete.  For externally managed metadata this causes a
deadlock as userspace needs to take the lock to communicate that the metadata
update has completed.

Change md_allow_write() in the 'external' case to start the 'mark active'
operation and then return -EAGAIN.  The expected side effects while waiting for
userspace to write 'active' to 'array_state' are holding off reshape (code
currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
these failures can be mitigated by coordinating with mdmon.

md_write_start() still prevents writes from occurring until the metadata
handler has had a chance to take action as it unconditionally waits for
MD_CHANGE_CLEAN to be cleared.

[neilb@suse.de: return -EAGAIN, try GFP_NOIO]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-06-30 17:18:19 -07:00
Dan Williams 1fe797e67f md: rationalize raid5 function names
From: Dan Williams <dan.j.williams@intel.com>

Commit a4456856 refactored some of the deep code paths in raid5.c into separate
functions.  The names chosen at the time do not consistently indicate what is
going to happen to the stripe.  So, update the names, and since a stripe is a
cache element use cache semantics like fill, dirty, and clean.

(also, fix up the indentation in fetch_block5)

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 09:16:30 +10:00
Dan Williams 7b3a871ed9 md: handle operation chaining in raid5_run_ops
From: Dan Williams <dan.j.williams@intel.com>

Neil said:
> At the end of ops_run_compute5 you have:
>         /* ack now if postxor is not set to be run */
>         if (tx && !test_bit(STRIPE_OP_POSTXOR, &s->ops_run))
>                 async_tx_ack(tx);
>
> It looks odd having that test there.  Would it fit in raid5_run_ops
> better?

The intended global interpretation is that raid5_run_ops can build a chain
of xor and memcpy operations.  When MD registers the compute-xor it tells
async_tx to keep the operation handle around so that another item in the
dependency chain can be submitted. If we are just computing a block to
satisfy a read then we can terminate the chain immediately.  raid5_run_ops
gives a better context for this test since it cares about the entire chain.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:32:09 +10:00
Dan Williams d8ee0728b5 md: replace R5_WantPrexor with R5_WantDrain, add 'prexor' reconstruct_states
From: Dan Williams <dan.j.williams@intel.com>

Currently ops_run_biodrain and other locations have extra logic to determine
which blocks are processed in the prexor and non-prexor cases.  This can be
eliminated if handle_write_operations5 flags the blocks to be processed in all
cases via R5_Wantdrain.  The presence of the prexor operation is tracked in
sh->reconstruct_state.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:32:06 +10:00
Dan Williams 600aa10993 md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>

Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion)  Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.

This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:32:05 +10:00
Dan Williams 976ea8d475 md: replace STRIPE_OP_COMPUTE_BLK with STRIPE_COMPUTE_RUN
From: Dan Williams <dan.j.williams@intel.com>

Track the state of compute operations (recalculating a block from all the other
blocks in a stripe) with a state flag.  Reduces the scope of the
STRIPE_OP_COMPUTE_BLK flag to only tracking whether a compute operation has
been requested via the ops_request field of struct stripe_head_state.

Note, the compute operation that is performed in the course of doing a 'repair'
operation (check the parity block, recalculate it and write it back if the
check result is not zero) is tracked separately with the 'check_state'
variable.  Compute operations are held off while a 'check' is in progress, and
moving this check out to handle_issuing_new_read_requests5 the helper routine
__handle_issuing_new_read_requests5 can be simplified.

This is another step towards the removal of ops.{pending,ack,complete,count},
i.e. STRIPE_OP_COMPUTE_BLK only requests an operation and does not track the
state of the operation.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:32:03 +10:00
Dan Williams 83de75cc92 md: replace STRIPE_OP_BIOFILL with STRIPE_BIOFILL_RUN
From: Dan Williams <dan.j.williams@intel.com>

Track the state of read operations (copying data from the stripe cache to bio
buffers outside the lock) with a state flag.  Reduce the scope of the
STRIPE_OP_BIOFILL flag to only tracking whether a biofill operation has been
requested via the ops_request field of struct stripe_head_state.

This is another step towards the removal of ops.{pending,ack,complete,count},
i.e. STRIPE_OP_BIOFILL only requests an operation and does not track the state
of the operation.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:58 +10:00
Dan Williams ecc65c9b3f md: replace STRIPE_OP_CHECK with 'check_states'
From: Dan Williams <dan.j.williams@intel.com>

The STRIPE_OP_* flags record the state of stripe operations which are
performed outside the stripe lock.  Their use in indicating which
operations need to be run is straightforward; however, interpolating what
the next state of the stripe should be based on a given combination of
these flags is not straightforward, and has led to bugs.  An easier to read
implementation with minimal degrees of freedom is needed.

Towards this goal, this patch introduces explicit states to replace what was
previously interpolated from the STRIPE_OP_* flags.  For now this only converts
the handle_parity_checks5 path, removing a user of the
ops.{pending,ack,complete,count} fields of struct stripe_operations.

This conversion also found a remaining issue with the current code.  There is
a small window for a drive to fail between when we schedule a repair and when
the parity calculation for that repair completes.  When this happens we will
writeback to 'failed_num' when we really want to write back to 'pd_idx'.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:57 +10:00
Dan Williams f0e43bcdeb md: unify raid5/6 i/o submission
From: Dan Williams <dan.j.williams@intel.com>

Let the raid6 path call ops_run_io to get pending i/o submitted.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:55 +10:00
Dan Williams c4e5ac0a22 md: use stripe_head_state in ops_run_io()
From: Dan Williams <dan.j.williams@intel.com>

In handle_stripe after taking sh->lock we sample some bits into 's' (struct
stripe_head_state):

	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);

Use these values from 's' in ops_run_io() rather than re-sampling the bits.
This ensures a consistent snapshot (as seen under sh->lock) is used.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:53 +10:00
Dan Williams 2b7497f0e0 md: kill STRIPE_OP_IO flag
From: Dan Williams <dan.j.williams@intel.com>

The R5_Want{Read,Write} flags already gate i/o.  So, this flag is
superfluous and we can unconditionally call ops_run_io().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:52 +10:00
Dan Williams b203886edb md: kill STRIPE_OP_MOD_DMA in raid5 offload
From: Dan Williams <dan.j.williams@intel.com>

This micro-optimization allowed the raid code to skip a re-read of the
parity block after checking parity.  It took advantage of the fact that
xor-offload-engines have their own internal result buffer and can check
parity without writing to memory.  Remove it for the following reasons:

1/ It is a layering violation for MD to need to manage the DMA and
   non-DMA paths within async_xor_zero_sum
2/ Bad precedent to toggle the 'ops' flags outside the lock
3/ Hard to realize a performance gain as reads will not need an updated
   parity block and writes will dirty it anyways.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:50 +10:00
Chris Webb 0cd17fec98 Support changing rdev size on running arrays.
From: Chris Webb <chris@arachsys.com>

Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to zero
to fill all the available space on the underlying device.

Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:46 +10:00
Neil Brown 526647320e Make sure all changes to md/dev-XX/state are notified
The important state change happens during an interrupt
in md_error.  So just set a flag there and call sysfs_notify
later in process context.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:44 +10:00
Neil Brown a99ac97113 Make sure all changes to md/degraded are notified.
When a device fails, when a spare is activated, when
an array is reshaped, or when an array is started,
the extent to which the array is degraded can change.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:43 +10:00
Neil Brown 72a23c211e Make sure all changes to md/sync_action are notified.
When the 'resync' thread starts or stops, when we explicitly
set sync_action, or when we determine that there is definitely nothing
to do, we notify sync_action.

To stop "sync_action" from occasionally showing the wrong value,
we introduce a new flags - MD_RECOVERY_RECOVER - to say that a
recovery is probably needed or happening, and we make sure
that we set MD_RECOVERY_RUNNING before clearing MD_RECOVERY_NEEDED.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:41 +10:00
Neil Brown 0fd62b861e Make sure all changes to md/array_state are notified.
Changes in md/array_state could be of interest to a monitoring
program.  So make sure all changes trigger a notification.

Exceptions:
   changing active_idle to active is not reported because it
      is frequent and not interesting.
   changing active to active_idle is only reported on arrays
      with externally managed metadata, as it is not interesting
      otherwise.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:36 +10:00
Neil Brown c7d0c941ae Don't reject HOT_REMOVE_DISK request for an array that is not yet started.
There is really no need for this test here, and there are valid
cases for selectively removing devices from an array that
it not actually active.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:34 +10:00
Neil Brown 199050ea1f rationalise return value for ->hot_add_disk method.
For all array types but linear, ->hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.

This doesn't cause a functional problem because the ->hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.

So convert all to return 0 for success or -errno on failure
and fix call sites to match.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:33 +10:00
Neil Brown 6c2fce2ef6 Support adding a spare to a live md array with external metadata.
i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:31 +10:00
Neil Brown 8ed0a5216a Enable setting of 'offset' and 'size' of a hot-added spare.
offset_store and rdev_size_store allow control of the region of a
device which is to be using in an md/raid array.
They only allow these values to be set when an array is being assembled,
as changing them on an active array could be dangerous.
However when adding a spare device to an array, we might need to
set the offset and size before starting recovery.  So allow
these values to be set also if "->raid_disk < 0" which indicates that
the device is still a spare.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:29 +10:00
Neil Brown 1a0fd49773 Don't try to make md arrays dirty if that is not meaningful.
Arrays personalities such as 'raid0' and 'linear' have no redundancy,
and so marking them as 'clean' or 'dirty' is not meaningful.
So always allow write requests without requiring a superblock update.

Such arrays types are detected by ->sync_request being NULL.  If it is
not possible to send a sync request we don't need a 'dirty' flag because
all a dirty flag does is trigger some sync_requests.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:27 +10:00
Neil Brown f48ed53838 Close race in md_probe
There is a possible race in md_probe.  If two threads call md_probe
for the same device, then one could exit (having checked that
->gendisk exists) before the other has called kobject_init_and_add,
thus returning an incomplete kobj which will cause problems when
we try to add children to it.

So extend the range of protection of disks_mutex slightly to
avoid this possibility.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:26 +10:00
Neil Brown 5e96ee65c8 Allow setting start point for requested check/repair
This makes it possible to just resync a small part of an array.
e.g. if a drive reports that it has questionable sectors,
a 'repair' of just the region covering those sectors will
cause them to be read and, if there is an error, re-written
with correct data.

Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:24 +10:00
Neil Brown a0da84f35b Improve setting of "events_cleared" for write-intent bitmaps.
When an array is degraded, bits in the write-intent bitmap are not
cleared, so that if the missing device is re-added, it can be synced
by only updated those parts of the device that have changed since
it was removed.

The enable this a 'events_cleared' value is stored. It is the event
counter for the array the last time that any bits were cleared.

Sometimes - if a device disappears from an array while it is 'clean' -
the events_cleared value gets updated incorrectly (there are subtle
ordering issues between updateing events in the main metadata and the
bitmap metadata) resulting in the missing device appearing to require
a full resync when it is re-added.

With this patch, we update events_cleared precisely when we are about
to clear a bit in the bitmap.  We record events_cleared when we clear
the bit internally, and copy that to the superblock which is written
out before the bit on storage.  This makes it more "obviously correct".

We also need to update events_cleared when the event_count is going
backwards (as happens on a dirty->clean transition of a non-degraded
array).

Thanks to Mike Snitzer for identifying this problem and testing early
"fixes".

Cc:  "Mike Snitzer" <snitzer@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:22 +10:00
Neil Brown 0e13fe23a0 use bio_endio instead of a call to bi_end_io
Turn calls to bi->bi_end_io() into bio_endio(). Apparently bio_endio does
exactly the same error processing as is hardcoded at these places.

bio_endio() avoids recursion (or will soon), so it should be used.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:20 +10:00
Nikanth Karthikesan 13864515f7 linear: correct disk numbering error check
From: "Nikanth Karthikesan" <knikanth@novell.com>

Correct disk numbering problem check.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:19 +10:00
Neil Brown 9bbbca3a0e Fix error paths if md_probe fails.
md_probe can fail (e.g. alloc_disk could fail) without
returning an error (as it alway returns NULL).
So when we call mddev_find immediately afterwards, we need
to check that md_probe actually succeeded.  This means checking
that mdev->gendisk is non-NULL.

cc: <stable@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:17 +10:00
Neil Brown efe3114318 Don't acknowlege that stripe-expand is complete until it really is.
We shouldn't acknowledge that a stripe has been expanded (When
reshaping a raid5 by adding a device) until the moved data has
actually been written out.  However we are currently
acknowledging (by calling md_done_sync) when the POST_XOR
is complete and before the write.

So track in s.locked whether there are pending writes, and don't
call md_done_sync yet if there are.

Note: we all set R5_LOCKED on devices which are are about to
read from.  This probably isn't technically necessary, but is
usually done when writing a block, and justifies the use of
s.locked here.

This bug can lead to a crash if an array is stopped while an reshape
is in progress.

Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 08:31:14 +10:00