WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Mike Christie	6296b9604f	block, drivers, fs: shrink bi_rw from long to int We don't need bi_rw to be so large on 64 bit archs, so reduce it to unsigned int. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	796a5cf083	md: use bio op accessors Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Song Liu	4125758074	right meaning of PARITY_ENABLE_RMW and PARITY_PREFER_RMW In current handle_stripe_dirtying, the code prefers rmw with PARITY_ENABLE_RMW; while prefers rcw with PARITY_PREFER_RMW. This patch reverses this behavior. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-25 21:26:07 -07:00
Guoqing Jiang	85ad1d13ee	md: set MD_CHANGE_PENDING in a atomic region Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:24:02 -07:00
Heinz Mauelshagen	fe67d19a2d	md: raid5: add prerequisite to run underneath dm-raid In case md runs underneath the dm-raid target, the mddev does not have a request queue or gendisk, thus avoid accesses. This patch adds a missing conditional to the raid5 personality. Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:24:02 -07:00
Shaohua Li	b8a0b8e946	raid5: delete unnecessary warnning If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page != orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the SkipCopy flag and set page to orig_page. So the warning is unnecessary. Reported-by: Joey Liao <joeyliao@qnap.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-04-29 14:18:03 -07:00
Anna-Maria Gleixner	1d034e68e2	md/raid5: Cleanup cpu hotplug notifier The raid456_cpu_notify() hotplug callback lacks handling of the CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch buffer is leaked. Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions to free the scratch buffer. CC: Shaohua Li <shli@kernel.org> CC: linux-raid@vger.kernel.org Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-17 14:30:15 -07:00
Shaohua Li	fb3229d5cd	md/raid5: output stripe state for debug Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be quite convenient if we know the stripe state. This is what this patch does. Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-09 10:08:38 -08:00
NeilBrown	550da24f8d	md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list break_stripe_batch_list breaks up a batch and copies some flags from the batch head to the members, preserving others. It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a stripe_head is added to a batch, and is not set on stripe_heads already in a batch. However there is no locking to ensure one thread doesn't set the flag after it has just been cleared in another. This does occasionally happen. md/raid5 maintains a count of the number of stripe_heads with STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently this could becomes incorrect and will never again return to zero. md/raid5 delays the handling of some stripe_heads until preread_active_stripes becomes zero. So when the above mention race happens, those stripe_heads become blocked and never progress, resulting is write to the array handing. So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE in the members of a batch. URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741 URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153 URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz Reported-by: Martin Svec <martin.svec@zoner.cz> (and others) Tested-by: Tom Weber <linux@junkyard.4t2.com> Fixes: `1b956f7a8f` ("md/raid5: be more selective about distributing flags across batch.") Cc: stable@vger.kernel.org (v4.1 and later) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-09 09:31:41 -08:00
Shaohua Li	6ab2a4b806	RAID5: revert `e9e4c377e2` to fix a livelock Revert commit e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe) The problem is raid5_get_active_stripe waits on conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes in this order: - release all stripes with hash 0 - raid5_get_active_stripe still sleeps since active_stripes > max_nr_stripes * 3 / 4 - release all stripes with hash other than 0. active_stripes becomes 0 - raid5_get_active_stripe still sleeps, since nobody wakes up wait_for_stripe[0] The system live locks. The problem is active_stripes isn't a per-hash count. Revert the patch makes the live lock go away. Cc: stable@vger.kernel.org (v4.2+) Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>	2016-02-26 09:44:56 -08:00
Shaohua Li	27a353c026	RAID5: check_reshape() shouldn't call mddev_suspend check_reshape() is called from raid5d thread. raid5d thread shouldn't call mddev_suspend(), because mddev_suspend() waits for all IO finish but IO is handled in raid5d thread, we could easily deadlock here. This issue is introduced by `738a273` ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-02-26 09:44:11 -08:00
Jes Sorensen	e7597e69de	md/raid5: Compare apples to apples (or sectors to sectors) 'max_discard_sectors' is in sectors, while 'stripe' is in bytes. This fixes the problem where DISCARD would get disabled on some larger RAID5 configurations (6 or more drives in my testing), while it worked as expected with smaller configurations. Fixes: `620125f2bf` ("MD: raid5 trim support") Cc: stable@vger.kernel.org v3.7+ Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-02-25 16:38:53 -08:00
Shaohua Li	849674e4fb	MD: rename some functions These short function names are hard to search. Rename them to make vim happy. Signed-off-by: Shaohua Li <shli@fb.com>	2016-01-20 13:52:20 -08:00
Shaohua Li	f6b6ec5cfa	raid5-cache: add journal hot add/remove support Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO). Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:57 +11:00
Roman Gushchin	b46020aa3a	md/raid5: remove redundant check in stripe_add_to_batch_list() The stripe_add_to_batch_list() function is called only if stripe_can_batch() returned true, so there is no need for double check. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Neil Brown <neilb@suse.com> Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:22 +11:00
Linus Torvalds	ac322de6bf	md updates for 4.4. Two major components to this update. 1/ the clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2/ The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWNX9RAAoJEDnsnt1WYoG5bYMP/jI0pV3wcbs7mZQAa8S/V0lU 2l25x4MdwDvqVKMfjIc/C5J08QNgcrgSvhiVPCEOK0w18q395vep9f6gFKbMHhu/ lWU3PLHGw8XBHp5yEnxrpQkN0pRrNjh5NqIdlVMBNyL6u+RZPS2ZuzxJ8wiNAFg1 MypNkgoUu6s+nBp4DWWnMGYhBc+szBR+gTYAzGiZ8vqOH9uiSJ2SsGG5aRVUN/af oMYvJAf9aA6uj+xSzNlXIaLfWJIrshQYS1jU/W4gTm0DwK9yqbTxvubJaE0SGu/o 73FGU8tmQ6ELYfsp3D/jmfUkE7weiNEQhdVb/4wy1A/SGc+W7Ju9pxfhm8ra57s0 /BCkfwWZXEvx1flegXfK1mC6EMpMIcGAD2FQEhmQbW6wTdDwtNyEhIePDVGJwD/F rhEThFa+Dg9+xnBGnS6OUK3EpXgml2hAeAC7uA3TVSAnWd/9/Mpim6fZhqrB/v9L Ik0tZt+H4nxYaheZjKlKhuXUQYcUWGiMb67bGMem/YAlMa4y9C9qF+9mPXxyjVlI hBsd5SfZNz99DyB/bO8BumQeIWlTfzLeFzWW67eQ864LRKO6k0/VIbPZHCfn2oVG XvyC2fUhNOIURP3IMxcyHYxOA7Mu6EDsVVDTpuqLVbZQ5IPjDEfQ54yB/BLUvbX/ Gh2/tKn7Xc25HuLAFEbs =TD5o -----END PGP SIGNATURE----- Merge tag 'md/4.4' of git://neil.brown.name/md Pull md updates from Neil Brown: "Two major components to this update. 1) The clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2) The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. * tag 'md/4.4' of git://neil.brown.name/md: (66 commits) MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW MD: set journal disk ->raid_disk MD: kick out journal disk if it's not fresh raid5-cache: start raid5 readonly if journal is missing MD: add new bit to indicate raid array with journal raid5-cache: IO error handling raid5: journal disk can't be removed raid5-cache: add trim support for log MD: fix info output for journal disk raid5-cache: use bio chaining raid5-cache: small log->seq cleanup raid5-cache: new helper: r5_reserve_log_entry raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta raid5-cache: take rdev->data_offset into account early on raid5-cache: refactor bio allocation raid5-cache: clean up r5l_get_meta raid5-cache: simplify state machine when caches flushes are not needed raid5-cache: factor out a helper to run all stripes for an I/O unit raid5-cache: rename flushed_ios to finished_ios raid5-cache: free I/O units earlier ...	2015-11-04 21:12:47 -08:00
Shaohua Li	f2076e7d06	MD: set journal disk ->raid_disk Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of 0, because we already have a disk with ->raid_disk 0 and this causes sysfs entry creation conflict. A lot of places assumes disk with ->raid_disk >=0 is normal raid disk, so we add check for journal disk. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	7dde2ad3c5	raid5-cache: start raid5 readonly if journal is missing If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, start the array readonly. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	6e74a9cfb5	raid5-cache: IO error handling There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	c2bb6242ec	raid5: journal disk can't be removed raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places. Don't allow journal disk disappear magically. On the other hand, we do need to update superblock for other disks to bump up ->events, so next time journal disk will be identified as stale. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	e6c033f79a	raid5-cache: move reclaim stop to quiesce Move reclaim stop to quiesce handling, where is safer for this stuff. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	828cbe989e	raid5-cache: optimize FLUSH IO with log enabled With log enabled, bio is written to raid disks after the bio is settled down in log disk. The recovery guarantees we can recovery the bio data from log disk, so we we skip FLUSH IO. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	a8c34f9159	raid5-cache: switching to state machine for log disk cache flush Before we write stripe data to raid disks, we must guarantee stripe data is settled down in log disk. To do this, we flush log disk cache and wait the flush finish. That wait introduces sleep time in raid5d thread and impact performance. This patch moves the log disk cache flush process to the stripe handling state machine, which can remove the wait in raid5d. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	5c7e81c3de	raid5: enable log for raid array with cache disk Now log is safe to enable for raid array with cache disk Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	713cf5a639	raid5: don't allow resize/reshape with cache(log) support If cache(log) support is enabled, don't allow resize/reshape in current stage. In the future, we can flush all data from cache(log) to raid before resize/reshape and then allow resize/reshape. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	9c3e333d3f	raid5: disable batch with log enabled With log enabled, r5l_write_stripe will add the stripe to log. With batch, several stripes are linked together. The stripes must be in the same state. While with log, the log/reclaim unit is stripe, we can't guarantee the several stripes are in the same state. Disabling batch for log now. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Roman Gushchin	b8a9d66d04	md/raid5: fix locking in handle_stripe_clean_event() After commit `566c09c534` ("raid5: relieve lock contention in get_active_stripe()") __find_stripe() is called under conf->hash_locks + hash. But handle_stripe_clean_event() calls remove_hash() under conf->device_lock. Under some cirscumstances the hash chain can be circuited, and we get an infinite loop with disabled interrupts and locked hash lock in __find_stripe(). This leads to hard lockup on multiple CPUs and following system crash. I was able to reproduce this behavior on raid6 over 6 ssd disks. The devices_handle_discard_safely option should be set to enable trim support. The following script was used: for i in `seq 1 32`; do dd if=/dev/zero of=large$i bs=10M count=100 & done neilb: original was against a 3.x kernel. I forward-ported to 4.3-rc. This verison is suitable for any kernel since Commit: `59fc630b8b` ("RAID5: batch adjacent full stripe write") (v4.1+). I'll post a version for earlier kernels to stable. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Fixes: `566c09c534` ("raid5: relieve lock contention in get_active_stripe()") Signed-off-by: NeilBrown <neilb@suse.com> Cc: Shaohua Li <shli@kernel.org> Cc: <stable@vger.kernel.org> # 3.13 - 4.2	2015-10-31 10:53:50 +11:00
Shaohua Li	0576b1c618	raid5: log reclaim support This is the reclaim support for raid5 log. A stripe write will have following steps: 1. reconstruct the stripe, read data/calculate parity. ops_run_io prepares to write data/parity to raid disks 2. hijack ops_run_io. stripe data/parity is appending to log disk 3. flush log disk cache 4. ops_run_io run again and do normal operation. stripe data/parity is written in raid array disks. raid core can return io to upper layer. 5. flush cache of all raid array disks 6. update super block 7. log disk space used by the stripe can be reused In practice, several stripes consist of an io_unit and we will batch several io_unit in different steps, but the whole process doesn't change. It's possible io return just after data/parity hit log disk, but then read IO will need read from log disk. For simplicity, IO return happens at step 4, where read IO can directly read from raid disks. Currently reclaim run if there is specific reclaimable space (1/4 disk size or 10G) or we are out of space. Reclaim is just to free log disk spaces, it doesn't impact data consistency. The size based force reclaim is to make sure log isn't too big, so recovery doesn't scan log too much. Recovery make sure raid disks and log disk have the same data of a stripe. If crash happens before 4, recovery might/might not recovery stripe's data/parity depending on if data/parity and its checksum matches. In either case, this doesn't change the syntax of an IO write. After step 3, stripe is guaranteed recoverable, because stripe's data/parity is persistent in log disk. In some cases, log disk content and raid disks content of a stripe are the same, but recovery will still copy log disk content to raid disks, this doesn't impact data consistency. space reuse happens after superblock update and cache flush. There is one situation we want to avoid. A broken meta in the middle of a log causes recovery can't find meta at the head of log. If operations require meta at the head persistent in log, we must make sure meta before it persistent in log too. The case is stripe data/parity is in log and we start write stripe to raid disks (before step 4). stripe data/parity must be persistent in log before we do the write to raid disks. The solution is we restrictly maintain io_unit list order. In this case, we only write stripes of an io_unit to raid disks till the io_unit is the first one whose data/parity is in log. The io_unit list order is important for other cases too. For example, some io_unit are reclaimable and others not. They can be mixed in the list, we shouldn't reuse space of an unreclaimable io_unit. Includes fixes to problems which were... Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	f6bed0ef0a	raid5: add basic stripe log This introduces a simple log for raid5. Data/parity writing to raid array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up raid resync and fix write hole issue. The log structure is pretty simple. Data/meta data is stored in block unit, which is 4k generally. It has only one type of meta data block. The meta data block can track 3 types of data, stripe data, stripe parity and flush block. MD superblock will point to the last valid meta data block. Each meta data block has checksum/seq number, so recovery can scan the log correctly. We store a checksum of stripe data/parity to the metadata block, so meta data and stripe data/parity can be written to log disk together. otherwise, meta data write must wait till stripe data/parity is finished. For stripe data, meta data block will record stripe data sector and size. Currently the size is always 4k. This meta data record can be made simpler if we just fix write hole (eg, we can record data of a stripe's different disks together), but this format can be extended to support caching in the future, which must record data address/size. For stripe parity, meta data block will record stripe sector. It's size should be 4k (for raid5) or 8k (for raid6). We always store p parity first. This format should work for caching too. flush block indicates a stripe is in raid array disks. Fixing write hole doesn't need this type of meta data, it's for caching extension. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	b70abcb247	raid5: add a new state for stripe log handling When a stripe finishes construction, we write the stripe to raid in ops_run_io normally. With log, we do a bunch of other operations before the stripe is written to raid. Mainly write the stripe to log disk, flush disk cache and so on. The operations are still driven by raid5d and run in the stripe state machine. We introduce a new state for such stripe (trapped into log). The stripe is in this state from the time it first enters ops_run_io (finish construction) to the time it is written to raid. Since we know the state is only for log, we bypass other check/operation in handle_stripe. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	6d036f7d52	raid5: export some functions Next several patches use some raid5 functions, rename them with raid5 prefix and export out. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Goldwyn Rodrigues	c40f341f1e	md-cluster: Use a small window for resync Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window (32M) is maintained in r1conf as cluster_sync_low and cluster_sync_high and processed in raid1's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Check if the sync will fit in the new window, if not issue a wait_barrier() and set cluster_sync_low to sector_nr. 3. Set cluster_sync_high to cluster_sync_low + resync_window. 4. Send a message to all nodes so they may add it in their suspension list. bitmap_cond_end_sync is modified to allow to force a sync inorder to get the curr_resync_completed uptodate with the sector passed. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-10-12 01:32:05 -05:00
Julia Lawall	644df1a85f	md: drop null test before destroy functions Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) $kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy$(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
NeilBrown	36707bb2e7	md/raid5: don't index beyond end of array in need_this_block(). When need_this_block probably shouldn't be called when there are more than 2 failed devices, we really don't want it to try indexing beyond the end of the failed_num[] of fdev[] arrays. So limit the loops to at most 2 iterations. Reported-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-10-02 17:23:43 +10:00
Shaohua Li	ebda780bce	raid5: update analysis state for failed stripe handle_failed_stripe() makes the stripe fail, eg, all IO will return with a failure, but it doesn't update stripe_head_state. Later handle_stripe() has special handling for raid6 for handle_stripe_fill(). That check before handle_stripe_fill() doesn't skip the failed stripe and we get a kernel crash in need_this_block. This patch clear the analysis state to make sure no functions wrongly called after handle_failed_stripe() Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:43 +10:00
NeilBrown	e89c6fdf9e	Merge linux-block/for-4.3/core into md/for-linux There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>	2015-09-05 11:08:32 +02:00
Linus Torvalds	1081230b74	Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...	2015-09-02 13:10:25 -07:00
NeilBrown	c3cce6cda1	md/raid5: ensure device failure recorded before write request returns. When a write to one of the devices of a RAID5/6 fails, the failure is recorded in the metadata of the other devices so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that completed when MD_CHANGE_PENDING is set to only be processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:59 +02:00
NeilBrown	34a6f80e16	md/raid5: use bio_list for the list of bios to return. This will make it easier to splice two lists together which will be needed in future patch. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:50 +02:00
NeilBrown	6cbd81487f	md/raid5: handle possible race as reshape completes. It is possible (though unlikely) for a reshape to be interrupted between the time that end_reshape is called and the time when raid5_finish_reshape is called. This can leave conf->reshape_progress set to MaxSector, but mddev->reshape_position not. This combination confused reshape_request() when ->reshape_backwards. As conf->reshape_progress is so high, it seems the reshape hasn't really begun. But assuming MaxSector is a valid address only leads to sorrow. So ensure reshape_position and reshape_progress both agree, and add an extra check in reshape_request() just in case they don't. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:38:59 +02:00
NeilBrown	c5e19d906a	md: be careful when testing resync_max against curr_resync_completed. While it generally shouldn't happen, it is not impossible for curr_resync_completed to exceed resync_max. This can particularly happen when reshaping RAID5 - the current status isn't copied to curr_resync_completed promptly, so when it is, it can exceed resync_max. This happens when the reshape is 'frozen', resync_max is set low, and reshape is re-enabled. Taking a difference between two unsigned numbers is always dangerous anyway, so add a test to behave correctly if curr_resync_completed > resync_max Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:37:33 +02:00
NeilBrown	c74c0d760e	md/raid5: remove incorrect "min_t()" when calculating writepos. This code is calculating: writepos, which is the furthest along address (device-space) that we will be writing to readpos, which is the earliest address that we could possible read from, and safepos, which is the earliest address in the 'old' section that we might read from after a crash when the reshape position is recovered from metadata. The first is a precise calculation, so clipping at zero doesn't make sense. As the reshape position is now guaranteed to always be a multiple of reshape_sectors and as we already BUG_ON when reshape_progress is zero, there is no point in this min_t() call. The readpos and safepos are worst case - actual value depends on precise geometry. That worst case could be negative, which is only a problem because we are storing the value in an unsigned. So leave the min_t() for those. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:36:06 +02:00
NeilBrown	05256d9884	md/raid5: strengthen check on reshape_position at run. When reshaping, we work in units of the largest chunk size. If changing from a larger to a smaller chunk size, that means we reshape more than one stripe at a time. So the required alignment of reshape_position needs to take into account both the old and new chunk size. This means that both 'here_new' and 'here_old' are calculated with respect to the same (maximum) chunk size, so testing if they are the same when delta_disks is zero becomes pointless. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:34:21 +02:00
NeilBrown	3cb5edf454	md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible The chunk_sectors and new_chunk_sectors fields of mddev can be changed any time (via sysfs) that the reconfig mutex can be taken. So raid5 keeps internal copies in 'conf' which are stable except for a short locked moment when reshape stops/starts. So any access that does not hold reconfig_mutex should use the 'conf' values, not the 'mddev' values. Several don't. This could result in corruption if new values were written at awkward times. Also use min() or max() rather than open-coding. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:32:48 +02:00
NeilBrown	5cac6bcb93	md/raid5: always set conf->prev_chunk_sectors and ->prev_algo These aren't really needed when no reshape is happening, but it is safer to have them always set to a meaningful value. The next patch will use ->prev_chunk_sectors without checking if a reshape is happening (because that makes the code simpler), and this patch makes that safe. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:32:25 +02:00
NeilBrown	92140480ed	md/raid5: consider updating reshape_position at start of reshape. md/raid5 only updates ->reshape_position (which is stored in metadata and is authoritative) occasionally, but particularly when getting closed to ->resync_max as it must be correct when ->resync_max is reached. When mdadm tries to stop an array which is reshaping it will: - freeze the reshape, - set resync_max to where the reshape has reached. - unfreeze the reshape. When this happens, the reshape is aborted and then restarted. The restart doesn't check that resync_max is close, and so doesn't update ->reshape_position like it should. This results in the reshape stopping, but ->reshape_position being incorrect. So on that first call to reshape_request, make sure ->reshape_position is updated if needed. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:31:20 +02:00
Kent Overstreet	8ae126660f	block: kill merge_bvec_fn() completely As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:57 -06:00
Kent Overstreet	7140aafce2	md/raid5: get rid of bio_fits_rdev() Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now performed by blk_queue_split() in md_make_request(). Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:54 -06:00
Ming Lin	7ef6b12a19	md/raid5: split bio for chunk_aligned_read If a read request fits entirely in a chunk, it will be passed directly to the underlying device (providing it hasn't failed of course). If it doesn't fit, the slightly less efficient path that uses the stripe_cache is used. Requests that get to the stripe cache are always completely split up as necessary. So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work, but could cause it to take the less efficient path more often. All that is needed to manage this is for 'chunk_aligned_read' do some bio splitting, much like the RAID0 code does. Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:51 -06:00
Sasha Levin	9b81c84235	block: don't access bio->bi_error after bio_put() Commit `4246a0b6` ("block: add a bi_error field to struct bio") has added a few dereferences of 'bio' after a call to bio_put(). This causes use-after-frees such as: [521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714 [521120.720638] Read of size 4 by task mount.ocfs2/9644 [521120.721212] ============================================================================= [521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected [521120.722968] ----------------------------------------------------------------------------- [521120.722968] [521120.723915] Disabling lock debugging due to kernel taint [521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080 [521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800 [521120.726037] [521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff ...6............ [521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ................ [521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00 ................ [521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff ................ [521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff ...........6.... [521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff .s".....@..<.... [521120.736667] Object ffff880f36b38790: 00 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00 ................ [521120.737596] Object ffff880f36b387a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.738524] Object ffff880f36b387b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.739388] Object ffff880f36b387c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.740277] Object ffff880f36b387d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.741187] Object ffff880f36b387e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.742233] Object ffff880f36b387f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.743229] CPU: 41 PID: 9644 Comm: mount.ocfs2 Tainted: G B 4.2.0-rc6-next-20150810-sasha-00039-gf909086 #2420 [521120.744274] ffff880f36b38000 ffff880d89c8f638 ffffffffb6e9ba8a ffff880101c0e5c0 [521120.745025] ffff880d89c8f668 ffffffffad76a313 ffff880101c0e5c0 ffffea003cdace00 [521120.745908] ffff880f36b38700 ffff880f36b38798 ffff880d89c8f690 ffffffffad772854 [521120.747063] Call Trace: [521120.747520] dump_stack (lib/dump_stack.c:52) [521120.748053] print_trailer (mm/slub.c:653) [521120.748582] object_err (mm/slub.c:660) [521120.749079] kasan_report_error (include/linux/kasan.h:20 mm/kasan/report.c:152 mm/kasan/report.c:194) [521120.750834] __asan_report_load4_noabort (mm/kasan/report.c:250) [521120.753580] dio_bio_complete (fs/direct-io.c:478) [521120.755752] do_blockdev_direct_IO (fs/direct-io.c:494 fs/direct-io.c:1291) [521120.759765] __blockdev_direct_IO (fs/direct-io.c:1322) [521120.761658] blkdev_direct_IO (fs/block_dev.c:162) [521120.762993] generic_file_read_iter (mm/filemap.c:1738) [521120.767405] blkdev_read_iter (fs/block_dev.c:1649) [521120.768556] __vfs_read (fs/read_write.c:423 fs/read_write.c:434) [521120.772126] vfs_read (fs/read_write.c:454) [521120.773118] SyS_pread64 (fs/read_write.c:607 fs/read_write.c:594) [521120.776062] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186) [521120.777375] Memory state around the buggy address: [521120.778118] ffff880f36b38600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [521120.779211] ffff880f36b38680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [521120.780315] >ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [521120.781465] ^ [521120.782083] ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [521120.783717] ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [521120.784818] ================================================================== This patch fixes a few of those places that I caught while auditing the patch, but the original patch should be audited further for more occurences of this issue since I'm not too familiar with the code. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-11 11:34:32 -06:00
NeilBrown	49895bcc7e	md/raid5: don't let shrink_slab shrink too far. I have a report of drop_one_stripe() called from raid5_cache_scan() apparently finding ->max_nr_stripes == 0. This should not be allowed. So add a test to keep max_nr_stripes above min_nr_stripes. Also use a 'mask' rather than a 'mod' in drop_one_stripe to ensure 'hash' is valid even if max_nr_stripes does reach zero. Fixes: `edbe83ab4c` ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org (4.1 - please release with `2d5b569b66`) Reported-by: Tomas Papan <tomas.papan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-03 17:10:56 +10:00
Jens Axboe	b7c44ed9d2	block: manipulate bio->bi_flags through helpers Some places use helpers now, others don't. We only have the 'is set' helper, add helpers for setting and clearing flags too. It was a bit of a mess of atomic vs non-atomic access. With BIO_UPTODATE gone, we don't have any risk of concurrent access to the flags. So relax the restriction and don't make any of them atomic. The flags that do have serialization issues (reffed and chained), we already handle those separately. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:20 -06:00
Christoph Hellwig	4246a0b63b	block: add a bi_error field to struct bio Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:15 -06:00
NeilBrown	e6030cb06c	md/raid5: clear R5_NeedReplace when no longer needed. This flag is currently never cleared, which can in rare cases trigger a warn-on if it is still set but the block isn't InSync. So clear it when it isn't need, which includes if the replacement device has failed. Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-24 13:38:04 +10:00
NeilBrown	2d5b569b66	md/raid5: avoid races when changing cache size. Cache size can grow or shrink due to various pressures at any time. So when we resize the cache as part of a 'grow' operation (i.e. change the size to allow more devices) we need to blocks that automatic growing/shrinking. So introduce a mutex. auto grow/shrink uses mutex_trylock() and just doesn't bother if there is a blockage. Resizing the whole cache holds the mutex to ensure that the correct number of new stripes is allocated. This bug can result in some stripes not being freed when an array is stopped. This leads to the kmem_cache not being freed and a subsequent array can try to use the same kmem_cache and get confused. Fixes: `edbe83ab4c` ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2) Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-22 14:04:15 +10:00
Shaohua Li	713bc5c2de	md/raid5: ignore released_stripes check conf->released_stripes list isn't always related to where there are free stripes pending. Active stripes can be in the list too. And even free stripes were active very recently. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:33 +10:00
Yuanhan Liu	e9e4c377e2	md/raid5: per hash value and exclusive wait_for_stripe I noticed heavy spin lock contention at get_active_stripe() with fsmark multiple thread write workloads. Here is how this hot contention comes from. We have limited stripes, and it's a multiple thread write workload. Hence, those stripes will be taken soon, which puts later processes to sleep for waiting free stripes. When enough stripes(>= 1/4 total stripes) are released, all process are woken, trying to get the lock. But there is one only being able to get this lock for each hash lock, making other processes spinning out there for acquiring the lock. Thus, it's effectiveless to wakeup all processes and let them battle for a lock that permits one to access only each time. Instead, we could make it be a exclusive wake up: wake up one process only. That avoids the heavy spin lock contention naturally. To do the exclusive wake up, we've to split wait_for_stripe into multiple wait queues, to make it per hash value, just like the hash lock. Here are some test results I have got with this patch applied(all test run 3 times): `fsmark.files_per_sec' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose 17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose 188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync 254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync `time.system_time' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose 3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose Where, 1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark 1t, 64t: where 't' means thread 4M: means the single file size, corresponding to the '-s' option of fsmark 40G, 30G, 120G: means the total test size 4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means the size of one ramdisk. So, it would be 48G in total. And we made a raid on those ramdisk As you can see, though there are no much performance gain for hard disk workload, the system time is dropped heavily, up to 97%. And as expected, the performance increased a lot, up to 260%, for fast device(ram disk). v2: use bits instead of array to note down wait queue need to wake up. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:27 +10:00
Yuanhan Liu	b1b4648648	md/raid5: split wait_for_stripe and introduce wait_for_quiescent I noticed heavy spin lock contention at get_active_stripe(), introduced at being wake up stage, where a bunch of processes try to re-hold the spin lock again. After giving some thoughts on this issue, I found the lock could be relieved(and even avoided) if we turn the wait_for_stripe to per waitqueue for each lock hash and make the wake up exclusive: wake up one process each time, which avoids the lock contention naturally. Before go hacking with wait_for_stripe, I found it actually has 2 usages: for the array to enter or leave the quiescent state, and also to wait for an available stripe in each of the hash lists. So this patch splits the first usage off into a separate wait_queue, wait_for_quiescent, and the next patch will turn the second usage into one waitqueue for each hash value, and make it exclusive, to relieve the lock contention. v2: wake_up(wait_for_quiescent) when (active_stripes == 0) Commit log refactor suggestion from Neil. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:21 +10:00
NeilBrown	ea358cd0d2	md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync MD_RECOVERY_DONE is normally cleared by md_check_recovery after a resync etc finished. However it is possible for raid5_start_reshape to race and start a reshape before MD_RECOVERY_DONE is cleared. This can lean to multiple reshapes running at the same time, which isn't good. To make sure it is cleared before starting a reshape, and also clear it when reaping a thread, just to be safe. Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-12 20:16:33 +10:00
NeilBrown	626f2092c8	md/raid5: break stripe-batches when the array has failed. Once the array has too much failure, we need to break stripe-batches up so they can all be dealt with. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:48:59 +10:00
NeilBrown	787b76fa37	md/raid5: call break_stripe_batch_list from handle_stripe_clean_event Now that the code in break_stripe_batch_list() is nearly identical to the end of handle_stripe_clean_event, replace the later with a function call. The only remaining difference of any interest is the masking that is applieds to dev[i].flags copied from head_sh. R5_WriteError certainly isn't wanted as it is set per-stripe, not per-patch. R5_Overlap isn't wanted as it is explicitly handled. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:47:02 +10:00
NeilBrown	1b956f7a8f	md/raid5: be more selective about distributing flags across batch. When a batch of stripes is broken up, we keep some of the flags that were per-stripe, and copy other flags from the head to all others. This only happens while a stripe is being handled, so many of the flags are irrelevant. The "SYNC_FLAGS" (which I've renamed to make it clear there are several) and STRIPE_DEGRADED are set per-stripe and so need to be preserved. STRIPE_INSYNC is the only flag that is set on the head that needs to be propagated to all others. For safety, add a WARN_ON if others are set, except: STRIPE_HANDLE - this is safe and per-stripe and we are going to set in several cases anyway STRIPE_INSYNC STRIPE_IO_STARTED - this is just a hint and doesn't hurt. STRIPE_ON_PLUG_LIST STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched stripe to be on one of these lists, but it can happen as can be safely ignored. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:40:01 +10:00
NeilBrown	3960ce7961	md/raid5: add handle_flags arg to break_stripe_batch_list. When we break a stripe_batch_list we sometimes want to set STRIPE_HANDLE on the individual stripes, and sometimes not. So pass a 'handle_flags' arg. If it is zero, always set STRIPE_HANDLE (on non-head stripes). If not zero, only set it if any of the given flags are present. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:39:30 +10:00
NeilBrown	fb642b92c2	md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list break_stripe_batch list didn't clear head_sh->batch_head. This was probably a bug. Also clear all R5_Overlap flags and if any were cleared, wake up 'wait_for_overlap'. This isn't always necessary but the worst effect is a little extra checking for code that is waiting on wait_for_overlap. Also, don't use wake_up_nr() because that does the wrong thing if 'nr' is zero, and it number of flags cleared doesn't strongly correlate with the number of threads to wake. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:36:25 +10:00
NeilBrown	4e3d62ff49	md/raid5: remove condition test from check_break_stripe_batch_list. handle_stripe_clean_event() contains a chunk of code very similar to check_break_stripe_batch_list(). If we make the latter more like the former, we can end up with just one copy of this code. This first step removed the condition (and the 'check_') part of the name. This has the added advantage of making it clear what check is being performed at the point where the function is called. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:36:06 +10:00
NeilBrown	b15a9dbdbf	md/raid5: Ensure a batch member is not handled prematurely. If a stripe is a member of a batch, but not the head, it must not be handled separately from the rest of the batch. 'clear_batch_ready()' handles this requirement to some extent but not completely. If a member is passed to handle_stripe() a second time it returns '0' indicating the stripe can be handled, which is wrong. So add an extra test. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:35:47 +10:00
NeilBrown	d0852df543	md/raid5: close race between STRIPE_BIT_DELAY and batching. When we add a write to a stripe we need to make sure the bitmap bit is set. While doing that the stripe is not locked so it could be added to a batch after which further changes to STRIPE_BIT_DELAY and ->bm_seq are ineffective. So we need to hold off adding to a stripe until bitmap_startwrite has completed at least once, and we need to avoid further changes to STRIPE_BIT_DELAY once the stripe has been added to a batch. If a bitmap_startwrite() completes after the stripe was added to a batch, it will not have set the bit, only incremented a counter, so no extra delay of the stripe is needed. Reported-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:34:40 +10:00
NeilBrown	2b6b245742	md/raid5: ensure whole batch is delayed for all required bitmap updates. When we add a stripe to a batch, we need to be sure that head stripe will wait for the bitmap update required for the new stripe. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:29:14 +10:00
Shaohua Li	487696957e	raid5: fix broken async operation chain ops_run_reconstruct6() doesn't correctly chain asyn operations. The tx returned by async_gen_syndrome should be added as the dependent tx of next stripe. The issue is introduced by commit `59fc630b8b` RAID5: batch adjacent full stripe write Reported-and-tested-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-21 09:14:20 +10:00
NeilBrown	bb27051f9f	md/raid5: fix handling of degraded stripes in batches. There is no need for special handling of stripe-batches when the array is degraded. There may be if there is a failure in the batch, but STRIPE_DEGRADED does not imply an error. So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is degraded. This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in check_break_stripe_batch_list() and so the bitmap bit gets cleared when it shouldn't. So in check_break_stripe_batch_list(), split the batch up completely - again STRIPE_DEGRADED isn't meaningful. Also don't set STRIPE_BATCH_ERR when there is a write error to a replacement device. This simply removes the replacement device and requires no extra handling. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-08 18:47:57 +10:00
NeilBrown	738a273806	md/raid5: fix allocation of 'scribble' array. As the new 'scribble' array is sized based on chunk size, we need to make sure the size matches the largest of 'old' and 'new' chunk sizes when the array is undergoing reshape. We also potentially need to resize it even when not resizing the stripe cache, as chunk size can change without changing number of devices. So move the 'resize' code into a separate function, and consider old and new sizes when allocating. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `46d5b78562` ("raid5: use flex_array for scribble data")	2015-05-08 18:47:48 +10:00
NeilBrown	6e9eac2dce	md/raid5: don't record new size if resize_stripes fails. If any memory allocation in resize_stripes fails we will return -ENOMEM, but in some cases we update conf->pool_size anyway. This means that if we try again, the allocations will be assumed to be larger than they are, and badness results. So only update pool_size if there is no error. This bug was introduced in 2.6.17 and the patch is suitable for -stable. Fixes: `ad01c9e375` ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array") Cc: stable@vger.kernel.org (v2.6.17+) Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-08 18:47:35 +10:00
NeilBrown	10d82c5f0d	md/raid5: avoid reading parity blocks for full-stripe write to degraded array When performing a reconstruct write, we need to read all blocks that are not being over-written .. except the parity (P and Q) blocks. The code currently reads these (as they are not being over-written!) unnecessarily. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `ea664c8245` ("md/raid5: need_this_block: tidy/fix last condition.")	2015-05-08 18:47:17 +10:00
NeilBrown	b0c783b323	md/raid5: more incorrect BUG_ON in handle_stripe_fill. It is not incorrect to call handle_stripe_fill() when a batch of full-stripe writes is active. It is, however, a BUG if fetch_block() then decides it needs to actually fetch anything. So move the 'BUG_ON' to where it belongs. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `59fc630b8b` ("RAID5: batch adjacent full stripe write")	2015-05-08 18:46:52 +10:00
NeilBrown	f18c1a35f6	md/raid5: new alloc_stripe() to allocate an initialize a stripe. The new batch_lock and batch_list fields are being initialized in grow_one_stripe() but not in resize_stripes(). This causes a crash on resize. So separate the core initialization into a new function and call it from both allocation sites. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `59fc630b8b` ("RAID5: batch adjacent full stripe write")	2015-05-08 18:40:01 +10:00
Eric Mei	9ffc8f7cb9	md/raid5: don't do chunk aligned read on degraded array. When array is degraded, read data landed on failed drives will result in reading rest of data in a stripe. So a single sequential read would result in same data being read twice. This patch is to avoid chunk aligned read for degraded array. The downside is to involve stripe cache which means associated CPU overhead and extra memory copy. Test Results: Following test are done on a enterprise storage node with Seagate 6T SAS drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, chunk size 128 KiB. I use FIO, using direct-io with various bs size, enough queue depth, tested sequential and 100% random read against 3 array config: 1) optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel version is 4.0-rc3. Each individual test I only did once so there might be some variations, but we just focus on big trend. Sequential Read: bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) 1024 1608 656 995 512 1624 710 956 256 1635 728 980 128 1636 771 983 64 1612 1119 1000 32 1580 1420 1004 16 1368 688 986 8 768 647 953 4 411 413 850 Random Read: bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) 1024 163 160 156 512 274 273 272 256 426 428 424 128 576 592 591 64 726 724 726 32 849 848 837 16 900 970 971 8 927 940 929 4 948 940 955 Some notes: * In sequential + optimal, as bs size getting smaller, the FIO thread become CPU bound. * In sequential + degraded, there's big increase when bs is 64K and 32K, I don't have explanation. * In sequential + degraded-with-patch, the MD thread mostly become CPU bound. If you want to we can discuss specific data point in those data. But in general it seems with this patch, we have more predictable and in most cases significant better sequential read performance when array is degraded, and almost no noticeable impact on random read. Performance is a complicated thing, the patch works well for this particular configuration, but may not be universal. For example I imagine testing on all SSD array may have very different result. But I personally think in most cases IO bandwidth is more scarce resource than CPU. Signed-off-by: Eric Mei <eric.mei@seagate.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	edbe83ab4c	md/raid5: allow the stripe_cache to grow and shrink. The default setting of 256 stripe_heads is probably much too small for many configurations. So it is best to make it auto-configure. Shrinking the cache under memory pressure is easy. The only interesting part here is that we put a fairly high cost ('seeks') on shrinking the cache as the cost is greater than just having to read more data, it reduces parallelism. Growing the cache on demand needs to be done carefully. If we allow fast growth, that can upset memory balance as lots of dirty memory can quickly turn into lots of memory queued in the stripe_cache. It is important for the raid5 block device to appear congested to allow write-throttling to work. So we only add stripes slowly. We set a flag when an allocation fails because all stripes are in use, allocate at a convenient time when that flag is set, and don't allow it to be set again until at least one stripe_head has been released for re-use. This means that a spurt of requests will only cause one stripe_head to be allocated, but a steady stream of requests will slowly increase the cache size - until memory pressure puts it back again. It could take hours to reach a steady state. The value written to, and displayed in, stripe_cache_size is used as a minimum. The cache can grow above this and shrink back down to it. The actual size is not directly visible, though it can be deduced to some extent by watching stripe_cache_active. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	5423399a84	md/raid5: change ->inactive_blocked to a bit-flag. This allows us to easily add more (atomic) flags. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	486f0644c3	md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe() succeeds, do it inside the functions. Also choose the correct hash to handle next inside the functions. This removes duplication and will help with future new uses of {grow,drop}_one_stripe. This also fixes a minor bug where the "md/raid:%md: allocate XXkB" message always said "0kB". Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
NeilBrown	a9683a795b	md/raid5: pass gfp_t arg to grow_one_stripe() This is needed for future improvement to stripe cache management. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	d06f191f8e	md/raid5: introduce configuration option rmw_level Depending on the available coding we allow optimized rmw logic for write operations. To support easier testing this patch allows manual control of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level. The configuration can handle three levels of control. rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q calculation has no implementation path yet to factor in/out chunks of a syndrome. Enforcing this level can be benefical for slow CPUs with hardware syndrome support and fast SSDs. rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will save IOs. This equals the "old" unpatched behaviour and will be the default. rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are equal. We might have higher CPU consumption because of calculating the parity twice but it can be benefical otherwise. E.g. RAID4 with fast dedicated parity disk/SSD. The option is implemented just to be forward-looking and will ONLY work with this patch! Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	584acdd49c	md/raid5: activate raid6 rmw feature Glue it altogehter. The raid6 rmw path should work the same as the already existing raid5 logic. So emulate the prexor handling/flags and split functions as needed. 1) Enable xor_syndrome() in the async layer. 2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome at the start of a rmw run as we did it before for the single parity. 3) Take care of rmw run in ops_run_reconstruct6(). Again process only the changed pages to get syndrome back into sync. 4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw run. The lower layers will calculate start & end pages from that and call the xor_syndrome() correspondingly. 5) Adapt the several places where we ignored Q handling up to now. Performance numbers for a single E5630 system with a mix of 10 7200k desktop/server disks. 300 seconds random write with 8 threads onto a 3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4) bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0 skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0 4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s 8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s 16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s 32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s 64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s 128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s 256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s 512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
shli@kernel.org	dabc4ec6ba	raid5: handle expansion/resync case with stripe batching expansion/resync can grab a stripe when the stripe is in batch list. Since all stripes in batch list must be in the same state, we can't allow some stripes run into expansion/resync. So we delay expansion/resync for stripe in batch list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	72ac733015	raid5: handle io error of batch list If io error happens in any stripe of a batch list, the batch list will be split, then normal process will run for the stripes in the list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	59fc630b8b	RAID5: batch adjacent full stripe write stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	7a87f43405	raid5: track overwrite disk count Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	da41ba6597	raid5: add a new flag to track if a stripe can be batched A freshly new stripe with write request can be batched. Any time the stripe is handled or new read is queued, the flag will be cleared. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	46d5b78562	raid5: use flex_array for scribble data Use flex_array for scribble data. Next patch will batch several stripes together, so scribble data should be able to cover several stripes, so this patch also allocates scribble data for stripes across a chunk. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
NeilBrown	09314799e4	md: remove 'go_faster' option from ->sync_request() This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
Eric Mei	16d9cfab93	raid5: check faulty flag for array status during recovery. When we have more than 1 drive failure, it's possible we start rebuild one drive while leaving another faulty drive in array. To determine whether array will be optimal after building, current code only check whether a drive is missing, which could potentially lead to data corruption. This patch is to add checking Faulty flag. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-25 11:38:26 +11:00
NeilBrown	26ac107378	md/raid5: Fix livelock when array is both resyncing and degraded. Commit a7854487cd7128a30a7f4f5259de9f67d5efb95f: md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. Causes an RCW cycle to be forced even when the array is degraded. A degraded array cannot support RCW as that requires reading all data blocks, and one may be missing. Forcing an RCW when it is not possible causes a live-lock and the code spins, repeatedly deciding to do something that cannot succeed. So change the condition to only force RCW on non-degraded arrays. Reported-by: Manibalan P <pmanibalan@amiindia.co.in> Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com> Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `a7854487cd` Cc: stable@vger.kernel.org (v3.7+)	2015-02-18 11:35:14 +11:00
NeilBrown	6791875e2e	md: make reconfig_mutex optional for writes to md sysfs files. Rather than using mddev_lock() to take the reconfig_mutex when writing to any md sysfs file, we only take mddev_lock() in the particular _store() functions that require it. Admittedly this is most, but it isn't all. This also allows us to remove special-case handling for new_dev_store (in md_attr_store). Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-06 09:32:56 +11:00
NeilBrown	7b1485bab9	md/raid5: use ->lock to protect accessing raid5 sysfs attributes. It is important that mddev->private isn't freed while a sysfs attribute function is accessing it. So use mddev->lock to protect the setting of ->private to NULL, and take that lock when checking ->private for NULL and de-referencing it in the sysfs access functions. This only applies to the read ('show') side of access. Write access will be handled separately. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-06 09:32:55 +11:00
NeilBrown	afa0f557cb	md: rename ->stop to ->free Now that the ->stop function only frees the private data, rename is accordingly. Also pass in the private pointer as an arg rather than using mddev->private. This flexibility will be useful in level_store(). Finally, don't clear ->private. It doesn't make sense to clear it seeing that isn't what we free, and it is no longer necessary to clear ->private (it was some time ago before ->to_remove was introduced). Setting ->to_remove in ->free() is a bit of a wart, but not a big problem at the moment. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	5aa61f427e	md: split detach operation out from ->stop. Each md personality has a 'stop' operation which does two things: 1/ it finalizes some aspects of the array to ensure nothing is accessing the ->private data 2/ it frees the ->private data. All the steps in '1' can apply to all arrays and so can be performed in common code. This is useful as in the case where we change the personality which manages an array (in level_store()), it would be helpful to do step 1 early, and step 2 later. So split the 'step 1' functionality out into a new mddev_detach(). Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	64590f45dd	md: make merge_bvec_fn more robust in face of personality changes. There is no locking around calls to merge_bvec_fn(), so it is possible that calls which coincide with a level (or personality) change could go wrong. So create a central dispatch point for these functions and use rcu_read_lock(). If the array is suspended, reject any merge that can be rejected. If not, we know it is safe to call the function. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	5c675f83c6	md: make ->congested robust against personality changes. There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	ea664c8245	md/raid5: need_this_block: tidy/fix last condition. That last condition is unclear and over cautious. There are two related issues here. If a partial write is destined for a missing device, then either RMW or RCW can work. We must read all the available block. Only then can the missing blocks be calculated, and then the parity update performed. If RMW is not an option, then there is a complication even without partial writes. If we would need to read a missing device to perform the reconstruction, then we must first read every block so the missing device data can be computed. This is the case for RAID6 (Which currently does not support RMW) and for times when we don't trust the parity (after a crash) and so are in the process of resyncing it. So make these two cases more clear and separate, and perform the relevant tests more thoroughly. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:51 +11:00
NeilBrown	a9d56950f7	md/raid5: need_this_block: start simplifying the last two conditions. Both the last two cases are only relevant if something has failed and something needs to be written (but not over-written), and if it is OK to pre-read blocks at this point. So factor out those tests and explain them. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:51 +11:00
NeilBrown	a79cfe12c6	md/raid5: separate out the easy conditions in need_this_block. Some of the conditions in need_this_block have very straight forward motivation. Separate those out and document them. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:51 +11:00
NeilBrown	2c58f06e6f	md/raid5: separate large if clause out of fetch_block(). fetch_block() has a very large and hard to read 'if' condition. Separate it into its own function so that it can be made more readable. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:51 +11:00
Jes Sorensen	ad3ab8b608	md: do_release_stripe(): No need to call md_wakeup_thread() twice `67f455486d` introduced a call to md_wakeup_thread() when adding to the delayed_list. However the md thread is woken up unconditionally just below. Remove the unnecessary wakeup call. Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:51 +11:00
NeilBrown	b1b02fe97f	md/raid5: fix another livelock caused by non-aligned writes. If a non-page-aligned write is destined for a device which is missing/faulty, we can deadlock. As the target device is missing, a read-modify-write cycle is not possible. As the write is not for a full-page, a recontruct-write cycle is not possible. This should be handled by logic in fetch_block() which notices there is a non-R5_OVERWRITE write to a missing device, and so loads all blocks. However since commit `67f455486d`, that code requires STRIPE_PREREAD_ACTIVE before it will active, and those circumstances never set STRIPE_PREREAD_ACTIVE. So: in handle_stripe_dirtying, if neither rmw or rcw was possible, set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set after a suitable delay. Fixes: `67f455486d` Cc: stable@vger.kernel.org (v3.16+) Reported-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-02 16:57:17 +11:00
NeilBrown	108cef3aa4	md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. It is critical that fetch_block() and handle_stripe_dirtying() are consistent in their analysis of what needs to be loaded. Otherwise raid5 can wait forever for a block that won't be loaded. Currently when writing to a RAID5 that is resyncing, to a location beyond the resync offset, handle_stripe_dirtying chooses a reconstruct-write cycle, but fetch_block() assumes a read-modify-write, and a lockup can happen. So treat that case just like RAID6, just as we do in handle_stripe_dirtying. RAID6 always does reconstruct-write. This bug was introduced when the behaviour of handle_stripe_dirtying was changed in 3.7, so the patch is suitable for any kernel since, though it will need careful merging for some versions. Cc: stable@vger.kernel.org (v3.7+) Fixes: `a7854487cd` Reported-by: Henry Cai <henryplusplus@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-12-03 16:07:58 +11:00
NeilBrown	f72ffdd686	md: remove unwanted white space from md.c My editor shows much of this is RED. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
Markus Stockhausen	b8e6a15a1a	md/raid5: fix init_stripe() inconsistencies raid5: fix init_stripe() inconsistencies 1) remove_hash() is not necessary. We will only be called right after get_free_stripe(). There we have already a call to remove_hash(). 2) Tracing prints out the sector of the freed stripe and not the sector that we want to initialize. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	3fd83717e4	md: use set_bit/clear_bit instead of shift/mask for bi_flags changes. Using {set,clear}_bit is more consistent than shifting and masking. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-09 10:07:04 +11:00
NeilBrown	8e0e99ba64	md/raid5: disable 'DISCARD' by default due to safety concerns. It has come to my attention (thanks Martin) that 'discard_zeroes_data' is only a hint. Some devices in some cases don't do what it says on the label. The use of DISCARD in RAID5 depends on reads from discarded regions being predictably zero. If a write to a previously discarded region performs a read-modify-write cycle it assumes that the parity block was consistent with the data blocks. If all were zero, this would be the case. If some are and some aren't this would not be the case. This could lead to data corruption after a device failure when data needs to be reconstructed from the parity. As we cannot trust 'discard_zeroes_data', ignore it by default and so disallow DISCARD on all raid4/5/6 arrays. As many devices are trustworthy, and as there are benefits to using DISCARD, add a module parameter to over-ride this caution and cause DISCARD to work if discard_zeroes_data is set. If a site want to enable DISCARD on some arrays but not on others they should select DISCARD support at the filesystem level, and set the raid456 module parameter. raid456.devices_handle_discard_safely=Y As this is a data-safety issue, I believe this patch is suitable for -stable. DISCARD support for RAID456 was added in 3.7 Cc: Shaohua Li <shli@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org (3.7+) Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Fixes: `620125f2bf` Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-02 13:45:00 +10:00
NeilBrown	9c4bdf697c	md/raid6: avoid data corruption during recovery of double-degraded RAID6 During recovery of a double-degraded RAID6 it is possible for some blocks not to be recovered properly, leading to corruption. If a write happens to one block in a stripe that would be written to a missing device, and at the same time that stripe is recovering data to the other missing device, then that recovered data may not be written. This patch skips, in the double-degraded case, an optimisation that is only safe for single-degraded arrays. Bug was introduced in 2.6.32 and fix is suitable for any kernel since then. In an older kernel with separate handle_stripe5() and handle_stripe6() functions the patch must change handle_stripe6(). Cc: stable@vger.kernel.org (2.6.32+) Fixes: `6c0069c0ae` Cc: Yuri Tikhonov <yur@emcraft.com> Cc: Dan Williams <dan.j.williams@intel.com> Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in> Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423 Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Dan Williams <dan.j.williams@intel.com>	2014-08-18 14:49:46 +10:00
NeilBrown	a40687ff73	md/raid5: avoid livelock caused by non-aligned writes. If a stripe in a raid6 array received a write to each data block while the array is degraded, and if any of these writes to a missing device are not page-aligned, then a live-lock happens. In this case the P and Q blocks need to be read so that the part of the missing block which is not being updated by the write can be constructed. Due to a logic error, these blocks are not loaded, so the update cannot proceed and the stripe is 'handled' repeatedly in an infinite loop. This bug is unlikely as most writes are page aligned. However as it can lead to a livelock it is suitable for -stable. It was introduced in 3.16. Cc: stable@vger.kernel.org (v3.16) Fixed: `67f455486d` Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-18 14:49:41 +10:00
Linus Torvalds	8d0304e69d	Assorted md fixes for 3.16 Mostly performance improvements with a few corner-case bug fixes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUAU5fevznsnt1WYoG5AQKzNxAAgjZPg6LZgl0wEHJi09ykqLK9dSkglz2B RY5IEWPEMcsGJQbSX7fEME5WSHkqKBIHx43xmdzijPCHyWLnF4vQRubNAi/Wo8PC yfJyI5hMrG8+0rNKnmXWeG+NLGj7l7j+wkExe27VBlu2+6ATvIUCa34kjOWFDJJx xAN9/t22XGly776SiuLXTpMhA0pRWH4sqijJhvM5EqPlsEUjGxvQU/URwUPYzhYr iCbvY64uKB1Crh+vn7yP1IUWc77aDOoUPIxO50m0GXOQFU/wbQta1NAFmMeu8tIo 3QRlGzP+jRJFBV7jLii4nHTMSVxBmD2zMppxVtDoq/egWhyZlJhh2Kd5sUKo2r9V 8bGQm7e1JlsDsRMVXxSlx3TebZDfJhtnFNhYb6JPlpWM5EYjCzXXfJUl3/6LfKX6 BKJ800xvabr5TFkVubF9H80LTQxaO+pQoS6PSs8hnwneqtybdbt9o7V59yluMGao IwaG/BY4jlY+76YnK32Kbr1eofUNYXBLg45mgSkrhUVqv9HLHe0CDnYAzYs68aur zG8a1rhA9IJtBmWwql2Gm8lKOnDFM6onvSmaYntsabJ++eahaNDpL5ck+R4kO8mJ pyI1N6GQDjtLdONk4NmH36P+CuCZje6KwRbVXvsJIBA1042tkZfU0zdqgt5QhFZj Ma9ZCn9wZFo= =MaM/ -----END PGP SIGNATURE----- Merge tag 'md/3.16' of git://neil.brown.name/md Pull md updates from Neil Brown: "Assorted md fixes for 3.16 Mostly performance improvements with a few corner-case bug fixes" * tag 'md/3.16' of git://neil.brown.name/md: raid5: speedup sync_request processing md/raid5: deadlock between retry_aligned_read with barrier io raid5: add an option to avoid copy data from bio to stripe cache md/bitmap: remove confusing code from filemap_get_page. raid5: avoid release list until last reference of the stripe md: md_clear_badblocks should return an error code on failure. md/raid56: Don't perform reads to support writes until stripe is ready. md: refuse to change shape of array if it is active but read-only	2014-06-11 08:33:41 -07:00
Eivind Sarto	053f5b6525	raid5: speedup sync_request processing The raid5 sync_request() processing calls handle_stripe() within the context of the resync-thread. The resync-thread issues the first set of read requests and this adds execution latency and slows down the scheduling of the next sync_request(). The current rebuild/resync speed of raid5 is not much faster than what rotational HDDs can sustain. Testing the following patch on a 6-drive array, I can increase the rebuild speed from 100 MB/s to 175 MB/s. The sync_request() now just sets STRIPE_HANDLE and releases the stripe. This creates some more parallelism between the resync-thread and raid5 kernel daemon. Signed-off-by: Eivind Sarto <esarto@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-06-10 11:02:01 +10:00
hui jiao	2844dc32ea	md/raid5: deadlock between retry_aligned_read with barrier io A chunk aligned read increases counter active_aligned_reads and decreases it after sub-device handle it successfully. But when a read error occurs, the read redispatched by raid5d, and the active_aligned_reads will not be decreased until we can grab a stripe head in retry_aligned_read. Now suppose, a barrier io comes, set conf->quiesce to 2, and wait until both active_stripes and active_aligned_reads are zero. The retried chunk aligned read gets stuck at get_active_stripe waiting until conf->quiesce becomes 0. Retry_aligned_read and barrier io are waiting each other now. One possible solution is that we ignore conf->quiesce, let the retried aligned read finish. I reproduced this deadlock and test this patch on centos6.0 Signed-off-by: NeilBrown <neilb@suse.de>	2014-06-05 17:18:19 +10:00
Shaohua Li	d592a99691	raid5: add an option to avoid copy data from bio to stripe cache The stripe cache has two goals: 1. cache data, so next time if data can be found in stripe cache, disk access can be avoided. 2. stable data. data is copied from bio to stripe cache and calculated parity. data written to disk is from stripe cache, so if upper layer changes bio data, data written to disk isn't impacted. In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES can guarantee 2 too. For 1, it's not common too. block plug mechanism will dispatch a bunch of sequentail small requests together. And since I'm using SSD, I'm using small chunk size. It's rare case stripe cache is really useful. So I'd like to avoid the copy from bio to stripe cache and it's very helpful for performance. In my 1M randwrite tests, avoid the copy can increase the performance more than 30%. Of course, this shouldn't be enabled by default. It's reported enabling BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to control it. Neilb: changed BUG_ON to WARN_ON Removed some assignments from raid5_build_block which are now not needed. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-05-29 16:59:47 +10:00
Eivind Sarto	cf170f3fa4	raid5: avoid release list until last reference of the stripe The (lockless) release_list reduces lock contention, but there is excessive queueing and dequeuing of stripes on this list. A stripe will currently be queued on the release_list with a stripe reference count > 1. This can cause the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount without doing any other useful processing of the stripe. The are two cases when the stripe can be put on the release_list multiple times before it is actually handled by the kernel thread(s). 1) make_request() activates the stripe processing in 4k increments. When a write request is large enough to span multiple chunks of a stripe_head, the first 4k chunk adds the stripe to the plug list. The next 4k chunk that is processed for the same stripe puts the stripe on the release_list with a refcount=2. This can cause the kernel thread to process and decrement the stripe before the stripe us unplugged, which again will put it back on the release_list. 2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe refcount is set to the number of active IO (for each chunk). The stripe is released as each IO complete, and can be queued and dequeued multiple times on the release_list, until its refcount finally reached zero. This simple patch will ensure a stripe is only queued on the release_list when its refcount=1 and is ready to be handled by the kernel thread(s). I added some instrumentation to raid5 and counted the number of times striped were queued on the release_list for a variety of write IO sizes. Without this patch the number of times stripes got queued on the release_list was 100-500% higher than with the patch. The excess queuing will increase with the IO size. The patch also improved throughput by 5-10%. Signed-off-by: Eivind Sarto <esarto@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-05-29 16:59:46 +10:00
NeilBrown	67f455486d	md/raid56: Don't perform reads to support writes until stripe is ready. If it is found that we need to pre-read some blocks before a write can succeed, we normally set STRIPE_DELAYED and don't actually perform the read until STRIPE_PREREAD_ACTIVE subsequently gets set. However for a degraded RAID6 we currently perform the reads as soon as we see that a write is pending. This significantly hurts throughput. So: - when handle_stripe_dirtying find a block that it wants on a device that is failed, set STRIPE_DELAY, instead of doing nothing, and - when fetch_block detects that a read might be required to satisfy a write, only perform the read if STRIPE_PREREAD_ACTIVE is set, and if we would actually need to read something to complete the write. This also helps RAID5, though less often as RAID5 supports a read-modify-write cycle. For RAID5 the read is performed too early only if the write is not a full 4K aligned write (i.e. no an R5_OVERWRITE). Also clean up a couple of horrible bits of formatting. Reported-by: Patrik Horník <patrik@dsl.sk> Signed-off-by: NeilBrown <neilb@suse.de>	2014-05-29 16:59:46 +10:00
Peter Zijlstra	4e857c58ef	arch: Mass conversion of smp_mb__() Mostly scripted conversion of the smp_mb__ barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-04-18 14:20:48 +02:00
Shaohua Li	c7a6d35e46	raid5: fix a race of stripe count check I hit another BUG_ON with `e240c1839d`. In __get_priority_stripe(), stripe count equals to 0 initially. Between atomic_inc and BUG_ON, get_active_stripe() finds the stripe. So the stripe count isn't 1 any more. V2: keeps the BUG_ON suggested by Neil. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-04-17 17:05:28 +10:00
Shaohua Li	e240c1839d	raid5: get_active_stripe avoids device_lock For sequential workload (or request size big workload), get_active_stripe can find cached stripe. In this case, we always hold device_lock, which exposes a lot of lock contention for such workload. If stripe count isn't 0, we don't need hold the lock actually, since we just increase its count. And this is the hot code path for such workload. Unfortunately we must delete the BUG_ON. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-04-09 14:42:42 +10:00
Shaohua Li	27c0f68f07	raid5: make_request does less prepare wait In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a lot of contention for sequential workload (or big request size workload). For such workload, each bio includes several stripes. So we can just do prepare_to_wait/finish_wait once for the whold bio instead of every stripe. This reduces the lock contention completely for such workload. Random workload might have the similar lock contention too, but I didn't see it yet, maybe because my stroage is still not fast enough. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-04-09 14:42:38 +10:00
Oleg Nesterov	789b5e0315	md/raid5: Fix CPU hotplug callback registration Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Interestingly, the raid5 code can actually prevent double initialization and hence can use the following simplified form of callback registration: register_cpu_notifier(&foobar_cpu_notifier); get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); put_online_cpus(); A hotplug operation that occurs between registering the notifier and calling get_online_cpus(), won't disrupt anything, because the code takes care to perform the memory allocations only once. So reorganize the code in raid5 this way to fix the deadlock with callback registration. Cc: linux-raid@vger.kernel.org Cc: stable@vger.kernel.org (v2.6.32+) Fixes: `36d1c6476b` Signed-off-by: Oleg Nesterov <oleg@redhat.com> [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the free_scratch_buffer() helper to condense code further and wrote the changelog.] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-02-13 13:46:45 +11:00
Linus Torvalds	53d8ab29f8	Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block Pull block IO driver changes from Jens Axboe: - bcache update from Kent Overstreet. - two bcache fixes from Nicholas Swenson. - cciss pci init error fix from Andrew. - underflow fix in the parallel IDE pg_write code from Dan Carpenter. I'm sure the 1 (or 0) users of that are now happy. - two PCI related fixes for sx8 from Jingoo Han. - floppy init fix for first block read from Jiri Kosina. - pktcdvd error return miss fix from Julia Lawall. - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from Michael Opdenacker. - comment typo fix for the loop driver from Olaf Hering. - potential oops fix for null_blk from Raghavendra K T. - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing an OOM problem and a problem with handling security locked conditions * 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits) mg_disk: Spelling s/finised/finished/ null_blk: Null pointer deference problem in alloc_page_buffers mtip32xx: Correctly handle security locked condition mtip32xx: Make SGL container per-command to eliminate high order dma allocation drivers/block/loop.c: fix comment typo in loop_config_discard drivers/block/cciss.c:cciss_init_one(): use proper errnos drivers/block/paride/pg.c: underflow bug in pg_write() drivers/block/sx8.c: remove unnecessary pci_set_drvdata() drivers/block/sx8.c: use module_pci_driver() floppy: bail out in open() if drive is not responding to block0 read bcache: Fix auxiliary search trees for key size > cacheline size bcache: Don't return -EINTR when insert finished bcache: Improve bucket_prio() calculation bcache: Add bch_bkey_equal_header() bcache: update bch_bkey_try_merge bcache: Move insert_fixup() to btree_keys_ops bcache: Convert sorting to btree_keys bcache: Convert debug code to btree_keys bcache: Convert btree_iter to struct btree_keys bcache: Refactor bset_tree sysfs stats ...	2014-01-30 11:40:10 -08:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
NeilBrown	7da9d450ab	md/raid5: close recently introduced race in stripe_head management. As release_stripe and __release_stripe decrement ->count and then manipulate ->lru both under ->device_lock, it is important that get_active_stripe() increments ->count and clears ->lru also under ->device_lock. However we currently list_del_init ->lru under the lock, but increment the ->count outside the lock. This can lead to races and list corruption. So move the atomic_inc(&sh->count) up inside the ->device_lock protected region. Note that we still increment ->count without device lock in the case where get_free_stripe() was called, and in fact don't take ->device_lock at all in that path. This is safe because if the stripe_head can be found by get_free_stripe, then the hash lock assures us the no-one else could possibly be calling release_stripe() at the same time. Fixes: `566c09c534` Cc: stable@vger.kernel.org (3.13) Reported-and-tested-by: Ian Kumlien <ian.kumlien@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-22 11:45:03 +11:00
NeilBrown	9f97e4b128	md/raid5: fix long-standing problem with bitmap handling on write failure. Before a write starts we set a bit in the write-intent bitmap. When the write completes we clear that bit if the write was successful to all devices. However if the write wasn't fully successful we should not clear the bit. If the faulty drive is subsequently re-added, the fact that the bit is still set ensure that we will re-write the data that is missing. This logic is mediated by the STRIPE_DEGRADED flag - we only clear the bitmap bit when this flag is not set. Currently we correctly set the flag if a write starts when some devices are failed or missing. But we do not set the flag if some device failed during the write attempt. This is wrong and can result in clearing the bit inappropriately. So: set the flag when a write fails. This bug has been present since bitmaps were introduces, so the fix is suitable for any -stable kernel. Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-16 09:35:38 +11:00
NeilBrown	5af9bef72c	md/raid5: fix a recently broken BUG_ON(). commit `6d183de407` md/raid5: fix newly-broken locking in get_active_stripe. simplified a BUG_ON, but removed too much so now it sometimes fires when it shouldn't. When the STRIPE_EXPANDING flag is set, the stripe_head might be on a special list while multiple stripe_heads are collected, or it might not be on any list, even a 'free' list when the refcount is zero. As long as STRIPE_EXPANDING is set, it will be found and added back to a list eventually. So both of the BUG_ONs which test for the ->lru being empty or not need to avoid the case where STRIPE_EXPANDING is set. The patch which broke this was marked for -stable, so this patch needs to be applied to any branch that received `6d183de4` Fixes: `6d183de407` Cc: stable@vger.kernel.org (any release to which above was applied) Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-14 16:44:07 +11:00
NeilBrown	1cc03eb932	md/raid5: Fix possible confusion when multiple write errors occur. commit `5d8c71f9e5` md: raid5 crash during degradation Fixed a crash in an overly simplistic way which could leave R5_WriteError or R5_MadeGood set in the stripe cache for devices for which it is no longer relevant. When those devices are removed and spares added the flags are still set and can cause incorrect behaviour. commit `14a75d3e07` md/raid5: preferentially read from replacement device if possible. Fixed the same bug if a more effective way, so we can now revert the original commit. Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though) Fixes: `5d8c71f9e5` Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-14 16:44:07 +11:00
Kent Overstreet	c78afc6261	bcache/md: Use raid stripe size Now that we've got code for raid5/6 stripe awareness, bcache just needs to know about the stripes and when writing partial stripes is expensive - we probably don't want to enable this optimization for raid1 or 10, even though they have stripes. So add a flag to queue_limits. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-08 13:05:09 -08:00
Jens Axboe	b28bc9b38c	Linux 3.13-rc6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU= =/4/R -----END PGP SIGNATURE----- Merge tag 'v3.13-rc6' into for-3.14/core Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c	2013-12-31 09:51:02 -07:00
NeilBrown	6d183de407	md/raid5: fix newly-broken locking in get_active_stripe. commit `566c09c534` raid5: relieve lock contention in get_active_stripe() modified the locking in get_active_stripe() reducing the range protected by the (highly contended) device_lock. Unfortunately it reduced the range too much opening up some races. One race can occur if get_priority_stripe runs between the test on sh->count and device_lock being taken. This will mean that sh->lru is not empty while get_active_stripe thinks ->count is zero resulting in a 'BUG' firing. Another race happens if __release_stripe is called immediately after sh->count is tested and found to be non-zero. If STRIPE_HANDLE is not set, get_active_stripe should increment ->active_stripes when it increments ->count from 0, but as it didn't think it was 0, it doesn't. Extending device_lock to cover the test on sh->count close these races. While we are here, fix the two BUG tests: -If count is zero, then lru really must not be empty, or we've lock the stripe_head somehow - no other tests are relevant. -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so testing it is pointless. Reported-and-tested-by: Brassow Jonathan <jbrassow@redhat.com> Reviewed-by: Shaohua Li <shli@kernel.org> Fixes: `566c09c534` Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-28 11:00:15 +11:00
NeilBrown	0c775d5208	md/raid5: fix new memory-reference bug in alloc_thread_groups. In alloc_thread_groups, worker_groups is a pointer to an array, not an array of pointers. So worker_groups[i] is wrong. It should be &(*worker_groups)[i] Found-by: coverity Fixes: `60aaf93385` Reported-by: Ben Hutchings <bhutchings@solarflare.com> Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-28 11:00:04 +11:00
Kent Overstreet	7988613b0e	block: Convert bio_for_each_segment() to bvec_iter More prep work for immutable biovecs - with immutable bvecs drivers won't be able to use the biovec directly, they'll need to use helpers that take into account bio->bi_iter.bi_bvec_done. This updates callers for the new usage without changing the implementation yet. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Paul Clements <Paul.Clements@steeleye.com> Cc: Jim Paris <jim@jtan.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: support@lsi.com Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Cc: linux-m68k@lists.linux-m68k.org Cc: linuxppc-dev@lists.ozlabs.org Cc: drbd-user@lists.linbit.com Cc: nbd-general@lists.sourceforge.net Cc: cbe-oss-dev@lists.ozlabs.org Cc: xen-devel@lists.xensource.com Cc: virtualization@lists.linux-foundation.org Cc: linux-raid@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: DL-MPTFusionLinux@lsi.com Cc: linux-scsi@vger.kernel.org Cc: devel@driverdev.osuosl.org Cc: linux-fsdevel@vger.kernel.org Cc: cluster-devel@redhat.com Cc: linux-mm@kvack.org Acked-by: Geoff Levand <geoff@infradead.org>	2013-11-23 22:33:49 -08:00
Kent Overstreet	4f024f3797	block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6	2013-11-23 22:33:47 -08:00
Linus Torvalds	6d6e352c80	md update for 3.13. Mostly optimisations and obscure bug fixes. - raid5 gets less lock contention - raid1 gets less contention between normal-io and resync-io during resync. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUovzDznsnt1WYoG5AQJ1pQ//bDuXadoJ5dwjWjVxFOKoQ9j/9joEI0yH XTApD3ADKckdBc4TSLOIbCNLW1Pbe23HlOI/GjCiJ/7mePL3OwHd7Fx8Rfq3BubV f7NgjVwu8nwYD0OXEZsshImptEtrbYwQdy+qlKcHXcZz1MUfR+Egih3r/ouTEfEt FNq/6MpyN0IKSY82xP/jFZgesBucgKz/YOUIbwClxm7UiyISKvWQLBIAfLB3dyI3 HoEdEzQX6I56Rw0mkSUG4Mk+8xx/8twxL+yqEUqfdJREWuB56Km8kl8y/e465Nk0 ZZg6j/TrslVEwbEeVMx0syvYcaAWFZ4X2jdKfo1lI0g9beZp7H1GRF8yR1s2t/h4 g/vb55MEN++4LPaE9ut4z7SG2yLyGkZgFTzTjyq5of+DFL0cayO7wXxbgpcD7JYf Doef/OSa6csKiGiJI48iQa08Bolmz9ZWzZQXhAthKfFQ9Rv+GEtIAi4kLR8EZPbu 0/FL1ylYNUY9O7p0g+iy9Kcoc+xW36I95pPZf8pO8GFcXTjyuCCBVh/SNvFZZHPl 3xk3aZJknAEID8VrVG2IJPkeDI8WK8YxmpU/nARCoytn07Df6Ye8jGvLdR8pL3lB TIZV6eRY4yciB8LtoK9Kg4XTmOMhBtjt4c3znkljp98vhOQQb/oHN+BXMGcwqvr9 fk0KGrg31VA= =8RCg -----END PGP SIGNATURE----- Merge tag 'md/3.13' of git://neil.brown.name/md Pull md update from Neil Brown: "Mostly optimisations and obscure bug fixes. - raid5 gets less lock contention - raid1 gets less contention between normal-io and resync-io during resync" * tag 'md/3.13' of git://neil.brown.name/md: md/raid5: Use conf->device_lock protect changing of multi-thread resources. md/raid5: Before freeing old multi-thread worker, it should flush them. md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. UAPI: include <asm/byteorder.h> in linux/raid/md_p.h raid1: Rewrite the implementation of iobarrier. raid1: Add some macros to make code clearly. raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array. raid1: Add a field array_frozen to indicate whether raid in freeze state. md: Convert use of typedef ctl_table to struct ctl_table md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. md: fix some places where mddev_lock return value is not checked. raid5: Retry R5_ReadNoMerge flag when hit a read error. raid5: relieve lock contention in get_active_stripe() raid5: relieve lock contention in get_active_stripe() wait: add wait_event_cmd() md/raid5.c: add proper locking to error path of raid5_start_reshape. md: fix calculation of stacking limits on level change. raid5: Use slow_path to release stripe when mddev->thread is null	2013-11-20 13:05:25 -08:00
majianpeng	60aaf93385	md/raid5: Use conf->device_lock protect changing of multi-thread resources. When we change group_thread_cnt from sysfs entry, it can OOPS. The kernel messages are: [ 135.299021] BUG: unable to handle kernel NULL pointer dereference at (null) [ 135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.299107] PGD 0 [ 135.299122] Oops: 0000 [#1] SMP [ 135.299144] Modules linked in: netconsole e1000e ptp pps_core [ 135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24 [ 135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000 [ 135.299283] RIP: 0010:[<ffffffff815188ab>] [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.299323] RSP: 0018:ffff8800b77a5c48 EFLAGS: 00010002 [ 135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008 [ 135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00 [ 135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000 [ 135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00 [ 135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70 [ 135.299479] FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 [ 135.299510] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0 [ 135.299559] Stack: [ 135.299570] ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300 [ 135.299611] 000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8 [ 135.299654] 000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98 [ 135.299696] Call Trace: [ 135.299711] [<ffffffff8107383e>] ? __wake_up+0x4e/0x70 [ 135.299733] [<ffffffff81518f88>] raid5d+0x4c8/0x680 [ 135.299756] [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0 [ 135.299781] [<ffffffff81524c9f>] md_thread+0x11f/0x170 [ 135.299804] [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40 [ 135.299826] [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110 [ 135.299850] [<ffffffff81069656>] kthread+0xc6/0xd0 [ 135.299871] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 135.299899] [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0 [ 135.299923] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90 [ 135.300005] RIP [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.300005] RSP <ffff8800b77a5c48> [ 135.300005] CR2: 0000000000000000 [ 135.300005] ---[ end trace 504854e5bb7562ed ]--- [ 135.300005] Kernel panic - not syncing: Fatal exception This is because raid5d() can be running when the multi-thread resources are changed via system. We see need to provide locking. mddev->device_lock is suitable, but we cannot simple call alloc_thread_groups under this lock as we cannot allocate memory while holding a spinlock. So change alloc_thread_groups() to allocate and return the data structures, then raid5_store_group_thread_cnt() can take the lock while updating the pointers to the data structures. This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x stable series. Fixes: `b721420e87` Cc: stable@vger.kernel.org (3.12) Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Shaohua Li <shli@kernel.org>	2013-11-19 15:19:18 +11:00
majianpeng	d206dcfa98	md/raid5: Before freeing old multi-thread worker, it should flush them. When changing group_thread_cnt from sysfs entry, the kernel can oops. The kernel messages are: [ 740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.961476] PGD b9013067 PUD b651e067 PMD 0 [ 740.961503] Oops: 0000 [#1] SMP [ 740.961525] Modules linked in: netconsole e1000e ptp pps_core [ 740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23 [ 740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000 [ 740.961673] RIP: 0010:[<ffffffff81062570>] [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.961708] RSP: 0018:ffff88013a247e08 EFLAGS: 00010086 [ 740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400 [ 740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680 [ 740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d [ 740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00 [ 740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0 [ 740.961861] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 [ 740.961893] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0 [ 740.962001] Stack: [ 740.962001] 0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680 [ 740.962001] ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0 [ 740.962001] ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8 [ 740.962001] Call Trace: [ 740.962001] [<ffffffff810639c6>] worker_thread+0x206/0x3f0 [ 740.962001] [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0 [ 740.962001] [<ffffffff81069656>] kthread+0xc6/0xd0 [ 740.962001] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 740.962001] [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0 [ 740.962001] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84 [ 740.962001] RIP [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.962001] RSP <ffff88013a247e08> [ 740.962001] CR2: 0000000000000008 [ 740.962001] ---[ end trace 39181460000748de ]--- [ 740.962001] Kernel panic - not syncing: Fatal exception This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH. A worker is queued to handle them. But before calling raid5_do_work, raid5d handles those stripes making conf->active_stripe = 0. So mddev_suspend() can return. We might then free old worker resources before the queued raid5_do_work() handled them. When it runs, it crashes. raid5d() raid5_store_group_thread_cnt() queue_work mddev_suspend() handle_strips active_stripe=0 free(old worker resources) process_one_work raid5_do_work To avoid this, we should only flush the worker resources before freeing them. This fixes a bug introduced in 3.12 so is suitable for the 3.12.x stable series. Cc: stable@vger.kernel.org (3.12) Fixes: `b721420e87` Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Shaohua Li <shli@kernel.org>	2013-11-19 15:19:18 +11:00
majianpeng	e59aa23f4c	md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. For R5_ReadNoMerge,it mean this bio can't merge with other bios or request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the same work. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-19 15:19:18 +11:00
NeilBrown	c91abf5a35	md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. We currently use kthread_should_stop() in various places in the sync/reshape code to abort early. However some places set MD_RECOVERY_INTR but don't immediately call md_reap_sync_thread() (and we will shortly get another one). When this happens we are relying on md_check_recovery() to reap the thread and that only happen when it finishes normally. So MD_RECOVERY_INTR must lead to a normal finish without the kthread_should_stop() test. So replace all relevant tests, and be more careful when the thread is interrupted not to acknowledge that latest step in a reshape as it may not be fully committed yet. Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop so we don't wait have to wait for the speed to drop before we can abort. Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-19 15:19:17 +11:00
Bian Yu	edfa1f651e	raid5: Retry R5_ReadNoMerge flag when hit a read error. Because of block layer merge, one bio fails will cause other bios which belongs to the same request fails, so raid5_end_read_request will record all these bios as badblocks. If retry request with R5_ReadNoMerge flag to avoid bios merge, badblocks can only record sector which is bad exactly. test: hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean mdadm /dev/md0 -f /dev/sdd mdadm /dev/md0 -r /dev/sdd mdadm --zero-superblock /dev/sdd mdadm /dev/md0 -a /dev/sdd 1. Without this patch: cat /sys/block/md0/md/rd/bad_blocks 299776 256 299776 256 2. With this patch: cat /sys/block/md0/md/rd/bad_blocks 300000 8 300000 8 Signed-off-by: Bian Yu <bianyu@kedacom.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-19 15:18:24 +11:00
Shaohua Li	4bda556aea	raid5: relieve lock contention in get_active_stripe() track empty inactive list count, so md_raid5_congested() can use it to make decision. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-19 15:18:22 +11:00
Christoph Hellwig	b89241e8cd	llists: move llist_reverse_order from raid5 to llist.c Make this useful helper available for other users. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-11-15 09:32:22 +09:00
Shaohua Li	566c09c534	raid5: relieve lock contention in get_active_stripe() get_active_stripe() is the last place we have lock contention. It has two paths. One is stripe isn't found and new stripe is allocated, the other is stripe is found. The first path basically calls __find_stripe and init_stripe. It accesses conf->generation, conf->previous_raid_disks, conf->raid_disks, conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded, conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except stripe_hashtbl and inactive_list, other fields are changed very rarely. With this patch, we split inactive_list and add new hash locks. Each free stripe belongs to a specific inactive list. Which inactive list is determined by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a lock_hash assigned. Stripe's inactive list is protected by a hash lock, which is determined by it's lock_hash too. The lock_hash is derivied from current stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl list too. The goal of the new hash locks introduced is we can only use the new locks in the first path of get_active_stripe(). Since we have several hash locks, lock contention is relieved significantly. The first path of get_active_stripe() accesses other fields, since they are changed rarely, changing them now need take conf->device_lock and all hash locks. For a slow path, this isn't a problem. If we need lock device_lock and hash lock, we always lock hash lock first. The tricky part is release_stripe and friends. We need take device_lock first. Neil's suggestion is we put inactive stripes to a temporary list and readd it to inactive_list after device_lock is released. In this way, we add stripes to temporary list with device_lock hold and remove stripes from the list with hash lock hold. So we don't allow concurrent access to the temporary list, which means we need allocate temporary list for all participants of release_stripe. One downside is free stripes are maintained in their inactive list, they can't across between the lists. By default, we have total 256 stripes and 8 lists, so each list will have 32 stripes. It's possible one list has free stripe but other list hasn't. The chance should be rare because stripes allocation are even distributed. And we can always allocate more stripes for cache, several mega bytes memory isn't a big deal. This completely removes the lock contention of the first path of get_active_stripe(). It slows down the second code path a little bit though because we now need takes two locks, but since the hash lock isn't contended, the overhead should be quite small (several atomic instructions). The second path of get_active_stripe() (basically sequential write or big request size randwrite) still has lock contentions. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-14 15:20:58 +11:00
NeilBrown	ba8805b973	md/raid5.c: add proper locking to error path of raid5_start_reshape. If raid5_start_reshape errors out, we need to reset all the fields that were updated (not just some), and need to use the seq_counter to ensure make_request() doesn't use an inconsitent state. Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-14 15:16:15 +11:00
majianpeng	ad4068de49	raid5: Use slow_path to release stripe when mddev->thread is null When release_stripe() is called in grow_one_stripe(), the mddev->thread is null. So it will omit one wakeup this thread to release stripe. For this condition, use slow_path to release stripe. Bug was introduced in 3.12 Cc: stable@vger.kernel.org (3.12+) Fixes: `773ca82fa1` Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-14 15:16:15 +11:00
Shaohua Li	d47648fcf0	raid5: avoid finding "discard" stripe SCSI discard will damage discard stripe bio setting, eg, some fields are changed. If the stripe is reused very soon, we have wrong bios setting. We remove discard stripe from hash list, so next time the strip will be fully initialized. Suitable for backport to 3.7+. Cc: <stable@vger.kernel.org> (3.7+) Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-10-24 13:00:24 +11:00
Shaohua Li	37c61ff31e	raid5: set bio bi_vcnt 0 for discard request SCSI layer will add new payload for discard request. If two bios are merged to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse SCSI and cause oops. Suitable for backport to 3.7+ Cc: stable@vger.kernel.org (v3.7+) Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com>	2013-10-24 12:57:36 +11:00
Shaohua Li	bfc90cb093	raid5: only wakeup necessary threads If there are not enough stripes to handle, we'd better not always queue all available work_structs. If one worker can only handle small or even none stripes, it will impact request merge and create lock contention. With this patch, the number of work_struct running will depend on pending stripes number. Note: some statistics info used in the patch are accessed without locking protection. This should doesn't matter, we just try best to avoid queue unnecessary work_struct. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-09-02 10:31:29 +10:00
NeilBrown	4d77e3ba88	md/raid5: flush out all pending requests before proceeding with reshape. Some requests - particularly 'discard' and 'read' are handled differently depending on whether a reshape is active or not. It is harmless to assume reshape is active if it isn't but wrong to act as though reshape is not active when it is. So when we start reshape - after making clear to all requests that reshape has started - use mddev_suspend/mddev_resume to flush out all requests. This will ensure that no requests will be assuming the absence of reshape once it really starts. Signed-off-by: NeilBrown <neilb@suse.de>	2013-08-28 16:58:44 +10:00
NeilBrown	c46501b2de	md/raid5: use seqcount to protect access to shape in make_request. make_request() access various shape parameters (raid_disks, chunk_size etc) which might be changed by raid5_start_reshape(). If the later is called at and awkward time during the form, the wrong stripe_head might be used. So introduce a 'seqcount' and after finding a stripe_head make sure there is no reason to expect that we got the wrong one. Signed-off-by: NeilBrown <neilb@suse.de>	2013-08-28 16:58:36 +10:00
Shaohua Li	b721420e87	raid5: sysfs entry to control worker thread number Add a sysfs entry to control running workqueue thread number. If group_thread_cnt is set to 0, we will disable workqueue offload handling of stripes. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-08-28 16:56:52 +10:00
Shaohua Li	851c30c9ba	raid5: offload stripe handle to workqueue This is another attempt to create multiple threads to handle raid5 stripes. This time I use workqueue. raid5 handles request (especially write) in stripe unit. A stripe is page size aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a state machine for the corresponding stripe, which includes reading some disks of the stripe, calculating parity, and writing some disks of the stripe. The state machine is running in raid5d thread currently. Since there is only one thread, it doesn't scale well for high speed storage. An obvious solution is multi-threading. To get better performance, we have some requirements: a. locality. stripe corresponding to request submitted from one cpu is better handled in thread in local cpu or local node. local cpu is preferred but some times could be a bottleneck, for example, parity calculation is too heavy. local node running has wide adaptability. b. configurablity. Different setup of raid5 array might need diffent configuration. Especially the thread number. More threads don't always mean better performance because of lock contentions. My original implementation is creating some kernel threads. There are interfaces to control which cpu's stripe each thread should handle. And userspace can set affinity of the threads. This provides biggest flexibility and configurability. But it's hard to use and apparently a new thread pool implementation is disfavor. Recent workqueue improvement is quite promising. unbound workqueue will be bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to do affinity setting. For example, we can only include one HT sibling in affinity. Since work is non-reentrant by default, and we can control running thread number by limiting dispatched work_struct number. In this patch, I created several stripe worker group. A group is a numa node. stripes from cpus of one node will be added to a group list. Workqueue thread of one node will only handle stripes of worker group of the node. In this way, stripe handling has numa node locality. And as I said, we can control thread number by limiting dispatched work_struct number. The work_struct callback function handles several stripes in one run. A typical work queue usage is to run one unit in each work_struct. In raid5 case, the unit is a stripe. But we can't do that: a. Though handling a stripe doesn't need lock because of reference accounting and stripe isn't in any list, queuing a work_struct for each stripe will make workqueue lock contended very heavily. b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we might dispatch request. If each work_struct only handles one stripe, such block plug is meaningless. This implementation can't do very fine grained configuration. But the numa binding is most popular usage model, should be enough for most workloads. Note: since we have only one stripe queue, switching to multi-thread might decrease request size dispatching down to low level layer. The impact depends on thread number, raid configuration and workload. So multi-thread raid5 might not be proper for all setups. Changes V1 -> V2: 1. remove WQ_NON_REENTRANT 2. disabling multi-threading by default 3. Add more descriptions in changelog Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-08-28 16:46:38 +10:00
Shaohua Li	d265d9dc1d	raid5: fix stripe release order patch "make release_stripe lockless" changes the order stripes are released. Originally I thought block layer can take care of request merge, but it appears there are still some requests not merged. It's easy to fix the order. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-08-28 16:36:26 +10:00
Shaohua Li	773ca82fa1	raid5: make release_stripe lockless release_stripe still has big lock contention. We just add the stripe to a llist without taking device_lock. We let the raid5d thread to do the real stripe release, which must hold device_lock anyway. In this way, release_stripe doesn't hold any locks. The side effect is the released stripes order is changed. But sounds not a big deal, stripes are never handled in order. And I thought block layer can already do nice request merge, which means order isn't that important. I kept the unplug release batch, which is unnecessary with this patch from lock contention avoid point of view, and actually if we delete it, the stripe_head release_list and lru can share storage. But the unplug release batch is also helpful for request merge. We probably can delay wakeup raid5d till unplug, but I'm still afraid of the case which raid5d is running. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-08-28 11:55:53 +10:00
NeilBrown	f94c0b6658	md/raid5: fix interaction of 'replace' and 'recovery'. If a device in a RAID4/5/6 is being replaced while another is being recovered, then the writes to the replacement device currently don't happen, resulting in corruption when the replacement completes and the new drive takes over. This is because the replacement writes are only triggered when 's.replacing' is set and not when the similar 's.sync' is set (which is the case during resync and recovery - it means all devices need to be read). So schedule those writes when s.replacing is set as well. In this case we cannot use "STRIPE_INSYNC" to record that the replacement has happened as that is needed for recording that any parity calculation is complete. So introduce STRIPE_REPLACED to record if the replacement has happened. For safety we should also check that STRIPE_COMPUTE_RUN is not set. This has a similar effect to the "s.locked == 0" test. The latter ensure that now IO has been flagged but not started. The former checks if any parity calculation has been flagged by not started. We must wait for both of these to complete before triggering the 'replace'. Add a similar test to the subsequent check for "are we finished yet". This possibly isn't needed (is subsumed in the STRIPE_INSYNC test), but it makes it more obvious that the REPLACE will happen before we think we are finished. Finally if a NeedReplace device is not UPTODATE then that is an error. We really must trigger a warning. This bug was introduced in commit `9a3e1101b8` (md/raid5: detect and handle replacements during recovery.) which introduced replacement for raid5. That was in 3.3-rc3, so any stable kernel since then would benefit from this fix. Cc: stable@vger.kernel.org (3.3+) Reported-by: qindehua <13691222965@163.com> Tested-by: qindehua <qindehua@163.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-25 16:46:57 +10:00
NeilBrown	fdcfbbb653	md/raid5: allow 5-device RAID6 to be reshaped to 4-device. There is a bug in 'check_reshape' for raid5.c To checks that the new minimum number of devices is large enough (which is good), but it does so also after the reshape has started (bad). This is bad because - the calculation is now wrong as mddev->raid_disks has changed already, and - it is pointless because it is now too late to stop. So only perform that test when reshape has not been committed to. Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-04 16:42:52 +10:00
Jingoo Han	b29bebd66d	md: replace strict_strto() with kstrto() The usage of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:26 +10:00
Linus Torvalds	82ea4be61f	A few bugfixes for md Some tagged for -stable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69 +zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4 jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ 72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/ ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X aoj69nQ3i/4= =8iy3 -----END PGP SIGNATURE----- Merge tag 'md-3.10-fixes' of git://neil.brown.name/md Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.	2013-06-13 10:13:29 -07:00
H. Peter Anvin	5026d7a9b2	md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-13 14:49:54 +10:00
Kent Overstreet	4997b72ee6	raid5: Initialize bi_vcnt The patch that converted raid5 to use bio_reset() forgot to initialize bi_vcnt. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: NeilBrown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Tested-by: Ilia Mirkin <imirkin@alum.mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-05-30 08:44:39 +02:00
Linus Torvalds	4de13d7aa8	Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...	2013-05-08 10:13:35 -07:00
NeilBrown	c0b32972fb	md/raid5: avoid an extra write when writing to a known-bad-block. If we write to a known-bad-block it will be flags as having a ReadError by analyse_stripe, but the write will proceed anyway (as it should). Then the read-error handling will kick in an write again, then re-read. We don't need that 'write-again', so set R5_ReWrite so it looks like it has already been done. Then we will just get the re-read, which we want. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-04-24 11:42:42 +10:00
majianpeng	6f608040ce	md/raid5: Change or of some order to improve efficiency. As the function call is the most expensive of these tests it should be done later in the chain so that it can be avoided in some cases. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-04-24 11:42:41 +10:00
Linus Torvalds	0a82a8d132	Revert "block: add missing block_bio_complete() tracepoint" This reverts commit `3a366e614d`. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-18 09:00:26 -07:00
Jens Axboe	64f8de4da7	Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. `e3620a3ad5` ("MD RAID5: Avoid accessing gendisk or queue structs when not available") `2f6db2a707` ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-02 10:04:39 +02:00
Linus Torvalds	22c3f2fff6	A few bugfixes for md - recent regressions in raid5 - recent regressions in dmraid - a few instances of CONFIG_MULTICORE_RAID456 linger Several tagged for -stable -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUUzCwDnsnt1WYoG5AQJKMhAAsi2XhqLC4Dx19J8MTF6+cjfynWCxF2SC 3mMcVZm6yxSowixb1Ht72CyssWdJAi4vgaw0aLNH7b3CbPDZfTSfqLP4tSvyPfod aDcFDdd/RhHjDpJqZ52Tyc6QzBfyhwu+s9R+a78TSL47ZMjZpz1QpshG8Sm9JYTs z72VlIZeglzhWmzO1FInsL/oT/Hwr9IfpmJpuXBQQObDn3BgvZLuzZyCi35upqrM 711ei7CKaN0s/jKcWclNRtgUrr10XsgQ6PugOZbli09CC8ushHwvXe/VmxoQFg2+ Sj14YSfYAY+1QpOiuYc+knrWc7CtPGHgUqBzOoYWMxi9Lqpo5xhD1vkRsFhXxMSg GVnAnh/RXl7bGzGWaRv8twG4vU+qYOlEPNgO6/079AxCOrrNrstYrgjBxBSWuxrB 0UIFQGT69zA5G3cLbIRrXUxO8oIVeEx92YV1TOcgLKP5OXlp/0I8ajnA9b8KoPZa He04GdPlZMXTLAqq9MaQRdS0XzX8YQDWbUebqe+w5NW46sLbckkmxaNZs7fOYAfG CNHfeRsLp5v0oNbhNyCDSjxqH6uYwKCdCqmDxo6A+fmjmDruHQmZoAK8YISUtPtx u4M82jW6Z/xOg4pomxMl4SxzCDhy1pM8PYzyx7Mj82C4XBR8CkrQTP8XD+FQL2Ih KhId4tJzx6Q= =Rycs -----END PGP SIGNATURE----- Merge tag 'md-3.9-fixes' of git://neil.brown.name/md Pull md fixes from NeilBrown: "A few bugfixes for md - recent regressions in raid5 - recent regressions in dmraid - a few instances of CONFIG_MULTICORE_RAID456 linger Several tagged for -stable" * tag 'md-3.9-fixes' of git://neil.brown.name/md: md: remove CONFIG_MULTICORE_RAID456 entirely md/raid5: ensure sync and DISCARD don't happen at the same time. MD: Prevent sysfs operations on uninitialized kobjects MD RAID5: Avoid accessing gendisk or queue structs when not available md/raid5: schedule_construction should abort if nothing to do.	2013-03-23 15:49:49 -07:00
Kent Overstreet	2f6db2a707	raid5: use bio_reset() Had to shuffle the code around a bit (where bi_rw and bi_end_io were set), but shouldn't really be anything tricky here Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de>	2013-03-23 14:15:35 -07:00
Kent Overstreet	aa8b57aa3d	block: Use bio_sectors() more consistently Bunch of places in the code weren't using it where they could be - this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: "Ed L. Cashin" <ecashin@coraid.com> CC: Nick Piggin <npiggin@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz> CC: Jim Paris <jim@jtan.com> CC: Geoff Levand <geoff@infradead.org> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ed Cashin <ecashin@coraid.com>	2013-03-23 14:15:30 -07:00
Kent Overstreet	f73a1c7d11	block: Add bio_end_sector() Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2013-03-23 14:15:29 -07:00
NeilBrown	f8dfcffd04	md/raid5: ensure sync and DISCARD don't happen at the same time. A number of problems can occur due to races between resync/recovery and discard. - if sync_request calls handle_stripe() while a discard is happening on the stripe, it might call handle_stripe_clean_event before all of the individual discard requests have completed (so some devices are still locked, but not all). Since commit `ca64cae960` md/raid5: Make sure we clear R5_Discard when discard is finished. this will cause R5_Discard to be cleared for the parity device, so handle_stripe_clean_event() will not be called when the other devices do become unlocked, so their ->written will not be cleared. This ultimately leads to a WARN_ON in init_stripe and a lock-up. - If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward time for resync, it can lead to s->uptodate being less than disks in handle_parity_checks5(), which triggers a BUG (because it is one). So: - keep R5_Discard on the parity device until all other devices have completed their discard request - make sure we don't try to have a 'discard' and a 'sync' action at the same time. This involves a new stripe flag to we know when a 'discard' is happening, and the use of R5_Overlap on the parity disk so when a discard is wanted while a sync is active, so we know to wake up the discard at the appropriate time. Discard support for RAID5 was added in 3.7, so this is suitable for any -stable kernel since 3.7. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-03-20 13:20:59 +11:00
Jonathan Brassow	e3620a3ad5	MD RAID5: Avoid accessing gendisk or queue structs when not available MD RAID5: Fix kernel oops when RAID4/5/6 is used via device-mapper Commit `a9add5d` (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver. However, when device-mapper is used to create RAID4/5/6 arrays, the mddev->gendisk and mddev->queue fields are not setup. Therefore, calling things like trace_block_bio_remap will cause a kernel oops. This patch conditionalizes those calls on whether the proper fields exist to make the calls. (Device-mapper will call trace_block_bio_remap on its own.) This patch is suitable for the 3.8.y stable kernel. Cc: stable@vger.kernel.org (v3.8+) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-03-20 13:16:57 +11:00
NeilBrown	ce7d363aaf	md/raid5: schedule_construction should abort if nothing to do. Since commit `1ed850f356` md/raid5: make sure to_read and to_write never go negative. It has been possible for handle_stripe_dirtying to be called when there isn't actually any work to do. It then calls schedule_reconstruction() which will set R5_LOCKED on the parity block(s) even when nothing else is happening. This then causes problems in do_release_stripe(). So add checks to schedule_reconstruction() so that if it doesn't find anything to do, it just aborts. This bug was introduced in v3.7, so the patch is suitable for -stable kernels since then. Cc: stable@vger.kernel.org (v3.7+) Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-03-20 12:16:51 +11:00
Linus Torvalds	a5e0d73163	md updates for 3.9 mostly little bugfixes. Only "feature" is a new RAID10 layout which slightly improves the number of sets of devices that can concurrently fail, without data loss. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUTPm+znsnt1WYoG5AQLLsw/+PMqr8roC4twgxTWV1NRbU8NtOcRi9Rj9 uvBS63uYAaLdi/D3UBKFYczmNCu9knuXbcp9SgFDxH7LlthQsWN/GYnif06pPo3w 9Agu5M8c062TJEG1vrnX6FhPO6pNgrWFr3h+CKkTiD3179i9DoQpP8LXQToeyMtI YRMQf/zCkxYtDvWAP0iwsEWtw8cf+q9I/uGPhQ1L+DnZapXYdbtnqWBRz9q6mrDt orcGrP41aZHvnOHUaTbwmaorCKkf/Ys4SMaGenrSFpnpQMypt7VgNuwHC59LxvJT 5eiFG/26zIsv7Wk0jv/TvFP5qzUPo0/PFkd5ug0ArvbVRiXS2cMJDwQvMdO1toxD i5Bb+P9DptadvoWhOTgIpxnG77yRH45wJvyJOk+ZfS1/IO87nCRa3d0yiNOU5e2/ o0VdXPZRr72sdKKTK6kQuYfwCPb+Z2Pz6Q8BJdk6GxlmTXyP6sKhIgwUX86534fE LrOxfK8qV+GetVu3X02RoX2CyJJRQHXyXmbHuSzXuo/JiOYtDigAydwNZChvf+tf OoMY9K8vgNbhnGsUG6la7XPvZ+6dZMjdnxp2HB99Ml5A3PWZd75i5T6IHHxIQFbD C3z9PWTWP+hK4k15DEyjlELtsE9WduGTXG4kUcf328xJ/7lj4VIImVugdCz+1B6z +HlI6BiLwzY= =YdVD -----END PGP SIGNATURE----- Merge tag 'md-3.9' of git://neil.brown.name/md Pull md updates from NeilBrown: "Mostly little bugfixes. Only "feature" is a new RAID10 layout which slightly improves the number of sets of devices that can concurrently fail, without data loss." * tag 'md-3.9' of git://neil.brown.name/md: md: expedite metadata update when switching read-auto -> active md: remove CONFIG_MULTICORE_RAID456 md/raid1,raid10: fix deadlock with freeze_array() md/raid0: improve error message when converting RAID4-with-spares to RAID0 md: raid0: fix error return from create_stripe_zones. md: fix two bugs when attempting to resize RAID0 array. DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2) MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1) MD RAID10: Minor non-functional code changes md: raid1,10: Handle REQ_WRITE_SAME flag in write bios md: protect against crash upon fsync on ro array	2013-03-05 17:22:08 -08:00
Linus Torvalds	ee89f81252	Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch\|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front\|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...	2013-02-28 12:52:24 -08:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
NeilBrown	51acbcec6c	md: remove CONFIG_MULTICORE_RAID456 This doesn't seem to actually help and we have an alternate multi-threading approach waiting in the wings, so just get rid of this config option and associated code. As a bonus, we remove one use of CONFIG_EXPERIMENTAL Cc: Dan Williams <djbw@fb.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: NeilBrown <neilb@suse.de>	2013-02-28 09:08:34 +11:00
Tejun Heo	3a366e614d	block: add missing block_bio_complete() tracepoint bio completion didn't kick block_bio_complete TP. Only dm was explicitly triggering the TP on IO completion. This makes block_bio_complete TP useless for tracers which want to know about bios, and all other bio based drivers skip generating blktrace completion events. This patch makes all bio completions via bio_endio() generate block_bio_complete TP. * Explicit trace_block_bio_complete() invocation removed from dm and the trace point is unexported. * @rq dropped from trace_block_bio_complete(). bios may fly around w/o queue associated. Verifying and accessing the assocaited queue belongs to TP probes. * blktrace now gets both request and bio completions. Make it ignore bio completions if request completion path is happening. This makes all bio based drivers generate blktrace completion events properly and makes the block_bio_complete TP actually useful. v2: With this change, block_bio_complete TP could be invoked on sg commands which have bio's with %NULL bi_bdev. Update TP assignment code to check whether bio->bi_bdev is %NULL before dereferencing. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Namhyung Kim <namhyung@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-14 15:00:36 +01:00
Linus Torvalds	ea88eeac0c	md update for 3.8 Mostly just little fixes. Probably biggest part is AVX accelerated RAID6 calculations. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUM/w2Dnsnt1WYoG5AQKXlg/9F5juv4CjRkRRFLqZgOPBLmn/s/2Vspgh 2Kv8Jcyixd8jUQNbobZv0ahlJH/iSU61kpOE8QjLbKi5Y42vAbM0ZU2aHJ6nqGZy HiTI8K+7kTvCK3ZXLcUQ+4oPPBNTcoTZbLWaEOmIqB1ruLddoIR7M9fG3PspVeG0 jijnXR8IfL6mr4YDXnJkEhFrneTysVik05RkKYZKyM/9r3stAoMJ9o0/EFy3OFxb lO6mLEtvjVArXcnuf1RMCw2YKgki9Y4r73HCplgQsVFvcxcpsya4gFF+lRR5j7cO /eMYbSQ89iWEYKh1dJ9u1nofc8fX5ia71QQyO1fkO4GXRHXPVIyBgKSbe7SaL6iG JUMm7idUV2rZGeq3ln3k8Yor4QqHvN1n7pRKKUF+ZdsPoQ1B/TABu+qpsAdo5ZhP fxDsULsHrzEaxgetd4V8F2Uptca9ni43sMI8mwsvVlA0p6SOzMIyoJLC9xAZpx11 b3H3+7Oje/fasmszBoq5B9uAlSt9XXVN4DDn2q6cX+S96JSX6jcsN1c6cJBO+ZxB OU6a6P5mnU6HuxU02rspe7G8BeU+ybaonErOW+GdyC4r7M/cImC0dSp0NGHK2211 oqu0xBx/Q/ddTFwKQqa4HzR2ws09+LhKbjdqYIhCEKttIbLIAjf73ARZ19XPSRRX pDR/ey2CB6E= =uK52 -----END PGP SIGNATURE----- Merge tag 'md-3.8' of git://neil.brown.name/md Pull md update from Neil Brown: "Mostly just little fixes. Probably biggest part is AVX accelerated RAID6 calculations." * tag 'md-3.8' of git://neil.brown.name/md: md/raid5: add blktrace calls md/raid5: use async_tx_quiesce() instead of open-coding it. md: Use ->curr_resync as last completed request when cleanly aborting resync. lib/raid6: build proper files on corresponding arch lib/raid6: Add AVX2 optimized gen_syndrome functions lib/raid6: Add AVX2 optimized recovery functions md: Update checkpoint of resync/recovery based on time. md:Add place to update ->recovery_cp. md.c: re-indent various 'switch' statements. md: close race between removing and adding a device. md: removed unused variable in calc_sb_1_csm.	2012-12-18 09:32:44 -08:00
NeilBrown	a9add5d92b	md/raid5: add blktrace calls This makes it easier to trace what raid5 is doing. Signed-off-by: NeilBrown <neilb@suse.de>	2012-12-18 10:22:21 +11:00
Linus Torvalds	9228ff9038	Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block Pull block driver update from Jens Axboe: "Now that the core bits are in, here are the driver bits for 3.8. The branch contains: - A huge pile of drbd bits that were dumped from the 3.7 merge window. Following that, it was both made perfectly clear that there is going to be no more over-the-wall pulls and how the situation on individual pulls can be improved. - A few cleanups from Akinobu Mita for drbd and cciss. - Queue improvement for loop from Lukas. This grew into adding a generic interface for waiting/checking an even with a specific lock, allowing this to be pulled out of md and now loop and drbd is also using it. - A few fixes for xen back/front block driver from Roger Pau Monne. - Partition improvements from Stephen Warren, allowing partiion UUID to be used as an identifier." * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits) drbd: update Kconfig to match current dependencies drbd: Fix drbdsetup wait-connect, wait-sync etc... commands drbd: close race between drbd_set_role and drbd_connect drbd: respect no-md-barriers setting also when changed online via disk-options drbd: Remove obsolete check drbd: fixup after wait_even_lock_irq() addition to generic code loop: Limit the number of requests in the bio list wait: add wait_event_lock_irq() interface xen-blkfront: free allocated page xen-blkback: move free persistent grants code block: partition: msdos: provide UUIDs for partitions init: reduce PARTUUID min length to 1 from 36 block: store partition_meta_info.uuid as a string cciss: use check_signature() cciss: cleanup bitops usage drbd: use copy_highpage drbd: if the replication link breaks during handshake, keep retrying drbd: check return of kmalloc in receive_uuids drbd: Broadcast sync progress no more often than once per second drbd: don't try to clear bits once the disk has failed ...	2012-12-17 13:39:11 -08:00
Linus Torvalds	a2013a13e6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial branch from Jiri Kosina: "Usual stuff -- comment/printk typo fixes, documentation updates, dead code elimination." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) HOWTO: fix double words typo x86 mtrr: fix comment typo in mtrr_bp_init propagate name change to comments in kernel source doc: Update the name of profiling based on sysfs treewide: Fix typos in various drivers treewide: Fix typos in various Kconfig wireless: mwifiex: Fix typo in wireless/mwifiex driver messages: i2o: Fix typo in messages/i2o scripts/kernel-doc: check that non-void fcts describe their return value Kernel-doc: Convention: Use a "Return" section to describe return values radeon: Fix typo and copy/paste error in comments doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c various: Fix spelling of "asynchronous" in comments. Fix misspellings of "whether" in comments. eisa: Fix spelling of "asynchronous". various: Fix spelling of "registered" in comments. doc: fix quite a few typos within Documentation target: iscsi: fix comment typos in target/iscsi drivers treewide: fix typo of "suport" in various comments and Kconfig treewide: fix typo of "suppport" in various comments ...	2012-12-13 12:00:02 -08:00
NeilBrown	749586b7d3	md/raid5: use async_tx_quiesce() instead of open-coding it. handle_stripe_expansion contains: if (tx) { async_tx_ack(tx); dma_wait_for_async_tx(tx); } which is very similar to the body of async_tx_quiesce(), except that the later handles an error from dma_wait_for_async_tx() (admittedly by panicing, but that decision belongs in the dma code, not the md code). So just us async_tx_quiesce(). Acked-by: Dan Williams <djbw@fb.com> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-12-13 19:52:32 +11:00
Lukas Czerner	eed8c02e68	wait: add wait_event_lock_irq() interface New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit moves the private wait_event_lock_irq() macro from MD to regular wait includes, introduces new macro wait_event_lock_irq_cmd() instead of using the old method with omitting cmd parameter which is ugly and makes a use of new macros in the MD. It also introduces the _interruptible_ variant. The use of new interface is when one have a special lock to protect data structures used in the condition, or one also needs to invoke "cmd" before putting it to sleep. All new macros are expected to be called with the lock taken. The lock is released before sleep and is reacquired afterwards. We will leave the macro with the lock held. Note to DM: IMO this should also fix theoretical race on waitqueue while using simultaneously wait_event_lock_irq() and wait_event() because of lack of locking around current state setting and wait queue removal. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-11-30 11:47:57 +01:00
NeilBrown	ca64cae960	md/raid5: Make sure we clear R5_Discard when discard is finished. commit `9e44476851` MD: raid5 avoid unnecessary zero page for trim change raid5 to clear R5_Discard when the complete request is handled rather than when submitting the per-device discard request. However it did not clear R5_Discard for the parity device. This means that if the stripe_head was reused before it expired from the cache, the setting would be wrong and a hang would result. Also if the R5_Uptodate bit happens to be set, R5_Discard again won't be cleared. But R5_Uptodate really should be clear at this point. So make sure R5_Discard is cleared in all cases, and clear R5_Uptodate when a 'discard' completes. Signed-off-by: NeilBrown <neilb@suse.de>	2012-11-22 09:14:13 +11:00
NeilBrown	ef5b7c69b7	md/raid5: move resolving of reconstruct_state earlier in stripe_handle. The chunk of code in stripe_handle which responds to a *_result value in reconstruct_state is really the completion of some processing that happened outside of handle_stripe (possibly asynchronously) and so should be one of the first things done in handle_stripe(). After the next patch it will be important that it happens before handle_stripe_clean_event(), as that will clear some dev->flags bit that this code tests. Signed-off-by: NeilBrown <neilb@suse.de>	2012-11-22 09:14:09 +11:00
NeilBrown	4ac6875eeb	md/raid5: round discard alignment up to power of 2. blkdev_issue_discard currently assumes that the granularity is a power of 2. So in raid5, round the chosen number up to avoid embarrassment. Cc: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>	2012-11-20 19:42:56 +11:00
Masanari Iida	83f0d77a7f	md: Fix typo in drivers/md Correct spelling typo in drivers/md. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-10-29 22:57:50 +01:00
Linus Torvalds	9db908806b	md updates for 3.7 "discard" support, some dm-raid improvements and other assorted bits and pieces. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUAUHk6Rjnsnt1WYoG5AQKovQ//Ym0ROo5a6uekb2USLyFSdQH3TC7z0v0+ +kujrgoc4nHZU/vj5yfMvPVomEUsAhHEwTkvvCiXFFHn6cxPzC8ezm8d40xEeISX qp6i2bPlvGURhsW1tYeD+THtY82/oyzQ4Wa/vaE1sjVLQ+caa2q7kVVgAL9Bj/Kz aESIZjAuPxQNE1674/KR0EmMFcbpd0z1WDV+ydKlRV5jHCHGYf8OmxOenJFf+V/b /f9p2u+NUq5BN5WLhThcysO8lPX1Y7GG8IYay3DlSt/crU24R2a2j0qh/BDoK8+t /DceoHipbIiGxXLVjM7y+1RwPpCh75HJSZQHltPype2Z3iwtwEth9uTkEE3M2h/W tOQEbOZku0kcgsrys7JBmpkBwkR9oZqq1kDd4YBzqW4PiGVP6z0JRH8QpjjB+mjN 47ODYIZcaEYZ+0Jj8kcVxo3gv4Xj4DWH+auSNZihTVmjQPVqrcy3CAt3CkuDzTkY 34fZVuCDiCetLGCGQKrwfMDnySVy5xOmtC6iWsEY5rExAeb0E+BCzcBvbAXzt+ef MPDsrxWbo/ZkvpuwXOwLFTccBuRtAsFi7CM4jcow53W6XMnPpdubphNw5nylaEm1 DEzfID58mv8VHWRuW15vr7SbtROjYJkEFCIaEK3oprrRUYftZntIABcknqvcIYR+ /ULNzkRU1w4= =XRmL -----END PGP SIGNATURE----- Merge tag 'md-3.7' of git://neil.brown.name/md Pull md updates from NeilBrown: - "discard" support, some dm-raid improvements and other assorted bits and pieces. * tag 'md-3.7' of git://neil.brown.name/md: (29 commits) md: refine reporting of resync/reshape delays. md/raid5: be careful not to resize_stripes too big. md: make sure manual changes to recovery checkpoint are saved. md/raid10: use correct limit variable md: writing to sync_action should clear the read-auto state. Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races md/raid5: make sure to_read and to_write never go negative. md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. md/raid5: protect debug message against NULL derefernce. md/raid5: add some missing locking in handle_failed_stripe. MD: raid5 avoid unnecessary zero page for trim MD: raid5 trim support md/bitmap:Don't use IS_ERR to judge alloc_page(). md/raid1: Don't release reference to device while handling read error. raid: replace list_for_each_continue_rcu with new interface add further __init annotations to crypto/xor.c DM RAID: Fix for "sync" directive ineffectiveness DM RAID: Fix comparison of index and quantity for "rebuild" parameter DM RAID: Add rebuild capability for RAID10 DM RAID: Move 'rebuild' checking code to its own function ...	2012-10-13 13:22:01 -07:00
NeilBrown	e56108d65f	md/raid5: be careful not to resize_stripes too big. When a RAID5 is reshaping, conf->raid_disks is increased before mddev->delta_disks becomes zero. This can result in check_reshape calling resize_stripes with a number that is too large. This particularly happens when md_check_recovery calls ->check_reshape(). If we use ->previous_raid_disks, we don't risk this. Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 14:24:13 +11:00
Jianpeng Ma	7f7583d420	Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races Now that multiple threads can handle stripes, it is safer to use an atomic64_t for resync_mismatches, to avoid update races. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 14:17:59 +11:00
NeilBrown	1ed850f356	md/raid5: make sure to_read and to_write never go negative. to_read and to_write are part of the result of analysing a stripe before handling it. Their use is to avoid some loops and tests if the values are known to be zero. Thus it is not a problem if they are a little bit larger than they should be. So decrementing them in handle_failed_stripe serves little value, and due to races it could cause some loops to be skipped incorrectly. So remove those decrements. Reported-by: "Jianpeng Ma" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:50:13 +11:00
Alexander Lyakas	a7854487cd	md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Suggested-by: Yair Hershko <yair@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:50:12 +11:00
NeilBrown	b97390aec4	md/raid5: protect debug message against NULL derefernce. The pr_debug in add_stripe_bio could race with something changing *bip, so it is best to hold the lock until after the pr_debug. Reported-by: "Jianpeng Ma" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:50:12 +11:00
NeilBrown	143c4d0573	md/raid5: add some missing locking in handle_failed_stripe. We really should hold the stripe_lock while accessing 'toread' else we could race with add_stripe_bio and corrupt a list. Reported-by: "Jianpeng Ma" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:50:12 +11:00
Shaohua Li	9e44476851	MD: raid5 avoid unnecessary zero page for trim We want to avoid zero discarded dev page, because it's useless for discard. But if we don't zero it, another read/write hit such page in the cache and will get inconsistent data. To avoid zero the page, we don't set R5_UPTODATE flag after construction is done. In this way, discard write request is still issued and finished, but read will not hit the page. If the stripe gets accessed soon, we need reread the stripe, but since the chance is low, the reread isn't a big deal. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:49:49 +11:00
Shaohua Li	620125f2bf	MD: raid5 trim support Discard for raid4/5/6 has limitation. If discard request size is small, we do discard for one disk, but we need calculate parity and write parity disk. To correctly calculate parity, zero_after_discard must be guaranteed. Even it's true, we need do discard for one disk but write another disks, which makes the parity disks wear out fast. This doesn't make sense. So an efficient discard for raid4/5/6 should discard all data disks and parity disks, which requires the write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size is smaller than chunk_size, such pattern is almost impossible in practice. So in this patch, I only handle the case that A's size equals to chunk_size. That is discard request should be aligned to stripe size and its size is multiple of stripe size. Since we can only handle request with specific alignment and size (or part of the request fitting stripes), we can't guarantee zero_after_discard even zero_after_discard is true in low level drives. The block layer doesn't send down correctly aligned requests even correct discard alignment is set, so I must filter out. For raid4/5/6 parity calculation, if data is 0, parity is 0. So if zero_after_discard is true for all disks, data is consistent after discard. Otherwise, data might be lost. Let's consider a scenario: discard a stripe, write data to one disk and write parity disk. The stripe could be still inconsistent till then depending on using data from other data disks or parity disks to calculate new parity. If the disk is broken, we can't restore it. So in this patch, we only enable discard support if all disks have zero_after_discard. If discard fails in one disk, we face the similar inconsistent issue above. The patch will make discard follow the same path as normal write request. If discard fails, a resync will be scheduled to make the data consistent. This isn't good to have extra writes, but data consistency is important. If a subsequent read/write request hits raid5 cache of a discarded stripe, the discarded dev page should have zero filled, so the data is consistent. This patch will always zero dev page for discarded request stripe. This isn't optimal because discard request doesn't need such payload. Next patch will avoid it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:49:05 +11:00
Shaohua Li	4ed8731d8e	MD: change the parameter of md thread Change the thread parameter, so the thread can carry extra info. Next patch will use it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:34:00 +11:00
NeilBrown	cb13ff69d6	md/raid5: add missing spin_lock_init. commit `b17459c050` raid5: add a per-stripe lock added a spin_lock to the 'stripe_head' struct. Unfortunately there are two places where this struct is allocated but the spin lock was only initialised in one of them. So add the missing spin_lock_init. Signed-off-by: NeilBrown <neilb@suse.de>	2012-09-24 16:27:20 +10:00
NeilBrown	e5c86471f9	md/raid5: fix calculate of 'degraded' when a replacement becomes active. When a replacement device becomes active, we mark the device that it replaces as 'faulty' so that it can subsequently get removed. However 'calc_degraded' only pays attention to the primary device, not the replacement, so the array appears to become degraded, which is wrong. So teach 'calc_degraded' to consider any replacement if a primary device is faulty. This is suitable for -stable as an incorrect 'degraded' value can confuse md and could lead to data corruption. This is only relevant for 3.3 and later. Cc: stable@vger.kernel.org Reported-by: Robin Hill <robin@robinhill.me.uk> Reported-by: John Drescher <drescherjm@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-09-19 12:52:30 +10:00
NeilBrown	a852d7b8a0	Revert "md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE." This reverts commit `895e3c5c58`. While this patch seemed like a good idea and did help some workloads, it hurts other workloads. Large sequential O_DIRECT writes were faster, Small random O_DIRECT writes were slower. Other changes (batching RAID5 writes) have improved the sequential writes using a different mechanism, so the net result of this patch is definitely negative. So revert it. Reported-by: Shaohua Li <shli@kernel.org> Tested-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-09-19 12:48:30 +10:00
Linus Torvalds	25aa6a7ae4	Additional md update for 3.6 This contains a few patches that depend on plugging changes in the block layer so needs to wait for those. It also contains a Kconfig fix for the new RAID10 support in dm-raid. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUAUBnKUznsnt1WYoG5AQJOQA/+M7RoVnF63+TbGIqdNDotuF8FxvudCZBl Ou2yG47EOPtWf/RoqPyfpydDgdjyXsk4T5TfXoc0hsXVr4shCYo51uT9K34TMSDJ 2GzGWuyugRJFyvxW7PBgM+zFWlcVdgUGcwsdmIUMtHRz8Q10TqO5fE22RNLkhwOl fvGCK1KYnQqlG87DbulHWMo22vyZVic8jBqFSw55CPuuFMSJMxCw0rOPUnvk5Q8v jWzZzuUqrM8iiOxTDHsbCA0IleCbGl/m0tgk02Vj4tkCvz9N/xzQW2se0H6uECiK k8odbAiNBOh1q135sa7ASrBzxT+JqSiQ25rLheTEzzNxjFv6/NlntXmYu6HB+lD3 DoHAvRjgMxiLCdisW6TJb10NItitXwE/HSpQOVRxyYtINdzmhIDaCccgfN8ZMkho nmE/uzO+CAoCFpZC2C/nY8D0BZs5fw4hgDAsci66mvs+88dy+SoA4AbyNEMAusOS tiL8ZEjnYXvxTh3JFaMIaqQd6PkbahmtEtvorwXsUYUdY0ybkcs2FYVksvkgYdyW WlejOZVurY2i5biqck3UqjesxeJA5TMAlAUQR7vXu1Fa9fYFXZbqJom/KnPRTfek xerCWPMbhuzmcyEjUOGfjs6GFEnEmRT6Q6fN3CBaQMS2Q/z+6AkTOXKVl5Fhvoyl aeu1m8nZLuI= =ovN2 -----END PGP SIGNATURE----- Merge tag 'md-3.6' of git://neil.brown.name/md Pull additional md update from NeilBrown: "This contains a few patches that depend on plugging changes in the block layer so needed to wait for those. It also contains a Kconfig fix for the new RAID10 support in dm-raid." * tag 'md-3.6' of git://neil.brown.name/md: md/dm-raid: DM_RAID should select MD_RAID10 md/raid1: submit IO from originating thread instead of md thread. raid5: raid5d handle stripe in batch way raid5: make_request use batch stripe release	2012-08-02 11:34:40 -07:00
Shaohua Li	46a06401f6	raid5: raid5d handle stripe in batch way Let raid5d handle stripe in batch way to reduce conf->device_lock locking. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-02 08:33:15 +10:00
Shaohua Li	8811b5968f	raid5: make_request use batch stripe release make_request() does stripe release for every stripe and the stripe usually has count 1, which makes previous release_stripe() optimization not work. In my test, this release_stripe() becomes the heaviest pleace to take conf->device_lock after previous patches applied. Below patch makes stripe release batch. All the stripes will be released in unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe lru. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-02 08:33:00 +10:00
Linus Torvalds	eff0d13f38	Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block Pull block driver changes from Jens Axboe: - Making the plugging support for drivers a bit more sane from Neil. This supersedes the plugging change from Shaohua as well. - The usual round of drbd updates. - Using a tail add instead of a head add in the request completion for ndb, making us find the most completed request more quickly. - A few floppy changes, getting rid of a duplicated flag and also running the floppy init async (since it takes forever in boot terms) from Andi. * 'for-3.6/drivers' of git://git.kernel.dk/linux-block: floppy: remove duplicated flag FD_RAW_NEED_DISK blk: pass from_schedule to non-request unplug functions. block: stack unplug blk: centralize non-request unplug handling. md: remove plug_cnt feature of plugging. block/nbd: micro-optimization in nbd request completion drbd: announce FLUSH/FUA capability to upper layers drbd: fix max_bio_size to be unsigned drbd: flush drbd work queue before invalidate/invalidate remote drbd: fix potential access after free drbd: call local-io-error handler early drbd: do not reset rs_pending_cnt too early drbd: reset congestion information before reporting it in /proc/drbd drbd: report congestion if we are waiting for some userland callback drbd: differentiate between normal and forced detach drbd: cleanup, remove two unused global flags floppy: Run floppy initialization asynchronous	2012-08-01 09:06:47 -07:00
NeilBrown	0021b7bc04	md: remove plug_cnt feature of plugging. This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:14 +02:00
majianpeng	895e3c5c58	md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. 'sync' writes set both REQ_SYNC and REQ_NOIDLE. O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE. We currently assume that a REQ_SYNC request will not be followed by more requests and so set STRIPE_PREREAD_ACTIVE to expedite the request. This is appropriate for sync requests, but not for O_DIRECT requests. So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE rather than REQ_SYNC. This is consistent with the documented meaning of REQ_NOIDLE: __REQ_NOIDLE, /* don't anticipate more IO after this one */ Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:05:44 +10:00
majianpeng	3f9e7c140e	raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer Because bios will merge at block-layer,so bios-error may caused by other bio which be merged into to the same request. Using this flag,it will find exactly error-sector and not do redundant operation like re-write and re-read. V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block layer. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:04:21 +10:00
Shaohua Li	b17459c050	raid5: add a per-stripe lock Add a per-stripe lock to protect stripe specific data. The purpose is to reduce lock contention of conf->device_lock. stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio list of the stripe is always serialized by this lock, so adding bio to the lists (add_stripe_bio()) and removing bio from the lists (like ops_run_biofill()) not race. If bio in ->read, ->written ... list are not shared by multiple stripes, we don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will protect them. If the bio are shared, there are two protections: 1. bi_phys_segments acts as a reference count 2. traverse the list uses r5_next_bio, which makes traverse never access bio not belonging to the stripe Let's have an example: \| stripe1 \| stripe2 \| stripe3 \| ...bio1......\|bio2\|bio3\|....bio4..... stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2 because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and stripe3 will end_bio bio4. before add_stripe_bio() addes a bio to a stripe, we already increament the bio bi_phys_segments, so don't worry other stripes release the bio. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-19 16:01:31 +10:00
Shaohua Li	7eaf7e8eb3	raid5: remove unnecessary bitmap write optimization Neil pointed out the bitmap write optimization in handle_stripe_clean_event() is unnecessary, because the chance one stripe gets written twice in the mean time is rare. We can always do a bitmap_startwrite when a write request is added to a stripe and bitmap_endwrite after write request is done. Delete the optimization. With it, we can delete some cases of device_lock. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-19 16:01:31 +10:00
Shaohua Li	e7836bd6f6	raid5: lockless access raid5 overrided bi_phys_segments Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold, which is unnecessary, We can make it lockless actually. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-19 16:01:31 +10:00
Shaohua Li	4eb788df67	raid5: reduce chance release_stripe() taking device_lock release_stripe() is a place conf->device_lock is heavily contended. We take the lock even stripe count isn't 1, which isn't required. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-19 16:01:31 +10:00
NeilBrown	b357f04a67	md: fix up plugging (again). The value returned by "mddev_check_plug" is only valid until the next 'schedule' as that will unplug things. This could happen at any call to mempool_alloc. So just calling mddev_check_plug at the start doesn't really make sense. So call it just before, or just after, queuing things for the thread. As the action that happens at unplug is to wake the thread, this makes lots of sense. If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we wake thread immediately. RAID5 is a bit different. Requests are queued for the thread and the thread is woken by release_stripe. So we don't need to wake the thread on failure. However the thread doesn't perform certain actions when there is any active plug, so it is important to install a plug before waking the thread. So for RAID5 we install the plug before queuing the request and waking the thread. Without this patch it is possible for raid1 or raid10 to queue a request without then waking the thread, resulting in the array locking up. Also change raid10 to only flush_pending_write when there are not active plugs, just like raid1. This patch is suitable for 3.0 or later. I plan to submit it to -stable, but I'll like to let it spend a few weeks in mainline first to be sure it is completely safe. Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 17:45:31 +10:00
Shaohua Li	fab363b5ff	raid5: delayed stripe fix There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but the two bits have relationship. A delayed stripe can be moved to hold list only when preread active stripe count is below IO_THRESHOLD. If a stripe has both the bits set, such stripe will be in delayed list and preread count not 0, which will make such stripe never leave delayed list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:57:19 +10:00
majianpeng	2e8ac30312	md/raid456: When read error cannot be recovered, record bad block We may not be able to fix a bad block if: - the array is degraded - the over-write fails. In these cases we currently eject the device, but we should record a bad block if possible. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:57:02 +10:00
NeilBrown	0232605d98	md: make 'name' arg to md_register_thread non-optional. Having the 'name' arg optional and defaulting to the current personality name is no necessary and leads to errors, as when changing the level of an array we can end up using the name of the old level instead of the new one. So make it non-optional and always explicitly pass the name of the level that the array will be. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:56:52 +10:00
NeilBrown	5f066c632f	md/raid5: fix refcount problem when blocked_rdev is set. commit `43220aa0f2` md/raid5: fix a hang on device failure. fixed a hang, but introduced a refcounting in-balance so that if the presence of bad-blocks ever caused an rdev to be 'blocked' we would increment the refcount on the rdev and never decrement it. So added the needed rdev_dec_pending when md_wait_for_blocked_rdev is not called. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 12:13:29 +10:00
majianpeng	1850753d2e	md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdev In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement nr_pending so we lose the reference we hold on the rdev. So atomic_inc it first to maintain the reference. This bug was introduced by commit `73e92e51b7` md/raid5. Don't write to known bad block on doubtful devices. which appeared in 3.0, so patch is suitable for stable kernels since then. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 12:11:54 +10:00
majianpeng	6c0544e255	md/raid5: Do not add data_offset before call to is_badblock In chunk_aligned_read() we are adding data_offset before calling is_badblock. But is_badblock also adds data_offset, so that is bad. So move the addition of data_offset to after the call to is_badblock. This bug was introduced by commit `31c176ecdf` md/raid5: avoid reading from known bad blocks. which first appeared in 3.0. So that patch is suitable for any -stable kernel from 3.0.y onwards. However it will need minor revision for most of those (as the comment didn't appear until recently). Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 12:09:57 +10:00
NeilBrown	5cfb22a1f8	md/raid5: prefer replacing failed devices over want-replacement devices. If a RAID5 has both a failed device and a device marked as 'WantReplacement', then we should preferentially replace the failed device. However the current code replaces whichever is found first. So split into 2 loops, check fail failed/missing first, and only check for WantReplacement if nothing is failed or missing. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 11:46:53 +10:00
NeilBrown	da7613b8b0	md/raid5: improve removal of extra devices after reshape. After a reshape which reduced the number of devices we need to disconnect the extra devices. The code for this doesn't currently handle 'replacement' devices. It is very unlikely that such devices will be present, but it is safest to handle them anyway. So simplify the handling. Just clear In_sync and leave it to remove_and_add_spaces (which will be called soon) to do the real works. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:33 +10:00
NeilBrown	30b67645fa	md/raid5: Allow reshape while a bitmap is present. We always should have allowed this. A raid5 reshape doesn't change the size of the bitmap, so not need to restrict it. Also add a test to make sure we don't try to start a reshape on a failed array. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:28 +10:00
NeilBrown	a4a6125a07	md: allow array to be resized while bitmap is present. Now that bitmaps can be resized, we can allow an array to be resized while the bitmap is present. This only covers resizing that involves changing the effective size of member devices, not resizing that changes the number of devices. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:27 +10:00
Shaohua Li	bc0934f047	raid5: support sync request REQ_SYNC is ignored in current raid5 code. Block layer does use it to do policy, for example ioscheduler. This patch adds it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:05 +10:00
Shaohua Li	cceeca43b5	raid5: remove unused variables The two variables are useless. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:04 +10:00
NeilBrown	b5254dd5fd	md/raid5: allow for change in data_offset while managing a reshape. The important issue here is incorporating the different in data_offset into calculations concerning when we might need to over-write data that is still thought to be valid. To this end we find the minimum offset difference across all devices and add that where appropriate. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:27:01 +10:00
NeilBrown	05616be5e1	md/raid5: Use correct data_offset for all IO. As there can now be two different data_offsets - an 'old' and a 'new' - we need to carefully choose between them. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:27:00 +10:00
NeilBrown	c6563a8c38	md: add possibility to change data-offset for devices. When reshaping we can avoid costly intermediate backup by changing the 'start' address of the array on the device (if there is enough room). So as a first step, allow such a change to be requested through sysfs, and recorded in v1.x metadata. (As we didn't previous check that all 'pad' fields were zero, we need a new FEATURE flag for this. A (belatedly) check that all remaining 'pad' fields are zero to avoid a repeat of this) The new data offset must be requested separately for each device. This allows each to have a different change in the data offset. This is not likely to be used often but as data_offset can be set per-device, new_data_offset should be too. This patch also removes the 'acknowledged' arg to rdev_set_badblocks as it is never used and never will be. At the same time we add a new arg ('in_new') which is currently always zero but will be used more soon. When a reshape finishes we will need to update the data_offset and rdev->sectors. So provide an exported function to do that. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:27:00 +10:00
NeilBrown	2c810cddc4	md: allow a reshape operation to be reversed. Currently a reshape operation always progresses from the start of the array to the end unless the number of devices is being reduced, in which case it progressed in the opposite direction. To reverse a partial reshape which changes the number of devices you can stop the array and re-assemble with the raid-disks numbers reversed and it will undo. However for a reshape that does not change the number of devices it is not possible to reverse the reshape in the middle - you have to wait until it completes. So add a 'reshape_direction' attribute with is either 'forwards' or 'backwards' and can be explicitly set when delta_disks is zero. This will become more important when we allow the data_offset to change in a reshape. Then the explicit statement of what direction is being used will be more useful. This can be enabled in raid5 trivially as it already supports reverse reshape and just needs to use a different trigger to request it. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:27:00 +10:00
majianpeng	c6d2e084c7	md/raid5: Fix a bug about judging if the operation is syncing or replacing When create a raid5 using assume-clean and echo check or repair to sync_action.Then component disks did not operated IO but the raid check/resync faster than normal. Because the judgement in function analyse_stripe(): if (do_recovery \|\| sh->sector >= conf->mddev->recovery_cp) s->syncing = 1; else s->replacing = 1; When check or repair,the recovery_cp == MaxSectore,so syncing equal zero not one. This bug was introduced by commit `9a3e1101b8` md/raid5: detect and handle replacements during recovery. so this patch is suitable for 3.3-stable. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-04-03 15:37:38 +10:00
NeilBrown	18b9837ea0	md/raid5: fix handling of bad blocks during recovery. 1/ We can only treat a known-bad-block like a read-error if we have the data that belongs in that block. So fix that test. 2/ If we cannot recovery a stripe due to insufficient data, don't tell "md_done_sync" that the sync failed unless we really did fail something. If we successfully record bad blocks, that is success. Reported-by: "majianpeng" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-04-03 15:36:17 +10:00
NeilBrown	dafb20fa34	md: tidy up rdev_for_each usage. md.h has an 'rdev_for_each()' macro for iterating the rdevs in an mddev. However it uses the 'safe' version of list_for_each_entry, and so requires the extra variable, but doesn't include 'safe' in the name, which is useful documentation. Consequently some places use this safe version without needing it, and many use an explicity list_for_each entry. So: - rename rdev_for_each to rdev_for_each_safe - create a new rdev_for_each which uses the plain list_for_each_entry, - use the 'safe' version only where needed, and convert all other list_for_each_entry calls to use rdev_for_each. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:39 +11:00
NeilBrown	dc10c643e8	md: allow re-add to failed arrays. When an array is failed (some data inaccessible) then there is no point attempting to add a spare as it could not possibly be recovered. However that may be value in re-adding a recently removed device. e.g. if there is a write-intent-bitmap and it is clear, then access to the data could be restored by this action. So don't reject a re-add to a failed array for RAID10 and RAID5 (the only arrays types that check for a failed array). Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:37 +11:00
majianpeng	41fe75f60b	md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read(). Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-13 11:21:25 +11:00
NeilBrown	9d4c7d8799	md/raid5: removed unused 'added_devices' variable. commit `908f4fbd26` removed the last user of this variable, so we should discard it completely. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-13 11:21:21 +11:00
NeilBrown	1e3fa9bd50	md/raid5: make sure reshape_position is cleared on error path. Leaving a valid reshape_position value in place could be confusing. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-13 11:21:18 +11:00
NeilBrown	3a6de2924a	md/raid5: Mark device want_replacement when we see a write error. Now that WantReplacement drives are replaced cleanly, mark a drive as WantReplacement when we see a write error. It might get failed soon so the WantReplacement flag is irrelevant, but if the write error is recorded in the bad block log, we still want to activate any spare that might be available. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:54 +11:00
NeilBrown	7bfec5f35c	md/raid5: If there is a spare and a want_replacement device, start replacement. When attempting to add a spare to a RAID[456] array, also consider adding it as a replacement for a want_replacement device. This requires that common md code attempt hot_add even when the array is not formally degraded. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:53 +11:00
NeilBrown	17045f52ac	md/raid5: recognise replacements when assembling array. If a Replacement is seen, file it as such. If we see two replacements (or two normal devices) for the one slot, abort. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:53 +11:00
NeilBrown	dd054fce88	md/raid5: handle activation of replacement device when recovery completes. When recovery completes - as reported by a call to ->spare_active, we clear In_sync on the original and set it on the replacement. Then when the original gets removed we move the replacement from 'replacement' to 'rdev'. This could race with other code that is looking at these pointers, so we use memory barriers and careful ordering to ensure that a reader might see one device twice, but never no devices. Then the readers guard against using both devices, which could only happen when writing. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:53 +11:00
NeilBrown	9a3e1101b8	md/raid5: detect and handle replacements during recovery. During recovery we want to write to the replacement but not the original. So we have two new flags - R5_NeedReplace if this stripe has a replacement that needs to be written at some stage - R5_WantReplace if NeedReplace, and the data is available, and a 'sync' has been requested on this stripe. We also distinguish between 'sync and replace' which need to read all other devices, and 'replace' which only needs to read the devices being replaced. Note that during resync we always write to any replacement device. It might not need to be written to, but as we don't read to compare, we have to write to be sure. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:53 +11:00
NeilBrown	977df36255	md/raid5: writes should get directed to replacement as well as original. When writing, we need to submit two writes, one to the original, and one to the replacement - if there is a replacement. If the write to the replacement results in a write error, we just fail the device. We only try to record write errors to the original. When writing for recovery, we shouldn't write to the original. This will be addressed in a subsequent patch that generally addresses recovery. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:53 +11:00
NeilBrown	657e3e4d88	md/raid5: allow removal for failed replacement devices. Enhance raid5_remove_disk to be able to remove ->replacement as well as ->rdev. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:52 +11:00
NeilBrown	14a75d3e07	md/raid5: preferentially read from replacement device if possible. If a replacement device is present and has been recovered far enough, then use it for reading into the stripe cache. If we get an error we don't try to repair it, we just fail the device. A replacement device that gives errors does not sound sensible. This requires removing the setting of R5_ReadError when we get a read error during a read that bypasses the cache. It was probably a bad idea anyway as we don't know that every block in the read caused an error, and it could cause ReadError to be set for the replacement device, which is bad. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:52 +11:00
NeilBrown	995c4275a7	md/raid5: remove redundant bio initialisations. We current initialise some fields of a bio when preparing a stripe_head, and again just before submitting the request. Remove the duplication by only setting the fields that lower level devices don't touch in raid5_build_block, and only set the changeable fields in ops_run_io. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:52 +11:00
NeilBrown	671488cc25	md/raid5: allow each slot to have an extra replacement device Just enhance data structures to record a second device per slot to be used as a 'replacement' device, replacing the original. We also have a second bio in each slot in each stripe_head. This will only be used when writing to the array - we need to write to both the original and the replacement at the same time, so will need two bios. For now, only try using the replacement drive for aligned-reads. In this case, we prefer the replacement if it has been recovered far enough, otherwise use the original. This includes a small enhancement. Previously we would only do aligned reads if the target device was fully recovered. Now we also do them if it has recovered far enough. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:52 +11:00
NeilBrown	b8321b68d1	md: change hot_remove_disk to take an rdev rather than a number. Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:51 +11:00
NeilBrown	908f4fbd26	md/raid5: be more thorough in calculating 'degraded' value. When an array is being reshaped to change the number of devices, the two halves can be differently degraded. e.g. one could be missing a device and the other not. So we need to be more careful about calculating the 'degraded' attribute. Instead of just inc/dec at appropriate times, perform a full re-calculation examining both possible cases. This doesn't happen often so it not a big cost, and we already have most of the code to do it. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:50 +11:00
NeilBrown	30d7a48368	md/raid5: ensure correct assessment of drives during degraded reshape. While reshaping a degraded array (as when reshaping a RAID0 by first converting it to a degraded RAID4) we currently get confused about which devices are in_sync. In most cases we get it right, but in the region that is being reshaped we need to treat non-failed devices as in-sync when we have the data but haven't actually written it out yet. Reported-by: Adam Kwolek <adam.kwolek@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 09:57:00 +11:00
Adam Kwolek	5d8c71f9e5	md: raid5 crash during degradation NULL pointer access causes crash in raid5 module. Signed-off-by: Adam Kwolek <adam.kwolek@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-09 14:26:11 +11:00
NeilBrown	9283d8c5af	md/raid5: never wait for bad-block acks on failed device. Once a device is failed we really want to completely ignore it. It should go away soon anyway. In particular the presence of bad blocks on it should not cause us to block as we won't be trying to write there anyway. So as soon as we can check if a device is Faulty, do so and pretend that it is already gone if it is Faulty. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-08 16:27:57 +11:00
Dan Williams	257a4b42af	md/raid5: STRIPE_ACTIVE has lock semantics, add barriers All updates that occur under STRIPE_ACTIVE should be globally visible when STRIPE_ACTIVE clears. test_and_set_bit() implies a barrier, but clear_bit() does not. This is suitable for 3.1-stable. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2011-11-08 16:22:06 +11:00

... 3 4 5 6 7 ...

840 Коммитов