From 1de6129f381b4907013ccea08a3bdea8c966d50a Mon Sep 17 00:00:00 2001 From: FUJITA Tomonori Date: Fri, 18 Dec 2009 12:37:48 +0100 Subject: [PATCH 01/17] block: remove Documentation/block/as-iosched.txt Commit 492af6350a5ccf087e4964104a276ed358811458 removed the AS IO scheduler, so remove its documentation too. Signed-off-by: FUJITA Tomonori Signed-off-by: Jens Axboe --- Documentation/block/00-INDEX | 2 - Documentation/block/as-iosched.txt | 172 ----------------------------- 2 files changed, 174 deletions(-) delete mode 100644 Documentation/block/as-iosched.txt diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX index 961a0513f8c3..a406286f6f3e 100644 --- a/Documentation/block/00-INDEX +++ b/Documentation/block/00-INDEX @@ -1,7 +1,5 @@ 00-INDEX - This file -as-iosched.txt - - Anticipatory IO scheduler barrier.txt - I/O Barriers biodoc.txt diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt deleted file mode 100644 index 738b72be128e..000000000000 --- a/Documentation/block/as-iosched.txt +++ /dev/null @@ -1,172 +0,0 @@ -Anticipatory IO scheduler -------------------------- -Nick Piggin 13 Sep 2003 - -Attention! Database servers, especially those using "TCQ" disks should -investigate performance with the 'deadline' IO scheduler. Any system with high -disk performance requirements should do so, in fact. - -If you see unusual performance characteristics of your disk systems, or you -see big performance regressions versus the deadline scheduler, please email -me. Database users don't bother unless you're willing to test a lot of patches -from me ;) its a known issue. - -Also, users with hardware RAID controllers, doing striping, may find -highly variable performance results with using the as-iosched. The -as-iosched anticipatory implementation is based on the notion that a disk -device has only one physical seeking head. A striped RAID controller -actually has a head for each physical device in the logical RAID device. - -However, setting the antic_expire (see tunable parameters below) produces -very similar behavior to the deadline IO scheduler. - -Selecting IO schedulers ------------------------ -Refer to Documentation/block/switching-sched.txt for information on -selecting an io scheduler on a per-device basis. - -Anticipatory IO scheduler Policies ----------------------------------- -The as-iosched implementation implements several layers of policies -to determine when an IO request is dispatched to the disk controller. -Here are the policies outlined, in order of application. - -1. one-way Elevator algorithm. - -The elevator algorithm is similar to that used in deadline scheduler, with -the addition that it allows limited backward movement of the elevator -(i.e. seeks backwards). A seek backwards can occur when choosing between -two IO requests where one is behind the elevator's current position, and -the other is in front of the elevator's position. If the seek distance to -the request in back of the elevator is less than half the seek distance to -the request in front of the elevator, then the request in back can be chosen. -Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. -This favors forward movement of the elevator, while allowing opportunistic -"short" backward seeks. - -2. FIFO expiration times for reads and for writes. - -This is again very similar to the deadline IO scheduler. The expiration -times for requests on these lists is tunable using the parameters read_expire -and write_expire discussed below. When a read or a write expires in this way, -the IO scheduler will interrupt its current elevator sweep or read anticipation -to service the expired request. - -3. Read and write request batching - -A batch is a collection of read requests or a collection of write -requests. The as scheduler alternates dispatching read and write batches -to the driver. In the case a read batch, the scheduler submits read -requests to the driver as long as there are read requests to submit, and -the read batch time limit has not been exceeded (read_batch_expire). -The read batch time limit begins counting down only when there are -competing write requests pending. - -In the case of a write batch, the scheduler submits write requests to -the driver as long as there are write requests available, and the -write batch time limit has not been exceeded (write_batch_expire). -However, the length of write batches will be gradually shortened -when read batches frequently exceed their time limit. - -When changing between batch types, the scheduler waits for all requests -from the previous batch to complete before scheduling requests for the -next batch. - -The read and write fifo expiration times described in policy 2 above -are checked only when in scheduling IO of a batch for the corresponding -(read/write) type. So for example, the read FIFO timeout values are -tested only during read batches. Likewise, the write FIFO timeout -values are tested only during write batches. For this reason, -it is generally not recommended for the read batch time -to be longer than the write expiration time, nor for the write batch -time to exceed the read expiration time (see tunable parameters below). - -When the IO scheduler changes from a read to a write batch, -it begins the elevator from the request that is on the head of the -write expiration FIFO. Likewise, when changing from a write batch to -a read batch, scheduler begins the elevator from the first entry -on the read expiration FIFO. - -4. Read anticipation. - -Read anticipation occurs only when scheduling a read batch. -This implementation of read anticipation allows only one read request -to be dispatched to the disk controller at a time. In -contrast, many write requests may be dispatched to the disk controller -at a time during a write batch. It is this characteristic that can make -the anticipatory scheduler perform anomalously with controllers supporting -TCQ, or with hardware striped RAID devices. Setting the antic_expire -queue parameter (see below) to zero disables this behavior, and the -anticipatory scheduler behaves essentially like the deadline scheduler. - -When read anticipation is enabled (antic_expire is not zero), reads -are dispatched to the disk controller one at a time. -At the end of each read request, the IO scheduler examines its next -candidate read request from its sorted read list. If that next request -is from the same process as the request that just completed, -or if the next request in the queue is "very close" to the -just completed request, it is dispatched immediately. Otherwise, -statistics (average think time, average seek distance) on the process -that submitted the just completed request are examined. If it seems -likely that that process will submit another request soon, and that -request is likely to be near the just completed request, then the IO -scheduler will stop dispatching more read requests for up to (antic_expire) -milliseconds, hoping that process will submit a new request near the one -that just completed. If such a request is made, then it is dispatched -immediately. If the antic_expire wait time expires, then the IO scheduler -will dispatch the next read request from the sorted read queue. - -To decide whether an anticipatory wait is worthwhile, the scheduler -maintains statistics for each process that can be used to compute -mean "think time" (the time between read requests), and mean seek -distance for that process. One observation is that these statistics -are associated with each process, but those statistics are not associated -with a specific IO device. So for example, if a process is doing IO -on several file systems on separate devices, the statistics will be -a combination of IO behavior from all those devices. - - -Tuning the anticipatory IO scheduler ------------------------------------- -When using 'as', the anticipatory IO scheduler there are 5 parameters under -/sys/block/*/queue/iosched/. All are units of milliseconds. - -The parameters are: -* read_expire - Controls how long until a read request becomes "expired". It also controls the - interval between which expired requests are served, so set to 50, a request - might take anywhere < 100ms to be serviced _if_ it is the next on the - expired list. Obviously request expiration strategies won't make the disk - go faster. The result basically equates to the timeslice a single reader - gets in the presence of other IO. 100*((seek time / read_expire) + 1) is - very roughly the % streaming read efficiency your disk should get with - multiple readers. - -* read_batch_expire - Controls how much time a batch of reads is given before pending writes are - served. A higher value is more efficient. This might be set below read_expire - if writes are to be given higher priority than reads, but reads are to be - as efficient as possible when there are no writes. Generally though, it - should be some multiple of read_expire. - -* write_expire, and -* write_batch_expire are equivalent to the above, for writes. - -* antic_expire - Controls the maximum amount of time we can anticipate a good read (one - with a short seek distance from the most recently completed request) before - giving up. Many other factors may cause anticipation to be stopped early, - or some processes will not be "anticipated" at all. Should be a bit higher - for big seek time devices though not a linear correspondence - most - processes have only a few ms thinktime. - -In addition to the tunables above there is a read-only file named est_time -which, when read, will show: - - - The probability of a task exiting without a cooperating task - submitting an anticipated IO. - - - The current mean think time. - - - The seek distance used to determine if an incoming IO is better. - From 4a63b030d75a063b910b2bab014d84837cb33eb7 Mon Sep 17 00:00:00 2001 From: Roel Kluin Date: Fri, 18 Dec 2009 12:38:11 +0100 Subject: [PATCH 02/17] drbd: fix test of unsigned in _drbd_fault_random() rsp->count is unsigned so the test does not work. Signed-off-by: Roel Kluin Cc: Lars Ellenberg Cc: Philipp Reisner Signed-off-by: Jens Axboe --- drivers/block/drbd/drbd_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 157d1e4343c2..c2594c137e65 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -3623,7 +3623,7 @@ _drbd_fault_random(struct fault_random_state *rsp) { long refresh; - if (--rsp->count < 0) { + if (!rsp->count--) { get_random_bytes(&refresh, sizeof(refresh)); rsp->state += refresh; rsp->count = FAULT_RANDOM_REFRESH; From 1db32c40600437c5e049796bd32f49f61244c6ef Mon Sep 17 00:00:00 2001 From: Vivek Goyal Date: Wed, 16 Dec 2009 17:52:57 -0500 Subject: [PATCH 03/17] cfq-iosched: Remove the check for same cfq group from allow_merge o allow_merge() already checks if submitting task is pointing to same cfqq as rq has been queued in. If everything is fine, we should not be having a task in one cgroup and having a pointer to cfqq in other cgroup. Well I guess in some situations it can happen and that is, when a random IO queue has been moved into root cgroup for group_isolation=0. In this case, tasks's cgroup/group is different from where actually cfqq is, but this is intentional and in this case merging should be allowed. The second situation is where due to close cooperator patches, multiple processes can be sharing a cfqq. If everything implemented right, we should not end up in a situation where tasks from different processes in different groups are sharing the same cfqq as we allow merging of cooperating queues only if they are in same group. Signed-off-by: Vivek Goyal Reviewed-by: Gui Jianfeng Signed-off-by: Jens Axboe --- block/cfq-iosched.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index e2f80463ed0d..a0e5347767d9 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1513,9 +1513,6 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq, struct cfq_io_context *cic; struct cfq_queue *cfqq; - /* Deny merge if bio and rq don't belong to same cfq group */ - if ((RQ_CFQQ(rq))->cfqg != cfq_get_cfqg(cfqd, 0)) - return false; /* * Disallow merge of a sync bio into an async request. */ From fb104db41e6e006c85ce1097f372cd1e10c1755c Mon Sep 17 00:00:00 2001 From: Vivek Goyal Date: Wed, 16 Dec 2009 17:52:58 -0500 Subject: [PATCH 04/17] cfq-iosched: Get rid of nr_groups o Currently code does not seem to be using cfqd->nr_groups. Get rid of it. Signed-off-by: Vivek Goyal Reviewed-by: Gui Jianfeng Signed-off-by: Jens Axboe --- block/cfq-iosched.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index a0e5347767d9..d9bfa09e68c1 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -208,8 +208,6 @@ struct cfq_data { /* Root service tree for cfq_groups */ struct cfq_rb_root grp_service_tree; struct cfq_group root_group; - /* Number of active cfq groups on group service tree */ - int nr_groups; /* * The priority currently being served @@ -842,7 +840,6 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg) __cfq_group_service_tree_add(st, cfqg); cfqg->on_st = true; - cfqd->nr_groups++; st->total_weight += cfqg->weight; } @@ -863,7 +860,6 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg) cfq_log_cfqg(cfqd, cfqg, "del_from_rr group"); cfqg->on_st = false; - cfqd->nr_groups--; st->total_weight -= cfqg->weight; if (!RB_EMPTY_NODE(&cfqg->rb_node)) cfq_rb_erase(&cfqg->rb_node, st); From 65b32a573eefa1cdd3cbe5ea59326308e6c3b9ad Mon Sep 17 00:00:00 2001 From: Vivek Goyal Date: Wed, 16 Dec 2009 17:52:59 -0500 Subject: [PATCH 05/17] cfq-iosched: Remove prio_change logic for workload selection o CFQ now internally divides cfq queues in therr workload categories. sync-idle, sync-noidle and async. Which workload to run depends primarily on rb_key offset across three service trees. Which is a combination of mulitiple things including what time queue got queued on the service tree. There is one exception though. That is if we switched the prio class, say we served some RT tasks and again started serving BE class, then with-in BE class we always started with sync-noidle workload irrespective of rb_key offset in service trees. This can provide better latencies for sync-noidle workload in the presence of RT tasks. o This patch gets rid of that exception and which workload to run with-in class always depends on lowest rb_key across service trees. The reason being that now we have multiple BE class groups and if we always switch to sync-noidle workload with-in group, we can potentially starve a sync-idle workload with-in group. Same is true for async workload which will be in root group. Also the workload-switching with-in group will become very unpredictable as it now depends whether some RT workload was running in the system or not. Signed-off-by: Vivek Goyal Reviewed-by: Gui Jianfeng Acked-by: Corrado Zoccolo Signed-off-by: Jens Axboe --- block/cfq-iosched.c | 48 ++++++++++++--------------------------------- 1 file changed, 12 insertions(+), 36 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index d9bfa09e68c1..8df4fe58f4e7 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -292,8 +292,7 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd); static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg, enum wl_prio_t prio, - enum wl_type_t type, - struct cfq_data *cfqd) + enum wl_type_t type) { if (!cfqg) return NULL; @@ -1146,7 +1145,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq, #endif service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq), - cfqq_type(cfqq), cfqd); + cfqq_type(cfqq)); if (cfq_class_idle(cfqq)) { rb_key = CFQ_IDLE_DELAY; parent = rb_last(&service_tree->rb); @@ -1609,7 +1608,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) { struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_group, cfqd->serving_prio, - cfqd->serving_type, cfqd); + cfqd->serving_type); if (!cfqd->rq_queued) return NULL; @@ -1956,8 +1955,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq) } static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, - struct cfq_group *cfqg, enum wl_prio_t prio, - bool prio_changed) + struct cfq_group *cfqg, enum wl_prio_t prio) { struct cfq_queue *queue; int i; @@ -1965,24 +1963,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, unsigned long lowest_key = 0; enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD; - if (prio_changed) { - /* - * When priorities switched, we prefer starting - * from SYNC_NOIDLE (first choice), or just SYNC - * over ASYNC - */ - if (service_tree_for(cfqg, prio, cur_best, cfqd)->count) - return cur_best; - cur_best = SYNC_WORKLOAD; - if (service_tree_for(cfqg, prio, cur_best, cfqd)->count) - return cur_best; - - return ASYNC_WORKLOAD; - } - - for (i = 0; i < 3; ++i) { - /* otherwise, select the one with lowest rb_key */ - queue = cfq_rb_first(service_tree_for(cfqg, prio, i, cfqd)); + for (i = 0; i <= SYNC_WORKLOAD; ++i) { + /* select the one with lowest rb_key */ + queue = cfq_rb_first(service_tree_for(cfqg, prio, i)); if (queue && (!key_valid || time_before(queue->rb_key, lowest_key))) { lowest_key = queue->rb_key; @@ -1996,8 +1979,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg) { - enum wl_prio_t previous_prio = cfqd->serving_prio; - bool prio_changed; unsigned slice; unsigned count; struct cfq_rb_root *st; @@ -2025,24 +2006,19 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg) * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload * expiration time */ - prio_changed = (cfqd->serving_prio != previous_prio); - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type, - cfqd); + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type); count = st->count; /* - * If priority didn't change, check workload expiration, - * and that we still have other queues ready + * check workload expiration, and that we still have other queues ready */ - if (!prio_changed && count && - !time_after(jiffies, cfqd->workload_expires)) + if (count && !time_after(jiffies, cfqd->workload_expires)) return; /* otherwise select new workload type */ cfqd->serving_type = - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio, prio_changed); - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type, - cfqd); + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio); + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type); count = st->count; /* From 7d4e9d0962cd0f6a30b01e256756dd10606dab30 Mon Sep 17 00:00:00 2001 From: Emese Revfy Date: Mon, 14 Dec 2009 00:59:30 +0100 Subject: [PATCH 06/17] drbd: Constify struct file_operations Signed-off-by: Emese Revfy Signed-off-by: Philipp Reisner --- drivers/block/drbd/drbd_int.h | 2 +- drivers/block/drbd/drbd_main.c | 2 +- drivers/block/drbd/drbd_proc.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h index 2312d782fe99..c97558763430 100644 --- a/drivers/block/drbd/drbd_int.h +++ b/drivers/block/drbd/drbd_int.h @@ -1490,7 +1490,7 @@ void drbd_bump_write_ordering(struct drbd_conf *mdev, enum write_ordering_e wo); /* drbd_proc.c */ extern struct proc_dir_entry *drbd_proc; -extern struct file_operations drbd_proc_fops; +extern const struct file_operations drbd_proc_fops; extern const char *drbd_conn_str(enum drbd_conns s); extern const char *drbd_role_str(enum drbd_role s); diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 157d1e4343c2..af30ca97493c 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -151,7 +151,7 @@ wait_queue_head_t drbd_pp_wait; DEFINE_RATELIMIT_STATE(drbd_ratelimit_state, 5 * HZ, 5); -static struct block_device_operations drbd_ops = { +static const struct block_device_operations drbd_ops = { .owner = THIS_MODULE, .open = drbd_open, .release = drbd_release, diff --git a/drivers/block/drbd/drbd_proc.c b/drivers/block/drbd/drbd_proc.c index bdd0b4943b10..df8ad9660d8f 100644 --- a/drivers/block/drbd/drbd_proc.c +++ b/drivers/block/drbd/drbd_proc.c @@ -38,7 +38,7 @@ static int drbd_proc_open(struct inode *inode, struct file *file); struct proc_dir_entry *drbd_proc; -struct file_operations drbd_proc_fops = { +const struct file_operations drbd_proc_fops = { .owner = THIS_MODULE, .open = drbd_proc_open, .read = seq_read, From 49829ea74f790d3be2e803a617e714f5b9a5ed50 Mon Sep 17 00:00:00 2001 From: Roel Kluin Date: Tue, 15 Dec 2009 22:55:44 +0100 Subject: [PATCH 07/17] drbd: Fix test of unsigned in _drbd_fault_random() rsp->count is unsigned so the test does not work. Signed-off-by: Roel Kluin Signed-off-by: Andrew Morton Signed-off-by: Philipp Reisner --- drivers/block/drbd/drbd_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index af30ca97493c..ecc1612df676 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -3623,7 +3623,7 @@ _drbd_fault_random(struct fault_random_state *rsp) { long refresh; - if (--rsp->count < 0) { + if (!rsp->count--) { get_random_bytes(&refresh, sizeof(refresh)); rsp->state += refresh; rsp->count = FAULT_RANDOM_REFRESH; From 7b886f4f7a051dc88165684cbcddd98e22bd0203 Mon Sep 17 00:00:00 2001 From: Huang Weiyi Date: Wed, 9 Dec 2009 21:09:12 +0800 Subject: [PATCH 08/17] drbd: remove duplicated #include Remove duplicated #include('s) in drivers/block/drbd/drbd_worker.c Signed-off-by: Huang Weiyi Signed-off-by: Philipp Reisner --- drivers/block/drbd/drbd_worker.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c index ed8796f1112d..3e8b6bc0ec2f 100644 --- a/drivers/block/drbd/drbd_worker.c +++ b/drivers/block/drbd/drbd_worker.c @@ -34,7 +34,6 @@ #include #include #include -#include #include #include From 820cd61a28503598f4262c544082ccb33678b9fc Mon Sep 17 00:00:00 2001 From: Huang Weiyi Date: Sun, 13 Dec 2009 22:05:03 +0800 Subject: [PATCH 09/17] drbd: remove unused #include Remove unused #include ('s) in drivers/block/drbd/drbd_main.c drivers/block/drbd/drbd_receiver.c drivers/block/drbd/drbd_worker.c Signed-off-by: Huang Weiyi Signed-off-by: Philipp Reisner --- drivers/block/drbd/drbd_main.c | 1 - drivers/block/drbd/drbd_receiver.c | 1 - drivers/block/drbd/drbd_worker.c | 1 - 3 files changed, 3 deletions(-) diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index ecc1612df676..9348f33f6242 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -27,7 +27,6 @@ */ #include -#include #include #include #include diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index c548f24f54a1..259c1351b152 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -28,7 +28,6 @@ #include #include -#include #include #include #include diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c index 3e8b6bc0ec2f..b453c2bca3be 100644 --- a/drivers/block/drbd/drbd_worker.c +++ b/drivers/block/drbd/drbd_worker.c @@ -24,7 +24,6 @@ */ #include -#include #include #include #include From 9504e0864b58b4a304820dcf3755f1da80d5e72f Mon Sep 17 00:00:00 2001 From: "Martin K. Petersen" Date: Mon, 21 Dec 2009 15:55:51 +0100 Subject: [PATCH 10/17] block: Fix topology stacking for data and discard alignment The stacking code incorrectly scaled up the data offset in some cases causing misaligned devices to report alignment. Rewrite the stacking algorithm to remedy this and apply the same alignment principles to the discard handling. Signed-off-by: Martin K. Petersen Signed-off-by: Jens Axboe --- block/blk-settings.c | 93 +++++++++++++++++++++++++------------------- 1 file changed, 53 insertions(+), 40 deletions(-) diff --git a/block/blk-settings.c b/block/blk-settings.c index 6ae118d6e193..e14fcbcedbfa 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -517,9 +517,8 @@ static unsigned int lcm(unsigned int a, unsigned int b) int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, sector_t offset) { - int ret; - - ret = 0; + sector_t alignment; + unsigned int top, bottom, granularity; t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); @@ -537,6 +536,19 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->max_segment_size = min_not_zero(t->max_segment_size, b->max_segment_size); + granularity = max(b->physical_block_size, b->io_min); + alignment = b->alignment_offset - (offset & (granularity - 1)); + + if (t->alignment_offset != alignment) { + + top = max(t->physical_block_size, t->io_min) + + t->alignment_offset; + bottom = granularity + alignment; + + if (max(top, bottom) & (min(top, bottom) - 1)) + t->misaligned = 1; + } + t->logical_block_size = max(t->logical_block_size, b->logical_block_size); @@ -544,54 +556,55 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, b->physical_block_size); t->io_min = max(t->io_min, b->io_min); + t->io_opt = lcm(t->io_opt, b->io_opt); + t->no_cluster |= b->no_cluster; t->discard_zeroes_data &= b->discard_zeroes_data; - /* Bottom device offset aligned? */ - if (offset && - (offset & (b->physical_block_size - 1)) != b->alignment_offset) { + if (t->physical_block_size & (t->logical_block_size - 1)) { + t->physical_block_size = t->logical_block_size; t->misaligned = 1; - ret = -1; } - /* - * Temporarily disable discard granularity. It's currently buggy - * since we default to 0 for discard_granularity, hence this - * "failure" will always trigger for non-zero offsets. - */ -#if 0 - if (offset && - (offset & (b->discard_granularity - 1)) != b->discard_alignment) { - t->discard_misaligned = 1; - ret = -1; - } -#endif - - /* If top has no alignment offset, inherit from bottom */ - if (!t->alignment_offset) - t->alignment_offset = - b->alignment_offset & (b->physical_block_size - 1); - - if (!t->discard_alignment) - t->discard_alignment = - b->discard_alignment & (b->discard_granularity - 1); - - /* Top device aligned on logical block boundary? */ - if (t->alignment_offset & (t->logical_block_size - 1)) { + if (t->io_min & (t->physical_block_size - 1)) { + t->io_min = t->physical_block_size; t->misaligned = 1; - ret = -1; } - /* Find lcm() of optimal I/O size and granularity */ - t->io_opt = lcm(t->io_opt, b->io_opt); - t->discard_granularity = lcm(t->discard_granularity, - b->discard_granularity); + if (t->io_opt & (t->physical_block_size - 1)) { + t->io_opt = 0; + t->misaligned = 1; + } - /* Verify that optimal I/O size is a multiple of io_min */ - if (t->io_min && t->io_opt % t->io_min) - ret = -1; + t->alignment_offset = lcm(t->alignment_offset, alignment) + & (max(t->physical_block_size, t->io_min) - 1); - return ret; + if (t->alignment_offset & (t->logical_block_size - 1)) + t->misaligned = 1; + + /* Discard alignment and granularity */ + if (b->discard_granularity) { + + alignment = b->discard_alignment - + (offset & (b->discard_granularity - 1)); + + if (t->discard_granularity != 0 && + t->discard_alignment != alignment) { + top = t->discard_granularity + t->discard_alignment; + bottom = b->discard_granularity + alignment; + + /* Verify that top and bottom intervals line up */ + if (max(top, bottom) & (min(top, bottom) - 1)) + t->discard_misaligned = 1; + } + + t->discard_granularity = max(t->discard_granularity, + b->discard_granularity); + t->discard_alignment = lcm(t->discard_alignment, alignment) & + (t->discard_granularity - 1); + } + + return t->misaligned ? -1 : 0; } EXPORT_SYMBOL(blk_stack_limits); From df9dc83d193959f990c981787279e8541fe9df8c Mon Sep 17 00:00:00 2001 From: Julia Lawall Date: Mon, 21 Dec 2009 16:27:49 -0800 Subject: [PATCH 11/17] drivers/block/DAC960.c: use DAC960_V2_Controller DAC960_LP_Controller and DAC960_V2_Controller have the same value, but elsewhere it is DAC960_V1_Controller or DAC960_V2_Controller that is used in the FirmwareType field. Signed-off-by: Julia Lawall Signed-off-by: Andrew Morton Signed-off-by: Jens Axboe --- drivers/block/DAC960.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/DAC960.c b/drivers/block/DAC960.c index eb4fa1943944..ce1fa923c414 100644 --- a/drivers/block/DAC960.c +++ b/drivers/block/DAC960.c @@ -7101,7 +7101,7 @@ static struct DAC960_privdata DAC960_BA_privdata = { static struct DAC960_privdata DAC960_LP_privdata = { .HardwareType = DAC960_LP_Controller, - .FirmwareType = DAC960_LP_Controller, + .FirmwareType = DAC960_V2_Controller, .InterruptHandler = DAC960_LP_InterruptHandler, .MemoryWindowSize = DAC960_LP_RegisterWindowSize, }; From e019ef0c4f11a0acb79b89fb1e09c72197b42bbe Mon Sep 17 00:00:00 2001 From: H Hartley Sweeten Date: Mon, 21 Dec 2009 16:27:49 -0800 Subject: [PATCH 12/17] drivers/block/mg_disk.c: use resource_size() Use resource_size() for ioremap. The ioremap appears to be passing the incorrect size for the platform resource. Unfortunately, I can't locate a user in mainline to verify this. Using resource_size should be the correct fix. Signed-off-by: H Hartley Sweeten Acked-by: unsik Kim Signed-off-by: Andrew Morton Signed-off-by: Jens Axboe --- drivers/block/mg_disk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/mg_disk.c b/drivers/block/mg_disk.c index e0339aaa1815..02b2583df7fc 100644 --- a/drivers/block/mg_disk.c +++ b/drivers/block/mg_disk.c @@ -860,7 +860,7 @@ static int mg_probe(struct platform_device *plat_dev) err = -EINVAL; goto probe_err_2; } - host->dev_base = ioremap(rsc->start , rsc->end + 1); + host->dev_base = ioremap(rsc->start, resource_size(rsc)); if (!host->dev_base) { printk(KERN_ERR "%s:%d ioremap fail\n", __func__, __LINE__); From 6ec1480d8539c8e2e6ba7fbbeffe5adc640bfe98 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Mon, 21 Dec 2009 16:27:50 -0800 Subject: [PATCH 13/17] aoe: switch to the new bio_flush_dcache_pages() interface Cc: "Ed L. Cashin" Cc: David Woodhouse Cc: Ilya Loginov Cc: Ingo Molnar Cc: Peter Horton Signed-off-by: Andrew Morton Signed-off-by: Jens Axboe --- drivers/block/aoe/aoecmd.c | 17 +---------------- 1 file changed, 1 insertion(+), 16 deletions(-) diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c index 13bb69d2abb3..64a223b0cc22 100644 --- a/drivers/block/aoe/aoecmd.c +++ b/drivers/block/aoe/aoecmd.c @@ -735,21 +735,6 @@ diskstats(struct gendisk *disk, struct bio *bio, ulong duration, sector_t sector part_stat_unlock(); } -/* - * Ensure we don't create aliases in VI caches - */ -static inline void -killalias(struct bio *bio) -{ - struct bio_vec *bv; - int i; - - if (bio_data_dir(bio) == READ) - __bio_for_each_segment(bv, bio, i, 0) { - flush_dcache_page(bv->bv_page); - } -} - void aoecmd_ata_rsp(struct sk_buff *skb) { @@ -871,7 +856,7 @@ aoecmd_ata_rsp(struct sk_buff *skb) if (buf->flags & BUFFL_FAIL) bio_endio(buf->bio, -EIO); else { - killalias(buf->bio); + bio_flush_dcache_pages(buf->bio); bio_endio(buf->bio, 0); } mempool_free(buf, d->bufpool); From 2f7a2d89a8b5915d89ad9ebeb0065db7d5831cea Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Mon, 28 Dec 2009 13:18:44 +0100 Subject: [PATCH 14/17] cfq-iosched: don't regard requests with long distance as close seek_mean could be very big sometimes, using it as close criteria is meaningless as this doen't improve any performance. So if it's big, let's fallback to default value. Reviewed-by: Corrado Zoccolo Signed-off-by: Shaohua Li Signed-off-by: Jens Axboe --- block/cfq-iosched.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 8df4fe58f4e7..918c7fd9aeb1 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1667,13 +1667,17 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, #define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, - struct request *rq) + struct request *rq, bool for_preempt) { sector_t sdist = cfqq->seek_mean; if (!sample_valid(cfqq->seek_samples)) sdist = CFQQ_SEEK_THR; + /* if seek_mean is big, using it as close criteria is meaningless */ + if (sdist > CFQQ_SEEK_THR && !for_preempt) + sdist = CFQQ_SEEK_THR; + return cfq_dist_from_last(cfqd, rq) <= sdist; } @@ -1701,7 +1705,7 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, * will contain the closest sector. */ __cfqq = rb_entry(parent, struct cfq_queue, p_node); - if (cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq)) + if (cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq, false)) return __cfqq; if (blk_rq_pos(__cfqq->next_rq) < sector) @@ -1712,7 +1716,7 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, return NULL; __cfqq = rb_entry(node, struct cfq_queue, p_node); - if (cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq)) + if (cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq, false)) return __cfqq; return NULL; @@ -3112,7 +3116,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, * if this request is as-good as one we would expect from the * current cfqq, let it preempt */ - if (cfq_rq_close(cfqd, cfqq, rq)) + if (cfq_rq_close(cfqd, cfqq, rq, true)) return true; return false; From 81744ee44ab2845c16ffd7d6f762f7b4a49a4750 Mon Sep 17 00:00:00 2001 From: "Martin K. Petersen" Date: Tue, 29 Dec 2009 08:35:35 +0100 Subject: [PATCH 15/17] block: Fix incorrect alignment offset reporting and update documentation queue_sector_alignment_offset returned the wrong value which caused partitions to report an incorrect alignment_offset. Since offset alignment calculation is needed several places it has been split into a separate helper function. The topology stacking function has been updated accordingly. Furthermore, comments have been added to clarify how the stacking function works. Signed-off-by: Martin K. Petersen Tested-by: Mike Snitzer Signed-off-by: Jens Axboe --- block/blk-settings.c | 44 +++++++++++++++++++++++++++++++----------- include/linux/blkdev.h | 11 +++++++++-- 2 files changed, 42 insertions(+), 13 deletions(-) diff --git a/block/blk-settings.c b/block/blk-settings.c index e14fcbcedbfa..d52d4adc440b 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -505,20 +505,30 @@ static unsigned int lcm(unsigned int a, unsigned int b) /** * blk_stack_limits - adjust queue_limits for stacked devices - * @t: the stacking driver limits (top) - * @b: the underlying queue limits (bottom) + * @t: the stacking driver limits (top device) + * @b: the underlying queue limits (bottom, component device) * @offset: offset to beginning of data within component device * * Description: - * Merges two queue_limit structs. Returns 0 if alignment didn't - * change. Returns -1 if adding the bottom device caused - * misalignment. + * This function is used by stacking drivers like MD and DM to ensure + * that all component devices have compatible block sizes and + * alignments. The stacking driver must provide a queue_limits + * struct (top) and then iteratively call the stacking function for + * all component (bottom) devices. The stacking function will + * attempt to combine the values and ensure proper alignment. + * + * Returns 0 if the top and bottom queue_limits are compatible. The + * top device's block sizes and alignment offsets may be adjusted to + * ensure alignment with the bottom device. If no compatible sizes + * and alignments exist, -1 is returned and the resulting top + * queue_limits will have the misaligned flag set to indicate that + * the alignment_offset is undefined. */ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, sector_t offset) { sector_t alignment; - unsigned int top, bottom, granularity; + unsigned int top, bottom; t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); @@ -536,15 +546,18 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->max_segment_size = min_not_zero(t->max_segment_size, b->max_segment_size); - granularity = max(b->physical_block_size, b->io_min); - alignment = b->alignment_offset - (offset & (granularity - 1)); + alignment = queue_limit_alignment_offset(b, offset); + /* Bottom device has different alignment. Check that it is + * compatible with the current top alignment. + */ if (t->alignment_offset != alignment) { top = max(t->physical_block_size, t->io_min) + t->alignment_offset; - bottom = granularity + alignment; + bottom = max(b->physical_block_size, b->io_min) + alignment; + /* Verify that top and bottom intervals line up */ if (max(top, bottom) & (min(top, bottom) - 1)) t->misaligned = 1; } @@ -561,32 +574,39 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->no_cluster |= b->no_cluster; t->discard_zeroes_data &= b->discard_zeroes_data; + /* Physical block size a multiple of the logical block size? */ if (t->physical_block_size & (t->logical_block_size - 1)) { t->physical_block_size = t->logical_block_size; t->misaligned = 1; } + /* Minimum I/O a multiple of the physical block size? */ if (t->io_min & (t->physical_block_size - 1)) { t->io_min = t->physical_block_size; t->misaligned = 1; } + /* Optimal I/O a multiple of the physical block size? */ if (t->io_opt & (t->physical_block_size - 1)) { t->io_opt = 0; t->misaligned = 1; } + /* Find lowest common alignment_offset */ t->alignment_offset = lcm(t->alignment_offset, alignment) & (max(t->physical_block_size, t->io_min) - 1); + /* Verify that new alignment_offset is on a logical block boundary */ if (t->alignment_offset & (t->logical_block_size - 1)) t->misaligned = 1; /* Discard alignment and granularity */ if (b->discard_granularity) { + unsigned int granularity = b->discard_granularity; + offset &= granularity - 1; - alignment = b->discard_alignment - - (offset & (b->discard_granularity - 1)); + alignment = (granularity + b->discard_alignment - offset) + & (granularity - 1); if (t->discard_granularity != 0 && t->discard_alignment != alignment) { @@ -598,6 +618,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->discard_misaligned = 1; } + t->max_discard_sectors = min_not_zero(t->max_discard_sectors, + b->max_discard_sectors); t->discard_granularity = max(t->discard_granularity, b->discard_granularity); t->discard_alignment = lcm(t->discard_alignment, alignment) & diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 784a919aa0d0..59b832be3044 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1116,11 +1116,18 @@ static inline int queue_alignment_offset(struct request_queue *q) return q->limits.alignment_offset; } +static inline int queue_limit_alignment_offset(struct queue_limits *lim, sector_t offset) +{ + unsigned int granularity = max(lim->physical_block_size, lim->io_min); + + offset &= granularity - 1; + return (granularity + lim->alignment_offset - offset) & (granularity - 1); +} + static inline int queue_sector_alignment_offset(struct request_queue *q, sector_t sector) { - return ((sector << 9) - q->limits.alignment_offset) - & (q->limits.io_min - 1); + return queue_limit_alignment_offset(&q->limits, sector << 9); } static inline int bdev_alignment_offset(struct block_device *bdev) From e79e95db5cffb2e01170d510686489c40937faa1 Mon Sep 17 00:00:00 2001 From: OGAWA Hirofumi Date: Tue, 29 Dec 2009 08:53:54 +0100 Subject: [PATCH 16/17] block: Honor the gfp_mask for alloc_page() in blkdev_issue_discard() Signed-off-by: OGAWA Hirofumi Signed-off-by: Jens Axboe --- block/blk-barrier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-barrier.c b/block/blk-barrier.c index 8873b9b439ff..8618d8996fea 100644 --- a/block/blk-barrier.c +++ b/block/blk-barrier.c @@ -402,7 +402,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, * our current implementations need. If we'll ever need * more the interface will need revisiting. */ - page = alloc_page(GFP_KERNEL | __GFP_ZERO); + page = alloc_page(gfp_mask | __GFP_ZERO); if (!page) goto out_free_bio; if (bio_add_pc_page(q, bio, page, sector_size, 0) < sector_size) From 9bd3f98821a83041e77ee25158b80b535d02d7b4 Mon Sep 17 00:00:00 2001 From: Gui Jianfeng Date: Wed, 30 Dec 2009 08:41:07 +0100 Subject: [PATCH 17/17] block: blk_rq_err_sectors cleanup blk_rq_err_sectors() seems useless, get rid of it. Signed-off-by: Gui Jianfeng Signed-off-by: Jens Axboe --- include/linux/blkdev.h | 6 ------ 1 file changed, 6 deletions(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 59b832be3044..9b98173a8184 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -845,7 +845,6 @@ static inline struct request_queue *bdev_get_queue(struct block_device *bdev) * blk_rq_err_bytes() : bytes left till the next error boundary * blk_rq_sectors() : sectors left in the entire request * blk_rq_cur_sectors() : sectors left in the current segment - * blk_rq_err_sectors() : sectors left till the next error boundary */ static inline sector_t blk_rq_pos(const struct request *rq) { @@ -874,11 +873,6 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq) return blk_rq_cur_bytes(rq) >> 9; } -static inline unsigned int blk_rq_err_sectors(const struct request *rq) -{ - return blk_rq_err_bytes(rq) >> 9; -} - /* * Request issue related functions. */