Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:
- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.
- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.
- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"
Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").
* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While stress-running very-small container scenarios with the Kernel Memory
Controller, I've run into a lockdep-detected lock imbalance in
cfq-iosched.c.
I'll apologize beforehand for not posting a backlog: I didn't anticipate
it would be so hard to reproduce, so I didn't save my serial output and
went directly on debugging. Turns out that it did not happen again in
more than 20 runs, making it a quite rare pattern.
But here is my analysis:
When we are in very low-memory situations, we will arrive at
cfq_find_alloc_queue and may not find a queue, having to resort to the oom
queue, in an rcu-locked condition:
if (!cfqq || cfqq == &cfqd->oom_cfqq)
[ ... ]
Next, we will release the rcu lock, and try to allocate a queue, retrying
if we succeed:
rcu_read_unlock();
spin_unlock_irq(cfqd->queue->queue_lock);
new_cfqq = kmem_cache_alloc_node(cfq_pool,
gfp_mask | __GFP_ZERO,
cfqd->queue->node);
spin_lock_irq(cfqd->queue->queue_lock);
if (new_cfqq)
goto retry;
We are unlocked at this point, but it should be fine, since we will
reacquire the rcu_read_lock when we retry.
Except of course, that we may not retry: the allocation may very well fail
and we'll keep on going through the flow:
The next branch is:
if (cfqq) {
[ ... ]
} else
cfqq = &cfqd->oom_cfqq;
And right before exiting, we'll issue rcu_read_unlock().
Being already unlocked, this is the likely source of our imbalance. Since
cfqq is either already NULL or made NULL in the first statement of the
outter branch, the only viable alternative here seems to be to return the
oom queue right away in case of allocation failure.
Please review the following patch and apply if you agree with my analysis.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unfortunately, at this point, there's no way to make the existing
statistics hierarchical without creating nasty surprises for the
existing users. Just create recursive counterpart of the existing
stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
To support hierarchical stats, it's necessary to remember stats from
dead children. Add cfqg->dead_stats and make a dying cfqg transfer
its stats to the parent's dead-stats.
The transfer happens form ->pd_offline_fn() and it is possible that
there are some residual IOs completing afterwards. Currently, we lose
these stats. Given that cgroup removal isn't a very high frequency
operation and the amount of residual IOs on offline are likely to be
nil or small, this shouldn't be a big deal and the complexity needed
to handle residual IOs - another callback and rather elaborate
synchronization to reach and lock the matching q - doesn't seem
justified.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
cfq_pd_reset_stats() and move the latter to where other pd methods are
defined. cfqg_stats_reset() will be used to implement hierarchical
stats.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for
summing up stats from multiple blkgs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support. The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.
Replace it with cfqg_parent() which returns the real parent. This
enables full blkcg hierarchy support for cfq-iosched. For example,
consider the following hierarchy.
root
/ \
A:500 B:250
/ \
AA:500 AB:1000
For simplicity, let's say all the leaf nodes have active tasks and are
on service tree. For each leaf node, vfraction would be
AA: (500 / 1500) * (500 / 750) =~ 0.2222
AB: (1000 / 1500) * (500 / 750) =~ 0.4444
B: (250 / 750) =~ 0.3333
and vdisktime will be distributed accordingly. For more detail,
please refer to Documentation/block/cfq-iosched.txt.
v2: cfq-iosched.txt updated to describe group scheduling as suggested
by Vivek.
v3: blkio-controller.txt updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
cfq_group_slice() calculates slice by taking a fraction of
cfq_target_latency according to the ratio of cfqg->weight against
service_tree->total_weight. This currently works only because all
cfqgs are treated to be at the same level.
To prepare for proper hierarchy support, convert cfq_group_slice() to
base the calculation on cfqg->vfraction. As cfqg->vfraction is always
a fraction of 1 and represents the fraction allocated to the cfqg with
hierarchy considered, the slice can be simply calculated by
multiplying cfqg->vfraction to cfq_target_latency (with fixed point
shift factored in).
As vfraction calculation currently treats all non-root cfqgs as
children of the root cfqg, this patch doesn't introduce noticeable
behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Currently, cfqg charges are scaled directly according to cfqg->weight.
Regardless of the number of active cfqgs or the amount of active
weights, a given weight value always scales charge the same way. This
works fine as long as all cfqgs are treated equally regardless of
their positions in the hierarchy, which is what cfq currently
implements. It can't work in hierarchical settings because the
interpretation of a given weight value depends on where the weight is
located in the hierarchy.
This patch reimplements cfqg charge scaling so that it can be used to
support hierarchy properly. The scheme is fairly simple and
light-weight.
* When a cfqg is added to the service tree, v(disktime)weight is
calculated. It walks up the tree to root calculating the fraction
it has in the hierarchy. At each level, the fraction can be
calculated as
cfqg->weight / parent->level_weight
By compounding these, the global fraction of vdisktime the cfqg has
claim to - vfraction - can be determined.
* When the cfqg needs to be charged, the charge is scaled inversely
proportionally to the vfraction.
The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
representation as before; however, the smallest scaling factor is now
1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1
was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
scaling factor.
While this shifts the global scale of vdisktime a bit, it doesn't
change the relative relationships among cfqgs and the scheduling
result isn't different.
cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY
didn't have any relevance to vdisktime before and is unlikely to cause
any visible behavior difference now especially as the scale shift
isn't that large.
As the new scheme now makes proper distinction between cfqg->weight
and ->leaf_weight, reverse the weight aliasing for root cfqgs. For
root, both weights are now mapped to ->leaf_weight instead of the
other way around.
Because we're still using cfqg_flat_parent(), this patch shouldn't
change the scheduling behavior in any noticeable way.
v2: Beefed up comments on vfraction as requested by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
To prepare for blkcg hierarchy support, add cfqg->nr_active and
->children_weight. cfqg->nr_active counts the number of active cfqgs
at the cfqg's level and ->children_weight is sum of weights of those
cfqgs. The level covers itself (cfqg->leaf_weight) and immediate
children.
The two values are updated when a cfqg enters and leaves the group
service tree. Unless the hierarchy is very deep, the added overhead
should be negligible.
Currently, the parent is determined using cfqg_flat_parent() which
makes the root cfqg the parent of all other cfqgs. This is to make
the transition to hierarchy-aware scheduling gradual. Scheduling
logic will be converted to use cfqg->children_weight without actually
changing the behavior. When everything is ready,
blkcg_weight_parent() will be replaced with proper parent function.
This patch doesn't introduce any behavior chagne.
v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
cfq blkcg is about to grow proper hierarchy handling, where a child
blkg's weight would nest inside the parent's. This makes tasks in a
blkg to compete against both tasks in the sibling blkgs and the tasks
of child blkgs.
We're gonna use the existing weight as the group weight which decides
the blkg's weight against its siblings. This patch introduces a new
weight - leaf_weight - which decides the weight of a blkg against the
child blkgs.
It's named leaf_weight because another way to look at it is that each
internal blkg nodes have a hidden child leaf node which contains all
its tasks and leaf_weight is the weight of the leaf node and handled
the same as the weight of the child blkgs.
This patch only adds leaf_weight fields and exposes it to userland.
The new weight isn't actually used anywhere yet. Note that
cfq-iosched currently offcially supports only single level hierarchy
and root blkgs compete with the first level blkgs - ie. root weight is
basically being used as leaf_weight. For root blkgs, the two weights
are kept in sync for backward compatibility.
v2: cfqd->root_group->leaf_weight initialization was missing from
cfq_init_queue() causing divide by zero when
!CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Currently we attach a character "S" or "A" to the cfqq<pid>, to represent
whether queues is sync or async. Add one more character "N" to represent
whether it is sync-noidle queue or sync queue. So now three different
type of queues will look as follows.
cfq1234S --> sync queus
cfq1234SN --> sync noidle queue
cfq1234A --> Async queue
Previously S/A classification was being printed only if group scheduling
was enabled. This patch also makes sure that this classification is
displayed even if group idling is disabled.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use of local varibale "n" seems to be unnecessary. Remove it. This brings
it inline with function __cfq_group_st_add(), which is also doing the
similar operation of adding a group to a rb tree.
No functionality change here.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
choose_service_tree() selects/sets both wl_class and wl_type. Rename it to
choose_wl_class_and_type() to make it very clear.
cfq_choose_wl() only selects and sets wl_type. It is easy to confuse
it with choose_st(). So rename it to cfq_choose_wl_type() to make
it clear what does it do.
Just renaming. No functionality change.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At quite a few places we use the keyword "service_tree". At some places,
especially local variables, I have abbreviated it to "st".
Also at couple of places moved binary operator "+" from beginning of line
to end of previous line, as per Tejun's feedback.
v2:
Reverted most of the service tree name change based on Jeff Moyer's feedback.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Some more renaming. Again making the code uniform w.r.t use of
wl_class/class to represent IO class (RT, BE, IDLE) and using
wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC).
At places this patch shortens the string "workload" to "wl".
Renamed "saved_workload" to "saved_wl_type". Renamed
"saved_serving_class" to "saved_wl_class".
For uniformity with "saved_wl_*" variables, renamed "serving_class"
to "serving_wl_class" and renamed "serving_type" to "serving_wl_type".
Again, just trying to improve upon code uniformity and improve
readability. No functional change.
v2:
- Restored the usage of keyword "service" based on Jeff Moyer's feedback.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we
are calling workloads belonging to these classes as "prio". This gets
very confusing as one starts to associate it with ioprio.
So this patch just does bunch of renaming so that reading code becomes
easier. All reference to RT, BE and IDLE workload are done using keyword
"class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made
using keyword "type".
This makes me feel much better while I am reading the code. There is no
functionality change due to this patch.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
request is queued in cfqq->fifo list. Looks it's possible we are moving a
request from one cfqq to another in request merge case. In such case, adjusting
the fifo list order doesn't make sense and is impossible if we don't iterate
the whole fifo list.
My test does hit one case the two cfqq are different, but didn't cause kernel
crash, maybe it's because fifo list isn't used frequently. Anyway, from the
code logic, this is buggy.
I thought we can re-enable the recusive merge logic after this is fixed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq may be built w/ or w/o blkcg support depending on
CONFIG_CFQ_CGROUP_IOSCHED. If blkcg support is disabled, most of
related code is ifdef'd out but some part is left dangling -
blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
calls are made on it.
Feeding zero filled policy to blkcg_policy_register() is incorrect and
triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
!CONFIG_CFQ_GROUP_IOSCHED.
------------[ cut here ]------------
WARNING: at block/blk-cgroup.c:867
Modules linked in:
Modules linked in:
CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1
Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
Krnl Code: 00000000003d76be: a7280001 lhi %r2,1
00000000003d76c2: a7f4ffdf brc 15,3d7680
#00000000003d76c6: a7f40001 brc 15,3d76c8
>00000000003d76ca: a7c8ffea lhi %r12,-22
00000000003d76ce: a7f4ffce brc 15,3d766a
00000000003d76d2: a7f40001 brc 15,3d76d4
00000000003d76d6: a7c80000 lhi %r12,0
00000000003d76da: a7f4ffc2 brc 15,3d765e
Call Trace:
([<0000000000b6f000>] initcall_debug+0x0/0x4)
[<0000000000989e8a>] cfq_init+0x62/0xd4
[<00000000001000ba>] do_one_initcall+0x3a/0x170
[<000000000096fb60>] kernel_init+0x214/0x2bc
[<0000000000623202>] kernel_thread_starter+0x6/0xc
[<00000000006231fc>] kernel_thread_starter+0x0/0xc
no locks held by swapper/0/1.
Last Breaking-Event-Address:
[<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
---[ end trace b8ef4903fcbf9dd3 ]---
This patch fixes the problem by ensuring all blkcg support code is
inside CONFIG_CFQ_GROUP_IOSCHED.
* blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
inside the first CONFIG_CFQ_GROUP_IOSCHED block. __maybe_unused is
dropped from blkcg_policy_cfq decl.
* blkcg_deactivate_poilcy() invocation is moved inside ifdef. This
also makes the activation logic match cfq_init_queue().
* All blkcg_policy_[un]register() invocations are moved inside ifdef.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq_init() would return zero after kmem cache creation failure. Fix
so that it returns -ENOMEM.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
=4d4w
-----END PGP SIGNATURE-----
Merge tag 'v3.4-rc5' into for-3.5/core
The core branch is behind driver commits that we want to build
on for 3.5, hence I'm pulling in a later -rc.
Linux 3.4-rc5
Conflicts:
Documentation/feature-removal-schedule.txt
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no reason to keep blkcg_policy_ops separate. Collapse it into
blkcg_policy.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently blkg_policy_data carries policy specific data as char flex
array instead of being embedded in policy specific data. This was
forced by oddities around blkg allocation which are all gone now.
This patch makes blkg_policy_data embedded in policy specific data -
throtl_grp and cfq_group so that it's more conventional and consistent
with how io_cq is handled.
* blkcg_policy->pdata_size is renamed to ->pd_size.
* Functions which used to take void *pdata now takes struct
blkg_policy_data *pd.
* blkg_to_pdata/pdata_to_blkg() updated to blkg_to_pd/pd_to_blkg().
* Dummy struct blkg_policy_data definition added. Dummy
pdata_to_blkg() definition was unused and inconsistent with the
non-dummy version - correct dummy pd_to_blkg() added.
* throtl and cfq updated accordingly.
* As dummy blkg_to_pd/pd_to_blkg() are provided,
blkg_to_cfqg/cfqg_to_blkg() don't need to be ifdef'd. Moved outside
ifdef block.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
During the recent blkcg cleanup, most of blkcg API has changed to such
extent that mass renaming wouldn't cause any noticeable pain. Take
the chance and cleanup the naming.
* Rename blkio_cgroup to blkcg.
* Drop blkio / blkiocg prefixes and consistently use blkcg.
* Rename blkio_group to blkcg_gq, which is consistent with io_cq but
keep the blkg prefix / variable name.
* Rename policy method type and field names to signify they're dealing
with policy data.
* Rename blkio_policy_type to blkcg_policy.
This patch doesn't cause any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkio_group->path[] stores the path of the associated cgroup and is
used only for debug messages. Just format the path from blkg->cgroup
when printing debug messages.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* All_q_list is unused. Drop all_q_{mutex|list}.
* @for_root of blkg_lookup_create() is always %false when called from
outside blk-cgroup.c proper. Factor out __blkg_lookup_create() so
that it doesn't check whether @q is bypassing and use the
underscored version for the @for_root callsite.
* blkg_destroy_all() is used only from blkcg proper and @destroy_root
is always %true. Make it static and drop @destroy_root.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All blkcg policies were assumed to be enabled on all request_queues.
Due to various implementation obstacles, during the recent blkcg core
updates, this was temporarily implemented as shooting down all !root
blkgs on elevator switch and policy [de]registration combined with
half-broken in-place root blkg updates. In addition to being buggy
and racy, this meant losing all blkcg configurations across those
events.
Now that blkcg is cleaned up enough, this patch replaces the temporary
implementation with proper per-queue policy activation. Each blkcg
policy should call the new blkcg_[de]activate_policy() to enable and
disable the policy on a specific queue. blkcg_activate_policy()
allocates and installs policy data for the policy for all existing
blkgs. blkcg_deactivate_policy() does the reverse. If a policy is
not enabled for a given queue, blkg printing / config functions skip
the respective blkg for the queue.
blkcg_activate_policy() also takes care of root blkg creation, and
cfq_init_queue() and blk_throtl_init() are updated accordingly.
This replaces blkcg_bypass_{start|end}() and update_root_blkg_pd()
unnecessary. Dropped.
v2: cfq_init_queue() was returning uninitialized @ret on root_group
alloc failure if !CONFIG_CFQ_GROUP_IOSCHED. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With per-queue policy activation, root blkg creation will be moved to
blkcg core. Add q->root_blkg in preparation. For blk-throtl, this
replaces throtl_data->root_tg; however, cfq needs to keep
cfqd->root_group for !CONFIG_CFQ_GROUP_IOSCHED.
This is to prepare for per-queue policy activation and doesn't cause
any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add @pol to blkg_conf_prep() and let it return with queue lock held
(to be released by blkg_conf_finish()). Note that @pol isn't used
yet.
This is to prepare for per-queue policy activation and doesn't cause
any visible difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove BLKIO_POLICY_* enums and let blkio_policy_register() allocate
@pol->plid dynamically on registration. The maximum number of blkcg
policies which can be registered at the same time is defined by
BLKCG_MAX_POLS constant added to include/linux/blkdev.h.
Note that blkio_policy_register() now may fail. Policy init functions
updated accordingly and unnecessary ifdefs removed from cfq_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The two functions were taking "enum blkio_policy_id plid". Make them
take "const struct blkio_policy_type *pol" instead.
This is to prepare for per-queue policy activation and doesn't cause
any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that all stat handling code lives in policy implementations,
there's no need to encode policy ID in cft->private.
* Export blkcg_prfill_[rw]stat() from blkcg, remove
blkcg_print_[rw]stat(), and implement cfqg_print_[rw]stat() which
use hard-code BLKIO_POLICY_PROP.
* Use cft->private for offset of the target field directly and drop
BLKCG_STAT_{PRIV|POL|OFF}().
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that all conf and stat fields are moved into policy specific
blkio_policy_data->pdata areas, there's no reason to use
blkio_policy_data itself in prfill functions. Pass around @pd->pdata
instead of @pd.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkio_group_conf->weight is owned by cfq and has no reason to be
defined in blkcg core. Replace it with cfq_group->dev_weight and let
conf setting functions directly set it. If dev_weight is zero, the
cfqg doesn't have device specific weight configured.
Also, rename BLKIO_WEIGHT_* constants to CFQ_WEIGHT_* and rename
blkio_cgroup->weight to blkio_cgroup->cfq_weight. We eventually want
per-policy storage in blkio_cgroup but just mark the ownership of the
field for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkio_group_stats contains only fields used by cfq and has no reason
to be defined in blkcg core.
* Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats.
* blkg_policy_data->stats is replaced with cfq_group->stats.
blkg_prfill_[rw]stat() are updated to use offset against pd->pdata
instead.
* All related macros / functions are renamed so that they have cfqg_
prefix and the unnecessary @pol arguments are dropped.
* All stat functions now take cfq_group * instead of blkio_group *.
* lockdep assertion on queue lock dropped. Elevator runs under queue
lock by default. There isn't much to be gained by adding lockdep
assertions at stat function level.
* cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method
so that cfqg->stats can be reset.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkio_group_stats_cpu is used to count dispatch stats using per-cpu
counters. This is used by both blk-throtl and cfq-iosched but the
sharing is rather silly.
* cfq-iosched doesn't need per-cpu dispatch stats. cfq always updates
those stats while holding queue_lock.
* blk-throtl needs per-cpu dispatch stats but only service_bytes and
serviced. It doesn't make use of sectors.
This patch makes cfq add and use global stats for service_bytes,
serviced and sectors, removes per-cpu sectors counter and moves
per-cpu stat printing code to blk-throttle.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
As with conf/stats file handling code, there's no reason for stat
update code to live in blkcg core with policies calling into update
them. The current organization is both inflexible and complex.
This patch moves stat update code to specific policies. All
blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP
stats are collapsed into their cfq_blkiocg_update_*_stats()
counterparts. blkiocg_update_dispatch_stats() is used by both
policies and duplicated as throtl_update_dispatch_stats() and
cfq_blkiocg_update_dispatch_stats(). This will be cleaned up later.
Signed-off-by: Tejun Heo <tj@kernel.org>
block/cfq.h contains some functions which interact with blkcg;
however, this is only part of it and cfq-iosched.c already has quite
some #ifdef CONFIG_CFQ_GROUP_IOSCHED. With conf/stat handling being
moved to specific policies, having these relay functions isolated in
cfq.h doesn't make much sense. Collapse cfq.h into cfq-iosched.c for
now. Let's split blkcg support properly later if necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
blkcg conf/stat handling is convoluted in that details which belong to
specific policy implementations are all out in blkcg core and then
policies hook into core layer to access and manipulate confs and
stats. This sadly achieves both inflexibility (confs/stats can't be
modified without messing with blkcg core) and complexity (all the
call-ins and call-backs).
The previous patches restructured conf and stat handling code such
that they can be separated out. This patch relocates the file
handling part. All conf/stat file handling code which belongs to
BLKIO_POLICY_PROP is moved to cfq-iosched.c and all
BKLIO_POLICY_THROTL code to blk-throtl.c.
The move is verbatim except for blkio_update_group_{weight|bps|iops}()
callbacks which relays conf changes to policies. The configuration
settings are handled in policies themselves so the relaying isn't
necessary. Conf setting functions are modified to directly call
per-policy update functions and the relaying mechanism is dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
In cfq, when we calculate a time slice for a process(or a cfqq to
be precise), we have to consider the cfq_target_latency so that all the
sync request have an estimated latency(300ms) and it is controlled by
cfq_target_latency. But in some hadoop test, we have found that if
there are many processes doing sequential read(24 for example), the
throughput is bad because every process can only work for about 25ms
and the cfqq is switched. That leads to a higher disk seek. We can
achive the good throughput by setting low_latency=0, but then some
read's latency is too much for the application.
So this patch makes cfq_target_latency tunable through sysfs so that
we can tune it and find some magic number which is not bad for both
the throughput and the read latency.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq caches the associated cfqq's for a given cic. The cache needs to
be flushed if the cic's ioprio or blkcg has changed. It is currently
done by requiring the changing action to set the respective
ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(),
which involves iterating through all the affected icqs.
All cfq wants to know is whether ioprio and/or blkcg have changed
since the last flush and can be easily achieved by just remembering
the current ioprio and blkcg ID in cic.
This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to
use the remembered value instead, and updates cfq_set_request() path
such that, instead of using icq_get_changed(), the current values are
compared against the remembered ones and trigger appropriate flush
action if not. Condition tests are moved inside both _changed
functions which are now named check_ioprio_changed() and
check_blkcg_changed().
ioprio.h::task_ioprio*() can't be used anymore and replaced with
open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that io_cq is managed by block core and guaranteed to exist for
any in-flight request, it is easier and carries more information to
pass around cfq_io_cq than io_context.
This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and
cfq_get_queue() to take @cic instead of @ioc. This change removes a
duplicate cfq_cic_lookup() from cfq_find_alloc_queue().
This change enables the use of cic-cached ioprio in the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement bio_blkio_cgroup() which returns the blkcg associated with
the bio if exists or %current's blkcg, and use it in blk-throttle and
cfq-iosched propio. This makes both cgroup policies honor task
association for the bio instead of always assuming %current.
As nobody is using bio_set_task() yet, this doesn't introduce any
behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IO scheduling and cgroup are tied to the issuing task via io_context
and cgroup of %current. Unfortunately, there are cases where IOs need
to be routed via a different task which makes scheduling and cgroup
limit enforcement applied completely incorrectly.
For example, all bios delayed by blk-throttle end up being issued by a
delayed work item and get assigned the io_context of the worker task
which happens to serve the work item and dumped to the default block
cgroup. This is double confusing as bios which aren't delayed end up
in the correct cgroup and makes using blk-throttle and cfq propio
together impossible.
Any code which punts IO issuing to another task is affected which is
getting more and more common (e.g. btrfs). As both io_context and
cgroup are firmly tied to task including userland visible APIs to
manipulate them, it makes a lot of sense to match up tasks to bios.
This patch implements bio_associate_current() which associates the
specified bio with %current. The bio will record the associated ioc
and blkcg at that point and block layer will use the recorded ones
regardless of which task actually ends up issuing the bio. bio
release puts the associated ioc and blkcg.
It grabs and remembers ioc and blkcg instead of the task itself
because task may already be dead by the time the bio is issued making
ioc and blkcg inaccessible and those are all block layer cares about.
elevator_set_req_fn() is updated such that the bio elvdata is being
allocated for is available to the elevator.
This doesn't update block cgroup policies yet. Further patches will
implement the support.
-v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
rq_ioc() to fix build breakage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.
This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that blkg additions / removals are always done under both q and
blkcg locks, the only places RCU locking is necessary are
blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops
unncessary RCU locking replacing it with plain blkcg locking as
necessary.
* blkiocg_pre_destroy() already perform proper locking and don't need
RCU. Dropped.
* blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
lock. This isn't a hot path.
* Now unnecessary synchronize_rcu() from queue exit paths removed.
This makes q->nr_blkgs unnecessary. Dropped.
* RCU annotation on blkg->q removed.
-v2: Vivek pointed out that blkg_lookup_create() still needs to be
called under rcu_read_lock(). Updated.
-v3: After the update, stats_lock locking in blkio_read_blkg_stats()
shouldn't be using _irq variant as it otherwise ends up enabling
irq while blkcg->lock is locked. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>