cgroup: add documentation about unified hierarchy
Unified hierarchy will be the new version of cgroup interface. This patch adds Documentation/cgroups/unified-hierarchy.txt which describes the design and rationales of unified hierarchy. v2: Grammatical updates as per Randy Dunlap's review. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org>
This commit is contained in:
Родитель
842b597ee0
Коммит
6573157800
|
@ -0,0 +1,359 @@
|
|||
|
||||
Cgroup unified hierarchy
|
||||
|
||||
April, 2014 Tejun Heo <tj@kernel.org>
|
||||
|
||||
This document describes the changes made by unified hierarchy and
|
||||
their rationales. It will eventually be merged into the main cgroup
|
||||
documentation.
|
||||
|
||||
CONTENTS
|
||||
|
||||
1. Background
|
||||
2. Basic Operation
|
||||
2-1. Mounting
|
||||
2-2. cgroup.subtree_control
|
||||
2-3. cgroup.controllers
|
||||
3. Structural Constraints
|
||||
3-1. Top-down
|
||||
3-2. No internal tasks
|
||||
4. Other Changes
|
||||
4-1. [Un]populated Notification
|
||||
4-2. Other Core Changes
|
||||
4-3. Per-Controller Changes
|
||||
4-3-1. blkio
|
||||
4-3-2. cpuset
|
||||
4-3-3. memory
|
||||
5. Planned Changes
|
||||
5-1. CAP for resource control
|
||||
|
||||
|
||||
1. Background
|
||||
|
||||
cgroup allows an arbitrary number of hierarchies and each hierarchy
|
||||
can host any number of controllers. While this seems to provide a
|
||||
high level of flexibility, it isn't quite useful in practice.
|
||||
|
||||
For example, as there is only one instance of each controller, utility
|
||||
type controllers such as freezer which can be useful in all
|
||||
hierarchies can only be used in one. The issue is exacerbated by the
|
||||
fact that controllers can't be moved around once hierarchies are
|
||||
populated. Another issue is that all controllers bound to a hierarchy
|
||||
are forced to have exactly the same view of the hierarchy. It isn't
|
||||
possible to vary the granularity depending on the specific controller.
|
||||
|
||||
In practice, these issues heavily limit which controllers can be put
|
||||
on the same hierarchy and most configurations resort to putting each
|
||||
controller on its own hierarchy. Only closely related ones, such as
|
||||
the cpu and cpuacct controllers, make sense to put on the same
|
||||
hierarchy. This often means that userland ends up managing multiple
|
||||
similar hierarchies repeating the same steps on each hierarchy
|
||||
whenever a hierarchy management operation is necessary.
|
||||
|
||||
Unfortunately, support for multiple hierarchies comes at a steep cost.
|
||||
Internal implementation in cgroup core proper is dazzlingly
|
||||
complicated but more importantly the support for multiple hierarchies
|
||||
restricts how cgroup is used in general and what controllers can do.
|
||||
|
||||
There's no limit on how many hierarchies there may be, which means
|
||||
that a task's cgroup membership can't be described in finite length.
|
||||
The key may contain any varying number of entries and is unlimited in
|
||||
length, which makes it highly awkward to handle and leads to addition
|
||||
of controllers which exist only to identify membership, which in turn
|
||||
exacerbates the original problem.
|
||||
|
||||
Also, as a controller can't have any expectation regarding what shape
|
||||
of hierarchies other controllers would be on, each controller has to
|
||||
assume that all other controllers are operating on completely
|
||||
orthogonal hierarchies. This makes it impossible, or at least very
|
||||
cumbersome, for controllers to cooperate with each other.
|
||||
|
||||
In most use cases, putting controllers on hierarchies which are
|
||||
completely orthogonal to each other isn't necessary. What usually is
|
||||
called for is the ability to have differing levels of granularity
|
||||
depending on the specific controller. In other words, hierarchy may
|
||||
be collapsed from leaf towards root when viewed from specific
|
||||
controllers. For example, a given configuration might not care about
|
||||
how memory is distributed beyond a certain level while still wanting
|
||||
to control how CPU cycles are distributed.
|
||||
|
||||
Unified hierarchy is the next version of cgroup interface. It aims to
|
||||
address the aforementioned issues by having more structure while
|
||||
retaining enough flexibility for most use cases. Various other
|
||||
general and controller-specific interface issues are also addressed in
|
||||
the process.
|
||||
|
||||
|
||||
2. Basic Operation
|
||||
|
||||
2-1. Mounting
|
||||
|
||||
Currently, unified hierarchy can be mounted with the following mount
|
||||
command. Note that this is still under development and scheduled to
|
||||
change soon.
|
||||
|
||||
mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
|
||||
|
||||
All controllers which are not bound to other hierarchies are
|
||||
automatically bound to unified hierarchy and show up at the root of
|
||||
it. Controllers which are enabled only in the root of unified
|
||||
hierarchy can be bound to other hierarchies at any time. This allows
|
||||
mixing unified hierarchy with the traditional multiple hierarchies in
|
||||
a fully backward compatible way.
|
||||
|
||||
|
||||
2-2. cgroup.subtree_control
|
||||
|
||||
All cgroups on unified hierarchy have a "cgroup.subtree_control" file
|
||||
which governs which controllers are enabled on the children of the
|
||||
cgroup. Let's assume a hierarchy like the following.
|
||||
|
||||
root - A - B - C
|
||||
\ D
|
||||
|
||||
root's "cgroup.subtree_control" file determines which controllers are
|
||||
enabled on A. A's on B. B's on C and D. This coincides with the
|
||||
fact that controllers on the immediate sub-level are used to
|
||||
distribute the resources of the parent. In fact, it's natural to
|
||||
assume that resource control knobs of a child belong to its parent.
|
||||
Enabling a controller in a "cgroup.subtree_control" file declares that
|
||||
distribution of the respective resources of the cgroup will be
|
||||
controlled. Note that this means that controller enable states are
|
||||
shared among siblings.
|
||||
|
||||
When read, the file contains a space-separated list of currently
|
||||
enabled controllers. A write to the file should contain a
|
||||
space-separated list of controllers with '+' or '-' prefixed (without
|
||||
the quotes). Controllers prefixed with '+' are enabled and '-'
|
||||
disabled. If a controller is listed multiple times, the last entry
|
||||
wins. The specific operations are executed atomically - either all
|
||||
succeed or fail.
|
||||
|
||||
|
||||
2-3. cgroup.controllers
|
||||
|
||||
Read-only "cgroup.controllers" file contains a space-separated list of
|
||||
controllers which can be enabled in the cgroup's
|
||||
"cgroup.subtree_control" file.
|
||||
|
||||
In the root cgroup, this lists controllers which are not bound to
|
||||
other hierarchies and the content changes as controllers are bound to
|
||||
and unbound from other hierarchies.
|
||||
|
||||
In non-root cgroups, the content of this file equals that of the
|
||||
parent's "cgroup.subtree_control" file as only controllers enabled
|
||||
from the parent can be used in its children.
|
||||
|
||||
|
||||
3. Structural Constraints
|
||||
|
||||
3-1. Top-down
|
||||
|
||||
As it doesn't make sense to nest control of an uncontrolled resource,
|
||||
all non-root "cgroup.subtree_control" files can only contain
|
||||
controllers which are enabled in the parent's "cgroup.subtree_control"
|
||||
file. A controller can be enabled only if the parent has the
|
||||
controller enabled and a controller can't be disabled if one or more
|
||||
children have it enabled.
|
||||
|
||||
|
||||
3-2. No internal tasks
|
||||
|
||||
One long-standing issue that cgroup faces is the competition between
|
||||
tasks belonging to the parent cgroup and its children cgroups. This
|
||||
is inherently nasty as two different types of entities compete and
|
||||
there is no agreed-upon obvious way to handle it. Different
|
||||
controllers are doing different things.
|
||||
|
||||
The cpu controller considers tasks and cgroups as equivalents and maps
|
||||
nice levels to cgroup weights. This works for some cases but falls
|
||||
flat when children should be allocated specific ratios of CPU cycles
|
||||
and the number of internal tasks fluctuates - the ratios constantly
|
||||
change as the number of competing entities fluctuates. There also are
|
||||
other issues. The mapping from nice level to weight isn't obvious or
|
||||
universal, and there are various other knobs which simply aren't
|
||||
available for tasks.
|
||||
|
||||
The blkio controller implicitly creates a hidden leaf node for each
|
||||
cgroup to host the tasks. The hidden leaf has its own copies of all
|
||||
the knobs with "leaf_" prefixed. While this allows equivalent control
|
||||
over internal tasks, it's with serious drawbacks. It always adds an
|
||||
extra layer of nesting which may not be necessary, makes the interface
|
||||
messy and significantly complicates the implementation.
|
||||
|
||||
The memory controller currently doesn't have a way to control what
|
||||
happens between internal tasks and child cgroups and the behavior is
|
||||
not clearly defined. There have been attempts to add ad-hoc behaviors
|
||||
and knobs to tailor the behavior to specific workloads. Continuing
|
||||
this direction will lead to problems which will be extremely difficult
|
||||
to resolve in the long term.
|
||||
|
||||
Multiple controllers struggle with internal tasks and came up with
|
||||
different ways to deal with it; unfortunately, all the approaches in
|
||||
use now are severely flawed and, furthermore, the widely different
|
||||
behaviors make cgroup as whole highly inconsistent.
|
||||
|
||||
It is clear that this is something which needs to be addressed from
|
||||
cgroup core proper in a uniform way so that controllers don't need to
|
||||
worry about it and cgroup as a whole shows a consistent and logical
|
||||
behavior. To achieve that, unified hierarchy enforces the following
|
||||
structural constraint:
|
||||
|
||||
Except for the root, only cgroups which don't contain any task may
|
||||
have controllers enabled in their "cgroup.subtree_control" files.
|
||||
|
||||
Combined with other properties, this guarantees that, when a
|
||||
controller is looking at the part of the hierarchy which has it
|
||||
enabled, tasks are always only on the leaves. This rules out
|
||||
situations where child cgroups compete against internal tasks of the
|
||||
parent.
|
||||
|
||||
There are two things to note. Firstly, the root cgroup is exempt from
|
||||
the restriction. Root contains tasks and anonymous resource
|
||||
consumption which can't be associated with any other cgroup and
|
||||
requires special treatment from most controllers. How resource
|
||||
consumption in the root cgroup is governed is up to each controller.
|
||||
|
||||
Secondly, the restriction doesn't take effect if there is no enabled
|
||||
controller in the cgroup's "cgroup.subtree_control" file. This is
|
||||
important as otherwise it wouldn't be possible to create children of a
|
||||
populated cgroup. To control resource distribution of a cgroup, the
|
||||
cgroup must create children and transfer all its tasks to the children
|
||||
before enabling controllers in its "cgroup.subtree_control" file.
|
||||
|
||||
|
||||
4. Other Changes
|
||||
|
||||
4-1. [Un]populated Notification
|
||||
|
||||
cgroup users often need a way to determine when a cgroup's
|
||||
subhierarchy becomes empty so that it can be cleaned up. cgroup
|
||||
currently provides release_agent for it; unfortunately, this mechanism
|
||||
is riddled with issues.
|
||||
|
||||
- It delivers events by forking and execing a userland binary
|
||||
specified as the release_agent. This is a long deprecated method of
|
||||
notification delivery. It's extremely heavy, slow and cumbersome to
|
||||
integrate with larger infrastructure.
|
||||
|
||||
- There is single monitoring point at the root. There's no way to
|
||||
delegate management of a subtree.
|
||||
|
||||
- The event isn't recursive. It triggers when a cgroup doesn't have
|
||||
any tasks or child cgroups. Events for internal nodes trigger only
|
||||
after all children are removed. This again makes it impossible to
|
||||
delegate management of a subtree.
|
||||
|
||||
- Events are filtered from the kernel side. A "notify_on_release"
|
||||
file is used to subscribe to or suppress release events. This is
|
||||
unnecessarily complicated and probably done this way because event
|
||||
delivery itself was expensive.
|
||||
|
||||
Unified hierarchy implements an interface file "cgroup.populated"
|
||||
which can be used to monitor whether the cgroup's subhierarchy has
|
||||
tasks in it or not. Its value is 0 if there is no task in the cgroup
|
||||
and its descendants; otherwise, 1. poll and [id]notify events are
|
||||
triggered when the value changes.
|
||||
|
||||
This is significantly lighter and simpler and trivially allows
|
||||
delegating management of subhierarchy - subhierarchy monitoring can
|
||||
block further propagation simply by putting itself or another process
|
||||
in the subhierarchy and monitor events that it's interested in from
|
||||
there without interfering with monitoring higher in the tree.
|
||||
|
||||
In unified hierarchy, the release_agent mechanism is no longer
|
||||
supported and the interface files "release_agent" and
|
||||
"notify_on_release" do not exist.
|
||||
|
||||
|
||||
4-2. Other Core Changes
|
||||
|
||||
- None of the mount options is allowed.
|
||||
|
||||
- remount is disallowed.
|
||||
|
||||
- rename(2) is disallowed.
|
||||
|
||||
- The "tasks" file is removed. Everything should at process
|
||||
granularity. Use the "cgroup.procs" file instead.
|
||||
|
||||
- The "cgroup.procs" file is not sorted. pids will be unique unless
|
||||
they got recycled in-between reads.
|
||||
|
||||
- The "cgroup.clone_children" file is removed.
|
||||
|
||||
|
||||
4-3. Per-Controller Changes
|
||||
|
||||
4-3-1. blkio
|
||||
|
||||
- blk-throttle becomes properly hierarchical.
|
||||
|
||||
|
||||
4-3-2. cpuset
|
||||
|
||||
- Tasks are kept in empty cpusets after hotplug and take on the masks
|
||||
of the nearest non-empty ancestor, instead of being moved to it.
|
||||
|
||||
- A task can be moved into an empty cpuset, and again it takes on the
|
||||
masks of the nearest non-empty ancestor.
|
||||
|
||||
|
||||
4-3-3. memory
|
||||
|
||||
- use_hierarchy is on by default and the cgroup file for the flag is
|
||||
not created.
|
||||
|
||||
|
||||
5. Planned Changes
|
||||
|
||||
5-1. CAP for resource control
|
||||
|
||||
Unified hierarchy will require one of the capabilities(7), which is
|
||||
yet to be decided, for all resource control related knobs. Process
|
||||
organization operations - creation of sub-cgroups and migration of
|
||||
processes in sub-hierarchies may be delegated by changing the
|
||||
ownership and/or permissions on the cgroup directory and
|
||||
"cgroup.procs" interface file; however, all operations which affect
|
||||
resource control - writes to a "cgroup.subtree_control" file or any
|
||||
controller-specific knobs - will require an explicit CAP privilege.
|
||||
|
||||
This, in part, is to prevent the cgroup interface from being
|
||||
inadvertently promoted to programmable API used by non-privileged
|
||||
binaries. cgroup exposes various aspects of the system in ways which
|
||||
aren't properly abstracted for direct consumption by regular programs.
|
||||
This is an administration interface much closer to sysctl knobs than
|
||||
system calls. Even the basic access model, being filesystem path
|
||||
based, isn't suitable for direct consumption. There's no way to
|
||||
access "my cgroup" in a race-free way or make multiple operations
|
||||
atomic against migration to another cgroup.
|
||||
|
||||
Another aspect is that, for better or for worse, the cgroup interface
|
||||
goes through far less scrutiny than regular interfaces for
|
||||
unprivileged userland. The upside is that cgroup is able to expose
|
||||
useful features which may not be suitable for general consumption in a
|
||||
reasonable time frame. It provides a relatively short path between
|
||||
internal details and userland-visible interface. Of course, this
|
||||
shortcut comes with high risk. We go through what we go through for
|
||||
general kernel APIs for good reasons. It may end up leaking internal
|
||||
details in a way which can exert significant pain by locking the
|
||||
kernel into a contract that can't be maintained in a reasonable
|
||||
manner.
|
||||
|
||||
Also, due to the specific nature, cgroup and its controllers don't
|
||||
tend to attract attention from a wide scope of developers. cgroup's
|
||||
short history is already fraught with severely mis-designed
|
||||
interfaces, unnecessary commitments to and exposing of internal
|
||||
details, broken and dangerous implementations of various features.
|
||||
|
||||
Keeping cgroup as an administration interface is both advantageous for
|
||||
its role and imperative given its nature. Some of the cgroup features
|
||||
may make sense for unprivileged access. If deemed justified, those
|
||||
must be further abstracted and implemented as a different interface,
|
||||
be it a system call or process-private filesystem, and survive through
|
||||
the scrutiny that any interface for general consumption is required to
|
||||
go through.
|
||||
|
||||
Requiring CAP is not a complete solution but should serve as a
|
||||
significant deterrent against spraying cgroup usages in non-privileged
|
||||
programs.
|
Загрузка…
Ссылка в новой задаче