268 строки
10 KiB
ReStructuredText
268 строки
10 KiB
ReStructuredText
=========================
|
|
BPF Graph Data Structures
|
|
=========================
|
|
|
|
This document describes implementation details of new-style "graph" data
|
|
structures (linked_list, rbtree), with particular focus on the verifier's
|
|
implementation of semantics specific to those data structures.
|
|
|
|
Although no specific verifier code is referred to in this document, the document
|
|
assumes that the reader has general knowledge of BPF verifier internals, BPF
|
|
maps, and BPF program writing.
|
|
|
|
Note that the intent of this document is to describe the current state of
|
|
these graph data structures. **No guarantees** of stability for either
|
|
semantics or APIs are made or implied here.
|
|
|
|
.. contents::
|
|
:local:
|
|
:depth: 2
|
|
|
|
Introduction
|
|
------------
|
|
|
|
The BPF map API has historically been the main way to expose data structures
|
|
of various types for use within BPF programs. Some data structures fit naturally
|
|
with the map API (HASH, ARRAY), others less so. Consequentially, programs
|
|
interacting with the latter group of data structures can be hard to parse
|
|
for kernel programmers without previous BPF experience.
|
|
|
|
Luckily, some restrictions which necessitated the use of BPF map semantics are
|
|
no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
|
|
BPF allocator, it is now possible to implement BPF data structures whose API
|
|
and semantics more closely match those exposed to the rest of the kernel.
|
|
|
|
Two such data structures - linked_list and rbtree - have many verification
|
|
details in common. Because both have "root"s ("head" for linked_list) and
|
|
"node"s, the verifier code and this document refer to common functionality
|
|
as "graph_api", "graph_root", "graph_node", etc.
|
|
|
|
Unless otherwise stated, examples and semantics below apply to both graph data
|
|
structures.
|
|
|
|
Unstable API
|
|
------------
|
|
|
|
Data structures implemented using the BPF map API have historically used BPF
|
|
helper functions - either standard map API helpers like ``bpf_map_update_elem``
|
|
or map-specific helpers. The new-style graph data structures instead use kfuncs
|
|
to define their manipulation helpers. Because there are no stability guarantees
|
|
for kfuncs, the API and semantics for these data structures can be evolved in
|
|
a way that breaks backwards compatibility if necessary.
|
|
|
|
Root and node types for the new data structures are opaquely defined in the
|
|
``uapi/linux/bpf.h`` header.
|
|
|
|
Locking
|
|
-------
|
|
|
|
The new-style data structures are intrusive and are defined similarly to their
|
|
vanilla kernel counterparts:
|
|
|
|
.. code-block:: c
|
|
|
|
struct node_data {
|
|
long key;
|
|
long data;
|
|
struct bpf_rb_node node;
|
|
};
|
|
|
|
struct bpf_spin_lock glock;
|
|
struct bpf_rb_root groot __contains(node_data, node);
|
|
|
|
The "root" type for both linked_list and rbtree expects to be in a map_value
|
|
which also contains a ``bpf_spin_lock`` - in the above example both global
|
|
variables are placed in a single-value arraymap. The verifier considers this
|
|
spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
|
|
the same map_value and will enforce that the correct lock is held when
|
|
verifying BPF programs that manipulate the tree. Since this lock checking
|
|
happens at verification time, there is no runtime penalty.
|
|
|
|
Non-owning references
|
|
---------------------
|
|
|
|
**Motivation**
|
|
|
|
Consider the following BPF code:
|
|
|
|
.. code-block:: c
|
|
|
|
struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
|
|
|
|
bpf_spin_lock(&lock);
|
|
|
|
bpf_rbtree_add(&tree, n); /* PASSED */
|
|
|
|
bpf_spin_unlock(&lock);
|
|
|
|
From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
|
|
has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
|
|
``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
|
|
program has ownership of the pointee's (object pointed to by ``n``) lifetime.
|
|
The BPF program must pass off ownership before exiting - either via
|
|
``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
|
|
``bpf_rbtree_add``.
|
|
|
|
(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
|
|
"ownership is acquired" and "ownership is passed", respectively)
|
|
|
|
What should the verifier do with ``n`` after ownership is passed off? If the
|
|
object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
|
|
should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
|
|
the object is no longer valid. The underlying memory may have been reused for
|
|
some other allocation, unmapped, etc.
|
|
|
|
When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
|
|
obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
|
|
but that would result in programs with useful, common coding patterns being
|
|
rejected, e.g.:
|
|
|
|
.. code-block:: c
|
|
|
|
int x;
|
|
struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
|
|
|
|
bpf_spin_lock(&lock);
|
|
|
|
bpf_rbtree_add(&tree, n); /* PASSED */
|
|
x = n->data;
|
|
n->data = 42;
|
|
|
|
bpf_spin_unlock(&lock);
|
|
|
|
Both the read from and write to ``n->data`` would be rejected. The verifier
|
|
can do better, though, by taking advantage of two details:
|
|
|
|
* Graph data structure APIs can only be used when the ``bpf_spin_lock``
|
|
associated with the graph root is held
|
|
|
|
* Both graph data structures have pointer stability
|
|
|
|
* Because graph nodes are allocated with ``bpf_obj_new`` and
|
|
adding / removing from the root involves fiddling with the
|
|
``bpf_{list,rb}_node`` field of the node struct, a graph node will
|
|
remain at the same address after either operation.
|
|
|
|
Because the associated ``bpf_spin_lock`` must be held by any program adding
|
|
or removing, if we're in the critical section bounded by that lock, we know
|
|
that no other program can add or remove until the end of the critical section.
|
|
This combined with pointer stability means that, until the critical section
|
|
ends, we can safely access the graph node through ``n`` even after it was used
|
|
to pass ownership.
|
|
|
|
The verifier considers such a reference a *non-owning reference*. The ref
|
|
returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
|
|
Both terms currently only have meaning in the context of graph nodes and API.
|
|
|
|
**Details**
|
|
|
|
Let's enumerate the properties of both types of references.
|
|
|
|
*owning reference*
|
|
|
|
* This reference controls the lifetime of the pointee
|
|
|
|
* Ownership of pointee must be 'released' by passing it to some graph API
|
|
kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
|
|
|
|
* If not released before program ends, verifier considers program invalid
|
|
|
|
* Access to the pointee's memory will not page fault
|
|
|
|
*non-owning reference*
|
|
|
|
* This reference does not own the pointee
|
|
|
|
* It cannot be used to add the graph node to a graph root, nor ``free``'d via
|
|
``bpf_obj_drop``
|
|
|
|
* No explicit control of lifetime, but can infer valid lifetime based on
|
|
non-owning ref existence (see explanation below)
|
|
|
|
* Access to the pointee's memory will not page fault
|
|
|
|
From verifier's perspective non-owning references can only exist
|
|
between spin_lock and spin_unlock. Why? After spin_unlock another program
|
|
can do arbitrary operations on the data structure like removing and ``free``-ing
|
|
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
|
|
``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
|
|
Or the memory could go away.
|
|
|
|
To prevent this logic violation all non-owning references are invalidated by the
|
|
verifier after a critical section ends. This is necessary to ensure the "will
|
|
not page fault" property of non-owning references. So if the verifier hasn't
|
|
invalidated a non-owning ref, accessing it will not page fault.
|
|
|
|
Currently ``bpf_obj_drop`` is not allowed in the critical section, so
|
|
if there's a valid non-owning ref, we must be in a critical section, and can
|
|
conclude that the ref's memory hasn't been dropped-and- ``free``'d or
|
|
dropped-and-reused.
|
|
|
|
Any reference to a node that is in an rbtree _must_ be non-owning, since
|
|
the tree has control of the pointee's lifetime. Similarly, any ref to a node
|
|
that isn't in rbtree _must_ be owning. This results in a nice property:
|
|
graph API add / remove implementations don't need to check if a node
|
|
has already been added (or already removed), as the ownership model
|
|
allows the verifier to prevent such a state from being valid by simply checking
|
|
types.
|
|
|
|
However, pointer aliasing poses an issue for the above "nice property".
|
|
Consider the following example:
|
|
|
|
.. code-block:: c
|
|
|
|
struct node_data *n, *m, *o, *p;
|
|
n = bpf_obj_new(typeof(*n)); /* 1 */
|
|
|
|
bpf_spin_lock(&lock);
|
|
|
|
bpf_rbtree_add(&tree, n); /* 2 */
|
|
m = bpf_rbtree_first(&tree); /* 3 */
|
|
|
|
o = bpf_rbtree_remove(&tree, n); /* 4 */
|
|
p = bpf_rbtree_remove(&tree, m); /* 5 */
|
|
|
|
bpf_spin_unlock(&lock);
|
|
|
|
bpf_obj_drop(o);
|
|
bpf_obj_drop(p); /* 6 */
|
|
|
|
Assume the tree is empty before this program runs. If we track verifier state
|
|
changes here using numbers in above comments:
|
|
|
|
1) n is an owning reference
|
|
|
|
2) n is a non-owning reference, it's been added to the tree
|
|
|
|
3) n and m are non-owning references, they both point to the same node
|
|
|
|
4) o is an owning reference, n and m non-owning, all point to same node
|
|
|
|
5) o and p are owning, n and m non-owning, all point to the same node
|
|
|
|
6) a double-free has occurred, since o and p point to same node and o was
|
|
``free``'d in previous statement
|
|
|
|
States 4 and 5 violate our "nice property", as there are non-owning refs to
|
|
a node which is not in an rbtree. Statement 5 will try to remove a node which
|
|
has already been removed as a result of this violation. State 6 is a dangerous
|
|
double-free.
|
|
|
|
At a minimum we should prevent state 6 from being possible. If we can't also
|
|
prevent state 5 then we must abandon our "nice property" and check whether a
|
|
node has already been removed at runtime.
|
|
|
|
We prevent both by generalizing the "invalidate non-owning references" behavior
|
|
of ``bpf_spin_unlock`` and doing similar invalidation after
|
|
``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
|
|
|
|
* takes an arbitrary node argument
|
|
|
|
* removes it from the data structure
|
|
|
|
* returns an owning reference to the removed node
|
|
|
|
May result in a state where some other non-owning reference points to the same
|
|
node. So ``remove``-type kfuncs must be considered a non-owning reference
|
|
invalidation point as well.
|