10 Bounds declarations
David Tarditi редактировал(а) эту страницу 2023-07-07 17:36:42 -07:00

Overview

Programmers declare bounds for variables and members with _Array_ptr, _Nt_array_ptr, or incomplete checked array types as part of variable and member declarations. The bounds for a variable or member are declared by following the declarator for the variable or member with a : and a bounds expression. The bounds declaration precedes any initializer.

There are 4 kinds of bounds expressions:

  1. count(e): the number of array elements that a pointer points to.
  2. byte_count(e): the number of bytes that a pointer points to.
  3. bounds(lower, upper): the lower and upper bound for a pointer. For an _Array_ptr variable, memory at or above lower and below upper can be accessed. For an _Nt_array_ptr variable, the element beginning at upper can be read or a null value can be assign there.
  4. bounds(unknown): the pointer has unknown bounds and cannot be used to access memory.

Count and byte_count are syntactic sugar for bounds. Bounds expressions are evaluated as part of bounds checking memory accesses, so they are not allowed to modify memory. They cannot be or contain assignment expressions, pre/post increment/decrement expressions, or function calls.

Bounds do not need to be declared for variables and members with complete checked array types (where the dimension size is given). The compiler infers the bounds from the declared size.

When a bounds expression is declared for a parameter, it can refer to any other parameter in the parameter list. The entire parameter list is parsed and then type checking of bounds expressions and checking of other well-formedness conditions is done.

Programmers can also declare bounds for the return value of a function. The return bounds are declared by following the function declarator with a ':' and a bounds expression. The return bounds can refer to any parameter. They can also refer to the special symbol _Return_value, which is the return value of the function.

Examples

// Function taking pointer to 5 integers.
int f(_Array_ptr<int> p : count(5));

// Equivalent declaration using byte_count.
int f(_Array_ptr<int> p : byte_count(5 * sizeof(int));

// Function taking pointer to len integers.
int g(_Array_ptr<int> p : count(len), int len);

// Equivalent declaration using an array type.
// Parameters with array types are treated as
// having pointer types.
int g(int p _Checked[] : count(len), int len);

// Function returning pointer to len integers.
_Array_ptr<int> h(int len) : count(len);

// Global variable pointing to 5 integers, initialized to
// null.
_Array_ptr<int> r : count(5) = 0;  

// Variable length array declaration.
struct VarArr {
    int len;
    int arr _Checked[] : count(len);
};

Keeping bounds declarations valid

A bounds declaration for a variable or an object member is valid:

  • If the declared bounds are within the bounds of an object at runtime.
  • Or the variable or object member is null. Null pointers cannot be used to access memory, so the bounds possibly being out-of-range of an object does not matter.

A bounds declaration must be valid for the lifetime of its variable or object member. All variables with bounds declarations must be initialized at declaration time. This also includes struct variables that have members with bounds. Struct variables can be initialized by declaring a value for the first member of the struct. The remaining values will be initialized to default values, which for pointers is null:

VarArr va = { 0 };

Heap-allocated data must be allocated with calloc or the programmer must ensure that it is initialized before it is used (we will eventually add checking that enforces this, but it isn't done yet).

Declared bounds must be valid after the end of every statement, unless the statement is in bundled scope. Declared bounds must be valid at the end of bundled scopes. If there is an assignment to a variable with a bounds declaration, or an assignment to a variable used by an in-scope bounds declaration, the programmer must ensure the declared bounds are valid at the end of the statement or boundled scope.

The compiler checks statically that declared bounds are valid. The static checking algorithm is described in Bounds declaration checking. It is not finished yet, so currently the compiler issues a warning if it cannot prove that bounds are valid. Eventually the compiler will issue an error. A programmer can use a dynamic runtime check to eliminate the warning (or error).

A program can create a pointer to a variable or member by taking its address. For all practical purposes, we do not allow pointers to variables or members that have bounds or are used in bounds to be created. This prevents bounds declaration checking from being bypassed via pointers. The result of an address-of operation for a variable or member that has bounds or is used in bounds and that does not have array type (or is a pointer derived from such an operation), cannot be used as:

  • the right-hand side of an assignment or as an argument to a call. The pointer value could escape to an unknown use.
  • In an lvalue expression that is assigned to. This would modify the variable or member.
_Array_ptr<int> p : count(len) = ...
_Array_ptr<int> r : count(len) = *&p;     // legal
_Ptr<Array_ptr<int>> pp = cond ? &p : &r; // not allowed.
_Ptr<int> plen = &len;                    // not allowed
*&p = r;                                  // not allowed

If a programmer uses unchecked pointers to modify a variable or member with bounds, it is the programmer's responsibility to maintain the validity of the bounds declaration. For example, a programmer could take the address of a member of a structure without bounds or not used in bounds, and use it to reach and modify an adjacent member in the structure. This cannot be done using checked pointers because the write to the adjacent member will not be allowed (it will fail the bounds check), but it could be done using unchecked pointers.

Narrowing and widening of bounds

It is OK to narrow the bounds of a pointer. This is useful when a function only operates over a portion of data. Consider a function that checks that integrity of data in a buffer. The checksum is stored in the first 4 bytes of a buffer and the data is in the remainder of the buffer.

int convert_to_int(Array_ptr<unsigned char> start : count(len), int len);

int checksum(_Array_ptr<unsigned char> start : bounds(start, end),
             _Array_ptr<unsigned char> end);

int verify_data(_Array_ptr<unsigned char> buf : bounds(buf, end),
                _Array_ptr<unsigned char> end) {
  if (buf + 4 > end)
    return 0;

  int target = convert_to_int(buf, 4);
  int computed = checksum(buf + 4, end);
  return computed == target;
}

Bounds of _Nt_array_ptr pointers can be widened if the value at the upper bound is non-null (note that this widening is not yet implemented in the compiler!).

_Nt_array_ptr<char> p : count(0);
if (*p) {
  _Nt_array_ptr<char> r : count(1) = p;
   ...
}

Bounds declaration checking

Bounds declarations are checked at compile-time at variable declarations, after assignments, and at function calls. The validity of the bounds declaration must follow from the constructed bounds for expressions and additional program facts gathered by dataflow analysis. The facts may include equality facts as well as relational facts.

If the compiler cannot prove (or disprove) the validity of the bounds, a programmer can insert a dynamic runtime check to avoid a compiler warning. The static checking algorithm that Checked C uses is incomplete, and there will be facts about bounds that are true but cannot be proved at compile time.

For a variable declaration, the compiler constructs the bounds for the initializer expression. It then checks that those bounds imply declared bounds:

_Array_ptr<int> x : count(10) = ...
_Array_ptr<int> y : count(5) = x; 

For assignments, the compiler checks that the bounds of variables and members are valid after expression statements. In the simplest case, the bounds of the right-hand side expression of an assignment imply the bounds for a variable or a member.

_Array_ptr<int> x : count(10) = ...
_Array_ptr<int> y : count(5) = ...
y = x;

C allows multiple assignments within a single statement. In more complex cases, a variable with a bounds and variables used in bounds are updated.

_Array_ptr<int> x : count(x_len) = ...
_Array_ptr<int> y : count(y_len) = ...

 x = y , x_len = y_len;

For function calls, the compiler checks that the bounds of argument expressions imply parameter bounds. The compiler does this by "translating" the parameter bounds into bounds at the call site. It substitutes the actual arguments into the parameter bounds. If the arguments have side-effects, the compiler introduces (symbolic) temporaries.

Reasoning about bounds expressions.

The bounds declaration checking algorithm reasons about:

  1. Expressions that are equivalent (always compute the same value.)
  2. Constant-sized ranges. These are bounds that can be expressed in the form (e1 + c, e1 + d), where c and d are compile-time constants.

For determining expression equivalence, the compiler starts with a list of sets of expressions that are known to evaluate to equal values (for example, given the assignment x = y, the compiler knows after the assignment that x == y). Two expressions are equivalent if:

  1. They are syntactically identical.
  2. They are both in the same set of expressions known to be equal.
  3. They constant-fold to the same value.
  4. The expressions both use the same operator, each of their operands is equivalent, and the corresponding operands have matching types.
  5. One expression is parenthesized and its subexpression is equivalent to the other expression.
  6. One expression is a value-preserving operation and its subexpression is equivalent to the other expression. Value-preserving operations include:
  • Parentheses.
  • C-style casts that preserve bitwise values (such as casts between pointer types).
  • * and & applied to function and pointer types.

To see if one constant-sized range (e1 + c1, e1 + d1) implies the validity (or falseness) of another constant-sized range (e2 + c2 , e2 + d2), the compiler checks if the base expressions are equivalent. If they are not equivalent, no conclusion can be drawn about truth or falseness. If they are equivalent, the compiler checks that c1 <= c2 and d1 >= d2.

Bounds declaration checking after assignments

To check bounds declarations after assignments, we begin with bounds that are true in the program state before the assignments (the "before state"). We then translate that information to the program state after the assignments . That way we have bounds in the "after" state that we can use to reason about the validity of bounds declarations there.

This done in 4 steps:

  1. Gather the set of required bounds for any variables or members that are assigned to in the statement. If an assignment is made to a variable or member with bounds, the bounds are included in the set. Also, if an assignment is made to any variable used in a in-scope bounds declaration, those bounds are included in the set. These bounds must be true about the "after" program state.
  2. Compute the bounds for the right-hand sides of any assignments. These are bounds that true in the "before" program state.
  3. Update the bounds obtained in step 2 based on the effects of assignments in the statement. These are now bounds true in the "after" program state.
  4. Check that the updated bounds along with dataflow facts imply the required bounds.

Step 3 involves two steps:

  1. Some assignments have right-hand sides that are invertible expressions. The prior value of the expression can be calculated from the new value. For example, given x = x + 1, where x is an unsigned integer, the prior value of x is x - 1. For those assignments, the original variable is replaced with the inverted expression.
  2. For all other assignments, any bounds using a variable updated by those assignment are changed to bounds(unknown)