зеркало из https://github.com/microsoft/clang-1.git
Convert region-design document to HTML.
git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@68366 91177308-0d34-0410-b5e6-96231b3b80d8
This commit is contained in:
Родитель
8a2073a856
Коммит
d76e0a6c22
|
@ -0,0 +1,232 @@
|
|||
<html>
|
||||
<head>
|
||||
<title>Static Analyzer Design Document: Memory Regions</title>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Static Analyzer Design Document: Memory Regions</h1>
|
||||
|
||||
<h3>Authors</h3>
|
||||
|
||||
<p>Ted Kremenek, <tt>kremenek at apple</tt><br>
|
||||
Zhongxing Xu, <tt>xuzhongzhing at gmail</tt></p>
|
||||
|
||||
<h2 id="intro">Introduction</h2>
|
||||
|
||||
<p>The path-sensitive analysis engine in libAnalysis employs an extensible API
|
||||
for abstractly modeling the memory of an analyzed program. This API employs the
|
||||
concept of "memory regions" to abstractly model chunks of program memory such as
|
||||
program variables and dynamically allocated memory such as those returned from
|
||||
'malloc' and 'alloca'. Regions are hierarchical, with subregions modeling
|
||||
subtyping relationships, field and array offsets into larger chunks of memory,
|
||||
and so on.</p>
|
||||
|
||||
<p>The region API consists of two components:</p>
|
||||
|
||||
<ul> <li>A taxonomy and representation of regions themselves within the analyzer
|
||||
engine. The primary definitions and interfaces are described in <tt><a
|
||||
href="http://clang.llvm.org/doxygen/MemRegion_8h-source.html">MemRegion.h</a></tt>.
|
||||
At the root of the region hierarchy is the class <tt>MemRegion</tt> with
|
||||
specific subclasses refining the region concept for variables, heap allocated
|
||||
memory, and so forth.</li> <li>The modeling of binding of values to regions. For
|
||||
example, modeling the value stored to a local variable <tt>x</tt> consists of
|
||||
recording the binding between the region for <tt>x</tt> (which represents the
|
||||
raw memory associated with <tt>x</tt>) and the value stored to <tt>x</tt>. This
|
||||
binding relationship is captured with the notion of "symbolic
|
||||
stores."</li> </ul>
|
||||
|
||||
<p>Symbolic stores, which can be thought of as representing the relation
|
||||
<tt>regions -> values</tt>, are implemented by subclasses of the
|
||||
<tt>StoreManager</tt> class (<tt><a
|
||||
href="http://clang.llvm.org/doxygen/Store_8h-source.html">Store.h</a></tt>). A
|
||||
particular StoreManager implementation has complete flexibility concerning the
|
||||
following:
|
||||
|
||||
<ul>
|
||||
<li><em>How</em> to model the binding between regions and values</li>
|
||||
<li><em>What</em> bindings are recorded
|
||||
</ul>
|
||||
|
||||
<p>Together, both points allow different StoreManagers to tradeoff between
|
||||
different levels of analysis precision and scalability concerning the reasoning
|
||||
of program memory. Meanwhile, the core path-sensitive engine makes no
|
||||
assumptions about either points, and queries a StoreManager about the bindings
|
||||
to a memory region through a generic interface that all StoreManagers share. If
|
||||
a particular StoreManager cannot reason about the potential bindings of a given
|
||||
memory region (e.g., '<tt>BasicStoreManager</tt>' does not reason about fields
|
||||
of structures) then the StoreManager can simply return 'unknown' (represented by
|
||||
'<tt>UnknownVal</tt>') for a particular region-binding. This separation of
|
||||
concerns not only isolates the core analysis engine from the details of
|
||||
reasoning about program memory but also facilities the option of a client of the
|
||||
path-sensitive engine to easily swap in different StoreManager implementations
|
||||
that internally reason about program memory in very different ways.</pp>
|
||||
|
||||
<p>The rest of this document is divided into two parts. We first discuss region
|
||||
taxonomy and the semantics of regions. We then discuss the StoreManager
|
||||
interface, and details of how the currently available StoreManager classes
|
||||
implement region bindings.</p>
|
||||
|
||||
<h2 id="regions">Memory Regions and Region Taxonomy</h2>
|
||||
|
||||
<h3>Pointers</h3>
|
||||
|
||||
<p>Before talking about the memory regions, we would talk about the pointers
|
||||
since memory regions are essentially used to represent pointer values.</p>
|
||||
|
||||
<p>The pointer is a type of values. Pointer values have two semantic aspects.
|
||||
One is its physical value, which is an address or location. The other is the
|
||||
type of the memory object residing in the address.</p>
|
||||
|
||||
<p>Memory regions are designed to abstract these two properties of the pointer.
|
||||
The physical value of a pointer is represented by MemRegion pointers. The rvalue
|
||||
type of the region corresponds to the type of the pointee object.</p>
|
||||
|
||||
<p>One complication is that we could have different view regions on the same
|
||||
memory chunk. They represent the same memory location, but have different
|
||||
abstract location, i.e., MemRegion pointers. Thus we need to canonicalize the
|
||||
abstract locations to get a unique abstract location for one physical
|
||||
location.</p>
|
||||
|
||||
<p>Furthermore, these different view regions may or may not represent memory
|
||||
objects of different types. Some different types are semantically the same,
|
||||
for example, 'struct s' and 'my_type' are the same type.</p>
|
||||
|
||||
<pre>
|
||||
struct s;
|
||||
typedef struct s my_type;
|
||||
</pre>
|
||||
|
||||
<p>But <tt>char</tt> and <tt>int</tt> are not the same type in the code below:</p>
|
||||
|
||||
<pre>
|
||||
void *p;
|
||||
int *q = (int*) p;
|
||||
char *r = (char*) p;
|
||||
</pre
|
||||
|
||||
<p>Thus we need to canonicalize the MemRegion which is used in binding and
|
||||
retrieving.</p>
|
||||
|
||||
<h3>Symbolic Regions</h3>
|
||||
|
||||
<p>A symbolic region is a map of the concept of symbolic values into the domain
|
||||
of regions. It is the way that we represent symbolic pointers. Whenever a
|
||||
symbolic pointer value is needed, a symbolic region is created to represent
|
||||
it.</p>
|
||||
|
||||
<p>A symbolic region has no type. It wraps a SymbolData. But sometimes we have
|
||||
type information associated with a symbolic region. For this case, a
|
||||
TypedViewRegion is created to layer the type information on top of the symbolic
|
||||
region. The reason we do not carry type information with the symbolic region is
|
||||
that the symbolic regions can have no type. To be consistent, we don't let them
|
||||
to carry type information.</p>
|
||||
|
||||
<p>Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
|
||||
and represents a generic chunk of memory.</p>
|
||||
|
||||
<p><em><b>NOTE</b>: We plan not to use loc::SymbolVal in RegionStore and remove it
|
||||
gradually.</em></p>
|
||||
|
||||
<p>Symbolic regions get their rvalue types through the following ways:</p>
|
||||
|
||||
<ul>
|
||||
<li>Through the parameter or global variable that points to it, e.g.:
|
||||
<pre>
|
||||
void f(struct s* p) {
|
||||
...
|
||||
}
|
||||
</pre>
|
||||
|
||||
<p>The symbolic region pointed to by <tt>p</tt> has type <tt>struct
|
||||
s</tt>.</p></li>
|
||||
|
||||
<li>Through explicit or implicit casts, e.g.:
|
||||
<pre>
|
||||
void f(void* p) {
|
||||
struct s* q = (struct s*) p;
|
||||
...
|
||||
}
|
||||
</pre>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>We attach the type information to the symbolic region lazily. For the first
|
||||
case above, we create the <tt>TypedViewRegion</tt> only when the pointer is
|
||||
actually used to access the pointee memory object, that is when the element or
|
||||
field region is created. For the cast case, the <tt>TypedViewRegion</tt> is
|
||||
created when visiting the <tt>CastExpr</tt>.</p>
|
||||
|
||||
<p>The reason for doing lazy typing is that symbolic regions are sometimes only
|
||||
used to do location comparison.</p>
|
||||
|
||||
<h3>Pointer Casts</h3>
|
||||
|
||||
<p>Pointer casts allow people to impose different 'views' onto a chunk of
|
||||
memory.</p>
|
||||
|
||||
<p>Usually we have two kinds of casts. One kind of casts cast down with in the
|
||||
type hierarchy. It imposes more specific views onto more generic memory regions.
|
||||
The other kind of casts cast up with in the type hierarchy. It strips away more
|
||||
specific views on top of the more generic memory regions.</p>
|
||||
|
||||
<p>We simulate the down casts by layering another <tt>TypedViewRegion</tt> on
|
||||
top of the original region. We simulate the up casts by striping away the top
|
||||
<tt>TypedViewRegion</tt>. Down casts is usually simple. For up casts, if the
|
||||
there is no <tt>TypedViewRegion</tt> to be stripped, we return the original
|
||||
region. If the underlying region is of the different type than the cast-to type,
|
||||
we flag an error state.</p>
|
||||
|
||||
<p>For toll-free bridging casts, we return the original region.</p>
|
||||
|
||||
<p>We can set up a partial order for pointer types, with the most general type
|
||||
<tt>void*</tt> at the top. The partial order forms a tree with <tt>void*</tt> as
|
||||
its root node.</p>
|
||||
|
||||
<p>Every <tt>MemRegion</tt> has a root position in the type tree. For example,
|
||||
the pointee region of <tt>void *p</tt> has its root position at the root node of
|
||||
the tree. <tt>VarRegion</tt> of <tt>int x</tt> has its root position at the 'int
|
||||
type' node.</p>
|
||||
|
||||
<p><tt>TypedViewRegion</tt> is used to move the region down or up in the tree.
|
||||
Moving down in the tree adds a <tt>TypedViewRegion</tt>. Moving up in the tree
|
||||
removes a <Tt>TypedViewRegion</tt>.</p>
|
||||
|
||||
<p>Do we want to allow moving up beyond the root position? This happens
|
||||
when:</p> <pre> int x; void *p = &x; </pre>
|
||||
|
||||
<p>The region of <tt>x</tt> has its root position at 'int*' node. the cast to
|
||||
void* moves that region up to the 'void*' node. I propose to not allow such
|
||||
casts, and assign the region of <tt>x</tt> for <tt>p</tt>.<p>
|
||||
|
||||
<h3>Region Bindings</h3>
|
||||
|
||||
<p>The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
|
||||
StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.</p>
|
||||
|
||||
<p>When binding regions, we perform canonicalization on element regions and field
|
||||
regions. This is because we can have different views on the same region, some
|
||||
of which are essentially the same view with different sugar type names.</p>
|
||||
|
||||
<p>To canonicalize a region, we get the canonical types for all TypedViewRegions
|
||||
along the way up to the root region, and make new TypedViewRegions with those
|
||||
canonical types.</p>
|
||||
|
||||
<p>For Objective-C and C++, perhaps another canonicalization rule should be
|
||||
added: for FieldRegion, the least derived class that has the field is used as
|
||||
the type of the super region of the FieldRegion.</p>
|
||||
|
||||
<p>All bindings and retrievings are done on the canonicalized regions.</p>
|
||||
|
||||
<p>Canonicalization is transparent outside the region store manager, and more
|
||||
specifically, unaware outside the Bind() and Retrieve() method. We don't need to
|
||||
consider region canonicalization when doing pointer cast.</p>
|
||||
|
||||
<h3>Constraint Manager</h3>
|
||||
|
||||
<p>The constraint manager reasons about the abstract location of memory objects.
|
||||
We can have different views on a region, but none of these views changes the
|
||||
location of that object. Thus we should get the same abstract location for those
|
||||
regions.</p>
|
||||
|
||||
</body>
|
||||
</html>
|
|
@ -1,197 +0,0 @@
|
|||
Static Analyzer: 'Regions'
|
||||
--------------------------
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
The path-sensitive analysis engine in libAnalysis employs an extensible API
|
||||
for abstractly modeling the memory of an analyzed program. This API employs
|
||||
the concept of "memory regions" to abstractly model chunks of program memory
|
||||
such as program variables and dynamically allocated memory such as those
|
||||
returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions
|
||||
modeling subtyping relationships, field and array offsets into larger chunks
|
||||
of memory, and so on.
|
||||
|
||||
The region API consists of two components. The first is the taxonomy and
|
||||
representation of regions themselves within the analyzer engine. The primary
|
||||
definitions and interfaces are described in
|
||||
'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region
|
||||
hierarchy is the class 'MemRegion' with specific subclasses refining the
|
||||
region concept for variables, heap allocated memory, and so forth.
|
||||
|
||||
The second component in the region API is the modeling of the binding of
|
||||
values to regions. For example, modeling the value stored to a local variable
|
||||
'x' consists of recording the binding between the region for 'x' (which
|
||||
represents the raw memory associated with 'x') and the value stored to 'x'.
|
||||
This binding relationship is captured with the notion of "symbolic stores."
|
||||
|
||||
Symbolic stores, which can be thought of as representing the relation 'regions
|
||||
-> values', are implemented by subclasses of the StoreManager class (Store.h).
|
||||
A particular StoreManager implementation has complete flexibility concerning
|
||||
(a) *how* to model the binding between regions and values and (b) *what*
|
||||
bindings are recorded. Together, both points allow different StoreManagers to
|
||||
tradeoff between different levels of analysis precision and scalability
|
||||
concerning the reasoning of program memory. Meanwhile, the core path-sensitive
|
||||
engine makes no assumptions about (a) or (b), and queries a StoreManager about
|
||||
the bindings to a memory region through a generic interface that all
|
||||
StoreManagers share. If a particular StoreManager cannot reason about the
|
||||
potential bindings of a given memory region (e.g., 'BasicStoreManager' does
|
||||
not reason about fields of structures) then the StoreManager can simply return
|
||||
'unknown' (represented by 'UnknownVal') for a particular region-binding. This
|
||||
separation of concerns not only isolates the core analysis engine from the
|
||||
details of reasoning about program memory but also facilities the option of a
|
||||
client of the path-sensitive engine to easily swap in different StoreManager
|
||||
implementations that internally reason about program memory in very different
|
||||
ways.
|
||||
|
||||
The rest of this document is divided into two parts. We first discuss region
|
||||
taxonomy and the semantics of regions. We then discuss the StoreManager
|
||||
interface, and details of how the currently available StoreManager classes
|
||||
implement region bindings.
|
||||
|
||||
MEMORY REGIONS and REGION TAXONOMY
|
||||
|
||||
POINTERS
|
||||
|
||||
Before talking about the memory regions, we would talk about the pointers
|
||||
since memory regions are essentially used to represent pointer values.
|
||||
|
||||
The pointer is a type of values. Pointer values have two semantic aspects. One
|
||||
is its physical value, which is an address or location. The other is the type
|
||||
of the memory object residing in the address.
|
||||
|
||||
Memory regions are designed to abstract these two properties of the
|
||||
pointer. The physical value of a pointer is represented by MemRegion
|
||||
pointers. The rvalue type of the region corresponds to the type of the pointee
|
||||
object.
|
||||
|
||||
One complication is that we could have different view regions on the same
|
||||
memory chunk. They represent the same memory location, but have different
|
||||
abstract location, i.e., MemRegion pointers. Thus we need to canonicalize
|
||||
the abstract locations to get a unique abstract location for one physical
|
||||
location.
|
||||
|
||||
Furthermore, these different view regions may or may not represent memory
|
||||
objects of different types. Some different types are semantically the same,
|
||||
for example, 'struct s' and 'my_type' are the same type.
|
||||
struct s;
|
||||
typedef struct s my_type;
|
||||
|
||||
But 'char' and 'int' are not the same type in the code below:
|
||||
void *p;
|
||||
int *q = (int*) p;
|
||||
char *r = (char*) p;
|
||||
|
||||
Thus we need to canonicalize the MemRegion which is used in binding and
|
||||
retrieving.
|
||||
|
||||
SYMBOLIC REGIONS
|
||||
|
||||
A symbolic region is a map of the concept of symbolic values into the domain
|
||||
of regions. It is the way that we represent symbolic pointers. Whenever a
|
||||
symbolic pointer value is needed, a symbolic region is created to represent
|
||||
it.
|
||||
|
||||
A symbolic region has no type. It wraps a SymbolData. But sometimes we have
|
||||
type information associated with a symbolic region. For this case, a
|
||||
TypedViewRegion is created to layer the type information on top of the
|
||||
symbolic region. The reason we do not carry type information with the symbolic
|
||||
region is that the symbolic regions can have no type. To be consistent, we
|
||||
don't let them to carry type information.
|
||||
|
||||
Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
|
||||
and represents a generic chunk of memory.
|
||||
|
||||
NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it
|
||||
gradually.
|
||||
|
||||
Symbolic regions get their rvalue types through the following ways:
|
||||
* through the parameter or global variable that points to it, e.g.:
|
||||
|
||||
void f(struct s* p) {
|
||||
...
|
||||
}
|
||||
|
||||
The symbolic region pointed to by 'p' has type 'struct s'.
|
||||
|
||||
* through explicit or implicit casts, e.g.:
|
||||
void f(void* p) {
|
||||
struct s* q = (struct s*) p;
|
||||
...
|
||||
}
|
||||
|
||||
We attach the type information to the symbolic region lazily. For the first
|
||||
case above, we create the TypedViewRegion only when the pointer is actually
|
||||
used to access the pointee memory object, that is when the element or field
|
||||
region is created. For the cast case, the TypedViewRegion is created when
|
||||
visiting the CastExpr.
|
||||
|
||||
The reason for doing lazy typing is that symbolic regions are sometimes only
|
||||
used to do location comparison.
|
||||
|
||||
Pointer Casts
|
||||
|
||||
Pointer casts allow people to impose different 'views' onto a chunk of memory.
|
||||
|
||||
Usually we have two kinds of casts. One kind of casts cast down with in the
|
||||
type hierarchy. It imposes more specific views onto more generic memory
|
||||
regions. The other kind of casts cast up with in the type hierarchy. It strips
|
||||
away more specific views on top of the more generic memory regions.
|
||||
|
||||
We simulate the down casts by layering another TypedViewRegion on top of the
|
||||
original region. We simulate the up casts by striping away the top
|
||||
TypedViewRegion. Down casts is usually simple. For up casts, if the there is
|
||||
no TypedViewRegion to be stripped, we return the original region. If the
|
||||
underlying region is of the different type than the cast-to type, we flag an
|
||||
error state.
|
||||
|
||||
For toll-free bridging casts, we return the original region.
|
||||
|
||||
We can set up a partial order for pointer types, with the most general type
|
||||
'void*' at the top. The partial order forms a tree with 'void*' as its root
|
||||
node.
|
||||
|
||||
Every MemRegion has a root position in the type tree. For example, the pointee
|
||||
region of 'void *p' has its root position at the root node of the tree.
|
||||
VarRegion of 'int x' has its root position at the 'int type' node.
|
||||
|
||||
TypedViewRegion is used to move the region down or up in the tree. Moving
|
||||
down in the tree adds a TypedViewRegion. Moving up in the tree removes a
|
||||
TypedViewRegion.
|
||||
|
||||
Do we want to allow moving up beyond the root position? This happens when:
|
||||
int x;
|
||||
void *p = &x;
|
||||
|
||||
The region of 'x' has its root position at 'int*' node. the cast to void*
|
||||
moves that region up to the 'void*' node. I propose to not allow such casts,
|
||||
and assign the region of 'x' for 'p'.
|
||||
|
||||
Region Bindings
|
||||
|
||||
The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
|
||||
StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.
|
||||
|
||||
When binding regions, we perform canonicalization on element regions and field
|
||||
regions. This is because we can have different views on the same region, some
|
||||
of which are essentially the same view with different sugar type names.
|
||||
|
||||
To canonicalize a region, we get the canonical types for all TypedViewRegions
|
||||
along the way up to the root region, and make new TypedViewRegions with those
|
||||
canonical types.
|
||||
|
||||
For ObjC and C++, perhaps another canonicalization rule should be added: for
|
||||
FieldRegion, the least derived class that has the field is used as the type
|
||||
of the super region of the FieldRegion.
|
||||
|
||||
All bindings and retrievings are done on the canonicalized regions.
|
||||
|
||||
Canonicalization is transparent outside the region store manager, and more
|
||||
specifically, unaware outside the Bind() and Retrieve() method. We don't need
|
||||
to consider region canonicalization when doing pointer cast.
|
||||
|
||||
Constraint Manager
|
||||
|
||||
The constraint manager reasons about the abstract location of memory
|
||||
objects. We can have different views on a region, but none of these views
|
||||
changes the location of that object. Thus we should get the same abstract
|
||||
location for those regions.
|
Загрузка…
Ссылка в новой задаче