Convert region-design document to HTML.

git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@68366 91177308-0d34-0410-b5e6-96231b3b80d8
This commit is contained in:
Ted Kremenek 2009-04-03 01:38:55 +00:00
Родитель 8a2073a856
Коммит d76e0a6c22
2 изменённых файлов: 232 добавлений и 197 удалений

232
docs/AnalyzerRegions.html Normal file
Просмотреть файл

@ -0,0 +1,232 @@
<html>
<head>
<title>Static Analyzer Design Document: Memory Regions</title>
</head>
<body>
<h1>Static Analyzer Design Document: Memory Regions</h1>
<h3>Authors</h3>
<p>Ted Kremenek, <tt>kremenek at apple</tt><br>
Zhongxing Xu, <tt>xuzhongzhing at gmail</tt></p>
<h2 id="intro">Introduction</h2>
<p>The path-sensitive analysis engine in libAnalysis employs an extensible API
for abstractly modeling the memory of an analyzed program. This API employs the
concept of "memory regions" to abstractly model chunks of program memory such as
program variables and dynamically allocated memory such as those returned from
'malloc' and 'alloca'. Regions are hierarchical, with subregions modeling
subtyping relationships, field and array offsets into larger chunks of memory,
and so on.</p>
<p>The region API consists of two components:</p>
<ul> <li>A taxonomy and representation of regions themselves within the analyzer
engine. The primary definitions and interfaces are described in <tt><a
href="http://clang.llvm.org/doxygen/MemRegion_8h-source.html">MemRegion.h</a></tt>.
At the root of the region hierarchy is the class <tt>MemRegion</tt> with
specific subclasses refining the region concept for variables, heap allocated
memory, and so forth.</li> <li>The modeling of binding of values to regions. For
example, modeling the value stored to a local variable <tt>x</tt> consists of
recording the binding between the region for <tt>x</tt> (which represents the
raw memory associated with <tt>x</tt>) and the value stored to <tt>x</tt>. This
binding relationship is captured with the notion of &quot;symbolic
stores.&quot;</li> </ul>
<p>Symbolic stores, which can be thought of as representing the relation
<tt>regions -> values</tt>, are implemented by subclasses of the
<tt>StoreManager</tt> class (<tt><a
href="http://clang.llvm.org/doxygen/Store_8h-source.html">Store.h</a></tt>). A
particular StoreManager implementation has complete flexibility concerning the
following:
<ul>
<li><em>How</em> to model the binding between regions and values</li>
<li><em>What</em> bindings are recorded
</ul>
<p>Together, both points allow different StoreManagers to tradeoff between
different levels of analysis precision and scalability concerning the reasoning
of program memory. Meanwhile, the core path-sensitive engine makes no
assumptions about either points, and queries a StoreManager about the bindings
to a memory region through a generic interface that all StoreManagers share. If
a particular StoreManager cannot reason about the potential bindings of a given
memory region (e.g., '<tt>BasicStoreManager</tt>' does not reason about fields
of structures) then the StoreManager can simply return 'unknown' (represented by
'<tt>UnknownVal</tt>') for a particular region-binding. This separation of
concerns not only isolates the core analysis engine from the details of
reasoning about program memory but also facilities the option of a client of the
path-sensitive engine to easily swap in different StoreManager implementations
that internally reason about program memory in very different ways.</pp>
<p>The rest of this document is divided into two parts. We first discuss region
taxonomy and the semantics of regions. We then discuss the StoreManager
interface, and details of how the currently available StoreManager classes
implement region bindings.</p>
<h2 id="regions">Memory Regions and Region Taxonomy</h2>
<h3>Pointers</h3>
<p>Before talking about the memory regions, we would talk about the pointers
since memory regions are essentially used to represent pointer values.</p>
<p>The pointer is a type of values. Pointer values have two semantic aspects.
One is its physical value, which is an address or location. The other is the
type of the memory object residing in the address.</p>
<p>Memory regions are designed to abstract these two properties of the pointer.
The physical value of a pointer is represented by MemRegion pointers. The rvalue
type of the region corresponds to the type of the pointee object.</p>
<p>One complication is that we could have different view regions on the same
memory chunk. They represent the same memory location, but have different
abstract location, i.e., MemRegion pointers. Thus we need to canonicalize the
abstract locations to get a unique abstract location for one physical
location.</p>
<p>Furthermore, these different view regions may or may not represent memory
objects of different types. Some different types are semantically the same,
for example, 'struct s' and 'my_type' are the same type.</p>
<pre>
struct s;
typedef struct s my_type;
</pre>
<p>But <tt>char</tt> and <tt>int</tt> are not the same type in the code below:</p>
<pre>
void *p;
int *q = (int*) p;
char *r = (char*) p;
</pre
<p>Thus we need to canonicalize the MemRegion which is used in binding and
retrieving.</p>
<h3>Symbolic Regions</h3>
<p>A symbolic region is a map of the concept of symbolic values into the domain
of regions. It is the way that we represent symbolic pointers. Whenever a
symbolic pointer value is needed, a symbolic region is created to represent
it.</p>
<p>A symbolic region has no type. It wraps a SymbolData. But sometimes we have
type information associated with a symbolic region. For this case, a
TypedViewRegion is created to layer the type information on top of the symbolic
region. The reason we do not carry type information with the symbolic region is
that the symbolic regions can have no type. To be consistent, we don't let them
to carry type information.</p>
<p>Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
and represents a generic chunk of memory.</p>
<p><em><b>NOTE</b>: We plan not to use loc::SymbolVal in RegionStore and remove it
gradually.</em></p>
<p>Symbolic regions get their rvalue types through the following ways:</p>
<ul>
<li>Through the parameter or global variable that points to it, e.g.:
<pre>
void f(struct s* p) {
...
}
</pre>
<p>The symbolic region pointed to by <tt>p</tt> has type <tt>struct
s</tt>.</p></li>
<li>Through explicit or implicit casts, e.g.:
<pre>
void f(void* p) {
struct s* q = (struct s*) p;
...
}
</pre>
</li>
</ul>
<p>We attach the type information to the symbolic region lazily. For the first
case above, we create the <tt>TypedViewRegion</tt> only when the pointer is
actually used to access the pointee memory object, that is when the element or
field region is created. For the cast case, the <tt>TypedViewRegion</tt> is
created when visiting the <tt>CastExpr</tt>.</p>
<p>The reason for doing lazy typing is that symbolic regions are sometimes only
used to do location comparison.</p>
<h3>Pointer Casts</h3>
<p>Pointer casts allow people to impose different 'views' onto a chunk of
memory.</p>
<p>Usually we have two kinds of casts. One kind of casts cast down with in the
type hierarchy. It imposes more specific views onto more generic memory regions.
The other kind of casts cast up with in the type hierarchy. It strips away more
specific views on top of the more generic memory regions.</p>
<p>We simulate the down casts by layering another <tt>TypedViewRegion</tt> on
top of the original region. We simulate the up casts by striping away the top
<tt>TypedViewRegion</tt>. Down casts is usually simple. For up casts, if the
there is no <tt>TypedViewRegion</tt> to be stripped, we return the original
region. If the underlying region is of the different type than the cast-to type,
we flag an error state.</p>
<p>For toll-free bridging casts, we return the original region.</p>
<p>We can set up a partial order for pointer types, with the most general type
<tt>void*</tt> at the top. The partial order forms a tree with <tt>void*</tt> as
its root node.</p>
<p>Every <tt>MemRegion</tt> has a root position in the type tree. For example,
the pointee region of <tt>void *p</tt> has its root position at the root node of
the tree. <tt>VarRegion</tt> of <tt>int x</tt> has its root position at the 'int
type' node.</p>
<p><tt>TypedViewRegion</tt> is used to move the region down or up in the tree.
Moving down in the tree adds a <tt>TypedViewRegion</tt>. Moving up in the tree
removes a <Tt>TypedViewRegion</tt>.</p>
<p>Do we want to allow moving up beyond the root position? This happens
when:</p> <pre> int x; void *p = &amp;x; </pre>
<p>The region of <tt>x</tt> has its root position at 'int*' node. the cast to
void* moves that region up to the 'void*' node. I propose to not allow such
casts, and assign the region of <tt>x</tt> for <tt>p</tt>.<p>
<h3>Region Bindings</h3>
<p>The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.</p>
<p>When binding regions, we perform canonicalization on element regions and field
regions. This is because we can have different views on the same region, some
of which are essentially the same view with different sugar type names.</p>
<p>To canonicalize a region, we get the canonical types for all TypedViewRegions
along the way up to the root region, and make new TypedViewRegions with those
canonical types.</p>
<p>For Objective-C and C++, perhaps another canonicalization rule should be
added: for FieldRegion, the least derived class that has the field is used as
the type of the super region of the FieldRegion.</p>
<p>All bindings and retrievings are done on the canonicalized regions.</p>
<p>Canonicalization is transparent outside the region store manager, and more
specifically, unaware outside the Bind() and Retrieve() method. We don't need to
consider region canonicalization when doing pointer cast.</p>
<h3>Constraint Manager</h3>
<p>The constraint manager reasons about the abstract location of memory objects.
We can have different views on a region, but none of these views changes the
location of that object. Thus we should get the same abstract location for those
regions.</p>
</body>
</html>

Просмотреть файл

@ -1,197 +0,0 @@
Static Analyzer: 'Regions'
--------------------------
INTRODUCTION
The path-sensitive analysis engine in libAnalysis employs an extensible API
for abstractly modeling the memory of an analyzed program. This API employs
the concept of "memory regions" to abstractly model chunks of program memory
such as program variables and dynamically allocated memory such as those
returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions
modeling subtyping relationships, field and array offsets into larger chunks
of memory, and so on.
The region API consists of two components. The first is the taxonomy and
representation of regions themselves within the analyzer engine. The primary
definitions and interfaces are described in
'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region
hierarchy is the class 'MemRegion' with specific subclasses refining the
region concept for variables, heap allocated memory, and so forth.
The second component in the region API is the modeling of the binding of
values to regions. For example, modeling the value stored to a local variable
'x' consists of recording the binding between the region for 'x' (which
represents the raw memory associated with 'x') and the value stored to 'x'.
This binding relationship is captured with the notion of "symbolic stores."
Symbolic stores, which can be thought of as representing the relation 'regions
-> values', are implemented by subclasses of the StoreManager class (Store.h).
A particular StoreManager implementation has complete flexibility concerning
(a) *how* to model the binding between regions and values and (b) *what*
bindings are recorded. Together, both points allow different StoreManagers to
tradeoff between different levels of analysis precision and scalability
concerning the reasoning of program memory. Meanwhile, the core path-sensitive
engine makes no assumptions about (a) or (b), and queries a StoreManager about
the bindings to a memory region through a generic interface that all
StoreManagers share. If a particular StoreManager cannot reason about the
potential bindings of a given memory region (e.g., 'BasicStoreManager' does
not reason about fields of structures) then the StoreManager can simply return
'unknown' (represented by 'UnknownVal') for a particular region-binding. This
separation of concerns not only isolates the core analysis engine from the
details of reasoning about program memory but also facilities the option of a
client of the path-sensitive engine to easily swap in different StoreManager
implementations that internally reason about program memory in very different
ways.
The rest of this document is divided into two parts. We first discuss region
taxonomy and the semantics of regions. We then discuss the StoreManager
interface, and details of how the currently available StoreManager classes
implement region bindings.
MEMORY REGIONS and REGION TAXONOMY
POINTERS
Before talking about the memory regions, we would talk about the pointers
since memory regions are essentially used to represent pointer values.
The pointer is a type of values. Pointer values have two semantic aspects. One
is its physical value, which is an address or location. The other is the type
of the memory object residing in the address.
Memory regions are designed to abstract these two properties of the
pointer. The physical value of a pointer is represented by MemRegion
pointers. The rvalue type of the region corresponds to the type of the pointee
object.
One complication is that we could have different view regions on the same
memory chunk. They represent the same memory location, but have different
abstract location, i.e., MemRegion pointers. Thus we need to canonicalize
the abstract locations to get a unique abstract location for one physical
location.
Furthermore, these different view regions may or may not represent memory
objects of different types. Some different types are semantically the same,
for example, 'struct s' and 'my_type' are the same type.
struct s;
typedef struct s my_type;
But 'char' and 'int' are not the same type in the code below:
void *p;
int *q = (int*) p;
char *r = (char*) p;
Thus we need to canonicalize the MemRegion which is used in binding and
retrieving.
SYMBOLIC REGIONS
A symbolic region is a map of the concept of symbolic values into the domain
of regions. It is the way that we represent symbolic pointers. Whenever a
symbolic pointer value is needed, a symbolic region is created to represent
it.
A symbolic region has no type. It wraps a SymbolData. But sometimes we have
type information associated with a symbolic region. For this case, a
TypedViewRegion is created to layer the type information on top of the
symbolic region. The reason we do not carry type information with the symbolic
region is that the symbolic regions can have no type. To be consistent, we
don't let them to carry type information.
Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
and represents a generic chunk of memory.
NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it
gradually.
Symbolic regions get their rvalue types through the following ways:
* through the parameter or global variable that points to it, e.g.:
void f(struct s* p) {
...
}
The symbolic region pointed to by 'p' has type 'struct s'.
* through explicit or implicit casts, e.g.:
void f(void* p) {
struct s* q = (struct s*) p;
...
}
We attach the type information to the symbolic region lazily. For the first
case above, we create the TypedViewRegion only when the pointer is actually
used to access the pointee memory object, that is when the element or field
region is created. For the cast case, the TypedViewRegion is created when
visiting the CastExpr.
The reason for doing lazy typing is that symbolic regions are sometimes only
used to do location comparison.
Pointer Casts
Pointer casts allow people to impose different 'views' onto a chunk of memory.
Usually we have two kinds of casts. One kind of casts cast down with in the
type hierarchy. It imposes more specific views onto more generic memory
regions. The other kind of casts cast up with in the type hierarchy. It strips
away more specific views on top of the more generic memory regions.
We simulate the down casts by layering another TypedViewRegion on top of the
original region. We simulate the up casts by striping away the top
TypedViewRegion. Down casts is usually simple. For up casts, if the there is
no TypedViewRegion to be stripped, we return the original region. If the
underlying region is of the different type than the cast-to type, we flag an
error state.
For toll-free bridging casts, we return the original region.
We can set up a partial order for pointer types, with the most general type
'void*' at the top. The partial order forms a tree with 'void*' as its root
node.
Every MemRegion has a root position in the type tree. For example, the pointee
region of 'void *p' has its root position at the root node of the tree.
VarRegion of 'int x' has its root position at the 'int type' node.
TypedViewRegion is used to move the region down or up in the tree. Moving
down in the tree adds a TypedViewRegion. Moving up in the tree removes a
TypedViewRegion.
Do we want to allow moving up beyond the root position? This happens when:
int x;
void *p = &x;
The region of 'x' has its root position at 'int*' node. the cast to void*
moves that region up to the 'void*' node. I propose to not allow such casts,
and assign the region of 'x' for 'p'.
Region Bindings
The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.
When binding regions, we perform canonicalization on element regions and field
regions. This is because we can have different views on the same region, some
of which are essentially the same view with different sugar type names.
To canonicalize a region, we get the canonical types for all TypedViewRegions
along the way up to the root region, and make new TypedViewRegions with those
canonical types.
For ObjC and C++, perhaps another canonicalization rule should be added: for
FieldRegion, the least derived class that has the field is used as the type
of the super region of the FieldRegion.
All bindings and retrievings are done on the canonicalized regions.
Canonicalization is transparent outside the region store manager, and more
specifically, unaware outside the Bind() and Retrieve() method. We don't need
to consider region canonicalization when doing pointer cast.
Constraint Manager
The constraint manager reasons about the abstract location of memory
objects. We can have different views on a region, but none of these views
changes the location of that object. Thus we should get the same abstract
location for those regions.