update overview and readme
This commit is contained in:
Родитель
839f2d9e2a
Коммит
6f0dd9b851
114
Overview.md
114
Overview.md
|
@ -1,67 +1,33 @@
|
|||
# Overview
|
||||
|
||||
At a high level, the parser accepts source code as an input, and
|
||||
produces a syntax tree as an output.
|
||||
The syntax tree produced by the parser ensures two key attributes:
|
||||
1. **All source information is held in full fidelity.** This means that the tree contains every piece of
|
||||
information found in the source text, every grammatical construct, every lexical token, and everything
|
||||
else in between including whitespace and comments. The syntax trees also represent errors in source code
|
||||
when the program is incomplete or malformed, by representing skipped or missing tokens in the syntax tree.
|
||||
2. **A syntax tree obtained from the parser is completely round-trippable back to the text it was parsed from.**
|
||||
From any syntax node, it is possible to get the text representation of the subtree rooted at that node.
|
||||
This means that syntax trees can be used as a way to construct and edit source text.
|
||||
|
||||
If you're familiar with Roslyn and TypeScript, many of the concepts presented here will be familiar
|
||||
(albeit adapted, to account for the unique runtime characteristics of PHP.)
|
||||
|
||||
## Syntax Tree
|
||||
A syntax tree is literally a tree data structure, where non-terminal structural
|
||||
elements parent other elements. Each syntax tree is made up of Nodes (represented by circles),
|
||||
Tokens (represented by squares), and trivia (not represented, below, but attached to each Token).
|
||||
|
||||
![image](https://cloud.githubusercontent.com/assets/762848/19092929/e10e60aa-8a3d-11e6-8b90-51eabe5d1d8e.png)
|
||||
|
||||
Syntax trees have two key attributes.
|
||||
|
||||
1. The first attribute is that Syntax trees hold all the source information in full fidelity.
|
||||
This means that the syntax tree contains every piece of information
|
||||
found in the source text, every grammatical construct, every lexical
|
||||
token, and everything else in between including whitespace, comments,
|
||||
and preprocessor directives. For example, each literal mentioned in
|
||||
the source is represented exactly as it was typed. The syntax trees
|
||||
also represent errors in source code when the program is incomplete
|
||||
or malformed, by representing skipped or missing tokens in the syntax tree.
|
||||
|
||||
2. This enables the second attribute of syntax trees. A syntax tree obtained
|
||||
from the parser is completely round-trippable back to the text it was parsed
|
||||
from. From any syntax node, it is possible to get the text representation of
|
||||
the sub-tree rooted at that node. This means that syntax trees can be used
|
||||
as a way to construct and edit source text. By creating a tree you have by
|
||||
implication created the equivalent text, and by editing a syntax tree,
|
||||
making a new tree out of changes to an existing tree, you have effectively
|
||||
edited the text.
|
||||
|
||||
The syntax tree is composed of Nodes (represented by circles),
|
||||
Tokens (represented by squares), and Trivia (not represented directly, but attached to
|
||||
individual Tokens)
|
||||
## Key Concepts
|
||||
The **Syntax Tree** produced is literally a tree data structure, where non-terminal structural elements parent other
|
||||
elements. Each syntax tree is made up of **Nodes** (non-terminal elements) and
|
||||
**Tokens** (terminal elements).
|
||||
|
||||
Additionally associated with each Node and Token is **Positional Information**, **Errors**, and **Comment + Whitespace Trivia**.
|
||||
|
||||
All trees guarantee a set of **Invariants** - properties of the tree that always hold true, no matter what the
|
||||
input. This set of invariants provides a consistent foundation
|
||||
that makes it easier to ensure the tree is "structurally sound", and confidently reason about the tree
|
||||
as we continue to build up our understanding. For instance, one such invariant is that the original text
|
||||
(including whitespace and comments) should always be reproducible from a Node. See [Invariants](Invariants.md)
|
||||
for a complete list.
|
||||
|
||||
## Tree Elements
|
||||
### Nodes
|
||||
Syntax nodes are one of the primary elements of syntax trees. These nodes represent
|
||||
syntactic constructs such as declarations, statements, clauses, and expressions.
|
||||
Each category of syntax nodes is represented by a separate class derived from SyntaxNode.
|
||||
The set of node classes is not extensible.
|
||||
|
||||
All syntax nodes are non-terminal nodes in the syntax tree, which means they always have
|
||||
other nodes and tokens as children. As a child of another node, each node has a parent node
|
||||
that can be accessed through the Parent property. Because nodes and trees are immutable,
|
||||
the parent of a node never changes. The root of the tree has a null parent.
|
||||
|
||||
Each node has a ChildNodes method, which returns a list of child nodes in sequential order
|
||||
based on its position in the source text. This list does not contain tokens. Each node also
|
||||
has a collection of Descendant methods - such as DescendantNodes, DescendantTokens, or
|
||||
DescendantTrivia - that represent a list of all the nodes, tokens, or trivia that exist in
|
||||
the sub-tree rooted by that node.
|
||||
|
||||
In addition, each syntax node subclass exposes all the same children through
|
||||
properties. For example, a BinaryExpressionSyntax node class has three additional properties
|
||||
specific to binary operators: Left, OperatorToken, and Right.
|
||||
|
||||
Some syntax nodes have optional children. For example, an IfStatementSyntax has an optional
|
||||
ElseClauseSyntax. If the child is not present, the property returns null.
|
||||
Each category of syntax nodes is represented by a separate class derived from `Node`.
|
||||
|
||||
### Tokens
|
||||
Syntax tokens are the terminals of the language grammar, representing the smallest syntactic
|
||||
|
@ -72,24 +38,23 @@ For efficiency purposes, unlike syntax nodes, there is only one structure for al
|
|||
kinds of tokens with a mix of properties that have meaning depending on the kind
|
||||
of token that is being represented.
|
||||
|
||||
### Trivia
|
||||
Syntax trivia represent the parts of the source text that are largely insignificant for
|
||||
normal understanding of the code, such as whitespace, comments, and preprocessor directives.
|
||||
Because trivia are not part of the normal language syntax and can appear anywhere between
|
||||
### Whitespace and Comment Trivia
|
||||
Because whitespace and comment trivia are not part of the normal language syntax and can appear anywhere between
|
||||
any two tokens, they are not included in the syntax tree as a child of a node. Yet, because
|
||||
they are important when implementing a feature like refactoring and to maintain full
|
||||
fidelity with the source text, they do exist as part of the syntax tree.
|
||||
|
||||
You can access trivia by inspecting a token's LeadingTrivia.
|
||||
When source text is parsed, sequences of trivia are associated with tokens.
|
||||
You can access trivia by inspecting a token's LeadingWhitespaceAndComments. When source text is parsed,
|
||||
sequences of trivia are associated with tokens.
|
||||
|
||||
### Kinds
|
||||
Each node, token, or trivia has a RawKind property (represented by a numeric literal),
|
||||
that identifies the exact syntax element represented.
|
||||
### Positional Information
|
||||
Each node, token, or trivia knows its position within the source text and the number of
|
||||
characters it consists of. A text position is represented as a 32-bit integer, which is
|
||||
a zero-based byte index into the string. The width corresponds to a count of characters,
|
||||
represented as integers. Zero-length refers to a location between two characters.
|
||||
|
||||
The RawKind property allows for easy disambiguation of syntax node types that share the
|
||||
same node class. For tokens and trivia, this property is the only way to distinguish
|
||||
one type of element from another.
|
||||
For efficiency purposes, the position refers to the absolute position within the text,
|
||||
and a helper function is available if you require Line/Column information.
|
||||
|
||||
### Errors
|
||||
Even when the source text contains syntax errors, a full syntax tree that is round-trippable
|
||||
|
@ -101,23 +66,12 @@ insert a missing token into the syntax tree in the location that the token was e
|
|||
A missing token represents the actual token that was expected, but it has an empty span.
|
||||
|
||||
Second, the parser may skip tokens until it finds one where it can continue parsing.
|
||||
In this case, the skipped tokens that were skipped are attached as a trivia node with
|
||||
the kind SkippedTokens.
|
||||
In this case, the skipped tokens that were skipped are attached as a skipped token in the tree.
|
||||
|
||||
Note that the parser produces trees in a tolerant fashion, and will not produce errors for
|
||||
all incorrect constructs (e.g. including a non-constant expression as the default value of
|
||||
a method parameter). Instead, it attaches these errors on a post-parse walk of the tree.
|
||||
|
||||
### Positional Information
|
||||
Each node, token, or trivia knows its position within the source text and the number of
|
||||
characters it consists of. A text position is represented as a 32-bit integer, which is
|
||||
a zero-based Unicode character index. A TextSpan object is the beginning position and a
|
||||
count of characters, both represented as integers. If TextSpan has a zero length, it refers
|
||||
to a location between two characters.
|
||||
|
||||
The position refers to the absolute position within the text, but a helper function is available
|
||||
if you require Line/Column information.
|
||||
|
||||
## Next Steps
|
||||
Check out the [Documentation](GettingStarted.md) section for more information on how consume
|
||||
Check out the [Readme](Readme.md) for more information on how consume
|
||||
the parser, or the [How It Works](HowItWorks.md) section if you want to dive deeper into the implementation.
|
||||
|
|
|
@ -54,7 +54,9 @@ foreach ($astNode->getDescendantNodes() as $descendant) {
|
|||
}
|
||||
```
|
||||
|
||||
> Note: The API is still a work in progress, and will evolve according to user feedback.
|
||||
> Note: [the API](ApiDocumentation.md) is not yet finalized, so please file issues let us know what functionality you want exposed,
|
||||
and we'll see what we can do! Also please file any bugs with unexpected behavior in the parse tree. We're still
|
||||
in our early stages, and any feedback you have is much appreciated :smiley:.
|
||||
|
||||
## Design Goals
|
||||
* Error tolerant design - in IDE scenarios, code is, by definition, incomplete. In the case that invalid code is entered, the
|
||||
|
@ -111,8 +113,6 @@ own machine to see for yourself.
|
|||
## Learn more
|
||||
**:dart: [Design Goals](#design-goals)** - learn about the design goals of the project (features, performance metrics, and more).
|
||||
|
||||
**:sunrise_over_mountains: [Syntax Overview](Overview.md)** - learn about the composition and key properties of the syntax tree.
|
||||
|
||||
**:seedling: [Documentation](GettingStarted.md#getting-started)** - learn how to reference the parser from your project, and how to perform
|
||||
operations on the AST to answer questions about your code.
|
||||
|
||||
|
|
Загрузка…
Ссылка в новой задаче