This commit is contained in:
Sara Itani 2017-01-16 22:52:14 -08:00
Родитель 839f2d9e2a
Коммит 6f0dd9b851
2 изменённых файлов: 37 добавлений и 83 удалений

Просмотреть файл

@ -1,67 +1,33 @@
# Overview
At a high level, the parser accepts source code as an input, and
produces a syntax tree as an output.
The syntax tree produced by the parser ensures two key attributes:
1. **All source information is held in full fidelity.** This means that the tree contains every piece of
information found in the source text, every grammatical construct, every lexical token, and everything
else in between including whitespace and comments. The syntax trees also represent errors in source code
when the program is incomplete or malformed, by representing skipped or missing tokens in the syntax tree.
2. **A syntax tree obtained from the parser is completely round-trippable back to the text it was parsed from.**
From any syntax node, it is possible to get the text representation of the subtree rooted at that node.
This means that syntax trees can be used as a way to construct and edit source text.
If you're familiar with Roslyn and TypeScript, many of the concepts presented here will be familiar
(albeit adapted, to account for the unique runtime characteristics of PHP.)
## Syntax Tree
A syntax tree is literally a tree data structure, where non-terminal structural
elements parent other elements. Each syntax tree is made up of Nodes (represented by circles),
Tokens (represented by squares), and trivia (not represented, below, but attached to each Token).
![image](https://cloud.githubusercontent.com/assets/762848/19092929/e10e60aa-8a3d-11e6-8b90-51eabe5d1d8e.png)
Syntax trees have two key attributes.
1. The first attribute is that Syntax trees hold all the source information in full fidelity.
This means that the syntax tree contains every piece of information
found in the source text, every grammatical construct, every lexical
token, and everything else in between including whitespace, comments,
and preprocessor directives. For example, each literal mentioned in
the source is represented exactly as it was typed. The syntax trees
also represent errors in source code when the program is incomplete
or malformed, by representing skipped or missing tokens in the syntax tree.
2. This enables the second attribute of syntax trees. A syntax tree obtained
from the parser is completely round-trippable back to the text it was parsed
from. From any syntax node, it is possible to get the text representation of
the sub-tree rooted at that node. This means that syntax trees can be used
as a way to construct and edit source text. By creating a tree you have by
implication created the equivalent text, and by editing a syntax tree,
making a new tree out of changes to an existing tree, you have effectively
edited the text.
The syntax tree is composed of Nodes (represented by circles),
Tokens (represented by squares), and Trivia (not represented directly, but attached to
individual Tokens)
## Key Concepts
The **Syntax Tree** produced is literally a tree data structure, where non-terminal structural elements parent other
elements. Each syntax tree is made up of **Nodes** (non-terminal elements) and
**Tokens** (terminal elements).
Additionally associated with each Node and Token is **Positional Information**, **Errors**, and **Comment + Whitespace Trivia**.
All trees guarantee a set of **Invariants** - properties of the tree that always hold true, no matter what the
input. This set of invariants provides a consistent foundation
that makes it easier to ensure the tree is "structurally sound", and confidently reason about the tree
as we continue to build up our understanding. For instance, one such invariant is that the original text
(including whitespace and comments) should always be reproducible from a Node. See [Invariants](Invariants.md)
for a complete list.
## Tree Elements
### Nodes
Syntax nodes are one of the primary elements of syntax trees. These nodes represent
syntactic constructs such as declarations, statements, clauses, and expressions.
Each category of syntax nodes is represented by a separate class derived from SyntaxNode.
The set of node classes is not extensible.
All syntax nodes are non-terminal nodes in the syntax tree, which means they always have
other nodes and tokens as children. As a child of another node, each node has a parent node
that can be accessed through the Parent property. Because nodes and trees are immutable,
the parent of a node never changes. The root of the tree has a null parent.
Each node has a ChildNodes method, which returns a list of child nodes in sequential order
based on its position in the source text. This list does not contain tokens. Each node also
has a collection of Descendant methods - such as DescendantNodes, DescendantTokens, or
DescendantTrivia - that represent a list of all the nodes, tokens, or trivia that exist in
the sub-tree rooted by that node.
In addition, each syntax node subclass exposes all the same children through
properties. For example, a BinaryExpressionSyntax node class has three additional properties
specific to binary operators: Left, OperatorToken, and Right.
Some syntax nodes have optional children. For example, an IfStatementSyntax has an optional
ElseClauseSyntax. If the child is not present, the property returns null.
Each category of syntax nodes is represented by a separate class derived from `Node`.
### Tokens
Syntax tokens are the terminals of the language grammar, representing the smallest syntactic
@ -72,24 +38,23 @@ For efficiency purposes, unlike syntax nodes, there is only one structure for al
kinds of tokens with a mix of properties that have meaning depending on the kind
of token that is being represented.
### Trivia
Syntax trivia represent the parts of the source text that are largely insignificant for
normal understanding of the code, such as whitespace, comments, and preprocessor directives.
Because trivia are not part of the normal language syntax and can appear anywhere between
### Whitespace and Comment Trivia
Because whitespace and comment trivia are not part of the normal language syntax and can appear anywhere between
any two tokens, they are not included in the syntax tree as a child of a node. Yet, because
they are important when implementing a feature like refactoring and to maintain full
fidelity with the source text, they do exist as part of the syntax tree.
You can access trivia by inspecting a token's LeadingTrivia.
When source text is parsed, sequences of trivia are associated with tokens.
You can access trivia by inspecting a token's LeadingWhitespaceAndComments. When source text is parsed,
sequences of trivia are associated with tokens.
### Kinds
Each node, token, or trivia has a RawKind property (represented by a numeric literal),
that identifies the exact syntax element represented.
### Positional Information
Each node, token, or trivia knows its position within the source text and the number of
characters it consists of. A text position is represented as a 32-bit integer, which is
a zero-based byte index into the string. The width corresponds to a count of characters,
represented as integers. Zero-length refers to a location between two characters.
The RawKind property allows for easy disambiguation of syntax node types that share the
same node class. For tokens and trivia, this property is the only way to distinguish
one type of element from another.
For efficiency purposes, the position refers to the absolute position within the text,
and a helper function is available if you require Line/Column information.
### Errors
Even when the source text contains syntax errors, a full syntax tree that is round-trippable
@ -101,23 +66,12 @@ insert a missing token into the syntax tree in the location that the token was e
A missing token represents the actual token that was expected, but it has an empty span.
Second, the parser may skip tokens until it finds one where it can continue parsing.
In this case, the skipped tokens that were skipped are attached as a trivia node with
the kind SkippedTokens.
In this case, the skipped tokens that were skipped are attached as a skipped token in the tree.
Note that the parser produces trees in a tolerant fashion, and will not produce errors for
all incorrect constructs (e.g. including a non-constant expression as the default value of
a method parameter). Instead, it attaches these errors on a post-parse walk of the tree.
### Positional Information
Each node, token, or trivia knows its position within the source text and the number of
characters it consists of. A text position is represented as a 32-bit integer, which is
a zero-based Unicode character index. A TextSpan object is the beginning position and a
count of characters, both represented as integers. If TextSpan has a zero length, it refers
to a location between two characters.
The position refers to the absolute position within the text, but a helper function is available
if you require Line/Column information.
## Next Steps
Check out the [Documentation](GettingStarted.md) section for more information on how consume
Check out the [Readme](Readme.md) for more information on how consume
the parser, or the [How It Works](HowItWorks.md) section if you want to dive deeper into the implementation.

Просмотреть файл

@ -54,7 +54,9 @@ foreach ($astNode->getDescendantNodes() as $descendant) {
}
```
> Note: The API is still a work in progress, and will evolve according to user feedback.
> Note: [the API](ApiDocumentation.md) is not yet finalized, so please file issues let us know what functionality you want exposed,
and we'll see what we can do! Also please file any bugs with unexpected behavior in the parse tree. We're still
in our early stages, and any feedback you have is much appreciated :smiley:.
## Design Goals
* Error tolerant design - in IDE scenarios, code is, by definition, incomplete. In the case that invalid code is entered, the
@ -111,8 +113,6 @@ own machine to see for yourself.
## Learn more
**:dart: [Design Goals](#design-goals)** - learn about the design goals of the project (features, performance metrics, and more).
**:sunrise_over_mountains: [Syntax Overview](Overview.md)** - learn about the composition and key properties of the syntax tree.
**:seedling: [Documentation](GettingStarted.md#getting-started)** - learn how to reference the parser from your project, and how to perform
operations on the AST to answer questions about your code.