write overview page
This commit is contained in:
Родитель
e077478046
Коммит
20561f1acf
|
@ -91,12 +91,20 @@ foreach ($childNodes as $childNode) {
|
|||
// }
|
||||
```
|
||||
|
||||
> Note: the API is not yet finalized, so please file issues let us know what functionality you want exposed,
|
||||
and we'll see what we can do! Also please file any bugs with unexpected behavior in the parse tree. We're still
|
||||
in our early stages, and any feedback you have is much appreciated :smiley:.
|
||||
|
||||
## Play around with the AST!
|
||||
In order to help you get a sense for the features and shape of the tree,
|
||||
we've also included a `PHP Syntax Visualizer Extension` that makes use of the parser
|
||||
to provide error tooltips.
|
||||
1. Download the VSIX
|
||||
2. Point it to your PHP Path
|
||||
3. Disable other extensions in the workspace to ensure minimal interference
|
||||
we've also included a [Syntax Visualizer Tool](syntax-visualizer/client#php-parser-syntax-visualizer-tool)
|
||||
that makes use of the parser to both visualize the tree and provide error tooltips.
|
||||
![image](https://cloud.githubusercontent.com/assets/762848/21635753/3f8c0cb8-d214-11e6-8424-e200d63abc18.png)
|
||||
|
||||
If you see something that looks off, please file an issue, or better yet, contribute as a test case. See [Contributing.md](Contributing.md) for more details.
|
||||
![image](https://cloud.githubusercontent.com/assets/762848/21705272/d5f2f7d8-d373-11e6-9688-46ead75b2fd3.png)
|
||||
|
||||
If you see something that looks off, please file an issue, or better yet, contribute as a test case. See [Contributing.md](Contributing.md) for more details.
|
||||
|
||||
## Next Steps
|
||||
Check out the [Syntax Overview](Overview.md) section for more information on key attributes of the parse tree,
|
||||
or the [How It Works](HowItWorks.md) section if you want to dive deeper into the implementation.
|
||||
|
|
120
HowItWorks.md
120
HowItWorks.md
|
@ -1,4 +1,17 @@
|
|||
# How it Works
|
||||
> Note: Make sure you read the [Overview](Overview.md) section first to get a sense for some of
|
||||
the high-level principles.
|
||||
|
||||
This approach borrows heavily from the designs of Roslyn and TypeScript. However,
|
||||
it needs to be adapted because PHP doesn't offer the
|
||||
same runtime characteristics as .NET and JS.
|
||||
|
||||
The syntax tree is produced via a two step process:
|
||||
1. The lexer reads in text, and produces the resulting Tokens.
|
||||
2. The parser reads in Tokens, to construct the final syntax tree.
|
||||
|
||||
Under the covers, the lexer is actually driven by the parser to reduce potential memory
|
||||
consumption and make it easier to perform lookaheads when building up the parse tree.
|
||||
|
||||
## Lexer
|
||||
The lexer produces tokens out PHP, based on the following lexical grammar:
|
||||
|
@ -11,17 +24,12 @@ flexibility to use our own lightweight token representation (see below) from the
|
|||
a conversion. This initial implementation is available in `src/Lexer.php`, but has been deprecated in favor of
|
||||
`src/PhpTokenizer.php`.
|
||||
|
||||
Ultimately, the biggest challenge with the initial approach was performance (especially with Unicode representations). Ultimately,
|
||||
Ultimately, the biggest challenge with the initial approach was performance (especially with Unicode representations) -
|
||||
we found that PHP doesn't provide an efficient way to extract character codes without multiple conversions after the initial
|
||||
file-read.
|
||||
file-read.
|
||||
|
||||
> **"Model" vs "Representation"**
|
||||
> * Model := general information exposed, how we will intaract with it
|
||||
> * Representation := underlying data structures
|
||||
|
||||
|
||||
### Tokens (Model)
|
||||
Tokens take the following form:
|
||||
### Tokens
|
||||
Tokens hold onto the following information:
|
||||
```
|
||||
Token: {
|
||||
Kind: Id, // the classification of the token
|
||||
|
@ -31,7 +39,6 @@ Token: {
|
|||
}
|
||||
```
|
||||
|
||||
### Tokens (Representation)
|
||||
#### Helper functions
|
||||
In order to be as efficient as possible, we do not store full content in memory.
|
||||
Instead, each token is uniquely defined by four integers, and we take advantage of helper
|
||||
|
@ -39,8 +46,9 @@ functions to extract further information.
|
|||
* `GetTriviaForToken`
|
||||
* `GetFullTextForToken`
|
||||
* `GetTextForToken`
|
||||
* See code for an up-to-date list
|
||||
|
||||
#### Data structures
|
||||
#### Notes
|
||||
At this point in time, the Representation has not yet diverged from the Model. Tokens
|
||||
are currently represented as a `Token` object, with four properties - `$kind`, `$fullStart`,
|
||||
`$start`, and `$length`. However, objects (and arrays, and ...) are super expensive in PHP
|
||||
|
@ -88,7 +96,7 @@ does not preclude us from presenting a more reasonable API for consumers of the
|
|||
override the property getters / setters on Node.
|
||||
|
||||
|
||||
### Invariants
|
||||
#### Invariants
|
||||
In order to ensure that the parser evolves in a healthy manner over time,
|
||||
we define and continuously test the set of invariants defined below:
|
||||
* The sum of the lengths of all of the tokens is equivalent to the length of the document
|
||||
|
@ -99,11 +107,49 @@ we define and continuously test the set of invariants defined below:
|
|||
* `GetTriviaForToken` returns a string of length equivalent to `(Start - FullStart)`
|
||||
* `GetFullTextForToken` returns a string of length equivalent to `Length`
|
||||
* `GetTextForToken` returns a string of length equivalent to `Length - (Start - FullStart)`
|
||||
* See the code for an up-to-date list...
|
||||
* See `tests/LexicalInvariantsTest.php` for an up-to-date list...
|
||||
|
||||
## Parser
|
||||
### Node (Model)
|
||||
Nodes include the following information:
|
||||
The parser reads in Tokens provided by the lexer to produce the resulting Syntax Tree.
|
||||
The parser uses a combination of top-down and bottom-up parsing. In particular, most constructs
|
||||
are parsed in a top-down fashion, which keeps the this keeps the code simple and readable/maintainable
|
||||
(by humans :wink:) over time. The one exception to this is expressions, which are parsed
|
||||
bottom-up. We also hold onto our current `ParseContext`, which lets us know, for instance, whether
|
||||
we are parsing `ClassMembers`, or `TraitMembers`, or something else; holding onto this `ParseContext`
|
||||
enables us to provide better error handling (described below).
|
||||
|
||||
For instance, let's take the simple example of an `if-statement`. We know to start parsing the
|
||||
`if-statement` because we'll see an `if` keyword token. We also know from the
|
||||
[PHP grammar](https://github.com/php/php-langspec/blob/master/spec/19-grammar.md#user-content-grammar-if-statement),
|
||||
that an if-statement can be defined as follows:
|
||||
```
|
||||
if-statement:
|
||||
if ( expression ) statement elseif-clauses-1opt else-clause-1opt
|
||||
```
|
||||
|
||||
The resultant parsing logic will look something like this. Notice that we anticipate the next token or
|
||||
set of tokens based on the our current context. This is top-down parsing.
|
||||
```php
|
||||
function parseIfStatement($parent) {
|
||||
$n = new IfStatement();
|
||||
$n->ifKeyword = eat("if");
|
||||
$n->openParen = eat("(");
|
||||
$n->expression = parseExpression();
|
||||
$n->closeParen = eat(")");
|
||||
$n->statement = parseStatement();
|
||||
$n->parent = $parent;
|
||||
return $n;
|
||||
}
|
||||
```
|
||||
|
||||
Expressions (produced by `parseExpression`), on the other hand, are parsed bottom-up. That is, rather than attempting
|
||||
to anticipate the next token, we read one token at a time, and construct a resulting tree based on
|
||||
operator precedence properties. See the `parseBinaryExpression` in `src/Parser.php` for full information.
|
||||
|
||||
See the Error-handling section below for more information on how `ParseContext` is used.
|
||||
|
||||
### Nodes
|
||||
Nodes hold onto the following information:
|
||||
```
|
||||
Node: {
|
||||
Kind: Id,
|
||||
|
@ -112,8 +158,10 @@ Node: {
|
|||
}
|
||||
```
|
||||
|
||||
### Node (Representation)
|
||||
> TODO - discerning between Model and Representation
|
||||
#### Notes
|
||||
In order to reduce memory usage, we plan to remove the NodeKind property, and instead rely soley on
|
||||
subclasses in order to represent the Node's kind. This should reduce memory usage by ~16 bytes per
|
||||
Node.
|
||||
|
||||
### Abstract Syntax Tree
|
||||
An example tree is below. The tree Nodes (represented by circles), and Tokens (represented by squares)
|
||||
|
@ -131,17 +179,16 @@ WIDTH(T) -> T.Width
|
|||
```
|
||||
|
||||
|
||||
### Invariants
|
||||
#### Invariants
|
||||
* Invariants for all Tokens hold true
|
||||
* The tree contains every token
|
||||
* span of any node is sum of spans of child nodes and tokens
|
||||
* The tree length exactly matches the file length
|
||||
* Every leaf node of the tree is a token
|
||||
* Every Node contains at least one Token
|
||||
* See `tests/ParserInvariantsTest.php` for an up-to-date list...
|
||||
|
||||
### Building up the Tree
|
||||
|
||||
#### Error Tokens
|
||||
### Error Tokens
|
||||
We define two types of `Error` tokens:
|
||||
* **Skipped Tokens:** extra token that no one knows how to deal with
|
||||
* **Missing Tokens:** Grammar expects a token to be there, but it does not exist
|
||||
|
@ -302,18 +349,25 @@ the number of edge cases by limiting the granularity of node-reuse. In the case
|
|||
we believe a reasonable balance is to limit granularity to a list `ParseContext`.
|
||||
|
||||
## Open Questions
|
||||
This approach, however, makes a few assumptions that we should validate upfront, if possible,
|
||||
in order to minimize potential risk:
|
||||
* [ ] **Assumption 1:** This approach will work on a wide range of user development environment configurations.
|
||||
* [ ] **Assumption 2:** PHP can be sufficiently optimized to support aforementioned parser performance goals.
|
||||
* [ ] **Assumption 3:** PHP 7 grammar is a superset of PHP5 grammar.
|
||||
* [ ] **Assumption 4:** The PHP grammar described in `php/php-langspec` is complete.
|
||||
* Anything else?
|
||||
|
||||
Some open Qs:
|
||||
* need some examples of large PHP applications to help benchmark
|
||||
* would PHP 5 provide sufficient perf?
|
||||
* what sort of data structures do we need? Ideally we'd throw everything into a struct. Anything better?
|
||||
Open Questions:
|
||||
* need some examples of large PHP applications to help benchmark? We are currently testing against
|
||||
the frameworks in the `validation` folder, and more suggestions welcome.
|
||||
* what are the most memory-efficient data-structures we could use? See Node and Token Notes sections above
|
||||
for our current thoughts on this, but we hope we can do better than that, so ideas are very much welcome.
|
||||
* Can PHP can be sufficiently optimized to support aforementioned parser performance goals? Performance shouldn't
|
||||
is pretty okay at the moment, and there's more we could to do optimize that. But we are certainly
|
||||
running up against major challenges when it comes to memory.
|
||||
* How well does this approach will work on a wide range of user development environment configurations?
|
||||
* Anything else?
|
||||
|
||||
Previously open questions:
|
||||
* would PHP 5 provide sufficient performance? No - the memory management and performance in PHP5 is so
|
||||
behind that of PHP7, that it wouldn't really make sense to support. Check out Nikic's blog to get an
|
||||
idea for just how stark the difference is: https://nikic.github.io/2015/05/05/Internal-value-representation-in-PHP-7-part-1.html
|
||||
* Is the PHP grammar described in `php/php-langspec` complete? Complete enough - we've submitted some PRs to
|
||||
improve the spec, but overall we haven't run into any major impediments.
|
||||
* Is the PHP 7 grammar a superset of the PHP5 grammar? It's close enough that we can afford to patch
|
||||
the cases where it's not.
|
||||
|
||||
## Real world validation strategy
|
||||
* benchmark against other parsers (investigate any instance of disagreement)
|
||||
|
|
|
@ -0,0 +1,122 @@
|
|||
# Overview
|
||||
|
||||
At a high level, the parser accepts source code as an input, and
|
||||
produces a syntax tree as an output.
|
||||
|
||||
If you're familiar with Roslyn and TypeScript, many of the concepts presented here will be familiar
|
||||
(albeit adapted, to account for the unique runtime characteristics of PHP.)
|
||||
|
||||
## Syntax Tree
|
||||
A syntax tree is literally a tree data structure, where non-terminal structural
|
||||
elements parent other elements. Each syntax tree is made up of Nodes (represented by circles),
|
||||
Tokens (represented by squares), and trivia (not represented, below, but attached to each Token).
|
||||
|
||||
![image](https://cloud.githubusercontent.com/assets/762848/19092929/e10e60aa-8a3d-11e6-8b90-51eabe5d1d8e.png)
|
||||
|
||||
Syntax trees have two key attributes.
|
||||
1. The first attribute is that Syntax trees hold all the source information in full fidelity.
|
||||
This means that the syntax tree contains every piece of information
|
||||
found in the source text, every grammatical construct, every lexical
|
||||
token, and everything else in between including whitespace, comments,
|
||||
and preprocessor directives. For example, each literal mentioned in
|
||||
the source is represented exactly as it was typed. The syntax trees
|
||||
also represent errors in source code when the program is incomplete
|
||||
or malformed, by representing skipped or missing tokens in the syntax tree.
|
||||
|
||||
2. This enables the second attribute of syntax trees. A syntax tree obtained
|
||||
from the parser is completely round-trippable back to the text it was parsed
|
||||
from. From any syntax node, it is possible to get the text representation of
|
||||
the sub-tree rooted at that node. This means that syntax trees can be used
|
||||
as a way to construct and edit source text. By creating a tree you have by
|
||||
implication created the equivalent text, and by editing a syntax tree,
|
||||
making a new tree out of changes to an existing tree, you have effectively
|
||||
edited the text.
|
||||
|
||||
The syntax tree is composed of Nodes (represented by circles),
|
||||
Tokens (represented by squares), and Trivia (not represented directly, but attached to
|
||||
individual Tokens)
|
||||
|
||||
|
||||
|
||||
### Nodes
|
||||
Syntax nodes are one of the primary elements of syntax trees. These nodes represent
|
||||
syntactic constructs such as declarations, statements, clauses, and expressions.
|
||||
Each category of syntax nodes is represented by a separate class derived from SyntaxNode.
|
||||
The set of node classes is not extensible.
|
||||
|
||||
All syntax nodes are non-terminal nodes in the syntax tree, which means they always have
|
||||
other nodes and tokens as children. As a child of another node, each node has a parent node
|
||||
that can be accessed through the Parent property. Because nodes and trees are immutable,
|
||||
the parent of a node never changes. The root of the tree has a null parent.
|
||||
|
||||
Each node has a ChildNodes method, which returns a list of child nodes in sequential order
|
||||
based on its position in the source text. This list does not contain tokens. Each node also
|
||||
has a collection of Descendant methods - such as DescendantNodes, DescendantTokens, or
|
||||
DescendantTrivia - that represent a list of all the nodes, tokens, or trivia that exist in
|
||||
the sub-tree rooted by that node.
|
||||
|
||||
In addition, each syntax node subclass exposes all the same children through
|
||||
properties. For example, a BinaryExpressionSyntax node class has three additional properties
|
||||
specific to binary operators: Left, OperatorToken, and Right.
|
||||
|
||||
Some syntax nodes have optional children. For example, an IfStatementSyntax has an optional
|
||||
ElseClauseSyntax. If the child is not present, the property returns null.
|
||||
|
||||
### Tokens
|
||||
Syntax tokens are the terminals of the language grammar, representing the smallest syntactic
|
||||
fragments of the code. They are never parents of other nodes or tokens. Syntax tokens
|
||||
consist of keywords, identifiers, literals, and punctuation.
|
||||
|
||||
For efficiency purposes, unlike syntax nodes, there is only one structure for all
|
||||
kinds of tokens with a mix of properties that have meaning depending on the kind
|
||||
of token that is being represented.
|
||||
|
||||
### Trivia
|
||||
Syntax trivia represent the parts of the source text that are largely insignificant for
|
||||
normal understanding of the code, such as whitespace, comments, and preprocessor directives.
|
||||
Because trivia are not part of the normal language syntax and can appear anywhere between
|
||||
any two tokens, they are not included in the syntax tree as a child of a node. Yet, because
|
||||
they are important when implementing a feature like refactoring and to maintain full
|
||||
fidelity with the source text, they do exist as part of the syntax tree.
|
||||
|
||||
You can access trivia by inspecting a token's LeadingTrivia.
|
||||
When source text is parsed, sequences of trivia are associated with tokens.
|
||||
|
||||
### Kinds
|
||||
Each node, token, or trivia has a RawKind property (represented by a numeric literal),
|
||||
that identifies the exact syntax element represented.
|
||||
|
||||
The RawKind property allows for easy disambiguation of syntax node types that share the
|
||||
same node class. For tokens and trivia, this property is the only way to distinguish
|
||||
one type of element from another.
|
||||
|
||||
### Errors
|
||||
Even when the source text contains syntax errors, a full syntax tree that is round-trippable
|
||||
to the source is exposed. When the parser encounters code that does not conform to the
|
||||
defined syntax of the language, it uses one of two techniques to create a syntax tree.
|
||||
|
||||
First, if the parser expects a particular kind of token, but does not find it, it may
|
||||
insert a missing token into the syntax tree in the location that the token was expected.
|
||||
A missing token represents the actual token that was expected, but it has an empty span.
|
||||
|
||||
Second, the parser may skip tokens until it finds one where it can continue parsing.
|
||||
In this case, the skipped tokens that were skipped are attached as a trivia node with
|
||||
the kind SkippedTokens.
|
||||
|
||||
Note that the parser produces trees in a tolerant fashion, and will not produce errors for
|
||||
all incorrect constructs (e.g. including a non-constant expression as the default value of
|
||||
a method parameter). Instead, it attaches these errors on a post-parse walk of the tree.
|
||||
|
||||
### Positional Information
|
||||
Each node, token, or trivia knows its position within the source text and the number of
|
||||
characters it consists of. A text position is represented as a 32-bit integer, which is
|
||||
a zero-based Unicode character index. A TextSpan object is the beginning position and a
|
||||
count of characters, both represented as integers. If TextSpan has a zero length, it refers
|
||||
to a location between two characters.
|
||||
|
||||
The position refers to the absolute position within the text, but a helper function is available
|
||||
if you require Line/Column information.
|
||||
|
||||
## Next Steps
|
||||
Check out the [Documentation](GettingStarted.md) section for more information on how consume
|
||||
the parser, or the [How It Works](HowItWorks.md) section if you want to dive deeper into the implementation.
|
|
@ -43,10 +43,6 @@ so each language server operation should be < 50 ms to leave room for all the
|
|||
* Written in PHP - make it as easy as possible for the PHP community to consume and contribute.
|
||||
|
||||
## Current Status and Approach
|
||||
This approach borrows heavily from the designs of Roslyn and TypeScript. However,
|
||||
it will need to be adapted because PHP doesn't offer the
|
||||
same runtime characteristics as .NET and JS.
|
||||
|
||||
To ensure a sufficient level of correctness at every step of the way, the
|
||||
parser is being developed using the following incremental approach:
|
||||
|
||||
|
@ -61,7 +57,7 @@ Error Nodes. Write tests for all invariants.
|
|||
* [ ] _**Performance:**_ profile, benchmark against large PHP applications
|
||||
* [ ] **Phase 6:** Finalize API to make it as easy as possible for people to consume.
|
||||
|
||||
> :rabbit: **Ready to see just how deep the rabbit hole goes?** Check out [How It Works](HowItWorks.md) for all the fun technical details.
|
||||
> :rabbit: **Ready to see just how deep the rabbit hole goes?** Check out the [Overview](Overview.md) to learn more about key properties of the Syntax Tree and [How It Works](HowItWorks.md) for all the fun technical details.
|
||||
|
||||
<hr>
|
||||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||
|
|
Загрузка…
Ссылка в новой задаче