This commit is contained in:
Sara Itani 2017-01-11 14:58:34 -08:00
Родитель e077478046
Коммит 20561f1acf
4 изменённых файлов: 224 добавлений и 44 удалений

Просмотреть файл

@ -91,12 +91,20 @@ foreach ($childNodes as $childNode) {
// }
```
> Note: the API is not yet finalized, so please file issues let us know what functionality you want exposed,
and we'll see what we can do! Also please file any bugs with unexpected behavior in the parse tree. We're still
in our early stages, and any feedback you have is much appreciated :smiley:.
## Play around with the AST!
In order to help you get a sense for the features and shape of the tree,
we've also included a `PHP Syntax Visualizer Extension` that makes use of the parser
to provide error tooltips.
1. Download the VSIX
2. Point it to your PHP Path
3. Disable other extensions in the workspace to ensure minimal interference
we've also included a [Syntax Visualizer Tool](syntax-visualizer/client#php-parser-syntax-visualizer-tool)
that makes use of the parser to both visualize the tree and provide error tooltips.
![image](https://cloud.githubusercontent.com/assets/762848/21635753/3f8c0cb8-d214-11e6-8424-e200d63abc18.png)
If you see something that looks off, please file an issue, or better yet, contribute as a test case. See [Contributing.md](Contributing.md) for more details.
![image](https://cloud.githubusercontent.com/assets/762848/21705272/d5f2f7d8-d373-11e6-9688-46ead75b2fd3.png)
If you see something that looks off, please file an issue, or better yet, contribute as a test case. See [Contributing.md](Contributing.md) for more details.
## Next Steps
Check out the [Syntax Overview](Overview.md) section for more information on key attributes of the parse tree,
or the [How It Works](HowItWorks.md) section if you want to dive deeper into the implementation.

Просмотреть файл

@ -1,4 +1,17 @@
# How it Works
> Note: Make sure you read the [Overview](Overview.md) section first to get a sense for some of
the high-level principles.
This approach borrows heavily from the designs of Roslyn and TypeScript. However,
it needs to be adapted because PHP doesn't offer the
same runtime characteristics as .NET and JS.
The syntax tree is produced via a two step process:
1. The lexer reads in text, and produces the resulting Tokens.
2. The parser reads in Tokens, to construct the final syntax tree.
Under the covers, the lexer is actually driven by the parser to reduce potential memory
consumption and make it easier to perform lookaheads when building up the parse tree.
## Lexer
The lexer produces tokens out PHP, based on the following lexical grammar:
@ -11,17 +24,12 @@ flexibility to use our own lightweight token representation (see below) from the
a conversion. This initial implementation is available in `src/Lexer.php`, but has been deprecated in favor of
`src/PhpTokenizer.php`.
Ultimately, the biggest challenge with the initial approach was performance (especially with Unicode representations). Ultimately,
Ultimately, the biggest challenge with the initial approach was performance (especially with Unicode representations) -
we found that PHP doesn't provide an efficient way to extract character codes without multiple conversions after the initial
file-read.
file-read.
> **"Model" vs "Representation"**
> * Model := general information exposed, how we will intaract with it
> * Representation := underlying data structures
### Tokens (Model)
Tokens take the following form:
### Tokens
Tokens hold onto the following information:
```
Token: {
Kind: Id, // the classification of the token
@ -31,7 +39,6 @@ Token: {
}
```
### Tokens (Representation)
#### Helper functions
In order to be as efficient as possible, we do not store full content in memory.
Instead, each token is uniquely defined by four integers, and we take advantage of helper
@ -39,8 +46,9 @@ functions to extract further information.
* `GetTriviaForToken`
* `GetFullTextForToken`
* `GetTextForToken`
* See code for an up-to-date list
#### Data structures
#### Notes
At this point in time, the Representation has not yet diverged from the Model. Tokens
are currently represented as a `Token` object, with four properties - `$kind`, `$fullStart`,
`$start`, and `$length`. However, objects (and arrays, and ...) are super expensive in PHP
@ -88,7 +96,7 @@ does not preclude us from presenting a more reasonable API for consumers of the
override the property getters / setters on Node.
### Invariants
#### Invariants
In order to ensure that the parser evolves in a healthy manner over time,
we define and continuously test the set of invariants defined below:
* The sum of the lengths of all of the tokens is equivalent to the length of the document
@ -99,11 +107,49 @@ we define and continuously test the set of invariants defined below:
* `GetTriviaForToken` returns a string of length equivalent to `(Start - FullStart)`
* `GetFullTextForToken` returns a string of length equivalent to `Length`
* `GetTextForToken` returns a string of length equivalent to `Length - (Start - FullStart)`
* See the code for an up-to-date list...
* See `tests/LexicalInvariantsTest.php` for an up-to-date list...
## Parser
### Node (Model)
Nodes include the following information:
The parser reads in Tokens provided by the lexer to produce the resulting Syntax Tree.
The parser uses a combination of top-down and bottom-up parsing. In particular, most constructs
are parsed in a top-down fashion, which keeps the this keeps the code simple and readable/maintainable
(by humans :wink:) over time. The one exception to this is expressions, which are parsed
bottom-up. We also hold onto our current `ParseContext`, which lets us know, for instance, whether
we are parsing `ClassMembers`, or `TraitMembers`, or something else; holding onto this `ParseContext`
enables us to provide better error handling (described below).
For instance, let's take the simple example of an `if-statement`. We know to start parsing the
`if-statement` because we'll see an `if` keyword token. We also know from the
[PHP grammar](https://github.com/php/php-langspec/blob/master/spec/19-grammar.md#user-content-grammar-if-statement),
that an if-statement can be defined as follows:
```
if-statement:
if ( expression ) statement elseif-clauses-1opt else-clause-1opt
```
The resultant parsing logic will look something like this. Notice that we anticipate the next token or
set of tokens based on the our current context. This is top-down parsing.
```php
function parseIfStatement($parent) {
$n = new IfStatement();
$n->ifKeyword = eat("if");
$n->openParen = eat("(");
$n->expression = parseExpression();
$n->closeParen = eat(")");
$n->statement = parseStatement();
$n->parent = $parent;
return $n;
}
```
Expressions (produced by `parseExpression`), on the other hand, are parsed bottom-up. That is, rather than attempting
to anticipate the next token, we read one token at a time, and construct a resulting tree based on
operator precedence properties. See the `parseBinaryExpression` in `src/Parser.php` for full information.
See the Error-handling section below for more information on how `ParseContext` is used.
### Nodes
Nodes hold onto the following information:
```
Node: {
Kind: Id,
@ -112,8 +158,10 @@ Node: {
}
```
### Node (Representation)
> TODO - discerning between Model and Representation
#### Notes
In order to reduce memory usage, we plan to remove the NodeKind property, and instead rely soley on
subclasses in order to represent the Node's kind. This should reduce memory usage by ~16 bytes per
Node.
### Abstract Syntax Tree
An example tree is below. The tree Nodes (represented by circles), and Tokens (represented by squares)
@ -131,17 +179,16 @@ WIDTH(T) -> T.Width
```
### Invariants
#### Invariants
* Invariants for all Tokens hold true
* The tree contains every token
* span of any node is sum of spans of child nodes and tokens
* The tree length exactly matches the file length
* Every leaf node of the tree is a token
* Every Node contains at least one Token
* See `tests/ParserInvariantsTest.php` for an up-to-date list...
### Building up the Tree
#### Error Tokens
### Error Tokens
We define two types of `Error` tokens:
* **Skipped Tokens:** extra token that no one knows how to deal with
* **Missing Tokens:** Grammar expects a token to be there, but it does not exist
@ -302,18 +349,25 @@ the number of edge cases by limiting the granularity of node-reuse. In the case
we believe a reasonable balance is to limit granularity to a list `ParseContext`.
## Open Questions
This approach, however, makes a few assumptions that we should validate upfront, if possible,
in order to minimize potential risk:
* [ ] **Assumption 1:** This approach will work on a wide range of user development environment configurations.
* [ ] **Assumption 2:** PHP can be sufficiently optimized to support aforementioned parser performance goals.
* [ ] **Assumption 3:** PHP 7 grammar is a superset of PHP5 grammar.
* [ ] **Assumption 4:** The PHP grammar described in `php/php-langspec` is complete.
* Anything else?
Some open Qs:
* need some examples of large PHP applications to help benchmark
* would PHP 5 provide sufficient perf?
* what sort of data structures do we need? Ideally we'd throw everything into a struct. Anything better?
Open Questions:
* need some examples of large PHP applications to help benchmark? We are currently testing against
the frameworks in the `validation` folder, and more suggestions welcome.
* what are the most memory-efficient data-structures we could use? See Node and Token Notes sections above
for our current thoughts on this, but we hope we can do better than that, so ideas are very much welcome.
* Can PHP can be sufficiently optimized to support aforementioned parser performance goals? Performance shouldn't
is pretty okay at the moment, and there's more we could to do optimize that. But we are certainly
running up against major challenges when it comes to memory.
* How well does this approach will work on a wide range of user development environment configurations?
* Anything else?
Previously open questions:
* would PHP 5 provide sufficient performance? No - the memory management and performance in PHP5 is so
behind that of PHP7, that it wouldn't really make sense to support. Check out Nikic's blog to get an
idea for just how stark the difference is: https://nikic.github.io/2015/05/05/Internal-value-representation-in-PHP-7-part-1.html
* Is the PHP grammar described in `php/php-langspec` complete? Complete enough - we've submitted some PRs to
improve the spec, but overall we haven't run into any major impediments.
* Is the PHP 7 grammar a superset of the PHP5 grammar? It's close enough that we can afford to patch
the cases where it's not.
## Real world validation strategy
* benchmark against other parsers (investigate any instance of disagreement)

122
Overview.md Normal file
Просмотреть файл

@ -0,0 +1,122 @@
# Overview
At a high level, the parser accepts source code as an input, and
produces a syntax tree as an output.
If you're familiar with Roslyn and TypeScript, many of the concepts presented here will be familiar
(albeit adapted, to account for the unique runtime characteristics of PHP.)
## Syntax Tree
A syntax tree is literally a tree data structure, where non-terminal structural
elements parent other elements. Each syntax tree is made up of Nodes (represented by circles),
Tokens (represented by squares), and trivia (not represented, below, but attached to each Token).
![image](https://cloud.githubusercontent.com/assets/762848/19092929/e10e60aa-8a3d-11e6-8b90-51eabe5d1d8e.png)
Syntax trees have two key attributes.
1. The first attribute is that Syntax trees hold all the source information in full fidelity.
This means that the syntax tree contains every piece of information
found in the source text, every grammatical construct, every lexical
token, and everything else in between including whitespace, comments,
and preprocessor directives. For example, each literal mentioned in
the source is represented exactly as it was typed. The syntax trees
also represent errors in source code when the program is incomplete
or malformed, by representing skipped or missing tokens in the syntax tree.
2. This enables the second attribute of syntax trees. A syntax tree obtained
from the parser is completely round-trippable back to the text it was parsed
from. From any syntax node, it is possible to get the text representation of
the sub-tree rooted at that node. This means that syntax trees can be used
as a way to construct and edit source text. By creating a tree you have by
implication created the equivalent text, and by editing a syntax tree,
making a new tree out of changes to an existing tree, you have effectively
edited the text.
The syntax tree is composed of Nodes (represented by circles),
Tokens (represented by squares), and Trivia (not represented directly, but attached to
individual Tokens)
### Nodes
Syntax nodes are one of the primary elements of syntax trees. These nodes represent
syntactic constructs such as declarations, statements, clauses, and expressions.
Each category of syntax nodes is represented by a separate class derived from SyntaxNode.
The set of node classes is not extensible.
All syntax nodes are non-terminal nodes in the syntax tree, which means they always have
other nodes and tokens as children. As a child of another node, each node has a parent node
that can be accessed through the Parent property. Because nodes and trees are immutable,
the parent of a node never changes. The root of the tree has a null parent.
Each node has a ChildNodes method, which returns a list of child nodes in sequential order
based on its position in the source text. This list does not contain tokens. Each node also
has a collection of Descendant methods - such as DescendantNodes, DescendantTokens, or
DescendantTrivia - that represent a list of all the nodes, tokens, or trivia that exist in
the sub-tree rooted by that node.
In addition, each syntax node subclass exposes all the same children through
properties. For example, a BinaryExpressionSyntax node class has three additional properties
specific to binary operators: Left, OperatorToken, and Right.
Some syntax nodes have optional children. For example, an IfStatementSyntax has an optional
ElseClauseSyntax. If the child is not present, the property returns null.
### Tokens
Syntax tokens are the terminals of the language grammar, representing the smallest syntactic
fragments of the code. They are never parents of other nodes or tokens. Syntax tokens
consist of keywords, identifiers, literals, and punctuation.
For efficiency purposes, unlike syntax nodes, there is only one structure for all
kinds of tokens with a mix of properties that have meaning depending on the kind
of token that is being represented.
### Trivia
Syntax trivia represent the parts of the source text that are largely insignificant for
normal understanding of the code, such as whitespace, comments, and preprocessor directives.
Because trivia are not part of the normal language syntax and can appear anywhere between
any two tokens, they are not included in the syntax tree as a child of a node. Yet, because
they are important when implementing a feature like refactoring and to maintain full
fidelity with the source text, they do exist as part of the syntax tree.
You can access trivia by inspecting a token's LeadingTrivia.
When source text is parsed, sequences of trivia are associated with tokens.
### Kinds
Each node, token, or trivia has a RawKind property (represented by a numeric literal),
that identifies the exact syntax element represented.
The RawKind property allows for easy disambiguation of syntax node types that share the
same node class. For tokens and trivia, this property is the only way to distinguish
one type of element from another.
### Errors
Even when the source text contains syntax errors, a full syntax tree that is round-trippable
to the source is exposed. When the parser encounters code that does not conform to the
defined syntax of the language, it uses one of two techniques to create a syntax tree.
First, if the parser expects a particular kind of token, but does not find it, it may
insert a missing token into the syntax tree in the location that the token was expected.
A missing token represents the actual token that was expected, but it has an empty span.
Second, the parser may skip tokens until it finds one where it can continue parsing.
In this case, the skipped tokens that were skipped are attached as a trivia node with
the kind SkippedTokens.
Note that the parser produces trees in a tolerant fashion, and will not produce errors for
all incorrect constructs (e.g. including a non-constant expression as the default value of
a method parameter). Instead, it attaches these errors on a post-parse walk of the tree.
### Positional Information
Each node, token, or trivia knows its position within the source text and the number of
characters it consists of. A text position is represented as a 32-bit integer, which is
a zero-based Unicode character index. A TextSpan object is the beginning position and a
count of characters, both represented as integers. If TextSpan has a zero length, it refers
to a location between two characters.
The position refers to the absolute position within the text, but a helper function is available
if you require Line/Column information.
## Next Steps
Check out the [Documentation](GettingStarted.md) section for more information on how consume
the parser, or the [How It Works](HowItWorks.md) section if you want to dive deeper into the implementation.

Просмотреть файл

@ -43,10 +43,6 @@ so each language server operation should be < 50 ms to leave room for all the
* Written in PHP - make it as easy as possible for the PHP community to consume and contribute.
## Current Status and Approach
This approach borrows heavily from the designs of Roslyn and TypeScript. However,
it will need to be adapted because PHP doesn't offer the
same runtime characteristics as .NET and JS.
To ensure a sufficient level of correctness at every step of the way, the
parser is being developed using the following incremental approach:
@ -61,7 +57,7 @@ Error Nodes. Write tests for all invariants.
* [ ] _**Performance:**_ profile, benchmark against large PHP applications
* [ ] **Phase 6:** Finalize API to make it as easy as possible for people to consume.
> :rabbit: **Ready to see just how deep the rabbit hole goes?** Check out [How It Works](HowItWorks.md) for all the fun technical details.
> :rabbit: **Ready to see just how deep the rabbit hole goes?** Check out the [Overview](Overview.md) to learn more about key properties of the Syntax Tree and [How It Works](HowItWorks.md) for all the fun technical details.
<hr>
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).