write overview page

2017-01-11 14:58:34 -08:00 · 2017-01-11 14:58:34 -08:00 · 20561f1acf
--- a/GettingStarted.md
+++ b/GettingStarted.md
@ -91,12 +91,20 @@ foreach ($childNodes as $childNode) {
 //   }
 ```

+> Note: the API is not yet finalized, so please file issues let us know what functionality you want exposed, 
+and we'll see what we can do! Also please file any bugs with unexpected behavior in the parse tree. We're still
+in our early stages, and any feedback you have is much appreciated :smiley:.
+
 ## Play around with the AST!
 In order to help you get a sense for the features and shape of the tree, 
-we've also included a `PHP Syntax Visualizer Extension` that makes use of the parser
-to provide error tooltips. 
-1. Download the VSIX
-2. Point it to your PHP Path
-3. Disable other extensions in the workspace to ensure minimal interference
+we've also included a [Syntax Visualizer Tool](syntax-visualizer/client#php-parser-syntax-visualizer-tool)
+that makes use of the parser to both visualize the tree and provide error tooltips.
+![image](https://cloud.githubusercontent.com/assets/762848/21635753/3f8c0cb8-d214-11e6-8424-e200d63abc18.png)

-If you see something that looks off, please file an issue, or better yet, contribute as a test case. See [Contributing.md](Contributing.md) for more details.
+![image](https://cloud.githubusercontent.com/assets/762848/21705272/d5f2f7d8-d373-11e6-9688-46ead75b2fd3.png)
+
+If you see something that looks off, please file an issue, or better yet, contribute as a test case. See [Contributing.md](Contributing.md) for more details.
+
+## Next Steps
+Check out the [Syntax Overview](Overview.md) section for more information on key attributes of the parse tree, 
+or the [How It Works](HowItWorks.md) section if you want to dive deeper into the implementation.
--- a/HowItWorks.md
+++ b/HowItWorks.md
@ -1,4 +1,17 @@
 # How it Works
+> Note: Make sure you read the [Overview](Overview.md) section first to get a sense for some of
+the high-level principles.
+
+This approach borrows heavily from the designs of Roslyn and TypeScript. However,
+it needs to be adapted because PHP doesn't offer the 
+same runtime characteristics as .NET and JS.
+
+The syntax tree is produced via a two step process:
+1. The lexer reads in text, and produces the resulting Tokens.
+2. The parser reads in Tokens, to construct the final syntax tree.
+
+Under the covers, the lexer is actually driven by the parser to reduce potential memory
+consumption and make it easier to perform lookaheads when building up the parse tree. 

 ## Lexer
 The lexer produces tokens out PHP, based on the following lexical grammar:
@ -11,17 +24,12 @@ flexibility to use our own lightweight token representation (see below) from the
 a conversion. This initial implementation is available in `src/Lexer.php`, but has been deprecated in favor of 
 `src/PhpTokenizer.php`.

-Ultimately, the biggest challenge with the initial approach was performance (especially with Unicode representations). Ultimately,
+Ultimately, the biggest challenge with the initial approach was performance (especially with Unicode representations) - 
 we found that PHP doesn't provide an efficient way to extract character codes without multiple conversions after the initial
-file-read. 
+file-read.

-> **"Model" vs "Representation"**
-> * Model := general information exposed, how we will intaract with it
-> * Representation := underlying data structures
-
-
-### Tokens (Model)
-Tokens take the following form:
+### Tokens
+Tokens hold onto the following information: 
 ```
 Token: {
    Kind: Id, // the classification of the token
@ -31,7 +39,6 @@ Token: {
 }
 ```

-### Tokens (Representation)
 #### Helper functions
 In order to be as efficient as possible, we do not store full content in memory.
 Instead, each token is uniquely defined by four integers, and we take advantage of helper
@ -39,8 +46,9 @@ functions to extract further information.
 * `GetTriviaForToken`
 * `GetFullTextForToken`
 * `GetTextForToken`
+* See code for an up-to-date list

-#### Data structures
+#### Notes
 At this point in time, the Representation has not yet diverged from the Model. Tokens
 are currently represented as a `Token` object, with four properties - `$kind`, `$fullStart`,
 `$start`, and `$length`. However, objects (and arrays, and ...) are super expensive in PHP
@ -88,7 +96,7 @@ does not preclude us from presenting a more reasonable API for consumers of the
 override the property getters / setters on Node.


-### Invariants
+#### Invariants
 In order to ensure that the parser evolves in a healthy manner over time, 
 we define and continuously test the set of invariants defined below:
 * The sum of the lengths of all of the tokens is equivalent to the length of the document
@ -99,11 +107,49 @@ we define and continuously test the set of invariants defined below:
 * `GetTriviaForToken` returns a string of length equivalent to `(Start - FullStart)`
 * `GetFullTextForToken` returns a string of length equivalent to `Length`
 * `GetTextForToken` returns a string of length equivalent to `Length - (Start - FullStart)`
-* See the code for an up-to-date list...
+* See `tests/LexicalInvariantsTest.php` for an up-to-date list...

 ## Parser
-### Node (Model)
-Nodes include the following information:
+The parser reads in Tokens provided by the lexer to produce the resulting Syntax Tree.
+The parser uses a combination of top-down and bottom-up parsing. In particular, most constructs
+are parsed in a top-down fashion, which keeps the this keeps the code simple and readable/maintainable
+(by humans :wink:) over time. The one exception to this is expressions, which are parsed
+bottom-up. We also hold onto our current `ParseContext`, which lets us know, for instance, whether 
+we are parsing `ClassMembers`, or `TraitMembers`, or something else; holding onto this `ParseContext`
+enables us to provide better error handling (described below).
+
+For instance, let's take the simple example of an `if-statement`. We know to start parsing the
+ `if-statement` because we'll see an `if` keyword token. We also know from the
+[PHP grammar](https://github.com/php/php-langspec/blob/master/spec/19-grammar.md#user-content-grammar-if-statement),
+that an if-statement can be defined as follows:
+```
+if-statement:
+   if   (   expression   )   statement   elseif-clauses-1opt   else-clause-1opt
+```
+
+The resultant parsing logic will look something like this. Notice that we anticipate the next token or
+set of tokens based on the our current context. This is top-down parsing. 
+```php
+function parseIfStatement($parent) {
+    $n = new IfStatement();
+    $n->ifKeyword = eat("if");
+    $n->openParen = eat("(");
+    $n->expression = parseExpression();
+    $n->closeParen = eat(")");
+    $n->statement = parseStatement();
+    $n->parent = $parent;
+    return $n;
+}
+```
+
+Expressions (produced by `parseExpression`), on the other hand, are parsed bottom-up. That is, rather than attempting
+to anticipate the next token, we read one token at a time, and construct a resulting tree based on 
+operator precedence properties. See the `parseBinaryExpression` in `src/Parser.php` for full information.
+
+See the Error-handling section below for more information on how `ParseContext` is used. 
+
+### Nodes
+Nodes hold onto the following information:
 ```
 Node: {
  Kind: Id,
@ -112,8 +158,10 @@ Node: {
 }
 ```

-### Node (Representation)
-> TODO - discerning between Model and Representation
+#### Notes
+In order to reduce memory usage, we plan to remove the NodeKind property, and instead rely soley on
+subclasses in order to represent the Node's kind. This should reduce memory usage by ~16 bytes per
+Node. 

 ### Abstract Syntax Tree
 An example tree is below. The tree Nodes (represented by circles), and Tokens (represented by squares)
@ -131,17 +179,16 @@ WIDTH(T) -> T.Width
 ```


-### Invariants
+#### Invariants
 * Invariants for all Tokens hold true 
 * The tree contains every token
 * span of any node is sum of spans of child nodes and tokens
 * The tree length exactly matches the file length
 * Every leaf node of the tree is a token
 * Every Node contains at least one Token
+* See `tests/ParserInvariantsTest.php` for an up-to-date list...

-### Building up the Tree
-
-#### Error Tokens
+### Error Tokens
 We define two types of `Error` tokens:
 * **Skipped Tokens:** extra token that no one knows how to deal with
 * **Missing Tokens:** Grammar expects a token to be there, but it does not exist
@ -302,18 +349,25 @@ the number of edge cases by limiting the granularity of node-reuse. In the case
 we believe a reasonable balance is to limit granularity to a list `ParseContext`. 

 ## Open Questions
-This approach, however, makes a few assumptions that we should validate upfront, if possible,
-in order to minimize potential risk:
-* [ ] **Assumption 1:** This approach will work on a wide range of user development environment configurations.
-* [ ] **Assumption 2:** PHP can be sufficiently optimized to support aforementioned parser performance goals.
-* [ ] **Assumption 3:** PHP 7 grammar is a superset of PHP5 grammar.
-* [ ] **Assumption 4:** The PHP grammar described in `php/php-langspec` is complete.
-* Anything else?
-
-Some open Qs:
-  * need some examples of large PHP applications to help benchmark
-  * would PHP 5 provide sufficient perf?
-  * what sort of data structures do we need? Ideally we'd throw everything into a struct. Anything better?
+Open Questions:
+  * need some examples of large PHP applications to help benchmark? We are currently testing against
+  the frameworks in the `validation` folder, and more suggestions welcome. 
+  * what are the most memory-efficient data-structures we could use? See Node and Token Notes sections above
+   for our current thoughts on this, but we hope we can do better than that, so ideas are very much welcome.
+  * Can PHP can be sufficiently optimized to support aforementioned parser performance goals? Performance shouldn't
+  is pretty okay at the moment, and there's more we could to do optimize that. But we are certainly 
+  running up against major challenges when it comes to memory.
+  * How well does this approach will work on a wide range of user development environment configurations? 
+  * Anything else?
+  
+Previously open questions:
+* would PHP 5 provide sufficient performance? No - the memory management and performance in PHP5 is so
+behind that of PHP7, that it wouldn't really make sense to support. Check out Nikic's blog to get an
+idea for just how stark the difference is: https://nikic.github.io/2015/05/05/Internal-value-representation-in-PHP-7-part-1.html
+* Is the PHP grammar described in `php/php-langspec` complete? Complete enough - we've submitted some PRs to
+improve the spec, but overall we haven't run into any major impediments.
+* Is the PHP 7 grammar a superset of the PHP5 grammar? It's close enough that we can afford to patch 
+the cases where it's not. 

 ## Real world validation strategy
 * benchmark against other parsers (investigate any instance of disagreement)
--- a/Overview.md
+++ b/Overview.md
@ -0,0 +1,122 @@
+# Overview
+
+At a high level, the parser accepts source code as an input, and
+produces a syntax tree as an output.
+
+If you're familiar with Roslyn and TypeScript, many of the concepts presented here will be familiar
+(albeit adapted, to account for the unique runtime characteristics of PHP.)
+
+## Syntax Tree
+A syntax tree is literally a tree data structure, where non-terminal structural 
+elements parent other elements. Each syntax tree is made up of Nodes (represented by circles), 
+Tokens (represented by squares), and trivia (not represented, below, but attached to each Token).
+
+![image](https://cloud.githubusercontent.com/assets/762848/19092929/e10e60aa-8a3d-11e6-8b90-51eabe5d1d8e.png)
+
+Syntax trees have two key attributes.
+1. The first attribute is that Syntax trees hold all the source information in full fidelity. 
+This means that the syntax tree contains every piece of information 
+found in the source text, every grammatical construct, every lexical 
+token, and everything else in between including whitespace, comments, 
+and preprocessor directives. For example, each literal mentioned in 
+the source is represented exactly as it was typed. The syntax trees 
+also represent errors in source code when the program is incomplete 
+or malformed, by representing skipped or missing tokens in the syntax tree.
+
+2. This enables the second attribute of syntax trees. A syntax tree obtained 
+from the parser is completely round-trippable back to the text it was parsed 
+from. From any syntax node, it is possible to get the text representation of 
+the sub-tree rooted at that node. This means that syntax trees can be used 
+as a way to construct and edit source text. By creating a tree you have by 
+implication created the equivalent text, and by editing a syntax tree, 
+making a new tree out of changes to an existing tree, you have effectively 
+edited the text.
+
+The syntax tree is composed of Nodes (represented by circles), 
+Tokens (represented by squares), and Trivia (not represented directly, but attached to 
+individual Tokens)
+
+
+
+### Nodes
+Syntax nodes are one of the primary elements of syntax trees. These nodes represent 
+syntactic constructs such as declarations, statements, clauses, and expressions. 
+Each category of syntax nodes is represented by a separate class derived from SyntaxNode. 
+The set of node classes is not extensible.
+
+All syntax nodes are non-terminal nodes in the syntax tree, which means they always have 
+other nodes and tokens as children. As a child of another node, each node has a parent node
+ that can be accessed through the Parent property. Because nodes and trees are immutable, 
+ the parent of a node never changes. The root of the tree has a null parent.
+
+Each node has a ChildNodes method, which returns a list of child nodes in sequential order 
+based on its position in the source text. This list does not contain tokens. Each node also
+has a collection of Descendant methods - such as DescendantNodes, DescendantTokens, or 
+DescendantTrivia - that represent a list of all the nodes, tokens, or trivia that exist in 
+the sub-tree rooted by that node.
+
+In addition, each syntax node subclass exposes all the same children through 
+properties. For example, a BinaryExpressionSyntax node class has three additional properties 
+specific to binary operators: Left, OperatorToken, and Right.
+
+Some syntax nodes have optional children. For example, an IfStatementSyntax has an optional 
+ElseClauseSyntax. If the child is not present, the property returns null.
+
+### Tokens
+Syntax tokens are the terminals of the language grammar, representing the smallest syntactic 
+fragments of the code. They are never parents of other nodes or tokens. Syntax tokens 
+consist of keywords, identifiers, literals, and punctuation.
+
+For efficiency purposes, unlike syntax nodes, there is only one structure for all 
+kinds of tokens with a mix of properties that have meaning depending on the kind 
+of token that is being represented.
+
+### Trivia
+Syntax trivia represent the parts of the source text that are largely insignificant for 
+normal understanding of the code, such as whitespace, comments, and preprocessor directives.
+Because trivia are not part of the normal language syntax and can appear anywhere between 
+any two tokens, they are not included in the syntax tree as a child of a node. Yet, because 
+they are important when implementing a feature like refactoring and to maintain full 
+fidelity with the source text, they do exist as part of the syntax tree.
+
+You can access trivia by inspecting a token's LeadingTrivia. 
+When source text is parsed, sequences of trivia are associated with tokens. 
+
+### Kinds
+Each node, token, or trivia has a RawKind property (represented by a numeric literal), 
+that identifies the exact syntax element represented.
+
+The RawKind property allows for easy disambiguation of syntax node types that share the 
+same node class. For tokens and trivia, this property is the only way to distinguish 
+one type of element from another.
+
+### Errors
+Even when the source text contains syntax errors, a full syntax tree that is round-trippable
+to the source is exposed. When the parser encounters code that does not conform to the 
+defined syntax of the language, it uses one of two techniques to create a syntax tree.
+
+First, if the parser expects a particular kind of token, but does not find it, it may 
+insert a missing token into the syntax tree in the location that the token was expected. 
+A missing token represents the actual token that was expected, but it has an empty span.
+
+Second, the parser may skip tokens until it finds one where it can continue parsing. 
+In this case, the skipped tokens that were skipped are attached as a trivia node with 
+the kind SkippedTokens.
+
+Note that the parser produces trees in a tolerant fashion, and will not produce errors for
+all incorrect constructs (e.g. including a non-constant expression as the default value of
+a method parameter). Instead, it attaches these errors on a post-parse walk of the tree.
+
+### Positional Information
+Each node, token, or trivia knows its position within the source text and the number of 
+characters it consists of. A text position is represented as a 32-bit integer, which is 
+a zero-based Unicode character index. A TextSpan object is the beginning position and a 
+count of characters, both represented as integers. If TextSpan has a zero length, it refers
+to a location between two characters.
+
+The position refers to the absolute position within the text, but a helper function is available
+if you require Line/Column information. 
+
+## Next Steps
+Check out the [Documentation](GettingStarted.md) section for more information on how consume
+the parser, or the [How It Works](HowItWorks.md) section if you want to dive deeper into the implementation.
--- a/README.md
+++ b/README.md
@ -43,10 +43,6 @@ so each language server operation should be < 50 ms to leave room for all the
 * Written in PHP - make it as easy as possible for the PHP community to consume and contribute.

 ## Current Status and Approach
-This approach borrows heavily from the designs of Roslyn and TypeScript. However,
-it will need to be adapted because PHP doesn't offer the 
-same runtime characteristics as .NET and JS.
-
 To ensure a sufficient level of correctness at every step of the way, the
 parser is being developed using the following incremental approach:

@ -61,7 +57,7 @@ Error Nodes. Write tests for all invariants.
  * [ ] _**Performance:**_ profile, benchmark against large PHP applications
 * [ ] **Phase 6:** Finalize API to make it as easy as possible for people to consume. 

-> :rabbit: **Ready to see just how deep the rabbit hole goes?** Check out [How It Works](HowItWorks.md) for all the fun technical details.
+> :rabbit: **Ready to see just how deep the rabbit hole goes?** Check out the [Overview](Overview.md) to learn more about key properties of the Syntax Tree and [How It Works](HowItWorks.md) for all the fun technical details.

 <hr>
 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).