Merge branch 'master' of https://github.com/Microsoft/CNTK into amitaga/releaseModeCallStacks

2016-01-26 12:59:42 -08:00 · 2016-01-26 12:59:42 -08:00 · bda9787dba
--- a/Documentation/Documents/Configuration
+++ b/Documentation/Documents/Configuration
--- a/Documentation/Documents/Configuration
+++ b/Documentation/Documents/Configuration
--- a/Documentation/Documents/External
+++ b/Documentation/Documents/External
--- a/Documentation/Documents/External
+++ b/Documentation/Documents/External
@ -0,0 +1,69 @@
+# ExternalBuffer in Matrix class
+
+There are at least 4 different implementations of the Matrix class that have over time diverged in their implementation in respect to how the external buffer case is handled. The external buffer case is when the matrix class does not actually own it's own memory and is pointing to an external buffer that is managed separately. A deviceID of MANAGEDEXTERN used to be the way this was done, however we have now moved to setting a flag m_externalBuffer in the common header to signify an eternal buffer. We have two instances of this in our code today:
+
+1. Column Slices were implemented using this feature. The idea is you only want to reference a portion of a full matrix, but don't want to copy the contents to a new matrix for efficiency reasons. In this case the slice can be modified just like a real matrix, and it is the programmers responsibility to ensure that the lifetime of the underlying matrix is longer than any of it's slices. NOTE: lifetime management is not taken care of for you, so be careful
+2. PTask Buffers - PTask is our solution for using multiple GPUs. It uses a filter graph based approach for accelerating GPU applications. PTask executes a graph and calls each of it's tasks as the inputs are available. In CNTK most of these inputs are numeric arrays with an associated Matrix header metadata. We wrap the buffer in a Matrix shell with external buffers set, and call the normal processing methods.
+
+Both of these uses are similar, but slightly different as well. We believe that we can use the same implementations to satisfy both sets of needs. So here are the definitions:
+
+```c++
+Matrix(const size_t numRows, const size_t numCols, ElemType *pArray, const size_t matrixFlags=matrixFlagNormal, 
+short deviceId=AUTOPLACEMATRIX, const size_t nnz=0);
+```
+
+* Matrix constructor that constructs a matrix from a buffer pointer and some flags. The behavior depends on the flags. In all cases dimensions, format (from the matrixFlags), deviceId and nnz (for sparse representations) are copied:
+	* matrixFlagDontOwnBuffer - in this case the pArray pointer is set as the m_pArray of the matrix and m_externalBuffer = true
+	* matrixFlagSetValueOnDevice - if set this signifies that the buffer is on the proper device, but needs to be copied to newly allocated space for the m_pArray, m_externalBuffer = false
+	* neither set - the buffer is on the CPU and device memory is allocated and then the buffer is copied over, m_externalBuffer = false
+
+```c++
+Matrix(const Matrix<ElemType>& deepCopyFrom, short deviceId=AUTOPLACEMATRIX); //copy constructor, deep copy
+```
+
+* Matrix constructor that constructs a matrix from an existing matrix, Dimensions, format, and other elements are also copied:
+	* deepCopyFrom - regardless of if m_externalBuffer is set or not, a new buffer is allocated and the contents of the deepCopyFrom are copied to the new buffer. m_externalBuffer = false;
+	* NOTE: use move constructor or SetValue with matrixFlagDontOwnBuffer if an externalBuffer at the destination is desired
+
+```c++
+Matrix<ElemType>& operator=(const Matrix<ElemType>& deepCopyFrom); //assignment operator, deep copy
+```
+
+* assignment operator copies from one matrix to another. In all cases , dimensions, format, and other members are copied, m_externalBuffer is left unchanged, and copy of the buffer is buffer content only:
+	* destination normal, deepCopyFrom is external - destination is resized as necessary and then copy.
+	* destination is external, deepCopyFrom can be either - If the destination would require a resize, an exception is thrown, otherwise copy.
+
+```c++
+Matrix(Matrix<ElemType>&& moveFrom); //move constructor, shallow copy
+```
+
+* constructor with move semantics copies from one matrix to another:
+	* moveFrom is bitwise copied to the newly created matrix. So it is an exact copy of previous matrix (which is going to be discarded without destructors running)
+
+```c++
+Matrix<ElemType>& operator=(Matrix<ElemType>&& moveFrom); //move operator, shallow copy
+```
+
+* assignment operator with move semantics copies from one matrix to another:
+	* destination normal - In this case existing buffers are freed, and then everything is bitwise copied (including m_externalBuffer flag). 
+	* destination is external - bitwise copy over everything (including m_externalBuffer flag)
+
+```c++
+void SetValue(const Matrix<ElemType>& deepCopyFrom);
+```
+
+* Straight copy from one buffer to another, irrespective of m_external flags, which remain unchanged. If the destination is not large enough, it will be resized. If buffer mismatch occurs and the destination is m_externalBuffer, it will throw an exception.
+
+```c++
+void SetValue(const size_t numRows, const size_t numCols, ElemType *pArray, 
+const size_t matrixFlags=matrixFlagNormal, int deviceId=MANAGEDEXTERN);
+```
+
+* SetValue with a buffer pointer copies the contents of that buffer to the matrix, resizing the destination as necessary. Also sets the format (through a mask of the matrixFlags) and deviceId of the matrix:
+	* matrixFlagDontOwnBuffer set, destination normal - Free the contents of the current array buffer, replace pointer, dimensions,  m_externalBuffer = true
+	* matrixFlagDontOwnBuffer set, destination external - replace pointer and dimensions, m_externalBuffer = true 
+	* matrixFlagSetValueOnDevice set, destination normal - the buffer is on the proper device, resize destination as necessary, set the dimensions and copy buffer to the current array, m_externalBuffer = false
+	* matrixFlagSetValueOnDevice set, destination external - the buffer is on the proper device, throw if dimensions are incompatible, set the dimensions and copy buffer content to the current array location, m_externalBuffer = false
+	* no flags set, destination normal - the buffer is on the CPU, resize destination as necessary, set the dimensions and copy buffer to the current array, m_externalBuffer = false
+	* no flags set, destination external - the buffer is on the CPU, throw if dimensions are incompatible, set the dimensions and copy buffer content to the current array location, m_externalBuffer = false
+
--- a/Documentation/Documents/Model
+++ b/Documentation/Documents/Model
--- a/Documentation/Documents/Model
+++ b/Documentation/Documents/Model
@ -0,0 +1,750 @@
+# Model Editing Language
+
+## Definition
+
+The Model Editing Language (MEL) of the Computational Network ToolKit (CNTK) provides a means to modify an existing trained network using a set of provided commands. It provides a number of functions to modify the network and can use Network Description Language (NDL) to define new elements. It looks similar to a scripting language in syntax, but is not a programming “language”, but a simple way to modify an existing network. This network must have been defined in a format that CNTK can read, currently only the CNTK computational network disk format is supported. This document assumes some knowledge of NDL, reading the NDL document prior to this document is recommended.
+
+## Example
+
+This section will cover the features of the MEL by example. If you would rather see the “programmer documentation” just skip to MEL Reference section.
+
+Here is a simple example of a MEL script:
+
+```
+    model1 = LoadModel(“c:\models\mymodel.dnn”, format=cntk)
+    SetDefaultModel(model1)
+    DumpModel(model1, “c:\temp\originalModel.dmp”, includeData = true)
+    
+    #Let’s create another hidden layer
+    Copy(L3.*, L4.*, copy=all)
+
+    #Now hook up the layer
+    SetInput(L4.*.T, 1, L3.RL) # Layer 3 output to Layer 4 input
+    SetInput(CE.*.T, 1, L4.RL) # Layer 4 output to Top layer input
+
+    #Add mean variance normalization using in-line NDL
+    meanVal = Mean(features)
+    invstdVal = InvStdDev(features)
+    inputVal = PerDimMeanVarNormalization(features,meanVal,invstdVal)
+
+    #make the features input now take the normalized input
+    SetInput(L1.BFF.FF.T, 1, inputVal)
+
+    #save model
+    SaveModel(“c:\models\mymodel4HiddenWithMeanVarNorm.dnn”)
+```
+
+This MEL script is using a network that was defined originally by the following NDL script:
+
+```
+    # constants defined
+    # Sample, Hidden, and Label dimensions
+    SDim=784
+    HDim=256
+    LDim=10
+
+    features=Input(SDim, tag=feature)
+    labels=Input(LDim, tag=label)
+
+    # Layer operations
+    L1 = RBFF(features, HDim, SDim)
+    L2 = RBFF(L1, HDim, HDim)
+    L3 = RBFF(L2, HDim, HDim)
+    CE = SMBFF(L3, LDim, HDim, labels, tag=Criteria)
+    Err=ErrorPrediction(labels, CE.F, tag=Eval)
+
+    # rootNodes defined here
+    OutputNodes=(CE.F)
+```
+
+### Loading a model
+
+The first thing command executed in a MEL script is usually a LoadModel() command. This function takes the name of a model file on disk, and an optional parameter specifying the format of the model file. Currently only CNTK format model files are accepted, and CNTK format is the default value. Programmers can write file converters to support more model formats.
+
+```
+    model1 = LoadModel(“c:\models\mymodel.dnn”, format=cntk)
+    SetDefaultModel(model1)
+```
+
+‘model1’ is the identifying name this model is given for use in the MEL script. This identifier is used in the next line to this model as the default model. The default model defines what model will be assumed in all name references within the script, and the model to which any NDL (Network Definition Language) commands will apply. This line isn’t really necessary in this case, because the first model loaded will be the default model without explicitly calling the SetDefaultModel() function.
+
+### Viewing a model file
+
+It is often necessary to view a model file to determine the names used in the model file. MEL uses the node names in most commands, to specify which node(s) should be modified. The Dump() command dumps the node names and optionally values to a file.
+
+```
+    DumpModel(model1, “c:\temp\originalModel.dmp”, includeData = true)
+```
+
+the parameters are the model name, the file name, and if the dump should include data. The includeData optional parameter defaults to false. The dump looks something like this:
+
+```
+    …
+    features=InputValue [784,32] 
+    L1.BFF.B=LearnableParameter [256,1]   NeedGradient=true 
+     0.0127850091 
+     -0.00473949127 
+     0.0156492535 
+     …
+     0.00529919751 
+     #################################################################### 
+    L1.BFF.FF.P=Plus ( L1.BFF.FF.T , L1.BFF.B ) 
+    L1.BFF.FF.T=Times ( L1.BFF.W , normInput ) 
+    L1.BFF.W=LearnableParameter [256,784]   NeedGradient=true 
+     0.0174789988 0.0226208009 -0.00648776069 0.0346485041 -0.0449098013 -0.0233792514     
+     0.0154407881 0.000157605857 0.0206625946 0.0491085015 0.00128563121
+     …
+```
+
+These variables are set to scalar numeric values in this case and are used as parameters in the NDL Functions. These values are the dimensions of the data samples, hidden layers, and labels used in training. This particular setup is for the MNIST dataset, which is a collection of images that contain 784 pixels each. Each image is a handwritten digit (0-9), so there are 10 possible labels that can be applied to each image. The hidden matrix dimensions are determined by the user depending on their needs.
+
+### Copy
+
+The copy command will copy a node, or a group of nodes from one location to another location. This can be done within the same model, or between different models:
+
+```
+    #Let’s create another hidden layer 
+    Copy(L3.*, L4.*, copy=all)
+```
+
+The first parameter is the source of the copy and must exist, the second is the target and may or may not exist. If it does exist, those matching nodes will be overwritten by the copy. The optional parameter **copy** can be used to change this behavior, the options are: **all** the default which copies all node data and links to other nodes also copied, or **value** which copies the node values only, leaving the connections between nodes (if any) unchanged.
+
+In this command an entire layer is duplicated in the same model creating a new L4 layer in the model. The Copy() command will copy the nodes and connections between the nodes being copied by default so the optional parameter was not required in this case.
+
+The L3 used in this copy command was originally defined in NDL as follows:
+
+```
+    L3 = RBFF(L2, HDim, HDim)
+```
+
+So the new L4 layer will contain all the nodes L3 contains (RectifiedLinear, Plus, Times, W and B Parameters) all connected just as they were in the L3 layer.
+
+### SetInput
+
+To integrate this new layer into the model, the inputs and outputs must still be set properly. After the copy any node whose connected nodes were not copied will have those connections set to an invalid value. These need to be fixed up in order to have a valid model. Attempts to Save a model will first validate the model in the case where some nodes were not reconnected.
+
+You can change connections between nodes with the SetInput() command. This takes a node to modify, the input number to modify (zero-based input\#), and the new value for that input. The following commands hook up the inputs and outputs for our copied nodes:
+
+```
+    #Now hook up the layer
+    SetInput(L4.*.T, 1, L3.RL) # Layer 3 output to Layer 4 input
+    SetInput(CE.*.T, 1, L4.RL) # Layer 4 output to Top layer input
+```
+
+To connect our new L4 layer, we need to set the second input of the Times node (L4.BFF.FF.T) to L3.RL, which is the output of the L3 layer. The input number is zero-based, so the first input is zero and the second input would be '1'.
+Likewise we need to hook the output of the L4 layer nodes to the input of the top layer. Once again this ends up being a Times node (CE.BFF.FF.T)
+
+### Name Matching
+
+You may have noticed the use of the ‘\*’ wildcard character in the commands presented to this point. Those are name matching wildcards, and are useful in matching a group of related nodes. Because of the hierarchal “dot naming” scheme used by NDL, it is easy to select all the nodes that a particular macro generated because they will all start with the same prefix. Nodes generated by NDL macros have the following structure:
+
+```
+\[name\]{.\[macroName\]}.\[nameNode\]
+```
+
+Where **name** is the name assigned in NDL, **macroName** is the name given to a macro called by the initial macro, and can be several layers deep, and **nameNode** is the name given to a single node in the final macro. For Example this macro in NDL:
+
+```
+L3 = RBFF(L2, HDim, HDim)
+```
+
+Generates the following nodes:
+
+L3.RL | RectifiedLinear node
+---|---
+L3.BFF.B | Parameter node – used for bias |
+L3.BFF.W | Parameter node – used for weight |
+L3.BFF.FF.T | Times node |
+L3.BFF.FF.P | Plus node |
+
+These patterns can be used to access these nodes:
+
+L3.\* | Select all the L3 nodes
+---|---
+L3.\*.P | Select the L3.BFF.FF.P node
+\*.W | Select L3.BFF.W and any other node named ‘W’ in the model
+model1.L3.\* | All the L3 nodes in the ‘model1’ model
+model1\[.\*\] | All the nodes in model1 (the ‘.\*’) is optional
+
+There are also methods that will copy nodes based on the structure of the graph. Look for CopySubTree() in the reference section for details.
+
+### Adding new nodes
+
+Adding new nodes to an existing model can be done just like a model can be originally defined, in NDL. There are two ways to do this, the simplest is to just type the NDL definitions into the MEL script, as if it was NDL, like so:
+
+```
+    #Add mean variance normalization using in-line NDL
+    meanVal = Mean(features)
+    invstdVal = InvStdDev(features)
+    inputVal = PerDimMeanVarNormalization(features,meanVal,invstdVal)
+```
+
+This is called in-line NDL and can be used for most tasks. This sequence of nodes does a mean variance normalization on the dataset. The new nodes will be placed in the current default model in the MEL script. In our example script, we only use one model, and it was set as the default model using the SetDefaultModel() command. If no model has been explicitly set to be the default model, the last loaded model is used as the default. However, It is recommended that the SetDefaultModel() command be used to make it explicit.
+
+Notice the variable **features** that is used in the NDL is actually a node from the default model. It is legal to use nodes from the model in in-line NDL and vise-versa. However, no name matching '\*' patterns are allowed in NDL commands, and macros cannot be defined in in-line NDL.
+
+An NDL macro can also be used from in-line NDL, as long as it appears in the default macros defined for the editing script, or it is defined in an NDL Snippet (see below).
+
+### Connecting in-line NDL
+
+The sequence of nodes used to do mean variance normalization are now in the model. However, we have to use the output of these NDL nodes to replace the previous InputNode that provided the features. This node is called ‘features’ in this model, and we need to set the input to the L1 layer to be ‘inputVal’ (the output from the NDL nodes we just created) instead. This is done, again, using the SetInput() command:
+
+```
+    #make the features input now take the normalized input instead
+    SetInput(L1.BFF.FF.T, 1, inputVal)
+```
+
+Now the nodes have all been connected and the model is valid, a mean variance normalization step has just been added to the model. The mean() and variance() nodes both execute before the any training begins and are called ‘pre-compute’ nodes. The mean and variance are calculated over the training data set, and then those values are used during training to normalize the data.
+
+## NDL Snippets
+
+NDL snippets are sections of NDL definitions that generate a new model. Any NDL construct that is legal in an NDL script can be used. This includes defining macros and other advanced NDL features. For example, instead of loading an existing NDL file, an NDL snippet could have been used to define the network structure. The NDL Snippet looks like:
+
+```
+model1=[ 
+    # constants defined
+    # Sample, Hidden, and Label dimensions
+    SDim=784
+    HDim=256
+    LDim=10
+
+    features=Input(SDim, tag=feature)
+    labels=Input(LDim, tag=label)
+
+    # Layer operations
+    L1 = RBFF(features, HDim, SDim)
+    L2 = RBFF(L1, HDim, HDim)
+    L3 = RBFF(L2, HDim, HDim)
+    CE = SMBFF(L3, LDim, HDim, labels, tag=Criteria)
+    Err=ErrorPrediction(labels, CE.F, tag=Eval)
+
+    # rootNodes defined here
+    OutputNodes=(CE.F)
+]
+```
+
+When snippets are used, wildcard naming, and use of symbols from another model are not allowed. The syntax rules are identical to creating an NDL script.
+
+### SaveModel
+
+After the model edits are complete, it’s time to save the model:
+
+```
+    #save model
+    SaveModel("c:\models\mymodel4HiddenWithMeanVarNorm.dnn")
+```
+
+This command saves the default model (still ‘model1’) to the path name specified. ‘model1’ could have been specified as the first parameter with the path as the second to obtain the same affect. Before the save happens the model is validated to ensure it is a valid model before save can occur. Should there be an error in the model, an error message will be displayed on the console and the model edit will terminate.
+
+## MEL Reference
+
+Model Editing Language (MEL) is a language that provides a means to modify an existing CNTK network, or a trained model to create new networks and models. MEL allows nodes of a network to be copied, new nodes created, and node values to be duplicated to create new networks based on other previously done work.
+
+### Commands
+
+Commands in MEL are the operations that can be used to modify a network or model. The commands are represented in a function call like syntax:
+
+`Command(parameter1, parameter2, optionalParameter=value)`
+
+Commands do not return values, with the exception of the CreateModel() and LoadModel() commands, and some may have optional parameters. The parameters are delimited with a comma. The commands are:
+
+**Command Name** | **Example** | **Notes**
+---|---|---
+CreateModel | m1=CreateModel() | Returns a value
+CreateModelWithName | CreateModelWithName(model1) | Alternate no return value
+LoadModel | m1=LoadModel(“new.dnn”, format=cntk) | Returns a value 
+LoadModelWithName | LoadModelWithName(m1, “new.dnn”, format=cntk) | Alternate no return value
+LoadNDLSnippet | LoadNDLSnippet(mNDL, “net.ndl”) |                                                                
+SaveDefaultModel | SaveDefaultModel(“new.dnn”, format=cntk) |                                                            
+SaveModelWithName | SaveModelWithName(m1, “new.dnn”, format=cntk) | 
+SetDefaultModel | SetDefaultMode(m1) |                                                           
+UnloadModel | UnloadModel(m1) |                                                                
+Dump\[Model\] | Dump\[Network\](m1, “dump.txt”, includeData=false) | DumpModel is alternate name
+DumpNode | DumpNode(node, “node.txt”, includeData=false) |                                                                
+Copy\[Node\] | Copy(fromNode, toNode, copy=all) | CopyNode is alternate name
+CopySubTree | CopySubTree(fromNode, toNetwork, toNodeNamePrefix, copy=all) |                                                                
+Copy\[Node\]Inputs | CopyInputs(fromNode, toNode) | CopyNodeInputs is alternate name
+Set\[Node\]Input | SetInput(fromNode, inputID, inputNode) | SetNodeInput is alternate name                  
+Set\[Node\]Inputs | SetInputs(fromNode, inputNode1\[, inputNode2, inputNode3\])  | SetNodeInputs is alternate name, variable number of parameters |
+SetProperty | SetProperty(toNode, propertyName, propertyValjue) |                                                                
+SetPropertyForSubTree | SetPropertyForSubTree(rootNode, propertyName, propertyValue) |                                                                
+Remove\[Node\] | Remove(node\[, node2, node3, …\]) | Same as DeleteNode()
+Delete\[Node\] | Delete(node\[, node2, node3, …\]) | Same as RemoveNode()
+Rename | Rename(nodeOld, nodeNew) |
+
+### Name Matching
+
+MEL provides a way to perform a command on more than one node at a time. This is done through wildcard name matching. Because of the hierarchal “dot naming” scheme used by NDL, related nodes are easy to select with a wildcard name matching scheme. Nodes generated by NDL macros have the following structure:
+
+`{[modelName].}[name]{.[macroName]}.[nameNode]`
+
+Element | Descriptions
+---|---
+**modelName** | an optional prefix that defines which model should be applied to the rest of the name. If 	no **modelName** is specified, the current default model is assumed.                                                                                              
+**name** | the name of the node in question, or if NDL was used to create the network, the top level 	symbol used to identify the node (i.e. L3 in the following example.                                                                                        
+**macroName** | the name given to a macro called by the initial macro and can be several layers deep. 	Usually the names are the same as the macros called. A user is unlikely to know these names unless 	they dump the network nodes, so wildcard name matching can be used instead of the **macroName** (s)
+**nameNode** | the name given to a single node in the final macro.
+
+For Example this macro in NDL:
+
+```
+    L3 = RBFF(L2, HDim, HDim)
+```
+
+Generates the following nodes:
+
+Name | Descriptions
+---|---
+L3.RL | RectifiedLinear node |
+L3.BFF.B | Parameter node – used for bias |
+L3.BFF.W | Parameter node – used for weight |
+L3.BFF.FF.T | Times node |
+L3.BFF.FF.P | Plus node |
+
+The following wildcard patterns can be used to select nodes within a model. If a \[model\] prefix is not specified the default model is assumed:
+
+Pattern | Example  | Result
+---|---|---
+\[prefix\]\* | L3.\* | Select all the nodes starting with \[prefix\] 
+\[prefix\]\*\[suffix\] | L3.\*.P | Select all nodes with \[prefix\] and \[suffix\] 
+\*\[suffix\] | \*.W | Select all the nodes with \[suffix\]                 
+\[model\].\[pattern\] | model1.L3.\* | Select all the nodes matching a pattern in \[model\] 
+\[model\]{\*} | model1.\* | Select all nodes in the model, ‘\*’ is optional      
+
+There are also methods that will copy nodes based on the structure of the graph. Look for CopySubTree() in the reference section for details.
+
+### Optional Parameters
+
+Many commands have optional parameters that will change the behavior of the command. For example:
+
+```
+    Copy(L1.\*, L2.\*, copy=all)
+```
+
+In this example all the nodes starting with "L1." are copied to nodes starting with "L2.", the values of the nodes as well as any links between the nodes (the network structure) are copied. If the destination “L2.\*” nodes already exist, they will be overwritten. The other option is copy=value, which would be used when the network structure desired already exists, and the values contained in the node are all that are required to be copied. This can be used to copy over the values of Parameter() nodes to a new model with identical structure.
+
+Each command may have optional parameters, look in the Command reference section for details of the optional parameters that are accepted by a command.
+Stringize variables
+MEL supports a “stringize” feature similar to the one supported by configuration files. Anywhere in a MEL script file, you can specify “$VarName$”, and this entire string will be replaced by the value of the variable called “VarName”. Note that the variables that are considered in scope for this purpose are the configuration variables that are visible from the configuration section where the path to this MEL script is specified (via the “editPath” parameter). For example, if the variables “OldModelPath” and “NewModelPath” were defined at the root level of the configuration file, the following would be a proper MEL script:
+
+```
+	m1=LoadModel("$OldModelPath$",format=cntk)
+ 	# make change to model here
+	SaveModel(m1,"$NewModelPath$",format=cntk)
+```
+
+## NDL Integration
+
+NDL (Network Description Language) can be used freely in MEL to create new nodes and integrate them into an existing model. Please refer to the NDL Section of the documentation to get the details on all the NDL Functions that are available. The NDL Functions can be used in two different ways in MEL. In-line and as a snippet.
+
+### In-line NDL
+
+In-line NDL is, as it sounds, NDL lines mixed in with MEL Command calls. This is an easy way to define new nodes in a MEL script. In-line NDL only works on the default model at the time the NDL function is encountered. The default model is set with the SetDefaultModel() command, or if no such command has been encountered the last LoadModel() or CreateModel() command. It is recommended that the SetDefaultModel() command appear before any In-line NDL to make it clear which model is being modified.
+
+In-line NDL may use node names from the default model as parameters, and MEL commands may use NDL symbols as parameters. There are a number of restrictions using in-line NDL:
+
+1.  ‘\*’ names may not be used in In-line NDL, only fully quantified node names are accepted.
+2.  NDL symbols only apply to the default model at the time they were created when used in MEL commands
+3.  Macros may not be defined in in-line NDL (though they can in an NDL snippet)
+4.  Only macros defined in the default macro file referenced in the config file, or macros defined in an NDL snippet in the MEL Script may be used
+5.  NDL will be processed when the next MEL command that requires it to be processed is encountered. It is only at this time that the new nodes are fully created. If forward references are used to variables, they must be resolved before the next MEL command that requires the variables to be resolved.
+
+### NDL Snippets
+
+NDL snippets are sections of NDL definitions that generate a new model. Any NDL construct that is legal in an NDL script can be used. This includes defining macros and other advanced NDL features. The syntax for defining and NDL snippet are as follows:
+
+```
+	[modelName]=[
+	    #ndl commands go here
+	]
+```
+
+Upon the completion of the snippet, the modelName will be the name of the newly defined model. This model need not be fully defined, for example, the special nodes (i.e. criteria nodes) do not need to be defined in the model. However, all referenced variables must be defined in the snippet. It is often easier to use in-line NDL to define new nodes in MEL, and NDL Snippets to define any macros. Macros are defined in a global namespace and can be defined in any model and used from any other model.
+
+One possible use of an NDL snippet is to define an entirely new model, and then use MEL to populate the new model with values. Here is an example of how an NDL snippet could have been used to define the entire network structure:
+
+```
+model1=[ 
+    # constants defined
+    # Sample, Hidden, and Label dimensions
+    SDim=784
+    HDim=256
+    LDim=10
+
+    features=Input(SDim, tag=feature)
+    labels=Input(LDim, tag=label)
+
+    # Layer operations
+    L1 = RBFF(features, HDim, SDim)
+    L2 = RBFF(L1, HDim, HDim)
+    L3 = RBFF(L2, HDim, HDim)
+    CE = SMBFF(L3, LDim, HDim, labels, tag=Criteria)
+    Err=ErrorPrediction(labels, CE.F, tag=Eval)
+
+    # rootNodes defined here
+    OutputNodes=(CE.F)
+]
+```
+
+When snippets are used, wildcard naming, and use of symbols from another model are not allowed. The syntax rules are identical to creating an NDL script. Alternately, the LoadNDLSnippet() command can be used to load NDL from an external file.
+
+## Comments
+
+Comments in MEL are identical to those used in the NDL and configuration files. The ‘\#’ character signifies the beginning of a comment, everything that occurs after the ‘\#’ is ignored. The ‘\#’ must be preceded by whitespace or be at the beginning of the line to be interpreted as a comment. The following are valid comments:
+
+```
+    # Layer operations
+    L1 = RBFF(features, HDim, SDim) # define the first layer
+    # the following variable is set to infinity and the ‘#’ in ‘1#INF’ is not interpreted as a comment marker
+    var = 1#INF
+```
+
+## MEL Commands
+
+This section contains the currently implemented MEL Command functions.
+
+### CreateModel
+
+Creates a new model which is empty.
+
+`m1=CreateModel()`
+
+#### Parameters
+
+none
+
+#### Returns
+
+the new model
+
+#### Notes
+
+This command is one of only a few that return a value. If you prefer to easily distinguish between NDL functions (which always return a value) and MEL commands (which normally do not) you may wish to use the alternate CreateModelWithName() call, which takes the new model identifier as a parameter instead of returning it as a return value.
+
+### CreateModelWithName
+
+Creates a new model which is empty.
+
+`CreateModelWithName(m1)`
+
+#### Parameters
+
+the identifier for the newly created model
+
+#### Notes
+
+The alternate form of the command is CreateModel() and returns a value. If you prefer to easily distinguish between NDL functions (which always return a value) and MEL commands (which normally do not) you may wish to use this version of the command.
+
+###  LoadModel
+
+Load a model from a disk file and assign it a name. The format of the file may be specified as an optional parameter.
+
+`m1=LoadModel(modelFileName, [format=cntk])`
+
+#### Parameters
+
+`modelFileName` – name of the model file, can be a full path name. If it contains spaces, it must be enclosed in double quotes.
+
+#### Returns
+
+model identifier for the model that will be loaded
+
+#### Optional Parameters
+
+`format=[cntk]` – Specifies the format of a file, defaults to ‘cntk’. Currently only the native CNTK format of model file is accepted. Other formats may be supported in the future.
+
+#### Notes
+
+This command is one of only a few that return a value. If you prefer to easily distinguish between NDL functions (which always return a value) and MEL commands (which normally do not) you may wish to use the alternate LoadModelWithName() call, which takes the new model identifier as a parameter instead of returning it as a return value.
+
+### LoadModelWithName
+
+Load a model from a disk file and assign it a name. The format of the file may be specified as an optional parameter.
+
+`LoadModelWithName(model, modelFileName, [format=cntk])`
+
+#### Parameters
+
+`model`-identifier associated with the model that will be loaded.
+
+`modelFileName` – name of the model file, can be a full path name. If it contains spaces, it must be enclosed in double quotes.
+
+#### Optional Parameters
+
+`format=[cntk]` – Specifies the format of a file, defaults to ‘cntk’. Currently only the native CNTK format of model file is accepted. Other formats may be supported in the future.
+
+#### Notes
+
+The alternate form of the command is LoadModel() and returns a value. If you prefer to easily distinguish between NDL functions (which always return a value) and MEL commands (which normally do not) you may wish to use this version of the command.
+
+### LoadNDLSnippet
+
+Load an NDL Snippet from a file, and process it, assigning the results to a symbol
+
+`LoadNDLSnippet(model, nsdSnippetFileName[, section=first])`
+
+#### Parameters
+
+`model` – the identifier that will be used to reference this model.
+
+`ndlSnippetFileName` – name of the file that contains the snippet we want to load
+
+#### Optional Parameters
+
+`section=sectionName` – name of the section that contains the snippet we want to load. If the entire file is the snippet no section name should be specifiedmsmswscar cars Adam
+
+### SaveModel
+
+Save a model to disk in the specified model format
+
+`SaveModel(model, modelFileName[, format=cntk])`
+
+#### Parameters
+
+`model` – the identifier of the model which will be saved
+`modelFileName` – the file name to save the model as
+
+#### Optional Parameters
+
+`format=cntk` – the format of file to save. The only valid value currently is CNTK format, which is the default. It is expected that different formats will be added in the future
+
+### SaveDefaultModel
+
+Save the current default model to a file. The format can be specified with an optional parameter
+
+`SaveDefaultModel(modelFileName, format=cntk)`
+
+#### Parameters
+
+`modelFileName` – name of the model file to save
+
+#### Optional Parameters
+
+`format=cntk` – the format of file to save. The only valid value currently is CNTK format, which is the default. It is expected that different formats will be added in the future
+
+### UnloadModel
+
+Unload the specified model from memory.
+
+`UnloadModel(model)`
+
+#### Parameters
+
+`model` – model identifier.
+
+#### Notes
+
+In general it is unnecessary to unload a model explicitly since it will happen automatically at the end of the MEL script. It is also not recommended that you reuse a model identifier after unloading a model.
+
+### Dump, DumpModel
+
+Create a text file that represents the contents and structure of the Computational network.
+
+`Dump(model, dumpFileName[, includeData=false])`
+`DumpModel(model, dumpFileName[, includeData=false])`
+
+#### Parameters
+
+model – model Identifier
+dumpFileName – file name to save the output
+
+#### Optional Parameters
+
+`includeData=[true,false]` – (default = false) Include the data contained in a node. This will output the contents of nodes that contain matrix values.
+
+### DumpNode
+
+Create a text file that represents the contents of a node.
+
+`DumpNode(node, dumpFileName[, includeData=false])`
+
+#### Parameters
+
+`node` – node Identifier, a wildcard name may be used to output multiple nodes in one call
+`dumpFileName` – file name to save the output
+
+#### Optional Parameters
+
+`includeData=[true,false]` – (default = false) Include the data contained in a node. This will output the contents of nodes that contain matrix values.
+
+### Copy, CopyNode
+
+Copy a node, or a group of nodes from one location to another location. This can be done within the same model, or between different models. The copy can create new nodes or overwrite/update existing nodes. The network structure can be copied with multiple nodes, or just the values in the nodes.
+
+`Copy(fromNode, toNode[, copy=all])`
+`CopyNode(fromNode, toNode[, copy=all])`
+
+#### Parameters
+
+`fromNode` – node identifier we are copying from. This can also be a wildcard pattern.
+
+`toNode` – node identifier we are copying to. This can also be a wildcard pattern, but must match the `fromNode` pattern. A copy from a single node to multiple nodes is also permitted.
+
+#### Optional Parameters
+
+`copy=[all,value]` – (default = all). Specifies how the copy will be performed:
+
+  | If destination node exists | If destination node does not exist                                                                                                                             
+---|---|---
+All | Copies over the values of the nodes and any links between them overwriting the existing node values. 	Any node inputs that are not included in the copy set will remain unchanged. | Copies over the values 	of the nodes and any links between them creating new nodes. All nodes that include inputs in the copy 	set will still be connected. All other nodes will have no inputs and will need to be set using 	SetInput()
+Value | Copies over the node contents, the node inputs will remain unchanged | Not a valid option, the 	nodes must exist to copy only values.
+
+#### Examples
+
+`Copy(L1.*, L2.*)` – copies all the nodes and the inputs in the L1.\* copy set to L2.\*. If the L2.\* nodes did not exist, they will be created
+
+`Copy(L1.BFF.FF.W, model2.*.W, copy=value)` – copies values in the L1.BFF.FF.W node to all the nodes in model2 that are use the name ‘W’.
+
+####  Notes
+
+If an entire network is to be copied, it is easier to save the network first (possibly to a temporary location) and reload that model under a new name.
+
+### CopySubTree
+
+Copy all nodes in a subtree of a computational network from one location to another location. This can be done within the same model, or between different models.
+
+`CopySubTree(fromRootNode, toRootNode[, copy=all])`
+
+#### Parameters
+
+`fromRootNode` – node identifier we are copying from. This can also be a wildcard pattern.
+
+`toRootNode` – node identifier we are copying to. This can also be a wildcard pattern, but must match the fromRootNode pattern.
+
+#### Optional Parameters
+
+`copy=[all,value]` – (default = all). Specifies how the copy will be performed:
+
+  | If destination node exists | If destination node does not exist                                                                                                                             
+---|---|---
+All | Copies over the values of the nodes and any links between them overwriting the existing node values. 	Any node inputs that are not included in the copy set will remain unchanged. | Copies over the values 	of the nodes and any links between them creating new nodes. All nodes that include inputs in the copy 	set will still be connected. All other nodes will have no inputs and will need to be set using 	SetInput()
+Value | Copies over the node contents, the node inputs will remain unchanged | Not a valid option, the 	nodes must exist to copy only values.
+
+#### Notes
+
+If the fromRootNode is a wildcard pattern then the toRootNode must also be a similar wildcard pattern. The CopySubTree() command will execute separately for each root node.
+
+### SetInput, SetNodeInput
+
+Set an input of a node to a value
+
+`SetInput(node, inputNumber, inputNode)`
+
+#### Parameters
+
+`node` – node whose input we are modifying . This can also be a wildcard pattern.
+
+`inputNumber` – a zero-based index to the input that will be set.
+
+`inputNode` – node identifier for input node. This must be a single node.
+
+#### Notes
+
+SetInput() or SetInputs() are often required after a Copy() command in order to hook up all the copied nodes into the network.
+
+### SetInputs, SetNodeInputs
+
+Set all the inputs of a node. If only one input needs to be set use the SetInput() command instead.
+
+`SetInputs(node, inputNode1[, inputNode2, inputNode3])`
+
+#### Parameters
+
+`node` – node whose input we are modifying .
+
+`inputNode1`, `inputNode2`, `inputNode3` – node identifier for input node. The number of input parameters must match the number of inputs **node** requires.
+
+#### Notes
+
+SetInput() or SetInputs() are often required after a Copy() command in order to hook up all the copied nodes into the network.
+
+### SetProperty
+
+Set the property of a node to a specific value.
+
+`SetProperty(node, propertyName, propertyValue)`
+
+#### Parameters
+
+`node` – the node whose properties will set
+
+`propertyName` – name of the property to modify.
+
+`propertyValue` – the value the Property will receive.
+
+The acceptable propertyNames and propertyValues are as follows:
+
+PropertyName | Description | PropertyValue
+---|---|---
+ComputeGradient / NeedsGradient | A flag that determines if a node participates in gradient calculations. Applies to Parameter nodes | true / false
+Feature | Sets the ndoe as a featuer input. Applies to Input nodes | true / false
+Label | Set the node as a label input. Applies to Input nodes | true / false
+FinalCriterion / Criteria | Sets the node as one of the Criteria nodes of the network | true / false
+Evaluation / Eval | Set the node as one of the evaluation nodes | true / false
+Output | Set the node as one of the output nodes | true / false
+
+#### Notes
+
+Most of these properties can be set on nodes through alternate methods. All of these properties except for the ComputeGradient can be added (but not removed) using the special node syntax in NDL.
+
+### SetPropertyForSubTree
+
+Set the property of a node to a specific value.
+
+`SetProperty(rootNode, propertyName, propertyValue)`
+
+#### Parameters
+
+`rootNode` – the node at the root of the subtree
+
+`propertyName` – name of the property to modify.
+
+`propertyValue` – the value the Property will receive.
+
+The acceptable propertyNames and propertyValues for this command are as follows:
+
+PropertyName | Description | PropertyValue
+---|---|---
+ComputeGradient / NeedsGradient | A flag that determines if a node participates in gradient calculations. Applies to Parameter nodes | true / false
+
+#### Notes
+
+The ComputeGradient property only applies to Parameter nodes in the subtree.
+
+### Remove, RemoveNode, Delete, DeleteNode
+
+Delete or Remove node(s) from a model. All alternate command names provide the same option.
+
+`Remove(node[, node2, node3, …])`
+
+`Delete(node[, node2, node3, …])`
+
+`RemoveNode(node[, node2, node3, …])`
+
+`DeleteNode(node[, node2, node3, …])`
+
+#### Parameters
+
+`node` – the node to be removed. This can be a wildcard name.
+
+`node2`, `node3` – additional optional nodes that will also be removed, These can be wildcards
+
+#### Notes
+
+This command can leave unconnected nodes in a model which would need to be reconnected using the SetInput() or SetInputs() commands.
+
+### Rename
+
+Rename a node
+
+`Rename(oldNode, newNode)`
+
+#### Parameters
+
+`oldNode` – the node name of the old node, wildcard naming may be used.
+
+`newNode` – the node name for the new node, matching wildcard naming may be used if oldNode contains wildcards.
+
+#### Notes
+
+Renaming nodes has no effect on the node inputs, even if a name changes the association will remain intact.
--- a/Documentation/Documents/Network
+++ b/Documentation/Documents/Network
--- a/Documentation/Documents/Network
+++ b/Documentation/Documents/Network
@ -0,0 +1,900 @@
+# Network Description Language
+
+## Definition
+
+The Network Description Language (NDL) of the Computational Network ToolKit (CNTK) provides a simple way to define a network in a code-like fashion. It contains variables, and Macros, and other well understood concepts. It looks similar to a scripting language in syntax, but is not a programming “language”, but a simple way to define a network.
+
+## Example
+
+This section will cover the features of the NDL by example. If you would rather see the “programmer documentation” just skip to NDL Reference section.
+
+Here is a simple example of a network definition:
+
+```
+    SDim=784
+    HDim=256
+    LDim=10
+    B0=Parameter(HDim)
+    W0=Parameter(HDim, SDim)
+    features=Input(SDim)
+    labels=Input(LDim)
+    Times1=Times(W0, features)
+    Plus1=Plus(Times1, B0)
+    RL1=RectifiedLinear(Plus1)
+    B1=Parameter(LDim, 1)
+    W1=Parameter(LDim, HDim)
+    Times2=Times(W1, RL1)
+    Plus2=Plus(Times2, B1)
+    CrossEntropy=CrossEntropyWithSoftmax(labels, Plus2)
+    ErrPredict=ErrorPrediction(labels, Plus2)
+    FeatureNodes=(features)
+    LabelNodes=(labels)
+    CriteriaNodes=(CrossEntropy)
+    EvalNodes=(ErrPredict)
+    OutputNodes=(Plus2)
+```
+
+This is a simple Neural Network that consist of two layers.
+
+### Variables
+
+The first thing you will notice is that the SDim, HDim and LDim variables. Variable names can be any alphanumeric string (starting with a letter) and are case-insensitive.
+
+```
+    SDim=784
+    HDim=256
+    LDim=10
+```
+
+These variables are set to scalar numeric values in this case and are used as parameters in the NDL Functions. These values are the dimensions of the data samples, hidden layers, and labels used in training. This particular setup is for the MNIST dataset, which is a collection of images that contain 784 pixels each. Each image is a handwritten digit (0-9), so there are 10 possible labels that can be applied to each image. The hidden matrix dimension is determined by the user depending on their needs.
+
+### Parameters
+
+Parameters are matrices that constitute the learned model upon completion of training. The model parameter matrices are used to modify the sample data into the desired output data and are updated as part of the learning process.
+
+```
+    B0=Parameter(HDim)
+    W0=Parameter(HDim, SDim)
+```
+
+These lines setup the parameters that will be trained, W0 is the weight matrix and B0 is the bias matrix. Parameters are matrices, and have two dimension parameters. If only one dimension is given the other dimension is assumed to be a ‘1’. By default Parameters are initialized with uniform random numbers, but other options exist (see NDL Function definitions)
+
+### Inputs
+
+The inputs into the network are defined by the sample data and the labels associated with the samples.
+
+```
+    features=Input(SDim)
+    labels=Input(LDim)
+```
+
+The ‘features’ input will have the dimensions of the sample data, and the ‘labels’ input will have the dimensions of the labels. The variables chosen here are for convenience and could be any valid variable name.
+
+### Computation
+
+The computation portion of the network gets the product of the weight matrix and the features matrix and adds on the bias. It uses the matrix operators Times() and Plus().
+
+```
+    Times1=Times(W0, features)
+    Plus1=Plus(Times1, B0)
+    RL1=RectifiedLinear(Plus1)
+```
+
+Following this computation we apply the energy function, in this case RectifiedLinear(), to the product. The Sigmoid() function is also available (see NDL Function definitions).
+
+### Top Layer
+
+The top layer in a network is where the neural network produces the probabilities that correspond to the labels provided in supervised learning. This network uses category labels, for the MNIST case these will appear as an array of 10 floating point values, all of which are zero except for the proper label category which is 1.0.
+
+```
+    CrossEntropy=CrossEntropyWithSoftmax(labels, Plus2)
+```
+
+Networks will often use the SoftMax function to obtain the probabilities for each label. The error between the actual and the probability is then computed using CrossEntropy. In the CNTK these two actions can be combined in one function for efficiency. CrossEntropyWithSoftmax() takes the input, computes the SoftMax function, calculates the error from the actual value using CrossEntropy and that error signal is used to update the parameters in the network via back propagation.
+
+### Back Propagation
+
+CNTK does not require you to specify anything additional for the back propagation portion of the network. For this example Stochastic Gradient Descent (SGD) is used as the learning algorithm. Each function in CNTK also has a derivative counterpart function and the system automatically does the back propagation update of the network parameters.
+
+### Error Prediction
+
+Predicted error rates are often computed during the training phase to validate the system is getting better as the training progresses. This is handled in the CNTK using the following function:
+
+```
+    ErrPredict=ErrorPrediction(labels, Plus2)
+```
+
+The probabilities produced by the network are compared to the actual label and an error rate is computed. This is generally displayed by the system. Though this is useful, it is not mandatory to use ErrorPrediction, and this can be left out of the network if desired.
+
+### Defining special nodes
+
+After defining the network, it’s important to let CNTK know where the special nodes are in the network. For example, the input nodes (which are features, and which are labels), the output nodes, evaluation nodes and Top Layer criteria nodes. CNTK supports multiple nodes for each type, so the values are arrays. The array syntax is comma separated variable names surrounded by parenthesis.
+
+```
+    FeatureNodes=(features)
+    LabelNodes=(labels)
+    CriteriaNodes=(CrossEntropy)
+    EvalNodes=(ErrPredict)
+    OutputNodes=(Plus2)
+```
+
+## Macros
+
+While creating a network using the syntax shown above is not all that difficult, it can get wordy when creating deep neural networks with many layers. To alleviate this problem, common definitions can be combined into Macros. Macros can be defined as nested calls on a single line, or can be in a more function like syntax as can be seen in the following examples:
+
+### Examples
+
+Macro examples:
+
+```
+    RFF(x1, w1, b1)=RectifiedLinear(Plus(Times(w1,x1),b1))
+```
+
+This one macro is equivalent to the computation section in the previous section, but all in one line. The parameters used in macros are local to each macro.
+
+```
+	FF(X1, W1, B1)
+	{
+    	T=Times(W1,X1);
+    	FF=Plus(T, B1);
+	}
+```
+
+This macro is a feed forward computation without the energy function. It shows the alternate format of macros. Semicolons are optional, but can be used if desired. The variables and parameters used inside the macros are local to the macro. The return value of a macro is defined by a local macro variable that has the same name as the macro. In this case the FF() macros return value will be the FF local variable. If no variables match, the last variable in the macro will be returned.
+
+```
+	#Base Feed Forward network, includes Bias and weight parameters
+	BFF(in, rows, cols)
+	{
+    	B=Parameter(rows)
+    	W=Parameter(rows, cols)
+    	BFF = FF(in, w, b)
+	}
+```
+
+This macro shows how parameters can be declared within a macro. It also shows the comment syntax using a ‘\#’ as the first character in a line, signifies a comment line. As in this example, a macro may call another macro, however recursion is not supported.
+
+```
+	RBFF(input,rowCount,colCount)
+	{
+    	F = BFF(input, rowCount, colCount);
+    	RBFF = RectifiedLinear(F);
+	}
+```
+
+This macro calls the previous macro adding the RectifiedLinear() energy function for a complete layer.
+
+```
+	SMBFF(x,r,c, labels)
+	{
+    	F = BFF(x,r,c);  
+    	SM = CrossEntropyWithSoftmax(labels, F)
+	}
+```
+
+This macro defines a full Top layer, also using the BFF macro as in the other full layer macro. In this case no variables match the name of the macro, so the SM variable will be used as the return value, since it’s the last value defined in the macro.
+
+### Using Macros
+
+The following example uses the macros defined above
+
+```
+    # constants defined
+    # Sample, Hidden, and Label dimensions
+    SDim=784
+    HDim=256
+    LDim=10
+
+    features=Input(SDim)
+    labels=Input(LDim)
+
+    # Layer operations
+    L1 = RBFF(features, HDim, SDim)
+    CE = SMBFF(L1, LDim, HDim, labels)
+    Err=ErrorPrediction(labels, CE.F)
+```
+
+This shows the network definition equivalent to the original network shown but using the above macros. Much simpler to deal with, and understand. One new feature shown in this network definition is access to Macro variables. ErrorPrediction() needs to access the result of the FeedForward result before the CrossEntropyWithSoftmax() is applied to it. However the needed variable is local to the macro, but can still be accessed via “dot” syntax. The return value of the macro was CE, so to access the local F variable defined in the macro itself, CE.F can be used. In the single line version of macros, there are no user defined variable names, so this feature cannot be used.
+
+## Optional Parameters
+
+Optional Parameters are a feature that allows additional parameters to be specified on functions. While the optional parameters can be specified on any function or macro, they are limited to constant values and the underlying function must support the passed optional parameters, or there is no effect on the network. When used on a macro, the macro will have local variables defined that match the optional parameter name and value.
+
+### Parameter initialization
+
+One common use of these optional parameters is to define how parameters will be initialized:
+
+```
+    B0=Parameter(HDim, init=zero)
+    W0=Parameter(HDim, SDim, init=uniform)
+```
+
+In this example the Bias matrix will be zero initialized, and the weight matrix will be initialized with uniform random numbers. Please consult the NDL Function reference to find which functions accept optional parameters
+
+### Tagging special values
+
+As an alternate to providing an array of special nodes that are used as features, labels, criteria, etc, optional parameters can be used. So instead of:
+
+```
+    FeatureNodes=(features)
+    LabelNodes=(labels)
+    CriteriaNodes=(CrossEntropy)
+    EvalNodes=(ErrPredict)
+    OutputNodes=(Plus2)
+```
+
+The network can be defined as
+
+```
+    # constants defined
+    # Sample, Hidden, and Label dimensions
+    SDim=784
+    HDim=256
+    LDim=10
+
+    features=Input(SDim, tag=feature)
+    labels=Input(LDim, tag=label)
+
+    # Layer operations
+    L1 = RBFF(features, HDim, SDim)
+    L2 = RBFF(L1, HDim, HDim)
+    L3 = RBFF(L2, HDim, HDim)
+    CE = SMBFF(L3, LDim, HDim, labels, tag=Criteria)
+    Err=ErrorPrediction(labels, CE.F, tag=Eval)
+
+    # rootNodes defined here
+    OutputNodes=(CE.F)
+```
+
+Which avoids adding elements to the node arrays, and instead sets the ‘tag’ optional parameter on the functions or macros that return the value that fits into a specified category. In this case, since the output node is actually computed inside a macro, we must specify it explicitly.
+
+## NDL Reference
+
+### Variables
+
+Variables are defined in NDL when they appear on the left of an equal sign (‘=’). From that point on that variable name will be associated with the value it was assigned. Variables are immutable, and assigning new values to an existing variable is not supported.
+
+Variable names may be any alphanumeric sequence that starts with a letter. The variables can contain a matrix or scalar value.
+
+#### Reserved words
+
+Any name that is also a function name is a reserved word and cannot be used for a variable. The special node names are also reserved and are as follows:
+
+* `FeatureNodes`
+* `LabelNodes`
+* `CriteriaNodes`
+* `EvalNodes`
+* `OutputNodes`
+
+These may not be used as variable names.
+
+#### Dot names
+
+When it is necessary to access a variable that is defined in a macro (see Macros below), it can be accessed using dot-names. If the following macro is called from code:
+
+```
+    L1 = FF(features, HDim, SDim)
+```
+
+And the macro is defined as follows:
+
+```
+    FF(X1, W1, B1)
+    {
+        T=Times(W1,X1);
+        FF=Plus(T, B1);
+    }
+```
+
+If I want to access the result of the Times() function before the Plus happened, I can with the following variable:
+
+```
+    L1.T
+```
+
+The variable name used in the script followed by a dot and the local name in the macro. This does requires the user to know the name used in the macro, so having all macro definitions available is important. Since macros can be nested, dot names can be several layers deep if necessary.
+
+### Functions
+
+Functions are called using function call syntax similar to most programming languages:
+
+```
+    Times1=Times(W0, features)
+```
+
+The function name is followed by parenthesis which contains the comma separated parameter list. Each function returns a single value, which is identified by a variable.
+
+### Macros
+
+Macros are a combination of multiple Functions combined in a block. This can be done in a single-line nested fashion:
+
+```
+    RFF(x1, w1, b1)=RectifiedLinear(Plus(Times(w1,x1),b1))
+```
+
+In this case the functions called will be evaluated from the innermost nested function call to the outermost.
+
+The other method of defining macros uses a “programming block” style:
+
+```
+    FF(X1, W1, B1)
+    {
+        T=Times(W1,X1);
+        FF=Plus(T, B1);
+    }
+```
+
+In this case the intermediate variables, which are local to the macro, can still be accessed from the outside using the dot syntax for variables.
+
+### Optional Parameters
+
+
+Many functions will have optional parameters that will change the behavior of the function. For example:
+
+```
+    B0=Parameter(HDim, init=zero)
+```
+
+In this example the Bias vector will be zero initialized. The NDL Function reference will specify what optional parameters are accepted by each function
+
+#### Tags
+
+Tags are a special case of optional parameters, and are discussed in the Special Nodes section.
+
+### Special nodes
+
+Special nodes need to be identified for CNTK to automatically do back propagation updates of Learnable Parameters and identify inputs properly. These special nodes be specified in two different ways, the node arrays, or by use of special tags. If both methods are used the values are combined.
+
+#### Node Arrays
+
+CNTK supports multiple nodes for each type, so all these values are arrays. However, In many cases there will only be a single node for each node type. The array syntax (parenthesis) must be used when setting these special nodes, even if there is only one element. If more than one element is include, the entries are comma separated and surrounded by parenthesis. For example:
+
+```
+    FeatureNodes=(features)
+    LabelNodes=(labels)
+    CriteriaNodes=(CrossEntropy)
+    EvalNodes=(ErrPredict)
+    OutputNodes=(Plus2)
+```
+
+#### Tags
+
+A special optional parameter is a “tag”. These can be used as a shortcut way to identify special values in the network. For example features and labels can be tagged as such when the inputs are defined, as follows:
+
+```
+    F1=Input(SDim, tag=feature)
+    L1=Input(LDim, tag=label)
+```
+
+The acceptable tag names correspond to the special node types and are as follows:
+
+Tag name | Meaning
+---|---
+feature | A feature input
+label | A label input
+criteria | criteria node, top level node
+eval | evaluation node 
+Output | output node
+
+## NDL Functions
+
+This section contains the currently implemented NDL functions. The CNTK is being expanded and additional functions will be available as development continues.
+
+### Input, InputValue
+
+Defines input data for the network. This defines the input that will be read from a datasource. The datasource information is specified in the configuration file separately, allowing the same network to be used with multiple datasets easily.
+
+
+`Input(rows, [cols=1])`
+
+`InputValue(rows, [cols=1])`
+
+
+#### Parameters
+
+`rows` – row dimension of the data.
+
+`cols` – \[optional\] col dimension of the data. If this dimension is not specified, it is assumed to be 1
+
+#### Notes
+
+Input nodes are normally tagged with their intended purpose so the CNTK can use the inputs appropriately. The following tags may be used as optional parameters, and specify feature values, and label values respectively:
+
+`tag=feature`
+
+`tag=label`
+
+### ImageInput
+
+Defines image input data for the network. This defines the input that will be read from a datasource. The datasource information is specified in the configuration file separately, allowing the same network to be used with multiple datasets easily.
+
+ImageInput(width, height, channels, \[numImages=1\])
+
+#### Parameters
+
+`width` – width of the image data.
+
+`height` – height of the image data.
+
+`channels` – number of channels in the image data (i.e. RGB would have 3 channels)
+
+`numImages` – \[optional\] number of images in each sample, defaults to 1
+
+#### Notes
+
+Each data element is expected to be in 16-bit (single) or 32-bit (double) floating point format. The order of the data from least frequently changing to most frequently changing is Image, Col, Row, Channel.
+
+Input nodes are normally tagged with their intended purpose so the CNTK can use the inputs appropriately. The following tags may be used as optional parameters, and specify feature values, and label values respectively:
+
+`tag=feature`
+
+`tag=label`
+
+### Parameter, LearnableParameter
+
+Defines a parameter in the network that will be trained. Normally used for weight and bias matrices/vectors.
+
+
+`Parameter(row, \[cols\])`
+
+`LearnableParameter(rows, \[cols\])`
+
+#### Parameters
+
+`rows` – number of rows in the parameter, this will normally be determined by the Input size, a hidden weight/bias matrix size, or an output size.
+
+`cols` – (optional, defaults to 1) number of columns in the parameter data. This is often left at the default value to be determined by the minibatch size when processing the network.
+
+#### Optional Parameters
+
+`ComputeGradient=[true,false]` – Turns on (or off) automatic gradient calculation required for Stochastic Gradient Descent (SGD) training. Defaults to on.
+
+`InitValueScale=number` – Initialization value for the input. Depending on the initialization technique this number is used to determine the range of the random numbers used for initialization. Defaults to 0.05 producing random numbers in a range of \(\lbrack - 0.05 - 0.05\rbrack\)
+
+`Init = [None, Zero, Uniform, Gaussian]` – Form of initialization for inputs
+
+-   None – No initialization is required, should only be used if the network will be initializing in some other way
+
+-   Zero – zero initialize the parameter matrix
+
+-   Uniform – Initializes the parameter matrix with random numbers based on the InitValueScale in the following range: \(\pm InitValueScale/\sqrt{\text{cols}}\)
+
+-   Gaussian – Initializes the parameter matrix with random numbers using a Gaussian distribution in the range \(\pm (0.2)InitValueScale/\sqrt{\text{cols}}\)
+
+### Sum
+
+Calculate the sum of two matrices.
+
+`Sum(add1, add2)`
+
+#### Parameters
+
+`add1`, `add2` – matrix values, must be the same dimensions.
+
+#### Returns
+
+`add1`+`add2`, the element-wise matrix sum of the parameters. The result of the sum is stored in the `add1` matrix (`add1+=add2`)
+
+### Scale
+
+Scale a matrix by a scalar value
+
+`Scale(scaleFactor, matrix)`
+
+#### Parameters
+
+`scaleFactor` – floating point scalar scale factor
+
+`matrix` - matrix values, must be the same dimensions.
+
+#### Returns
+
+`scaleFactor * matrix`, the element-wise product of the scaleFactor and matrix
+
+### Times
+
+Calculate the sum of two matrices.
+
+`Times(mult1, mult2)`
+
+#### Parameters
+
+`mult1`, `mult2` – matrix values, the mult1.rows must equal mult2.cols.
+
+#### Returns
+
+`mult1 * mult2`, the matrix product of the parameters
+
+### Plus
+
+Calculate the sum of two matrices.
+
+`Plus(add1, add2)`
+
+#### Parameters
+
+`add1`, `add2` – matrix values, must be the same dimensions.
+
+#### Returns
+
+`add1+add2`, the element-wise matrix sum of the parameters
+
+### Minus
+
+Calculate the difference of two matrices.
+
+`Minus(sub1, sub2)`
+
+#### Parameters
+
+`sub1`, `sub2` – matrix values, must be the same dimensions.
+
+#### Returns
+
+`sub1 - sub2`, the element-wise matrix difference of the parameters
+
+### Negate
+
+Negate the matrix.
+
+`Negate(matrix)`
+
+#### Parameters
+
+`matrix` – matrix value.
+
+#### Returns
+
+`-(matrix)`, the element-wise negation of all elements of the matrix
+
+### RectifiedLinear
+
+Compute the RectifiedLinear operation on the matrix.
+
+`RectifiedLinear(matrix)`
+
+#### Parameters
+
+`matrix` – matrix value.
+
+#### Returns
+
+`RectifiedLinear(matrix)`, the element-wise rectified linear operation of all elements of the matrix
+
+### Sigmoid
+
+Compute the Sigmoid of the matrix.
+
+`Sigmoid(matrix)`
+
+#### Parameters
+
+`matrix` – matrix value.
+
+#### Returns
+
+`1 / (1 + (e ^ -t))`, the element-wise sigmoid of all elements of the matrix
+
+### Tanh
+
+Compute the Hyperbolic Tangent of the matrix elements.
+
+`Tanh(matrix)`
+
+#### Parameters
+
+`matrix` – matrix value.
+
+#### Returns
+
+`tanh(matrix)` the element-wise hyperbolic tangent of all elements of the matrix
+
+### Log
+
+Compute the Logarithm (base 10) of the matrix elements.
+
+`Log(matrix)`
+
+#### Parameters
+
+`matrix` – matrix value.
+
+#### Returns
+
+`log(matrix)`
+
+the element-wise logarithm of all elements of the matrix
+
+### Softmax
+
+Compute the Softmax of the matrix.
+
+`Softmax(matrix)`
+
+#### Parameters
+
+`matrix` – matrix value.
+
+#### Returns
+
+`softmax(matrix)` the softmax of the matrix
+
+### SquareError
+
+Compute the SquareError of the matrix.
+
+`SquareError(m1, m2)`
+
+#### Parameters
+
+`m1` – first matrix to compare.
+
+`m2` - second matrix to compare
+
+#### Returns
+
+The square error value of the matrix, returned in a 1x1 matrix
+
+### CrossEntropyWithSoftmax, CEWithSM
+
+Compute the Softmax of the matrix, compare against the ground truth labels and compute the CrossEntropy error matrix.
+
+`CrossEntropyWithSoftmax(labels, matrix)`
+
+`CEWithSM(labels, matrix)`
+
+#### Parameters
+
+`labels` – the ground truth labels
+
+`matrix` – matrix value.
+
+#### Returns
+
+the CrossEntropy error matrix
+
+#### Notes
+
+This node will often be tagged as a “Criteria” node to allow the CNTK to identify the node producing the error matrix. To tag appropriate node(s) the following optional parameter should be added to the call(s):
+
+`tag=Criteria`
+
+### MatrixL1Reg, L1Reg
+
+Compute the sum of the absolute value of the entire matrix.
+
+`MatrixL1Reg(matrix)`
+
+`L1Reg(matrix)`
+
+#### Parameters
+
+`matrix` – matrix to use in computation
+
+#### Returns
+
+the sum of the absolute value of the matrix elements, returned in a 1x1 matrix
+
+### MatrixL2Reg, L2Reg
+
+Compute the FrobeniusNorm of the matrix.
+
+`MatrixL2Reg(matrix)`
+
+`L2Reg(matrix)`
+
+#### Parameters
+
+`matrix` – matrix to compute the FrobeniusNorm on.
+
+#### Returns
+
+The FrobeniusNorm of the matrix, returned in a 1x1 matrix
+
+### PerDimMeanVarNormalization, PerDimMVNorm
+
+Compute the Mean-Variance Normalized Matrix
+
+`PerDimMeanVarNormalization(matrix, mean, invStdDev)`
+
+`PerDimMVNorm(matrix, mean, invStdDev)`
+
+#### Parameters
+
+`matrix` – matrix than needs to be normalized
+
+`mean` – the mean for each sample index (same row dimensions as “matrix”)
+
+`invStdDev` – 1/stddev for each sample index. (same row dimensions as “matrix”)
+
+#### Returns
+
+The mean variance normalized matrix
+
+#### Notes
+
+This function requires the Mean and InvStdDev to be already computed. They can either be loaded from a dataset, or computed in a pre-pass, before normalization is required.
+
+### ErrorPrediction
+
+Evaluate the accuracy of the current predictions made by the model. This is generally used to compute the training accuracy of a model. It finds the highest predicted probability from the model and compares it to the actual ground truth.
+
+`ErrorPrediction(labels, matrix)`
+
+#### Parameters
+
+`labels` – the ground truth labels
+
+`matrix` – matrix value.
+
+#### Returns
+
+The number of predicted values that do not match the labels in the current minibatch. Returns a 1x1 matrix
+
+#### Notes
+
+This node will often be tagged as an “Eval” node to allow the CNTK to print ongoing error statistics during training. To take appropriate node(s) the following optional parameter should be added to the call(s):
+
+`tag=Eval`
+
+### Dropout
+
+Compute a new matrix with *dropoutRate* perecent set to zero. The values that are set to zero are randomly chosen. This is commonly used to prevent overfitting during the training process.
+
+`Dropout(matrix)`
+
+#### Parameters
+
+`matrix` – source matrix
+
+#### Returns
+
+a new matrix with *dropoutRate* percent of the elements set to zero (dropped out).
+
+#### Optional Parameters
+
+`dropoutRate` – The percent (represented as a decimal 0.0-1.0) of values that will be dropped on each iteration.
+
+### Mean
+
+Compute the Per dim mean matrix for the entire dataset
+
+`Mean(matrix)`
+
+#### Parameters
+
+`matrix` – source matrix
+
+#### Returns
+
+_Note: Can't use LaTex on GitHub, so this is a patched together solution_
+
+`mean(i) = (Sum from j=0 to j=n of matrix(i,j)) / n`
+
+Where 'n' is the size of the entire dataset
+
+#### Notes
+
+This function is a pre-pass function, will only be called during a pre-pass through the entire dataset before the first training pass. This allows the Mean to be computed before it is required for Mean-Variance Normalization.
+
+### Convolution, Convolve
+
+Compute the convolution of an image input
+
+`Convolution(cvweight, features, kernelWidth, kernelHeight, outputChannels, horizontalSubsample, verticalSubsample, zeroPadding=false)`
+
+#### Parameters
+
+`cvweight` – convolution weight matrix, it has the dimensions of \[outputChannels, kernelWidth \* kernelHeight \* inputChannels\]
+
+`kernelWidth` – width of the kernel
+
+`kernelHeight` – height of the kernel
+
+`outputChannels` – number of output channels
+
+`horizontalSubsample` – subsamples in the horizontal direction
+
+`verticalSubsample` – subsamples in the vertical direction
+
+#### Optional Parameters
+
+`zeroPadding` – \[default = false\] should the sides of the image be padded with zeros?
+
+`maxTempMemSizeInSamples` – \[default=0\] maximum amount of memory (in samples) that should be reserved as temporary space
+
+#### Returns
+
+The convolved matrix according to the parameters passed
+
+#### Notes
+
+The input to this node must be an ImageInput(). This node automatically determines image size on input and output based on the size of the original input and which nodes the input has passed through. This function is often followed by another Convolution() or a MaxPooling() or AveragePooling() node.
+
+### MaxPooling
+
+Computes a new matrix by selecting the maximum value in the pooling window. This is used to reduce the dimensions of a matrix.
+
+`MaxPooling(matrix, windowWidth, windowHeight, stepW, stepH)`
+
+#### Parameters
+
+`matrix` – input matrix
+
+`windowWidth` – width of the pooling window
+
+`windowHeight` – height of the pooling window
+
+`stepW` – step used in the width direction
+
+`stepH` – step used in the height direction
+
+#### Returns
+
+The dimension reduced matrix consisting of the maximum value within each pooling window.
+
+#### Notes
+
+This function is often associated with Convolution() operations.
+
+### AveragePooling
+
+Computes a new matrix by selecting the average value in the pooling window. This is used to reduce the dimensions of a matrix.
+
+`MaxPooling(matrix, windowWidth, windowHeight, stepW, stepH)`
+
+#### Parameters
+
+`matrix` – input matrix
+
+`windowWidth` – width of the pooling window
+
+`windowHeight` – height of the pooling window
+
+`stepW` – step used in the width direction
+
+`stepH` – step used in the height direction
+
+#### Returns
+
+The dimension reduced matrix consisting of the maximum value within each pooling window.
+
+#### Notes
+
+This function is often associated with Convolution() operations.
+
+### Delay
+
+Delay node used in recurrent networks, allows creation of a loop in the convolutional network that will repeat a specified number of times.
+
+`Delay(rows, [cols], delayNode, delayTime=1, needGradient=true, defaultHiddenActivity=0.1)`
+
+#### Parameters
+
+`cvweight` – convolution weight matrix, it has the dimensions of \[outputChannels, kernelWidth \* kernelHeight \* inputChannels\]
+
+`kernelWidth` – width of the kernel
+
+`kernelHeight` – height of the kernel
+
+`outputChannels` – number of output channels
+
+`horizontalSubsample` – subsamples in the horizontal direction
+
+`verticalSubsample` – subsamples in the vertical direction
+
+#### Optional Parameters
+
+`delayTime` – \[default = 1\] the amount of delay that will be introduced (number of times the loop will happen)
+
+`needGradient` – \[default = true\] does the gradient need to be computed for this node
+
+`defaultHiddenActivity` – \[default = 0.1\] the numerical amount for the defaultHiddenActivity
+
+#### Returns
+
+The results of the completed Delay loop
+
+#### Notes
+
+This node is used in recurrent networks, where a delay is introduced to examine values from a previous time, such as the prior value (t-1). This has the affect of creating a loop in the computational network that will repeat delayTime number of iterations.
--- a/Examples/Image/MNIST/AdditionalFiles/mnist_convert.py
+++ b/Examples/Image/MNIST/AdditionalFiles/mnist_convert.py
@ -7,9 +7,9 @@ import struct
 import numpy as np

 def loadData(src, cimg):
-    print 'Downloading ' + src
+    print ('Downloading ' + src)
    gzfname, h = urllib.urlretrieve(src, './delete.me')
-    print 'Done.'
+    print ('Done.')
    try:
        with gzip.open(gzfname) as gz:
            n = struct.unpack('I', gz.read(4))
--- a/Examples/Image/Miscellaneous/CIFAR-10/CIFAR_convert.py
+++ b/Examples/Image/Miscellaneous/CIFAR-10/CIFAR_convert.py
@ -38,33 +38,33 @@ def readBatch(src, outFmt):
    return np.hstack((np.reshape(d['labels'], (len(d['labels']), 1)), feat))

 def loadData(src, outFmt):
-    print 'Downloading ' + src
+    print ('Downloading ' + src)
    fname, h = urllib.urlretrieve(src, './delete.me')
-    print 'Done.'
+    print ('Done.')
    try:
-        print 'Extracting files...'
+        print ('Extracting files...')
        with tarfile.open(fname) as tar:
            tar.extractall()
-        print 'Done.'
-        print 'Preparing train set...'
+        print ('Done.')
+        print ('Preparing train set...')
        trn = np.empty((0, NumFeat + 1))
        for i in range(5):
            batchName = './cifar-10-batches-py/data_batch_{0}'.format(i + 1)
            trn = np.vstack((trn, readBatch(batchName, outFmt)))
-        print 'Done.'
-        print 'Preparing test set...'
+        print ('Done.')
+        print ('Preparing test set...')
        tst = readBatch('./cifar-10-batches-py/test_batch', outFmt)
-        print 'Done.'
+        print ('Done.')
    finally:
        os.remove(fname)
    return (trn, tst)

 def usage():
-    print 'Usage: CIFAR_convert.py [-f <format>] \n  where format can be either cudnn or legacy. Default is cudnn.'
+    print ('Usage: CIFAR_convert.py [-f <format>] \n  where format can be either cudnn or legacy. Default is cudnn.')

 def parseCmdOpt(argv):
    if len(argv) == 0:
-        print "Using cudnn output format."
+        print ("Using cudnn output format.")
        return "cudnn"
    try:
        opts, args = getopt.getopt(argv, 'hf:', ['help', 'outFormat='])
@ -78,7 +78,7 @@ def parseCmdOpt(argv):
        elif opt in ('-f', '--outFormat'):
            fmt = arg
            if fmt != 'cudnn' and fmt != 'legacy':
-                print 'Invalid output format option.'
+                print ('Invalid output format option.')
                usage()
                sys.exit(1)
            return fmt
@ -86,9 +86,9 @@ def parseCmdOpt(argv):
 if __name__ == "__main__":
    fmt = parseCmdOpt(sys.argv[1:])
    trn, tst = loadData('http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz', fmt)
-    print 'Writing train text file...'
+    print ('Writing train text file...')
    np.savetxt(r'./Train.txt', trn, fmt = '%u', delimiter='\t')
-    print 'Done.'
-    print 'Writing test text file...'
+    print ('Done.')
+    print ('Writing test text file...')
    np.savetxt(r'./Test.txt', tst, fmt = '%u', delimiter='\t')
-    print 'Done.'
+    print ('Done.')