CNTK/BrainScript/Notes.txt

CNTK configuration language redesign (ongoing work)
====================================

F. Seide, August 2015

These are the original notes from before coding began. Basic ideas are correct, but may be a bit outdated.

 - config specifies all configurable runtime objects and their initialization parameters
 - basic concepts: dictionaries and runtime-object definitions
 - basic syntactic elements:
    - runtime object definitions            // new classname initargsdictionary
    - macro definition                      // M(x,y,z) = expression    // expression uses x, y, and z
    - expressions
       - dictionaries                       // [ a=expr1 ; c=expr2 ]
       - math ops and parentheses as usual  // W*v+a, n==0
       - conditional expression             // if c then a else b
       - array                              // a:b:c ; array [1..N] (i => f(i))
 - syntax supports usual math and boolean expressions
 - functions are runtime objects defined through macros, e.g. Replace(s,with,withwhat) = String [ from=s ; replacing=what ; with=withwhat ]
 - config is parsed eagerly but evaluated lazily
 - CNTK command line "configFile=conf.bs a=b c=d" expands to "new CNTK {content of conf.bs} + [ a=b ; c=d ]"

current issues
--------------

 - syntax does not distinguish between dictionary members, intermediate variables, and actual parameter names
 - dictionary editing needs to allow a.b.c syntax; and subtracting is not pretty as it needs dummy values -> maybe use a delete symbol? a=delete?
 - missing: optional parameters to macros; and how this whole thing would work with MEL

grammar
-------

// --- top level defines a runtime object of class 'CNTK'
// example: new CNTK [ actions=train ; train=TrainAction [ ... ] ]       // where "new CNTK [" is prepended by the command-line parser

$ = $dictitems                  // this is a dictionary without enclosing [ ... ] that defines instantiation args of CNTK class

// ---  defining a runtime object and its parameters
// example: new ComputeNode [ class="Plus" ; arg1=A ; arg2=B ]

$newinstance = 'new' $classname $expr
               where $expr must be a dictionary expression
$classname = $identifier
             where $identifier is one of the known pre-defined C++ class names

// --- dictionaries are groups of key-value pairs.
//     Dictionaries are expressions.
//     Multiple dictionaries can be edited (dict1 + dict2) where dict2 members override dict1 ones of the same name.
// examples: [ arg1=A ; arg2=B ]
//           dict1 + (if (dpt && layer < totallayers) then [ numiter = 5 ] else []) // overrides 'numiter' in 'dict1' if condition is fulfilled

$dictdef = '[' $dictitems ']'
$dictitems = $itemdef*

$itemdef = $paramdef            // var=val
         | $macrodef            // macro(args)=expression

$paramdef = $identifier '=' $expr                           // e.g. numiter = 13
$macrodef = $identifier '(' $arg (',' $arg) ')' = $expr     // e.g. sqr(x) = x*x

// --- expressions
//     Expressions are what you'd expect. Infix operators those of C, with addition of '.*' '**' ':' '..'
//     ML-style "let ... in" (expression-local variables) are possible but not super-pretty: [ a=13; b=42; res=a*b ].res
//     There are infix ops for strings (concatenation) and dictionaries (editing).

$expr = $operand
      | $expr $infixop $operand
      | $expr '.' $memberref                    // dict.member   TODO: fix this; memberrefs exist without '.'
        where $expr is a dictionary
      | $expr '(' $expr (',' $expr)* ')'        // a(13)   also:  dict.a(13); note: partial application possible, i.e. macros may be passed as args and curried
        where $expr is a macro
      | $expr '[' $expr ']'                     // h_fwd[t]
        where first $expr must be a array and second $expr a number (that must be an integer value)
$infixop =      // highest precedence level
           '*'                                  // numbers; also magic short-hand for "Times" and "Scale" ComputeNodes
         | '/'                                  // numbers; Scale ComputeNode
         | '.*'                                 // ComputeNodes: component-wise product
         | '**'                                 // numbers (exponentiation, FORTRAN style!)
         | '%'                                  // numbers: remainder
                // next lower precedence level
         | '+'                                  // numbers; ComputeNodes; strings; dictionary editing
         | '-'                                  // numbers; ComputeNodes; dictionary editing
                // next lower precedence level
         | '==' '!=' '<' '>' '<=' '>='          // applies to config items only; objects other than boxed primitive values are compared by object identity not content
                // next lower precedence level
         | '&&'                                 // booleans
                // next lower precedence level
         | '||' | '^'                           // booleans
                // next lower precedence level
         | ':'                                  // concatenate items and/or arrays --TODO: can arrays have nested arrays? Syntax?
$operand = $literal                             // "Hello World"
         | $memberref                           // a
         | $dictdef                             // [ a="Hello World" ]
         | $newinstance                         // new ComputeNode [ ... ]
         | ('-' | '+' | '!') $operand           // -X+Y
         | '(' $expr ')'                        // (a==b) || (c==d)
         | $arrayconstructor                    // array [1..N] (i => i*i)

$literal = $number                              // built-in literal types are numeric, string, and boolean
         | $string
         | $boolconst
$number = // floating point number; no separate 'int' type, 'int' args are checked at runtime to be non-fractional
$string = // characters enclosed in "" or ''; no escape characters inside, use combinations of "", '', and + instead (TODO: do we need string interpolation?).
          // Strings may span multiple lines (containing newlines)
$boolconst = 'true' | 'false'

$memberref = $identifier                          // will search parent scopes

$arrayconstructor = 'array' '[' $expr '..' $expr ']' '(' $identifier '=>' $expr ')'     // array [1..N] (i => i*i)
                    where       ^start     ^end (int)    ^index variable  ^function of index variable

// --- predefined functions
//     *All* functions are defined as macros that instantiate a runtime object. (The same is true for operators above, too, actually.)

// functions that really are macros that instantiate ComputeNodes:
//  - Times(,), Plus(,), Sigmoid(), etc.
// numeric functions:
//  - Floor() (for int division), Ceil(), Round() (for rounding), Abs(), Sign(), ...
// string functions:
//  - Replace(s,what,withwhat), Str(number) (number to string), Chr(number) (convert Unicode codepoint to string), Format(fmt,val) (sprintf-like formatting with one arg)
// other:
//  - Fail("error description")     --will throw exception when executed; use this like assertion

dictionaries
------------

 - dictionaries are key-value pairs; they are records or compound data structures for use inside the config file itself
 - dictionaries are immutable and exist inside the parser but are not serialized to disk with a model   --TODO: it might be needed to do that for MEL
 - the argument to a runtime-object instantiation is also a dictionary
    - the config file can access that dictionary's members directly from the runtime-object expression, for convenience
 - intermediate variables that are only used to construct dictionary entries also become dictionary entries (no syntactic distinction)   --TODO: should we distinguish them?
 - macros are also dictionary members
 - dictionary values are read out using dict.field syntax, where 'dict' is any expression that evaluates to a dictionary
 - object instantiations will also traverse outer scopes to find values (e.g. precision, which is shared by many)
 - runtime objects themselves are inputs to other runtime objects, but they cannot have data members that output values
    - instead, output arguments use a proxy class ComputeNodeRef that can be used as a ComputeNode for input, and gets filled in at runtime
 - dictionaries can be "edited" by "adding" (+) a second dictionary to it; items from the second will overwrite the same items in the first.
   Subtracting a dictionary will remove all items in the second dict from the first.
   This is used to allow for overriding variables on the command line.   --TODO: not fully fleshed out how to access nested inner variables inside a dict

arrays
------

 - another core data type is the array. Like dictionaries, arrays are immutable and exist inside the parser only.
 - arrays are created at once in two ways
    - 'array' expression:
      array [1..N] (i => f(i))        // fake lambda syntax could be made real lambda; also could extend to multi-dim arrays
    - ':' operator concatenates arrays and/or elements. Arrays are flattened.
      1:2:3
 - elements are read-accessed with index operator
   X[i]
 - example syntax of how one could define useful operators for arrays
    - Append(seq,item) = seq : item
    - Repeat(item,N) = array [1..N] (i => item)
 - arrays with repetition can be created like this:
      0.8 : array [1..3] (i => 0.2) : 0.05
   or
      0.8 : Repeat(0.2,3) : 0.05
 - the array[] () argument looks like a C# lambda, but for now is hard-coded syntax (but with potential to be a true lambda in the future)

towards MEL
-----------

Model editing is now done in a functional manner, like this:

TIMIT_AddLayer = new EditAction [

    currModelPath = "ExpDir\TrainWithPreTrain\dptmodel1\cntkSpeech.dnn"
    newModelPath  = "ExpDir\TrainWithPreTrain\dptmodel2\cntkSpeech.dnn.0"

    model = LoadModel(currModelPath);
    newModel = EditModel(model, [
        // new items here
        outZ = SBFF(model.outZ.INPUT, LABELDIM, outZ.INPUT.OUTDIM)
    ])
    do = ( Dump(newModel, newModelPath + ".dump.txt")
         : SaveModel(newModel, newModelPath) )

]

sample
------

// This sample is a modification of the original TIMIT_TrainSimpleNetwork.config and TIMIT_TrainNDLNetwork.config.
// The changes compared to the origina syntax are called out in comments.

stderr = ExpDir + "\TrainSimpleNetwork\log\log"     // before: $ExpDir$\TrainSimpleNetwork\log\log
actions = TIMIT_TrainSimple                         // before: command = ...  ('command' is singular, but this can be a sequence of actions)

// these values are used by several runtime-object instantiations below
precision = 'float'                                 // before: precision = float
deviceId = DeviceNumber                             // before: $DeviceNumber$

#######################################
#  TRAINING CONFIG (Simple, Fixed LR) #
#######################################

Repeat(val,count) = array [1..count] (i => val)        // new: array helper to repeat a value (result is a array) (this would be defined in a library eventually)

TIMIT_TrainSimple = new TrainAction [               // new: added TrainAction; this is a class name of the underlying runtime object
    // new: TrainAction takes three main parameters: 'source' -> 'model' -> 'optimizer'  (-> indicating logical dependency)
    //action = train                                // removed (covered by class name)
    traceLevel = 1

    // new: Model object; some parameters were moved into this
    model = new Model [                             // this is an input to TrainAction
        modelPath = ExpDir + "\TrainSimpleNetwork\model\cntkSpeech.dnn"  // before: $ExpDir$\TrainSimpleNetwork\model\cntkSpeech.dnn

        // EXAMPLE 1: SimpleNetworkBuilder   --TODO: do we even need a C++ class, or can we use a macro instead? Would make life easier re connecting inputs
        network = new SimpleNetworkBuilder [        // before: SimpleNetworkBuilder = [
            layerSizes = 792 : Repeat(512,3) : 183  // before: 792:512*3:183
            layerTypes = 'Sigmoid'                  // before: no quotes
            initValueScale = 1.0
            applyMeanVarNorm = true
            uniformInit = true
            needPrior = true
            // the following two belong into SGD, so they were removed here
            //trainingCriterion = CrossEntropyWithSoftmax
            //evalCriterion = ErrorPrediction
            // new: connect to input stream from source; and expose the output layer
            input = source.features.data                        // these are also ComputeNodeRefs, exposed by the source
            output = ComputeNodeRef [ dim = source.labels.dim ] // SimpleNetworkBuilder will put top layer affine transform output (input to softmax) here
            // criteria are configurable here; these are ComputeNodes created here
            trainingCriterion = CrossEntropyWithSoftmax (source.labels.data, output)
            evalCriterion = ErrorPrediction (source.labels.data, output)
            // new: (and half-baked) define Input nodes
            myFeatures=Input(featDim)               // reader stream will reference this
            myLabels=Input(labelDim)
        ]

        // EXAMPLE 2: network from NDL (an actual config would contain one of these two examples)
        network = new NDL [                         // before: run=ndlCreateNetwork ; ndlCreateNetwork=[
            featDim = myFeatures.dim                // before: 792 hard-coded; note: myFeatures and myLabels are defined below
            labelDim = myLabels.dim                 // before: 183 hard-coded
            hiddenDim = 512

            // input nodes
            myFeatures=Input(featDim)               // before: optional arg tag=feature
            myLabels=Input(labelDim)                // before: optional arg tag=label

            // old
            //# define network
            //featNorm = MeanVarNorm(myFeatures)
            //L1 = SBFF(featNorm,hiddenDim,featDim)
            //L2 = SBFF(L1,hiddenDim,hiddenDim)
            //L3 = SBFF(L2,hiddenDim,hiddenDim)
            //CE = SMBFF(L3,labelDim,hiddenDim,myLabels,tag=Criteria)
            //Err = ErrorPrediction(myLabels,CE.BFF.FF.P,tag=Eval)
            //logPrior = LogPrior(myLabels)
            //ScaledLogLikelihood=Minus(CE.BFF.FF.P,logPrior,tag=Output)

            // new:
            // Let's have the macros declared here for illustration (in the end, these would live in a library)
            FF(X1, W1, B1) = W1 * X1 +  B1          // before: T=Times(W1,X1) ; P=Plus(T, B1)
            BFF(in, rows, cols) = [                 // before: BFF(in, rows, cols) { ... }
                B = Parameter(rows, init = fixedvalue, value = 0)
                W = Parameter(rows, cols)
                z = FF(in, w, b)                    // before: FF = ...; illegal now, cannot use same name again
            ]
            SBFF(in, rowCount, colCount) = [        // before: SBFF(in,rowCount,colCount) { ... }
                z = BFF(in, rowCount, colCount).z   // before: BFF = BFF(in, rowCount, colCount)
                Eh = Sigmoid(z)
            ]
            // Macros are expressions. FF returns a ComputeNode; while BFF and SBFF return a dictionary that contains multiple named ComputeNode.

            // new: define network in a loop. This allows parameterizing over the network depth.
            numLayers = 7
            layers = array [0..numLayers] ( layer =>
                if layer == 0 then             featNorm
                else if layer == 1 then        SBFF(layers[layer-1].Eh, hiddenDim, featDim)
                else if layer < numLayers then SBFF(layers[layer-1].Eh, hiddenDim, hiddenDim)
                else                            BFF(layers[layer-1].Eh, labelDim,  hiddenDim)
            )
            outZ = layers[numLayers].z              // new: to access the output value, the variable name (dictionary member) cannot be omitted

            // alternative to the above: define network with recursion
            HiddenStack(layer) = if layer > 1 then SBFF(HiddenStack(layer-1).Eh, hiddenDim, hiddenDim) else SBFF(featNorm, hiddenDim, featDim)
            outZ = BFF(HiddenStack(numlayers).Eh, labelDim, hiddenDim)

            // define criterion nodes
            CE = CrossEntropyWithSoftmax(myLabels, outZ)
            Err = ErrorPrediction(myLabels, outZ)

            // define output node for decoding
            logPrior = LogPrior(myLabels)
            ScaledLogLikelihood = outZ - logPrior   // before: Minus(CE.BFF.FF.P,logPrior,tag=Output)
        ]
    ]

    // the SGD optimizer
    optimizer = new SGDOptimizer [                  // before: SGD = [
        epochSize = 0
        minibatchSize = 256 : 1024
        learningRatesPerMB = 0.8 : Repeat(3.2,14) : 0.08    // (syntax change for repetition)
        momentumPerMB = 0.9
        dropoutRate = 0.0
        maxEpochs = 25
        // new: link to the criterion node
        trainingCriterion = model.network.CE        // (note: I would like to rename this to 'objective')
    ]

    // The RandomizingSource performs randomization and mini-batching, while driving low-level random-access readers.
    source = new RandomizingSource [                // before: reader = [
        //readerType = HTKMLFReader                 // removed since covered by class name

        // new: define what utterances to get from what stream sources
        dataSetFile = ScpDir + "\TIMIT.train.scp.fbank.fullpath"    // (new) defines set of utterances to train on; accepts HTK archives
        streams = ( [                                               // This passes the 'features' and 'labels' runtime objects to the source, and also connects them to the model's Input nodes.
                        reader = features                           // random-access reader
                        input = model.network.myFeatures            // Input node that this feeds into
                    ]
                  : [
                        reader = labels
                        input = model.network.myLabels
                    ] ) // note: ':' is array syntax. Parentheses are only for readability

        readMethod = 'blockRandomize'               // before: no quotes
        miniBatchMode = 'Partial'                   // before: no quotes
        randomize = 'Auto'                          // before: no quotes
        verbosity = 1

        // change: The following two are not accessed directly by the source, but indirectly through the 'streams' argument.
        //         They could also be defined outside of this dictionary. They are from the NDL, though.
        //         The 'RandomizingSource' does not know about features and labels specifically.
        features = new HTKFeatReader [              // before: features = [
            //dim = 792                             // (moved to 'data' node)
            scpFile = dataSetFile                   // HTK reader can share source's archive file that defines dataSet
            data = new ComputeNodeRef [ dim = 792 ] // an input node the model can connect to; dimension is verified when files are opened
        ]

        labels = new HTKMLFReader [                 // before: labels = [
            mlfFile = MlfDir + "\TIMIT.train.align_cistate.mlf.cntk"  // before: $MlfDir$\TIMIT.train.align_cistate.mlf.cntk
            //labelDim = 183                        // (moved to 'data' node)
            labelMappingFile = MlfDir + "\TIMIT.statelist"            // before: $MlfDir$\TIMIT.statelist
            data = new ComputeNodeRef [ dim = 183 ] // an input node the model can connect to; dimension is verified when reading statelist file
        ]
    ]
]

Example 2: truncated bidirectional RNN
--------------------------------------

network = new NDL [
    // network parameters
    hiddenDim = 512
    numHiddenLayers = 6                     // 6 hidden layers
    T = 41                                  // total context window

    // data sources
    myFeatures = source.features.data
    myLabels = source.labels.data

    // derived dimensions
    augmentedFeatDim = myFeatures.dim       // feature arrays are context window frames stacked into a single long array
    labelDim = myLabels.dim

    centerT = Floor(T/2)                    // center frame to predict
    featDim = Floor(augmentedFeatDim / T)

    // split the augmented input vector into individual frame vectors
    subframes = array [0..T-1] (t => RowSlice(t * featDim, featDim, myFeatures))

    // hidden layers
    // Hidden state arrays for all frames are stored in a array object.
    layers = array [1..numHiddenLayers] (layer => [     // each layer stores a dictionary that stores its output hidden fwd and bwd state vectors
        // model parameters
        W_fwd = Parameter(hiddenDim, featDim)                                           // Parameter(outdim, indim) --in_fwd.rows is an initialization parameter read from the dict
        W_bwd = if layer > 1 then Parameter(hiddenDim, hiddenDim) else Fail("no W_bwd") // W denotes input-to-hidden connections
        H_fwd = Parameter(hiddenDim, hiddenDim)                                         // H denotes hidden-to-hidden lateral connections
        H_bwd = Parameter(hiddenDim, hiddenDim)
        b     = Parameter(hiddenDim, 1)                                                 // bias
        // shared part of activations (input connections and bias)
        z_shared = array [0..T-1] (t => if layers > 1 then W_fwd * layers[layer-1].h_fwd[t] + W_bwd * layers[layer-1].h_bwd[t] + b    // intermediate layer gets fed fwd and bwd hidden state
                                                      else W_fwd * subframes                                                   + b)   // input layer reads frames directly
        // recurrent part and non-linearity
        neededT = if layer < numHiddenLayers then T else centerT+1                  // last hidden layer does not require all frames
        step(H,h,dt,t) = Sigmoid(if (t+dt > 0 && t+dt < T) then z_shared[t] + H * h[t+dt]
                                                           else z_shared[t])
        h_fwd = array [0..neededT-1]   (t => step(H_fwd, h_fwd, -1, t))
        h_bwd = array [T-neededT..T-1] (t => step(H_bwd, h_bwd,  1, t))
    ])
    // output layer --linear only at this point; Softmax is applied later
    outZ = [
        // model parameters
        W_fwd = Parameter(labelDim, hiddenDim)
        W_bwd = Parameter(labelDim, hiddenDim)
        b     = Parameter(labelDim, 1)
        //  output
        topHiddenLayer = layers[numHiddenLayers]
        z = W_fwd * topHiddenLayer.h_fwd[centerT] + W_bwd * topHiddenLayer.h_bwd[centerT] + b
    ].z     // we only want this one & don't care about the rest of this dictionary

    // define criterion nodes
    CE = CrossEntropyWithSoftmax(myLabels, outZ)
    Err = ErrorPrediction(myLabels, outZ)

    // define output node for decoding
    logPrior = LogPrior(myLabels)
    ScaledLogLikelihood = outZ - logPrior   // before: Minus(CE.BFF.FF.P,logPrior,tag=Output)
]