Merge branch 'master' into ebarsoum/ImageHandsOn

To check the update notebook.
2016-10-26 17:20:30 -07:00 · 2016-10-26 17:20:30 -07:00 · 2afe81948e
--- a/Examples/Image/Detection/FastRCNN/README.md
+++ b/Examples/Image/Detection/FastRCNN/README.md
@ -36,6 +36,10 @@ Currently, CNTK only supports `Python 3.4`. We recommend to install anaconda pyt
 conda create --name cntk python=3.4.3 numpy scipy
 activate cntk
 ```
+To run the code in this example, you need to install a few additional packages. Under Python 3.4 (64bit version assumed), go to the FastRCNN folder and run:
+```
+pip install -r requirements.txt
+```
 You will further need Scikit-Image and OpenCV to run these examples. You can download the corresponding wheel packages and install them manually. For Windows users, visit http://www.lfd.uci.edu/~gohlke/pythonlibs/, and download:

    scikit_image-0.12.3-cp34-cp34m-win_amd64.whl  
@ -48,7 +52,9 @@ Once you download the respective wheel binaries, install them with:

 This example code assumes you are using 64bit version of Python 3.4, as the Fast R-CNN DLL files under [utils_win64](./fastRCNN/utils3_win64) are prebuilt for this version. If your task requires the use of a different Python version, please recompile these DLL files yourself in the correct environment. 

-Last but not least, in `PARAMETERS.py`: Change 'rootdir' to the absolute path of the FastRCNN folder of your CNTK repository clone (only forward slashes, has to end with forward slash). Also, make sure datasetName is set to "grocery".
+The folder where cntk.exe resides needs to be in your PATH environment variable.
+
+Last but not least, in `PARAMETERS.py`: make sure datasetName is set to "grocery".

 ### Preprocess data

--- a/1
+++ b/1
@ -944,6 +944,7 @@ $(UNITTEST_READER): $(UNITTEST_READER_OBJ) | $(HTKMLFREADER) $(HTKDESERIALIZERS)
 	$(CXX) $(LDFLAGS) $(patsubst %,-L%, $(LIBDIR) $(BOOSTLIB_PATH)) $(patsubst %, $(RPATH)%, $(ORIGINLIBDIR) $(BOOSTLIB_PATH)) -o $@ $^ $(BOOSTLIBS) -l$(CNTKMATH) -ldl 

 UNITTEST_NETWORK_SRC = \
+	$(SOURCEDIR)/../Tests/UnitTests/NetworkTests/CropNodeTests.cpp \
 	$(SOURCEDIR)/../Tests/UnitTests/NetworkTests/OperatorEvaluation.cpp \
 	$(SOURCEDIR)/../Tests/UnitTests/NetworkTests/stdafx.cpp \
 	$(SOURCEDIR)/CNTK/ModelEditLanguage.cpp \
--- a/README.md
+++ b/README.md
@ -1,18 +1,24 @@
 # Latest news
-##*2016-10-25.* New CNTK Name, new Web Site and V 2.0 Beta 1 Release  
+*2016-10-25.* New CNTK Name, new Web Site and V 2.0 Beta 1 Release  

 CNTK becomes **The Microsoft Cognitive Toolkit**. See more at our [new Web Site](https://www.microsoft.com/en-us/research/product/cognitive-toolkit/).

+With the today's Release we start delivering CNTK V2 - a major upgrade of Microsoft Cognitive Toolkit.
+
+Expect a set of Beta Releases in the Coming Weeks.
+
 Highlights of this Release:
 * CNTK can now be used as a library with [brand new C++ and Python APIs](https://github.com/microsoft/cntk/wiki/CNTK-Library-API)
 * New Python Examples and Tutorials
+* Support of Protocol Buffers serialization
 * Support of Fast R-CNN algorithm
+* New automated installation procedures
 * Improvements in CNTK Evaluation library including support of CNTK APIs

 See more in the [Release Notes](https://github.com/Microsoft/CNTK/wiki/CNTK_2_0_beta_1_Release_Notes). You will find there links to the materials about the new features.  
+Get the Release from the [CNTK Releases page](https://github.com/Microsoft/CNTK/releases)

-
-##*2016-10-03.* V 1.7.2 Binary release  
+*2016-10-03.* V 1.7.2 Binary release  
 **This is a Hot Fix Release. It affects all users of Model Evaluation Library**

 If you are NOT using Model Evaluation Library you may skip this release.  
@ -20,8 +26,7 @@ If you ARE using Model Evaluation Library we **strongly recommend** installing v

 See [Release Notes](https://github.com/Microsoft/CNTk/wiki/CNTK_1_7_2_Release_Notes) for details.

-
-##*2016-09-28.* V 1.7.1 Binary release  
+*2016-09-28.* V 1.7.1 Binary release  
 Highlights of this Release:
 * Two Breaking Changes related to Layers library default initialization and ```fsAdagrad``` gradient-normalization scheme
 * Improvements in BrainScript
@ -33,26 +38,22 @@ Highlights of this Release:
 See more in the [Release Notes](https://github.com/Microsoft/CNTK/wiki/CNTK_1_7_1_Release_Notes) (including the full list of bugs fixed)  
 Get the Release from the [CNTK Releases page](https://github.com/Microsoft/CNTK/releases)

-
-##*2016-08-31.* V 1.7 Binary release  
+*2016-08-31.* V 1.7 Binary release  
 Highlights of this Release:
-* Improvements in BrainScript (New library of predefined common layer types, Support of cuDNN5 RNN and Common random-initialization types, improved handling of GRUs)
-* Support of NVIDIA cuDNN 5.1
-* Improvements in Readers and Deserializers
-* Additions to Evaluator Library (Eval Client Sample, Strong Name for EvalWrapper)
-* New in Unit Tests (Linux support, Randomization engines)
+* Improvements in BrainScript (new library of predefined common layer types; common random-initialization types; GRUs)
+* Support of NVIDIA cuDNN 5.1 and cuDNN RNN
+* Improvements in readers and deserializers
+* Additions to Evaluator library (Eval Client Sample, strong name for EvalWrapper)
+* New unit tests incl. Linux support
 * Python API Preview (since V.1.5)
 * Multiple bug fixes

 See more in the [Release Notes](https://github.com/Microsoft/CNTK/wiki/CNTK_1_7_Release_Notes)  
 Get the Release from the [CNTK Releases page](https://github.com/Microsoft/CNTK/releases)

+*2016-08-29.* Two new Tutorials are available:  
+[Image recognition](https://github.com/Microsoft/CNTK/wiki/Hands-On-Labs-Image-Recognition) (CIFAR-10) and [Language understanding](https://github.com/Microsoft/CNTK/wiki/Hands-On-Labs-Language-Understanding) (ATIS).

-##*2016-08-29.* Two new Tutorials are available:  
-[Image recognition](https://github.com/Microsoft/CNTK/wiki/Hands-On-Labs-Image-Recognition) (CIFAR-10) and [Language understanding](https://github.com/Microsoft/CNTK/wiki/Hands-On-Labs-Image-Recognition) (ATIS).
-
-
-*2016-08-10.* We have significantly simplified handling of **Gated Recurrent Units (GRU)**. Read more in the [corresponding article](https://github.com/Microsoft/CNTK/wiki/GRUs-on-CNTK-with-BrainScript).

 See [all news](https://github.com/Microsoft/CNTK/wiki/News).

--- a/Scripts/linux/install-cntk.sh
+++ b/Scripts/linux/install-cntk.sh
@ -26,7 +26,7 @@ CNTK_DEP_LIB_PATH="$PWD/cntk/dependencies/lib"
 CNTK_EXAMPLES_PATH="$PWD/Examples"
 CNTK_BINARY="$CNTK_BIN_PATH/cntk"
 CNTK_PY34_ENV_FILE="$SCRIPT_DIR/conda-linux-cntk-py34-environment.yml"
-CNTK_WHEEL_PATH="cntk/python/cntk-2.0a4-cp34-cp34m-linux_x86_64.whl"
+CNTK_WHEEL_PATH="cntk/python/cntk-2.0.beta1.0-cp34-cp34m-linux_x86_64.whl"
 test -d "$CNTK_BIN_PATH" && test -d "$CNTK_LIB_PATH" && test -d "$CNTK_DEP_LIB_PATH" && 
 test -d "$CNTK_EXAMPLES_PATH" && test -x "$CNTK_BINARY" &&
 test -f "$CNTK_PY34_ENV_FILE" && test -f "$CNTK_WHEEL_PATH" || {
--- a/Source/ActionsLib/NDLNetworkBuilder.cpp
+++ b/Source/ActionsLib/NDLNetworkBuilder.cpp
@ -539,6 +539,55 @@ void NDLNodeEvaluatorImpl<ElemType>::Evaluate(NDLNode<ElemType>* node, const wst
            nodePtr = builder.BatchNormalization(nullptr, nullptr, nullptr, nullptr, nullptr, spatial, normTimeConst, blendTimeConst, epsilon, useCntkEngine, imageLayoutKind, name);
        }
    }
+    else if (cnNodeType == OperationNameOf(CropNode))
+    {
+        // We expect 2 or 4 inputs.
+        if (parameter.size() != 2 && parameter.size() != 4)
+        {
+            RuntimeError("%ls accepts inputs: [input1, input2, offsetX, offsetY] or \
+                                              [input1, input2] or \
+                                              [input1, input2, eqNode1, eqNode2].", cnNodeType.c_str());
+        }
+
+        if (pass == ndlPassInitial)
+        {
+            // In initial phase we just need to create node.
+            if (parameter.size() == 4)
+            {
+                // Here we need to determine if 3rd and 4th parameters are offsets or equivalence nodes.
+                vector<void*> params = EvaluateParameters(node, baseName, 0, parameter.size(), pass);
+                // TODO: Is there a better way to discriminate?
+                if (((NDLNode<ElemType>*) params[2])->GetType() == NDLType::ndlTypeConstant)
+                {
+                    // We have offsets given, take offsets from evaluated parameters.
+                    size_t offsetX = ((NDLNode<ElemType>*) params[2])->GetScalar();
+                    size_t offsetY = ((NDLNode<ElemType>*) params[3])->GetScalar();
+
+                    // Create crop node with offsets but without inputs (will be attached later in resolve phase).
+                    nodePtr = builder.Crop(nullptr, nullptr, offsetX, offsetY, name);
+                }
+                else
+                {
+                    // We have 4 node inputs (2 crop inputs and 2 equivalence node inputs).
+                    nodePtr = builder.Crop(nullptr, nullptr, nullptr, nullptr, name);
+                }
+            }
+            else
+            {
+                // Just two inputs, must be node inputs which will be attached in the resolve phase below.
+                nodePtr = builder.Crop(nullptr, nullptr, name);
+            }
+            // Done processing in this phase.
+            nodeParamStart = 0;
+            nodeParamCount = 0;
+        }
+        else
+        {
+            // In non-initial phase we just process node inputs below, here we just set inputs of interest.
+            nodeParamStart = 0;
+            nodeParamCount = nodePtr->GetNumInputs();
+        }
+    }
    else
    {

--- a/Source/ActionsLib/NetworkDescriptionLanguage.cpp
+++ b/Source/ActionsLib/NetworkDescriptionLanguage.cpp
@ -168,6 +168,7 @@ bool CheckFunction(std::string& p_nodeType, bool* allowUndeterminedVariable)
    else if (EqualInsensitive(nodeType, OperationNameOf(NotEqualNode))) ret = true;
    else if (EqualInsensitive(nodeType, OperationNameOf(ClipNode))) ret = true;
    else if (EqualInsensitive(nodeType, OperationNameOf(ConvolutionNode), L"Convolve")) ret = true;
+    else if (EqualInsensitive(nodeType, OperationNameOf(CropNode))) ret = true;
    else if (EqualInsensitive(nodeType, OperationNameOf(PoolingNode))) ret = true;
    else if (EqualInsensitive(nodeType, OperationNameOf(CosDistanceNode), L"CosDist")) ret = true;
    else if (EqualInsensitive(nodeType, OperationNameOf(CosDistanceWithNegativeSamplesNode), L"CosWithNegSamples")) ret = true;
--- a/Source/CNTKv2LibraryDll/dllmain.cpp
+++ b/Source/CNTKv2LibraryDll/dllmain.cpp
@ -12,6 +12,25 @@
 #include "Windows.h"
 #endif

+#if _DEBUG
+#include <cstdlib>
+#include <crtdbg.h>
+
+// in case of asserts in debug mode, print the message into stderr and throw exception
+int HandleDebugAssert(int,               // reportType  - ignoring reportType, printing message and aborting for all reportTypes
+    char *message,                       // message     - fully assembled debug user message
+    int * returnValue)                   // returnValue - retVal value of zero continues execution
+{
+    fprintf(stderr, "C-Runtime: %s\n", message);
+
+    if (returnValue)
+    {
+        *returnValue = 0;   // return value of 0 will continue operation and NOT start the debugger
+    }
+    return TRUE;            // returning TRUE will make sure no message box is displayed
+}
+#endif
+
 BOOL APIENTRY DllMain(HMODULE /*hModule*/,
                      DWORD ul_reason_for_call,
                      LPVOID /*lpReserved*/
@ -19,10 +38,25 @@ BOOL APIENTRY DllMain(HMODULE /*hModule*/,
 {
    switch (ul_reason_for_call)
    {
+#if _DEBUG
    case DLL_PROCESS_ATTACH:
+        // Disabling assertions in test environment.
+        // These functions should not lock anything, no deadlock expected.
+        if (std::getenv("V2_LIB_TESTING"))
+        {
+            _set_error_mode(_OUT_TO_STDERR);
+            _CrtSetReportHook2(_CRT_RPTHOOK_INSTALL, HandleDebugAssert);
+        }
+        break;
+    case DLL_PROCESS_DETACH:
+        _CrtSetReportHook2(_CRT_RPTHOOK_REMOVE, HandleDebugAssert);
+        break;
+#else
+    case DLL_PROCESS_ATTACH:
+    case DLL_PROCESS_DETACH:
+#endif
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
-    case DLL_PROCESS_DETACH:
        break;
    }
    return TRUE;
--- a/Source/ComputationNetworkLib/ComputationNetworkBuilder.cpp
+++ b/Source/ComputationNetworkLib/ComputationNetworkBuilder.cpp
@ -46,6 +46,7 @@ static shared_ptr<ComputationNode<ElemType>> CreateStandardNode(const std::wstri
    else if (nodeType == OperationNameOf(CosDistanceNode))                      return New<CosDistanceNode<ElemType>>(forward<_Types>(_Args)...);
    else if (nodeType == OperationNameOf(CosDistanceWithNegativeSamplesNode))   return New<CosDistanceWithNegativeSamplesNode<ElemType>>(forward<_Types>(_Args)...);
    else if (nodeType == OperationNameOf(CosineNode))                           return New<CosineNode<ElemType>>(forward<_Types>(_Args)...);
+    else if (nodeType == OperationNameOf(CropNode))                             return New<CropNode<ElemType>>(forward<_Types>(_Args)...);
    else if (nodeType == OperationNameOf(CrossEntropyNode))                     return New<CrossEntropyNode<ElemType>>(forward<_Types>(_Args)...);
    else if (nodeType == OperationNameOf(CrossEntropyWithSoftmaxNode))          return New<CrossEntropyWithSoftmaxNode<ElemType>>(forward<_Types>(_Args)...);
    else if (nodeType == OperationNameOf(DiagonalNode))                         return New<DiagonalNode<ElemType>>(forward<_Types>(_Args)...);
@ -400,6 +401,24 @@ shared_ptr<ComputationNode<ElemType>> ComputationNetworkBuilder<ElemType>::Recon
    return net.AddNodeToNetAndAttachInputs(New<ReconcileDynamicAxisNode<ElemType>>(net.GetDeviceId(), nodeName), { dataInput, layoutInput });
 }

+template <class ElemType>
+shared_ptr<ComputationNode<ElemType>> ComputationNetworkBuilder<ElemType>::Crop(const ComputationNodePtr input1, const ComputationNodePtr input2, const std::wstring nodeName)
+{
+    return net.AddNodeToNetAndAttachInputs(New<CropNode<ElemType>>(net.GetDeviceId(), nodeName), { input1, input2 });
+}
+
+template <class ElemType>
+shared_ptr<ComputationNode<ElemType>> ComputationNetworkBuilder<ElemType>::Crop(const ComputationNodePtr input1, const ComputationNodePtr input2, size_t offsetX, size_t offsetY, const std::wstring nodeName)
+{
+    return net.AddNodeToNetAndAttachInputs(New<CropNode<ElemType>>(offsetX, offsetY, net.GetDeviceId(), nodeName), { input1, input2 });
+}
+
+template <class ElemType>
+shared_ptr<ComputationNode<ElemType>> ComputationNetworkBuilder<ElemType>::Crop(const ComputationNodePtr input1, const ComputationNodePtr input2, const ComputationNodePtr eqNode1, const ComputationNodePtr eqNode2, const std::wstring nodeName)
+{
+    return net.AddNodeToNetAndAttachInputs(New<CropNode<ElemType>>(net.GetDeviceId(), nodeName), { input1, input2, eqNode1, eqNode2 });
+}
+
 template <class ElemType>
 shared_ptr<ComputationNode<ElemType>> ComputationNetworkBuilder<ElemType>::ClassificationError(const ComputationNodePtr a, const ComputationNodePtr b, const std::wstring nodeName)
 {
--- a/Source/ComputationNetworkLib/ComputationNetworkBuilder.h
+++ b/Source/ComputationNetworkLib/ComputationNetworkBuilder.h
@ -104,6 +104,11 @@ public:
                                      const std::wstring nodeName = L"");
    ComputationNodePtr ROIPooling(const ComputationNodePtr inputValues, const ComputationNodePtr inputROIs, const TensorShape& roiOutputShape, const std::wstring nodeName = L"");
    ComputationNodePtr ReconcileDynamicAxis(const ComputationNodePtr dataInput, const ComputationNodePtr layoutInput, const std::wstring nodeName = L"");
+
+    ComputationNodePtr Crop(const ComputationNodePtr input1, const ComputationNodePtr input2, const std::wstring nodeName = L"");
+    ComputationNodePtr Crop(const ComputationNodePtr input1, const ComputationNodePtr input2, size_t offsetX, size_t offsetY, const std::wstring nodeName = L"");
+    ComputationNodePtr Crop(const ComputationNodePtr input1, const ComputationNodePtr input2, const ComputationNodePtr eqNode1, const ComputationNodePtr eqNode2, const std::wstring nodeName = L"");
+
 #ifdef COMING_SOON
    ComputationNodePtr CRF(const ComputationNodePtr label, const ComputationNodePtr postDepScore, const ComputationNodePtr transition_score, const std::wstring nodeName = L"");
 #endif
--- a/Source/ComputationNetworkLib/ComputationNode.h
+++ b/Source/ComputationNetworkLib/ComputationNode.h
@ -912,6 +912,218 @@ struct NumInputs : public INumInputs // e.g. derive from NumInputs<2>
    }
 };

+// =======================================================================
+// AxisTransform -- Defines transformation along one axis. Currently, just
+// scale and translation are supported.
+// =======================================================================
+
+struct AxisTransform
+{
+public:
+    bool operator==(const AxisTransform& other) const
+    {
+        return (scale == other.scale) && (translate == other.translate);
+    }
+
+    bool operator!=(const AxisTransform& other) const
+    {
+        return !operator==(other);
+    }
+
+    // Scale along the axis (by default identity transform -> 1 scale).
+    double scale = 1.0;
+    // Translation along the axis (by default identity transform -> 0 translate).
+    double translate = 0.0;
+};
+
+// =======================================================================
+// SpaceTransform -- Combines several axis transforms into space transform.
+// =======================================================================
+
+struct SpaceTransform
+{
+public:
+    SpaceTransform() {}
+
+    // Returns all axis transforms.
+    std::vector<AxisTransform>* GetTransform()
+    {
+        return &m_axisTransforms;
+    }
+
+    bool operator==(const SpaceTransform& other) const
+    {
+        CheckCompatibility(other);
+        for (size_t i = 0; i < m_axisTransforms.size(); i++)
+        {
+            if (m_axisTransforms[i] != other.m_axisTransforms[i])
+                return false;
+        }
+        return true;
+    }
+
+    bool operator!=(const SpaceTransform& other) const
+    {
+        return !operator==(other);
+    }
+
+    // Returns identity transform with given number of dimensions.
+    static SpaceTransform Identity(int dimensions)
+    {
+        SpaceTransform result;
+        result.m_axisTransforms.resize(dimensions);
+        return result;
+    }
+
+    // Returns composition of this transform with given one (without modifying this one).
+    SpaceTransform Compose(const SpaceTransform& other) const
+    {
+        CheckCompatibility(other);
+        SpaceTransform result = SpaceTransform::Identity(m_axisTransforms.size());
+        for (size_t ia = 0; ia < m_axisTransforms.size(); ia++)
+        {
+            result.m_axisTransforms[ia].scale     = m_axisTransforms[ia].scale * other.m_axisTransforms[ia].scale;
+            result.m_axisTransforms[ia].translate = m_axisTransforms[ia].scale * other.m_axisTransforms[ia].translate + m_axisTransforms[ia].translate;
+        }
+        return result;
+    }
+
+    // Returns inverse of this transform without modifying it.
+    SpaceTransform Inverse() const
+    {
+        SpaceTransform result = SpaceTransform::Identity(m_axisTransforms.size());
+        for (size_t ia = 0; ia < m_axisTransforms.size(); ia++)
+        {
+            result.m_axisTransforms[ia].scale = 1 / m_axisTransforms[ia].scale;
+            result.m_axisTransforms[ia].translate = -m_axisTransforms[ia].translate / m_axisTransforms[ia].scale;
+        }
+        return result;
+    }
+
+    // Check if this transform is compatible with given one.
+    void CheckCompatibility(const SpaceTransform& other) const
+    {
+        // Transforms are compatible if they have same number of axis transforms.
+        if (m_axisTransforms.size() != other.m_axisTransforms.size())
+        {
+            RuntimeError("Incompatible space transforms.");
+        }
+    }
+
+    std::vector<AxisTransform> m_axisTransforms;
+};
+
+// =======================================================================
+// TransformerNode -- Base class for all nodes that implement input-output
+// transformation. Using individual node transformations one can calculate cumulative
+// transformation between two nodes and establish spatial matching of its inputs or
+// outputs. Node needs to provide its type and template argument (we use recurring
+// template pattern to access number of inputs of the derived object).
+// Note: This interface assumes that node also inherits from NumInputs<> class.
+// =======================================================================
+
+struct TransformerNode
+{
+public:
+    TransformerNode() {}
+
+    virtual ~TransformerNode() {}
+
+    // Derived class needs to return if it supports transform computation between input at given index and output.
+    virtual bool SupportsTransformOnInput(size_t index) = 0;
+
+    // Derived class needs to compute transforms for all axes for all supported input-output paths (
+    // (see SupportsTransformOnInput above) on this call.
+    virtual void ComputeTransforms() = 0;
+
+    // Derived classes need to inform us regarding number of inputs they have using this call before first
+    // GetTransformForInput call.
+    void SetNumberOfInputs(size_t inputsCount)
+    {
+        // Allocate appropriate number of transforms. Here transforms will be set to identity, node needs to compute
+        // them during ComputeTransforms.
+        m_transforms.resize(inputsCount);
+    }
+
+    // Handles transform accessing for all derive classes. Derived objects still need to
+    // implement rest of ITransformerNode interface.
+    const SpaceTransform& GetTransformForInput(size_t inputIndex)
+    {
+        if (m_transforms.empty())
+            LogicError("No transforms present on GetTransformForInput call. Maybe SetNumberOfInputs has not been called?");
+
+        // Check that we are within range.
+        if (inputIndex >= m_transforms.size())
+            RuntimeError("Invalid transform index in TransformerNode.");
+
+        // Verify that derived object supports transform on given input.
+        if (!SupportsTransformOnInput(inputIndex))
+            RuntimeError("Space transform requested on unsupported input");
+
+        // All good, ask derived object to compute transforms.
+        ComputeTransforms();
+        // Return transform for requested input.
+        return m_transforms[inputIndex];
+    }
+
+protected:
+    // Transforms for all node inputs.
+    std::vector<SpaceTransform> m_transforms;
+};
+
+// =======================================================================
+// IdentityTransformerNode -- Helper class for nodes that have identity
+// transform for all inputs.
+// =======================================================================
+
+struct IdentityTransformerNode : public TransformerNode
+{
+private:
+    using TransformerNode::m_transforms;
+
+    // Set all transforms to identity.
+    virtual void ComputeTransforms() override
+    {
+        if (m_transforms[0].m_axisTransforms.empty())
+        {
+            for (size_t it = 0; it < m_transforms.size(); it++)
+            {
+                m_transforms[it].m_axisTransforms.resize(2);
+            }
+        }
+    }
+
+    // Support transforms for all inputs.
+    virtual bool SupportsTransformOnInput(size_t /*index*/) override { return true; }
+};
+
+// =======================================================================
+// IdentityTransformerNodeOnOneInput -- Helper class for nodes that support
+// identity transform for one input (defined with template argument).
+// =======================================================================
+
+template <size_t supportedInputIndex>
+struct IdentityTransformerNodeOnOneInput : public TransformerNode
+{
+private:
+    using TransformerNode::m_transforms;
+
+    virtual void ComputeTransforms() override
+    {
+        if (m_transforms[supportedInputIndex].m_axisTransforms.empty())
+        {
+            // m_axisTransforms defaults to identity.
+            m_transforms[supportedInputIndex].m_axisTransforms.resize(2);
+        }
+    }
+
+    // Support transforms just one input.
+    virtual bool SupportsTransformOnInput(size_t inputIndex) override
+    {
+        return (inputIndex == supportedInputIndex);
+    }
+};
+
 // =======================================================================
 // Nodes that can take a dynamic axis need to implement this.
 // =======================================================================
@ -1061,6 +1273,13 @@ public:
                m_inputs[i] = DownCast(inputs[i]); // (DownCast() checks the type; the assignment then downcasts it again)
            else
                m_inputs[i] = nullptr; // during network creation, nullptrs are possible
+
+        // If this object implements also TransformerNode interface we need to notify it about number of inputs.
+        if (Is<TransformerNode>())
+        {
+            auto transformerNode = As<TransformerNode>();
+            transformerNode->SetNumberOfInputs(m_inputs.size());
+        }
    }

 protected:
--- a/Source/ComputationNetworkLib/ConvolutionalNodes.h
+++ b/Source/ComputationNetworkLib/ConvolutionalNodes.h
@ -187,6 +187,26 @@ protected:
        FixVectorShape(filterRank, inputShape.size(), m_sharing,     true);
    }

+    // Derived classes implement transforms calculation. Since all derived classes are filter based we consolidate common
+    // filter transform calculation here to be reused by derived classes. For example convolution and de-convolution
+    // have same transform but inversed, hence both of them may reuse this method and one will call inverse in addition
+    // (similar holds for pooling nodes).
+    SpaceTransform ComputeFilterTransform()
+    {
+        std::shared_ptr<const ConvolveGeometry> geometry = m_convEng->Geometry();
+
+        SpaceTransform result;
+        result.m_axisTransforms.resize(2);
+
+        result.m_axisTransforms[0].scale = (float)(geometry->GetStride(0));
+        result.m_axisTransforms[0].translate = (float)((geometry->KernelShape()[0] - 1) / 2 - geometry->GetLowerPad(0));
+
+        result.m_axisTransforms[1].scale = (float)(geometry->GetStride(1));
+        result.m_axisTransforms[1].translate = (float)((geometry->KernelShape()[1] - 1) / 2 - geometry->GetLowerPad(1));
+
+        return result;
+    }
+
 protected:
    TensorShape m_kernelShape;
    TensorShape m_mapCount;
@ -229,7 +249,7 @@ public:
 // -----------------------------------------------------------------------

 template <class ElemType>
-class ConvolutionNode : public ConvolutionNodeBase<ElemType>, public NumInputs<2>
+class ConvolutionNode : public ConvolutionNodeBase<ElemType>, public NumInputs<2>, public TransformerNode
 {
    typedef ConvolutionNodeBase<ElemType> Base; UsingConvolutionNodeBaseMembers;
    static const std::wstring TypeName() { return L"Convolution"; }
@ -489,6 +509,31 @@ public:

    bool IsConvolution2D() const { return m_convolution2D; }

+private:
+    using TransformerNode::m_transforms;
+    using ConvolutionNodeBase<ElemType>::ComputeFilterTransform;
+
+    virtual void /*TransformerNode::*/ComputeTransforms() override
+    {
+        if (m_transforms[1].m_axisTransforms.empty())
+        {
+            m_transforms[1] = ComputeFilterTransform();
+            if (!m_transpose)
+            {
+                // Convolution, need to inverse transform.
+                m_transforms[1] = m_transforms[1].Inverse();
+            }
+            // else: Deconvolution, nothing to do.
+        }
+        // else: transform already computed, no need to do computation again.
+    }
+
+    virtual bool /*TransformerNode::*/SupportsTransformOnInput(size_t inputIndex) override
+    {
+        // We support transforms just on convolution input.
+        return (inputIndex == 1);
+    }
+
 protected:
    // Flag that indicates whether the node is created using 2D-syntax.
    bool m_convolution2D;
@ -674,7 +719,7 @@ protected:
 // -----------------------------------------------------------------------

 template <class ElemType>
-class PoolingNode : public ConvolutionNodeBase<ElemType>, public NumInputs<1>
+class PoolingNode : public ConvolutionNodeBase<ElemType>, public NumInputs<1>, public TransformerNode
 {
    typedef ConvolutionNodeBase<ElemType> Base; UsingConvolutionNodeBaseMembers;
    static const std::wstring TypeName() { return L"Pooling"; }
@ -756,6 +801,26 @@ public:
            }
        }
    }
+
+private:
+    using TransformerNode::m_transforms;
+    using ConvolutionNodeBase<ElemType>::ComputeFilterTransform;
+
+    virtual void /*TransformerNode::*/ComputeTransforms() override
+    {
+        if (m_transforms[0].m_axisTransforms.empty())
+        {
+            m_transforms[0] = ComputeFilterTransform();
+            m_transforms[0] = m_transforms[0].Inverse();
+        }
+        // else: transform already computed, no need to do it again.
+    }
+
+    virtual bool /*TransformerNode::*/SupportsTransformOnInput(size_t /*inputIndex*/) override
+    {
+        // We support transforms on all inputs (one here).
+        return true;
+    }
 };

 // -----------------------------------------------------------------------
@ -774,7 +839,7 @@ public:
 // -----------------------------------------------------------------------

 template <class ElemType>
-class MaxUnpoolingNode : public ConvolutionNodeBase<ElemType>, public NumInputs<2>
+class MaxUnpoolingNode : public ConvolutionNodeBase<ElemType>, public NumInputs<2>, public TransformerNode
 {
    typedef ConvolutionNodeBase<ElemType> Base;
    UsingConvolutionNodeBaseMembers;
@ -858,6 +923,25 @@ public:
            }
        }
    }
+
+private:
+    using TransformerNode::m_transforms;
+    using ConvolutionNodeBase<ElemType>::ComputeFilterTransform;
+
+    virtual void /*TransformerNode::*/ComputeTransforms() override
+    {
+        if (m_transforms.empty())
+        {
+            m_transforms[0] = ComputeFilterTransform();
+        }
+        // else: transform already computed, no need to do it again.
+    }
+
+    virtual bool /*TransformerNode::*/SupportsTransformOnInput(size_t inputIndex) override
+    {
+        // We support transform for just unpool input.
+        return (inputIndex == 0);
+    }
 };

 // -----------------------------------------------------------------------
--- a/Source/ComputationNetworkLib/InputAndParamNodes.h
+++ b/Source/ComputationNetworkLib/InputAndParamNodes.h
@ -382,7 +382,7 @@ private:
 // -----------------------------------------------------------------------

 template <class ElemType>
-class InputValue : public InputValueBase<ElemType>
+class InputValue : public InputValueBase<ElemType>, public IdentityTransformerNode
 {
    typedef InputValueBase<ElemType> Base; UsingComputationNodeMembersBoilerplate;
    static const std::wstring TypeName() { return L"InputValue"; }
--- a/Source/ComputationNetworkLib/NonlinearityNodes.h
+++ b/Source/ComputationNetworkLib/NonlinearityNodes.h
@ -36,7 +36,7 @@ enum GradientOperationType
 };

 template <class ElemType, ElementWiseOperator opForward, ElementWiseOperator opBackward, GradientOperationType opType>
-class UnaryElementWiseWithOpCodeNodeBase : public ComputationNode<ElemType>, public NumInputs<1>
+class UnaryElementWiseWithOpCodeNodeBase : public ComputationNode<ElemType>, public NumInputs<1>, public IdentityTransformerNode
 {
    typedef ComputationNode<ElemType> Base;
    UsingComputationNodeMembers;
--- a/Source/ComputationNetworkLib/ReshapingNodes.cpp
+++ b/Source/ComputationNetworkLib/ReshapingNodes.cpp
@ -20,6 +20,8 @@
 #include <memory>
 #include <algorithm>
 #include <assert.h>
+#include <stack>
+#include <unordered_map>

 namespace Microsoft { namespace MSR { namespace CNTK {

@ -481,4 +483,334 @@ template <class ElemType>
 template class ScatterPackedNode<float>;
 template class ScatterPackedNode<double>;

+// -----------------------------------------------------------------------
+// CropNode -- crop operation, crops first input according to shape of second
+//             input at offsets which are directly given or automatically calculated.
+// -----------------------------------------------------------------------
+
+template <class ElemType>
+CropNode<ElemType>::CropNode(DEVICEID_TYPE deviceId, const wstring& name)
+    : Base(deviceId, name), m_xOffset(numeric_limits<double>::max()), m_yOffset(numeric_limits<double>::max())
+{
+}
+
+template <class ElemType>
+CropNode<ElemType>::CropNode(size_t offsetX, size_t offsetY, DEVICEID_TYPE deviceId, const wstring& name)
+    : CropNode(deviceId, name)
+{
+    m_xOffset = (double)(offsetX);
+    m_yOffset = (double)(offsetY);
+}
+
+template <class ElemType>
+CropNode<ElemType>::CropNode(const ScriptableObjects::IConfigRecordPtr configp)
+    : CropNode(configp->Get(L"deviceId"), L"<placeholder>")
+{
+    // We may have 2 or 4 node inputs, check that and attach them.
+    const auto inputs = GetInputsFromConfig(configp);
+    if (inputs.size() != 2 && inputs.size() != 4)
+        LogicError("Crop node must have 2 or 4 node inputs.");
+
+    AttachInputs(inputs);
+
+    // Here we have 3 possibilities:
+    // 1. 2 input nodes -> auto crop calculation without equivalence nodes
+    // 2. 2 input nodes + 2 parameters -> manual crop with given offsets
+    // 3. 4 inputs -> auto crop calculation with equivalence nodes
+
+    if (inputs.size() == 2)
+    {
+        // We have 2 input nodes related to cropping (no equivalence node inputs given). Check if we have offsets
+        // directly given.
+        if (configp->Exists(L"yOffset") && configp->Exists(L"xOffset"))
+        {
+            // We have manual crop with given offsets (option 2. above). Save given offsets.
+            m_xOffset = configp->Get(L"xOffset");
+            m_yOffset = configp->Get(L"yOffset");
+        }
+        // else: Offsets not given (option 1. above), we have automatic crop calculation without equivalence nodes.
+    }
+    // else: We have 4 node inputs (option 3. above), we have automatic crop calculation with equivalence nodes.
+}
+
+template <class ElemType>
+void CropNode<ElemType>::Validate(bool isFinalValidationPass)
+{
+    Base::Validate(isFinalValidationPass);
+    InferMBLayoutFromInputsForStandardCase(isFinalValidationPass);
+
+    // Here we need to determine output dimensions which are same as dimensions of second input.
+    TensorShape inputShape0 = Input(0)->GetSampleLayout();
+    TensorShape inputShape1 = Input(1)->GetSampleLayout();
+
+    SmallVector<size_t> inDims = inputShape0.GetDims();
+    SmallVector<size_t> outDims = inputShape1.GetDims();
+
+    // We assume we have at least two dimensions (first two are to be cropped).
+    if (outDims.size() < 2)
+        RuntimeError("Crop input samples must have at least two dimensions.");
+
+    // Set output dimensions.
+    SetDims(TensorShape(outDims), HasMBLayout());
+
+    if (isFinalValidationPass)
+    {
+        // In final validation pass we compute crop offsets if needed.
+        ComputeCropOffsets();
+
+        // Cropped input must be large enough to allow cropping at given offset.
+        if (inDims[0] < outDims[0] + m_xOffset)
+            RuntimeError("Input is small to be cropped along x dimension in crop node.");
+
+        if (inDims[1] < outDims[1] + m_yOffset)
+            RuntimeError("Input is small to be cropped along y dimension in crop node.");
+    }
+}
+
+template <class ElemType>
+void CropNode<ElemType>::ForwardProp(const FrameRange& /*fr*/)
+{
+    // Our offsets must be initialized here.
+    if (m_xOffset == numeric_limits<double>::max() || m_yOffset == numeric_limits<double>::max())
+        LogicError("Crop offsets not initialized in ForwardProp.");
+
+    // Retrieve input and output views for the values. Input and output views are tensor views
+    // that define parts of first input and output that we operate on (we copy input from input view
+    // to output).
+    CroppedIOViews ioViews = CreateIOViews(&ComputationNode<ElemType>::ValuePtr);
+
+    // Copy values from cropped input to output.
+    ioViews.outputView.AssignCopyOf(ioViews.inputViewCropped);
+}
+
+template <class ElemType>
+void CropNode<ElemType>::BackpropTo(const size_t inputIndex, const FrameRange& /*fr*/)
+{
+    // We propagate gradients just to the cropped input.
+    if (inputIndex == 0)
+    {
+        // Reset input gradients to ensure that non-cropped parts do not affect backprop.
+        Input(0)->Gradient().SetValue(0);
+
+        // Retrieve input and output views for the gradients. Input and output views are tensor views
+        // that define parts of first input and output that we operate on (we copy gradients from output view
+        // to input view).
+        CroppedIOViews ioViews = CreateIOViews(&ComputationNode<ElemType>::GradientPtr);
+
+        // Copy gradients from output to cropped input.
+        ioViews.inputViewCropped.AddCopyOf(ioViews.outputView);
+    }
+}
+
+template <class ElemType>
+void CropNode<ElemType>::Save(File& fstream) const
+{
+    Base::Save(fstream);
+
+    fstream << m_xOffset;
+    fstream << m_yOffset;
+}
+
+template <class ElemType>
+void CropNode<ElemType>::Load(File& fstream, size_t modelVersion)
+{
+    Base::Load(fstream, modelVersion);
+
+    fstream >> m_xOffset;
+    fstream >> m_yOffset;
+}
+
+template <class ElemType>
+void CropNode<ElemType>::CopyTo(ComputationNodeBasePtr nodeP, const wstring& newName, const CopyNodeFlags flags) const
+{
+    Base::CopyTo(nodeP, newName, flags);
+    if (flags & CopyNodeFlags::copyNodeValue)
+    {
+        auto node = dynamic_pointer_cast<CropNode<ElemType>>(nodeP);
+        node->m_xOffset = m_xOffset;
+        node->m_yOffset = m_yOffset;
+    }
+}
+
+template <class ElemType>
+typename CropNode<ElemType>::CroppedIOViews CropNode<ElemType>::CreateIOViews(MatrixGetter matrixGetter)
+{
+    // Get the shapes of the inputs.
+    TensorShape inputShape0 = Input(0)->GetTensorShape(Input(0)->GetSampleLayout().GetRank());
+    TensorShape inputShape1 = Input(1)->GetTensorShape(Input(1)->GetSampleLayout().GetRank());
+
+    // Calculate cropped shape of the input.
+    TensorShape inputShapeCropped = inputShape0;
+    inputShapeCropped.NarrowTo(0, (size_t)(m_xOffset), (size_t)(m_xOffset) + inputShape1.GetDim(0));
+    inputShapeCropped.NarrowTo(1, (size_t)(m_yOffset), (size_t)(m_yOffset) + inputShape1.GetDim(1));
+
+    // Get output shape.
+    TensorShape outputShape = GetTensorShape(GetSampleLayout().GetRank());
+    // Cropped input and output dimensions must be same.
+    if (inputShapeCropped.GetDims() != outputShape.GetDims())
+        LogicError("Cropped input and output must have same rank.");
+
+    // Create proper views using calculated shapes.
+    return CroppedIOViews(this, matrixGetter, inputShapeCropped, outputShape);
+}
+
+// ComputeCropOffsets computes offsets to be used for cropping if manual offsets are absent. The offsets are computed
+// by traversing the network graph and finding common ancestor of crop node inputs. Once ancestor is found affine transform
+// is computed along the paths from first and second input to common ancestor. Complete transform from one input to other it
+// finally calculated composing these two transforms. Translate components of final transform define crop offsets.
+template <class ElemType>
+void CropNode<ElemType>::ComputeCropOffsets()
+{
+    // Helper method for traversing the tree and calculating node transforms.
+    auto ProcessInputs = [](ComputationNodeBase* currNode, stack<ComputationNodeBase*>& traversalStack, unordered_map<ComputationNodeBase*, SpaceTransform>& nodeToTransformMap)
+    {
+        if (!currNode->Is<TransformerNode>())
+            RuntimeError("Node does not support affine transform for cropping.");
+
+        auto transformerNode = currNode->As<TransformerNode>();
+        // Go over the nodes inputs.
+        for (size_t i = 0; i < currNode->GetNumInputs(); i++)
+        {
+            // Check if input-output transform is supported on the node.
+            if (transformerNode->SupportsTransformOnInput(i))
+            {
+                // Transform is supported, take the input.
+                ComputationNodeBase* currInput = currNode->GetInputs()[i].get();
+                // Take node transform from input to output.
+                const SpaceTransform& nodeTransform = transformerNode->GetTransformForInput(i);
+                // Calculate composite transform from node input to crop node.
+                SpaceTransform nodeToCropTransform = nodeToTransformMap.find(currNode)->second.Compose(nodeTransform);
+
+                // Check if we already visited this input node.
+                auto it = nodeToTransformMap.find(currInput);
+                if (it == nodeToTransformMap.end())
+                {
+                    // We have not visited this node before. Add it to the transform map and to traversal stack to continue
+                    // traversing its children.
+                    nodeToTransformMap.insert(make_pair(currInput, nodeToCropTransform));
+                    traversalStack.push(currInput);
+                }
+                else
+                {
+                    // We have been here before, check that transforms along two different paths are same.
+                    if (it->second != nodeToCropTransform)
+                    {
+                        // Different transforms along two different paths, should never happen.
+                        RuntimeError("Different transforms along different paths in Crop node.");
+                    }
+                }
+            }
+        }
+    };
+
+    if (m_xOffset != numeric_limits<double>::max() && m_yOffset != numeric_limits<double>::max())
+    {
+        // Offsets are already available, skip compute.
+        return;
+    }
+
+    // Used to keep nodes while traversing the network graph.
+    stack<ComputationNodeBase*> traversalStack;
+    // Maps node to transform between its output and crop node.
+    unordered_map<ComputationNodeBase*, SpaceTransform> nodeToCropInput0TransformMap;
+    unordered_map<ComputationNodeBase*, SpaceTransform> nodeToCropInput1TransformMap;
+    // Take equivalence nodes if provided.
+    ComputationNodeBase* equivalenceNode1 = nullptr;
+    ComputationNodeBase* equivalenceNode2 = nullptr;
+    if (GetInputs().size() == 4)
+    {
+        equivalenceNode1 = GetInputs()[2].get();
+        equivalenceNode2 = GetInputs()[3].get();
+    }
+
+    // Push first input to traversal stack to start exploring paths starting from there.
+    traversalStack.push(GetInputs()[0].get());
+    // Push first input transform as identity to enable composing transforms.
+    nodeToCropInput0TransformMap.insert(make_pair(GetInputs()[0].get(), SpaceTransform::Identity(2)));
+    // Start traversing graph starting from the first input.
+    while (!traversalStack.empty())
+    {
+        ComputationNodeBase* currNode = traversalStack.top();
+        traversalStack.pop();
+        ProcessInputs(currNode, traversalStack, nodeToCropInput0TransformMap);
+    }
+
+    // Now traverse from second input.
+    traversalStack.push(GetInputs()[1].get());
+    // Push second input transform as identity to enable composing transforms.
+    nodeToCropInput1TransformMap.insert(make_pair(GetInputs()[1].get(), SpaceTransform::Identity(2)));
+    // Once we meet node that is in nodeToCropInput0TransformMap or equivalence node we will compute offsets.
+    double xOffset = numeric_limits<double>::max();
+    double yOffset = numeric_limits<double>::max();
+    while (!traversalStack.empty())
+    {
+        ComputationNodeBase* currNode = traversalStack.top();
+        traversalStack.pop();
+        // Check if node is in the map corresponding to the first input (path connected over common ancestor).
+        auto it = nodeToCropInput0TransformMap.find(currNode);
+        const SpaceTransform* firstInputTransform = nullptr;
+        if (it != nodeToCropInput0TransformMap.end())
+        {
+            // We have closed the path between nodes, save the first input transform.
+            firstInputTransform = &it->second;
+        }
+        // Check if node is equivalent to one from the first subtree (path connected over equivalence nodes).
+        else if (currNode == equivalenceNode2)
+        {
+            // We have closed the path between nodes using equivalence nodes, save the first equivalence node transform.
+            firstInputTransform = &nodeToCropInput0TransformMap.find(equivalenceNode1)->second;
+        }
+
+        if (firstInputTransform)
+        {
+            // Calculate final transform.
+            SpaceTransform finalTransform = nodeToCropInput1TransformMap.find(currNode)->second.Compose(firstInputTransform->Inverse());
+            for (size_t ia = 0; ia < finalTransform.m_axisTransforms.size(); ia++)
+            {
+                // In crop node we expect no scaling.
+                if (finalTransform.m_axisTransforms[ia].scale != 1.0f)
+                    RuntimeError("Composite transform has non 1 scale in crop node.");
+                if (finalTransform.m_axisTransforms[ia].translate > 0)
+                    RuntimeError("Composite transform has positive translate (negative offset) in crop node.");
+            }
+            // Crop offsets are defined with transform translations.
+            xOffset = -finalTransform.m_axisTransforms[0].translate;
+            yOffset = -finalTransform.m_axisTransforms[1].translate;
+            // Finished.
+            break;
+        }
+        // No connected path, keep searching.
+        ProcessInputs(currNode, traversalStack, nodeToCropInput0TransformMap);
+    }
+    if (xOffset == numeric_limits<double>::max() || yOffset == numeric_limits<double>::max())
+        LogicError("Connected path between crop inputs not found. Unable to compute crop offsets.");
+
+    // Save computed offsets.
+    m_xOffset = xOffset;
+    m_yOffset = yOffset;
+}
+
+template <class ElemType>
+void CropNode<ElemType>::ComputeTransforms()
+{
+    if (m_transforms[0].m_axisTransforms.empty())
+    {
+        m_transforms[0].m_axisTransforms[0].scale = 1;
+        m_transforms[0].m_axisTransforms[0].translate = -m_xOffset;
+        m_transforms[0].m_axisTransforms[1].scale = 1;
+        m_transforms[0].m_axisTransforms[1].translate = -m_yOffset;
+    }
+    // else: already computed.
+}
+
+template <class ElemType>
+bool CropNode<ElemType>::SupportsTransformOnInput(size_t inputIndex)
+{
+    // We support transform on cropped input.
+    return (inputIndex == 0);
+}
+
+template class CropNode<float>;
+template class CropNode<double>;
+
 }}}
--- a/Source/ComputationNetworkLib/ReshapingNodes.h
+++ b/Source/ComputationNetworkLib/ReshapingNodes.h
@ -408,6 +408,111 @@ private:
 template class SliceNode<float>;
 template class SliceNode<double>;

+// -----------------------------------------------------------------------
+// CropNode
+//
+// Extracts portion of inputNode1 (input to be cropped) that corresponds to
+// inputNode2 (input that defines crop dimensions).
+
+// Cropping offsets can be given directly (offsetX, offsetY parameters in BS/NDL).
+// These offsets must be given in absolute values (pixels).
+//
+// Alternatively, offsets can be calculated automatically using network graph
+// and node transforms. The offsets are computed by traversing the network graph
+// and finding common ancestor of crop node inputs. Once ancestor is found affine
+// transform is computed along the paths from first and second input to common
+// ancestor. Complete transform from one input to other it finally calculated
+// composing these two transforms. Translate components of final transform define
+// crop offsets.
+// Automatic crop calculation uses concept of equivalence nodes. Equivalence nodes
+// are sort of virtual common ancestors. For example two inputs to network may be
+// equivalent in spatial sense (for example input and target in case of pixelwise
+// semantic labeling) but they are separate leaf nodes which cannot be common
+// ancestors for inputs to crop node. However, they can be declared as equivalent
+// using equivalence nodes option (when traversing from one crop input and other
+// once we reach two equivalence nodes we will consider that path between two
+// crop inputs is closed over them).
+//
+// Usage (Both NDL and BS):
+//  CropNode(input1, input2, offsetX, offsetY) or
+//  CropNode(input1, input2) or
+//  CropNode(input1, input2, eqNode1, eqNode2) or
+// where:
+//  input1 - computation node to be cropped at given/calculated offsets with width and height taken from input2
+//  input2 - computation node that defines cropping shape (width and height to be used when cropping input1)
+//  offsetX - manually given absolute offset in pixels along x axis (must be used with type="manual")
+//  offsetY - manually given absolute offset in pixels along y axis (must be used with type="manual")
+//  eqNode1 - first equivalence node
+//  eqNode2 - second equivalence node
+// -----------------------------------------------------------------------
+
+template <class ElemType>
+class CropNode : public ComputationNode<ElemType>, public TransformerNode
+{
+    typedef ComputationNode<ElemType> Base;
+    UsingComputationNodeMembersBoilerplate;
+
+    static const std::wstring TypeName() { return L"Crop"; }
+
+public:
+    CropNode(DEVICEID_TYPE deviceId, const std::wstring& name);
+
+    CropNode(size_t offsetX, size_t offsetY, DEVICEID_TYPE deviceId, const std::wstring& name);
+
+    CropNode(const ScriptableObjects::IConfigRecordPtr configp);
+
+    void /*ComputationNodeBase::*/ Validate(bool isFinalValidationPass) override;
+
+    virtual void /*ComputationNode::*/ ForwardProp(const FrameRange& /*fr*/) override;
+
+    virtual void /*ComputationNode::*/ BackpropTo(const size_t inputIndex, const FrameRange& /*fr*/) override;
+
+    void Save(File& fstream) const override;
+
+    void Load(File& fstream, size_t modelVersion) override;
+
+    void CopyTo(ComputationNodeBasePtr nodeP, const std::wstring& newName, const CopyNodeFlags flags) const override;
+
+private:
+    using ComputationNodeBase::GetInputs;
+    using TransformerNode::m_transforms;
+
+    // Declaration of matrix getting method to unify accessing values and gradients.
+    typedef MatrixBasePtr(ComputationNode<ElemType>::*MatrixGetter)() const;
+
+    // Helper structure to store input/output views which define parts of input and output we work with.
+    struct CroppedIOViews
+    {
+        CroppedIOViews(CropNode* cropNode, MatrixGetter matrixGetter, TensorShape inputShapeCropped, TensorShape ouputShape) :
+            // Input view is derived from first input.
+            inputViewCropped((cropNode->Input(0).get()->*matrixGetter)(), inputShapeCropped),
+            // Output view corresponds to single output.
+            outputView((cropNode->*matrixGetter)(), ouputShape)
+        {}
+
+        TensorView<ElemType> inputViewCropped;
+        TensorView<ElemType> outputView;
+    };
+
+    // Creates input and output views (TensorViews that define parts of input and output we work with). MatrixGetter is
+    // the pointer to method that returns appropriate matrix (values in forward or gradients in backward). Using
+    // MatrixGetter we can reuse code without copy-pasting.
+    CroppedIOViews CreateIOViews(MatrixGetter matrixGetter);
+
+    // Performs offsets computation if necessary.
+    void ComputeCropOffsets();
+
+    virtual void /*TransformerNode::*/ComputeTransforms() override;
+
+    virtual bool /*TransformerNode::*/SupportsTransformOnInput(size_t inputIndex) override;
+
+protected:
+    // Offset along x axis. We need to store offsets as floats for precision if one crop node affects computation of other.
+    double m_xOffset;
+    // Offset along y axis.
+    double m_yOffset;
+};
+
 // -----------------------------------------------------------------------
 // RowStack (input0, input1, ...)
 // stacks multiple inputs on top of each other
--- a/Source/ComputationNetworkLib/TrainingNodes.cpp
+++ b/Source/ComputationNetworkLib/TrainingNodes.cpp
@ -33,9 +33,9 @@ void RandomSampleNodeBase<ElemType>::CopyTo(ComputationNodeBasePtr nodeP, const
    if (flags & CopyNodeFlags::copyNodeValue)
    {
        auto node = dynamic_pointer_cast<RandomSampleNodeBase<ElemType>>(nodeP);
-        node->m_allowDuplicates           = m_allowDuplicates;
-        node->m_sizeOfSampledSet          = m_sizeOfSampledSet;
-        node->m_randomSeed                = m_randomSeed;
+        node->m_allowDuplicates  = m_allowDuplicates;
+        node->m_sizeOfSampledSet = m_sizeOfSampledSet;
+        node->m_randomSeed       = m_randomSeed;
    }
 }

@ -75,14 +75,14 @@ void RandomSampleNodeBase<ElemType>::UpdateWeightsPrefixSum()
 // Runs the sampling returning a vector with the id's of the samples. The parameter nTries is used to return the number of draws that was needed
 // to get the expected number of samples.
 template<class ElemType>
-const std::vector<size_t> RandomSampleNodeBase<ElemType>::RunSampling(long& nTries)
+const std::vector<size_t> RandomSampleNodeBase<ElemType>::RunSampling(size_t& nTries)
 {
    std::uniform_real_distribution<double> r(0, m_samplingWeightsPrefixSum.back());
    std::unordered_set<int> alreadySampled;
    std::vector<size_t> samples;
-    CPURNGHandle* cpuRNGHandle = dynamic_cast<CPURNGHandle*>(&GetRNGHandle(CPUDEVICE)); 
-    // find random samples using the specified weight
+    CPURNGHandle* cpuRNGHandle = dynamic_cast<CPURNGHandle*>(&GetRNGHandle(CPUDEVICE));

+    // find random samples using the specified weight
    if (m_allowDuplicates)
        nTries = m_sizeOfSampledSet;
    else
@ -123,10 +123,12 @@ void RandomSampleNode<ElemType>::ForwardPropNonLooping()
 {
    Base::UpdateWeightsPrefixSum();
    Matrix<ElemType>& valueMatrix = ValueAsMatrix();
+    // TODO: Should we prepare the CSC data directly on the CPU and move it in one go?
+    // Currently the reader will place the data onto the GPU. It will then be pulled on-demand to the CPU once (and cached there).
    valueMatrix.TransferToDeviceIfNotThere(CPUDEVICE, /*ismoved =*/ true/*means: BOTH state not ok */, /*emptyTransfer =*/ true, /*updatePreferredDevice =*/ false);
    valueMatrix.SetDevice(CPUDEVICE);

-    //BUGBUG: matrix type should be configured during validation
+    // BUGBUG: matrix type should be configured during validation
    valueMatrix.SwitchToMatrixType(SPARSE, matrixFormatSparseCSC, false);
    valueMatrix.Reset();

@ -144,7 +146,7 @@ void RandomSampleNode<ElemType>::ForwardPropNonLooping()
 template<class ElemType>
 const std::vector<size_t> RandomSampleNode<ElemType>::GetWeightedSamples()
 {
-    long dummy;
+    size_t dummy;
    // Here we are not interested in the number of sampling tries needed, which is returned in the parameter.
    return Base::RunSampling(dummy);
 }
@ -157,29 +159,31 @@ void RandomSampleNode<ElemType>::Validate(bool isFinalValidationPass)

    let& shape = Input(0)->GetSampleLayout();
    let dims = shape.GetDims();
-    size_t nClasses = dims[0];
+    size_t numClasses = dims[0];

    // Output: a (sparse) matrix containing m_sizeOfSampledSet columns of 1-hot vectors specifiying the sampled classes.
-    SetDims(TensorShape(nClasses, Base::m_sizeOfSampledSet), false);
+    SetDims(TensorShape(numClasses, Base::m_sizeOfSampledSet), false);
 }

 template<class ElemType>
 bool RandomSampleNode<ElemType>::IsOutOfDateWrtInputs() const
 {
-    // If we are in the mode to generate random samples (i.e. m_estimateInSampleFrequency == false) 
-    // we need to recompute the result for each mini-batch even if the weight vector didn't change.
+    // We need to recompute the result for each mini-batch even if the weight vector didn't change.
    return true;
 }

+template class RandomSampleNode<float>;
+template class RandomSampleNode<double>;
+
 template<class ElemType>
 double RandomSampleInclusionFrequencyNode<ElemType>::EstimateNumberOfTries()
 {
    // We estimate the average numver of tries by repeating a fixed number of experiments
    const size_t numExperiments = 10; // We choose 10 without any deep justification.
    long totalTries = 0;
-    for (int iExperiment = 0; iExperiment < numExperiments; iExperiment++)
+    for (int i = 0; i < numExperiments; i++)
    {
-        long nTries;
+        size_t nTries;
        Base::RunSampling(nTries);
        totalTries += nTries;
    }
@ -210,7 +214,7 @@ void RandomSampleInclusionFrequencyNode<ElemType>::ForwardPropNonLooping()
    valueMatrix.TransferToDeviceIfNotThere(CPUDEVICE, /*ismoved =*/ true/*means: BOTH state not ok */, /*emptyTransfer =*/ true, /*updatePreferredDevice =*/ false);
    valueMatrix.SetDevice(CPUDEVICE);

-    //BUGBUG: matrix type should be configured during validation
+    // BUGBUG: matrix type should be configured during validation
    valueMatrix.SwitchToMatrixType(DENSE, matrixFormatDense, false);
    double sumOfWeights = Base::m_samplingWeightsPrefixSum.back();
    const Matrix<ElemType>& samplingWeights = Input(0)->ValueAsMatrix();
@ -240,8 +244,6 @@ void RandomSampleInclusionFrequencyNode<ElemType>::Validate(bool isFinalValidati
    SetDims(TensorShape(nClasses, 1), false);
 }

-template class RandomSampleNode<float>;
-template class RandomSampleNode<double>;
 template class RandomSampleInclusionFrequencyNode<float>;
 template class RandomSampleInclusionFrequencyNode<double>;
 }}}
--- a/Source/ComputationNetworkLib/TrainingNodes.h
+++ b/Source/ComputationNetworkLib/TrainingNodes.h
@ -1185,7 +1185,7 @@ protected:
 // Provides random sampling functionality.
 //
 // Parameters:
-// * Input(0) Sampling weight vector: Matrix of shape (nClasses x 1) providing sampling weights >= 0.
+// * Input(0) Sampling weight vector: Matrix of shape [numClasses x 1] providing sampling weights >= 0.
 // * sizeOfSampledSet: Size of the sampled set.
 // * allowDuplicates: controls if sampled set is allowed to contain duplicates.
 // --------------------------------------------------------------------------------------------------------------------------------------------------
@ -1221,16 +1221,12 @@ protected:

    // Runs the sampling returning a vector with the id's of the samples. The parameter nTries is used to return the number of draws that was needed
    // to get the expected number of samples.
-    const std::vector<size_t> RunSampling(long& nTries);
+    const std::vector<size_t> RunSampling(size_t& nTries);

 public:
-    virtual void /*ComputationNode::*/ BackpropToNonLooping(size_t inputIndex) override {
-        // This node does not propagate gradients.
-    }
-
+    virtual void /*ComputationNode::*/ BackpropToNonLooping(size_t inputIndex) override {} // This node does not propagate gradients.
    virtual void /*ComputationNodeBase::*/ Validate(bool isFinalValidationPass) override;
    virtual bool OutputUsedInComputingInputNodesGradients() const override { return false; }
-
    virtual bool InputUsedInComputingInputNodesGradients(size_t /*childIndex*/) const override { return false;}
    virtual void /*ComputationNode::*/ ForwardPropNonLooping() override{}
    virtual bool GetAllowDuplicates() const { return m_allowDuplicates; }
@ -1244,14 +1240,17 @@ protected:

 // ------------------------------------------------------------------------------------------------------------------------------------------------
 // RandomSampleNode(samplingWeights, sizeOfSampledSet, allowDuplicates):
-// The node's value is a set of sizeOfSampledSet random samples represented as a (sparse) matrix of shape [nClasses x sizeOfSampledSet] where nClasses is the number of classes (categories) to choose from.
+// The node's value is a set of sizeOfSampledSet random samples represented as a (sparse) matrix 
+// of shape [numClasses x sizeOfSampledSet] where numClasses is the number of classes (categories) to choose from.
 // The output has no dynamic axis.
-// The samples are drawn according to the weight vector p(w_i) = w_i / sum_k(w_k)
+// The samples are drawn with a probability proportional to the weights w of the vector 'samplingWeights' : p(w_i) = w_i / sum_k(w_k)
 // We get one set of samples for per minibatch.
+// Multiply a 'numClasses' - dimensional vector with this matrix to randomly sample 'sizeOfSampledSet' values from it.
+// The resulting vector has a dimension of 'sizeOfSampledSet'.Currently, only rank - 1 tensors are supported.
 // Intended uses are e.g. sampled softmax, noise contrastive estimation etc.
 //
 // Parameters:
-// * Input(0): Sampling weight vector. Matrix of shape (nClasses x 1) providing sampling weights >= 0.
+// * Input(0): Sampling weight vector. Matrix of shape [numClasses x 1] providing sampling weights >= 0.
 // * sizeOfSampledSet: Size of the sampled set.
 // * allowDuplicates: controls if sampled set is allowed to contain duplicates.
 // --------------------------------------------------------------------------------------------------------------------------------------------------
@ -1280,13 +1279,13 @@ public:

 // ------------------------------------------------------------------------------------------------------------------------------------------------
 // RandomSampleInclusionFrequencyNode(samplingWeights, sizeOfSampledSet, allowDuplicates): 
-// Intended uses are e.g. sampled softmax, noise contrastive estimation etc where it is used together with RandomSampleNode.
+// Intended uses are e.g. sampled softmax, noise contrastive estimation etc. where it is used together with RandomSampleNode.
 // This node estimates how often each class will occur in a set sampled with RandomSampleNode(...) on the average. 
 // If the sampling mode 'allowDuplicates = true' is choosen this is trivial and exact. 
 // For allowDuplicates = false we get some estimate. The value is updated only when the input weights change.
 //
 // Parameters:
-// * Input(0): Sampling weight vector. Matrix of shape (nClasses x 1) providing sampling weights >= 0.
+// * Input(0): Sampling weight vector. Matrix of shape (numClasses x 1) providing sampling weights >= 0.
 // * sizeOfSampledSet: Size of the sampled set.
 // * allowDuplicates: controls if sampled set is allowed to contain duplicates.
 // --------------------------------------------------------------------------------------------------------------------------------------------------
@ -2217,7 +2216,8 @@ template class DropoutNode<double>;
 // * imageLayout is the image layout. Only cudnn is supported at present.
 // -----------------------------------------------------------------------
 template <class ElemType>
-class BatchNormalizationNode : public ComputationNodeNonLooping<ElemType>, public NumInputs<5>, public IFreezable
+class BatchNormalizationNode : public ComputationNodeNonLooping<ElemType>, public NumInputs<5>, public IFreezable,
+    public IdentityTransformerNodeOnOneInput<0>
 {
    typedef ComputationNodeNonLooping<ElemType> Base; UsingComputationNodeMembersBoilerplate;
    static const std::wstring TypeName() { return L"BatchNormalization"; }
--- a/Source/SGDLib/SGD.cpp
+++ b/Source/SGDLib/SGD.cpp
@ -479,7 +479,7 @@ void SGD<ElemType>::TrainOrAdaptModel(int startEpoch, ComputationNetworkPtr net,
        // the last minibatch size, or we use tuning to try and find a better one.
        if (m_autoAdjustMinibatch && i >= m_mbSize.size())
        {
-            size_t numFramesToUseInSearch = m_numMiniBatch4LRSearch[i] * m_mbSize[i];
+            size_t numFramesToUseInSearch = m_numSamples4Search[i];
            if (m_epochSize != requestDataSize)
            {
                // ensure the numFramesToUseInSearch does not exceed the total number of frames in the epoch
@ -820,7 +820,8 @@ size_t SGD<ElemType>::TrainOneEpoch(ComputationNetworkPtr net,
                                    std::list<Matrix<ElemType>>& smoothedGradients, vector<double>& smoothedCounts,
                                    /*out*/ EpochCriterion& epochCriterion,
                                    /*out*/ std::vector<EpochCriterion>& epochEvalErrors,
-                                    const std::string& prefixMsg)
+                                    const std::string& prefixMsg,
+                                    const size_t maxNumberOfSamples)
 {
    ScopedNetworkOperationMode modeGuard(net, NetworkOperationMode::training);

@ -943,6 +944,18 @@ size_t SGD<ElemType>::TrainOneEpoch(ComputationNetworkPtr net,
        }
    }

+    // In case adaptive minibatch/learning rates are used, the training can be limited by the maxNumberOfSamples.
+    bool maxNumSamplesExceeded = false;
+    size_t epochStartSample = 0;
+    bool shouldCheckEarlyExit = (maxNumberOfSamples != SIZE_MAX);
+    if (shouldCheckEarlyExit)
+    {
+        // SparsePC, LibSCV and DSS readers do not implement GetCurrentSamplePosition()
+        // for those adaptive minibatch size is not supported, thus specifying adaptive 
+        // minibatch for them will cause an error message.
+        epochStartSample = trainSetDataReader->GetCurrentSamplePosition();
+    }
+
    bool noMoreSamplesToProcess = false;
    bool isFirstMinibatch = true;
    for (;;)
@ -960,6 +973,10 @@ size_t SGD<ElemType>::TrainOneEpoch(ComputationNetworkPtr net,
        size_t actualMBSize = 0;
        bool wasDataRead = DataReaderHelpers::GetMinibatchIntoNetwork<ElemType>(*trainSetDataReader, net, criterionNodes[0],
                                                                                useDistributedMBReading, useParallelTrain, *inputMatrices, actualMBSize, m_mpi);
+
+        if (maxNumSamplesExceeded) // Dropping data.
+            wasDataRead = false;
+
        if (!wasDataRead && (!useDistributedMBReading || noMoreSamplesToProcess)) // in case of distributed reading, we do a few more loops until all ranks have completed
            break;                                                                // end of epoch

@ -1058,6 +1075,14 @@ size_t SGD<ElemType>::TrainOneEpoch(ComputationNetworkPtr net,
        } // if (actualMBSize > 0)
        // WARNING: If actualMBSize == 0, then criterion nodes have NOT been updated, and contain garbage (last MB's) values.

+        // In case of mini epochs (used for adaptive minibatch size and learning rate),
+        // no more data should be processed by this worker.
+        if (shouldCheckEarlyExit)
+        {
+            if (epochStartSample + maxNumberOfSamples < trainSetDataReader->GetCurrentSamplePosition())
+                maxNumSamplesExceeded = true;
+        }
+
        if (m_perfTraceLevel > 0)
        {
            std::unique_ptr<MatrixComputeStreamEvent> mainStreamSyncEvent(MatrixComputeStreamEvent::Create(net->GetDeviceId()));
@ -1490,7 +1515,7 @@ double SGD<ElemType>::SearchForBestLearnRate(ComputationNetworkPtr net,
 {
    double bestLearnRatePerSample = curLearnRate;

-    size_t numFramesToUseInSearch = m_numMiniBatch4LRSearch[epochNumber] * m_mbSize[epochNumber];
+    size_t numFramesToUseInSearch = m_numSamples4Search[epochNumber];
    if (m_epochSize != requestDataSize)
    {
        // ensure the numFramesToUseInSearch does not exceed the total number of frames in the epoch
@ -1525,13 +1550,14 @@ double SGD<ElemType>::SearchForBestLearnRate(ComputationNetworkPtr net,
    EpochCriterion baseCriterion;
    vector<EpochCriterion> epochEvalErrors(evaluationNodes.size(), EpochCriterion::Infinity()); // these are ignored in this entire method
    TrainOneMiniEpochAndReloadModel(net, refNet, refNode, epochNumber,
-                                    numFramesToUseInSearch, trainSetDataReader, 0, m_mbSize[epochNumber],
+                                    m_epochSize, trainSetDataReader, 0, m_mbSize[epochNumber],
                                    featureNodes, labelNodes,
                                    criterionNodes, evaluationNodes,
                                    inputMatrices, learnableNodes,
                                    smoothedGradients, smoothedCounts,
                                    /*out*/ baseCriterion, /*out*/ epochEvalErrors,
-                                    "BaseAdaptiveLearnRateSearch:");
+                                    "BaseAdaptiveLearnRateSearch:",
+                                    numFramesToUseInSearch);

    if (m_autoLearnRateSearchType == LearningRateSearchAlgorithm::SearchBeforeEpoch)
    {
@ -1552,13 +1578,14 @@ double SGD<ElemType>::SearchForBestLearnRate(ComputationNetworkPtr net,
    {
        learnRatePerSample *= 0.618;
        TrainOneMiniEpochAndReloadModel(net, refNet, refNode, epochNumber,
-                                        numFramesToUseInSearch, trainSetDataReader,
+                                        m_epochSize, trainSetDataReader,
                                        learnRatePerSample, m_mbSize[epochNumber], featureNodes,
                                        labelNodes, criterionNodes,
                                        evaluationNodes, inputMatrices,
                                        learnableNodes, smoothedGradients, smoothedCounts,
                                        /*out*/ epochCriterion, /*out*/ epochEvalErrors,
-                                        "AdaptiveLearnRateSearch:");
+                                        "AdaptiveLearnRateSearch:",
+                                        numFramesToUseInSearch);
    } while (epochCriterion.IsNan() || (epochCriterion.Average() > baseCriterion.Average() && learnRatePerSample > minLearnRate));

    bestLearnRatePerSample = learnRatePerSample;
@ -1572,14 +1599,15 @@ double SGD<ElemType>::SearchForBestLearnRate(ComputationNetworkPtr net,
        EpochCriterion leftCriterion; // we compute this from the mini epoch

        TrainOneMiniEpochAndReloadModel(net, refNet, refNode, epochNumber,
-                                        numFramesToUseInSearch, trainSetDataReader,
+                                        m_epochSize, trainSetDataReader,
                                        leftLearnRatePerSample, m_mbSize[epochNumber],
                                        featureNodes, labelNodes,
                                        criterionNodes, evaluationNodes,
                                        inputMatrices, learnableNodes,
                                        smoothedGradients, smoothedCounts,
                                        /*out*/ leftCriterion, /*out*/ epochEvalErrors,
-                                        "DetailBaseAdaptiveLearnRateSearch:");
+                                        "DetailBaseAdaptiveLearnRateSearch:",
+                                        numFramesToUseInSearch);

        while (rightLearnRatePerSample > leftLearnRatePerSample * 1.2)
        {
@ -1588,7 +1616,8 @@ double SGD<ElemType>::SearchForBestLearnRate(ComputationNetworkPtr net,
                rightLearnRatePerSample *= 0.618;

                TrainOneMiniEpochAndReloadModel(net, refNet, refNode,
-                                                epochNumber, numFramesToUseInSearch,
+                                                epochNumber, 
+                                                m_epochSize,
                                                trainSetDataReader,
                                                rightLearnRatePerSample, m_mbSize[epochNumber],
                                                featureNodes, labelNodes,
@ -1599,14 +1628,16 @@ double SGD<ElemType>::SearchForBestLearnRate(ComputationNetworkPtr net,
                                                smoothedGradients, smoothedCounts,
                                                /*out*/ rightCriterion,
                                                /*out*/ epochEvalErrors,
-                                                "DetailRightAdaptiveLearnRateSearch:");
+                                                "DetailRightAdaptiveLearnRateSearch:",
+                                                numFramesToUseInSearch);
            }
            else
            {
                leftLearnRatePerSample /= 0.618;

                TrainOneMiniEpochAndReloadModel(net, refNet, refNode,
-                                                epochNumber, numFramesToUseInSearch,
+                                                epochNumber,
+                                                m_epochSize,
                                                trainSetDataReader,
                                                leftLearnRatePerSample, m_mbSize[epochNumber],
                                                featureNodes, labelNodes,
@ -1617,7 +1648,8 @@ double SGD<ElemType>::SearchForBestLearnRate(ComputationNetworkPtr net,
                                                smoothedGradients, smoothedCounts,
                                                /*out*/ leftCriterion,
                                                /*out*/ epochEvalErrors,
-                                                "DetailLeftAdaptiveLearnRateSearch:");
+                                                "DetailLeftAdaptiveLearnRateSearch:",
+                                                numFramesToUseInSearch);
            }
        }

@ -1790,13 +1822,14 @@ size_t SGD<ElemType>::SearchForBestMinibatchSize(ComputationNetworkPtr net,
        // Train on a few minibatches and so we can observe the epochCriterion as we try increasing
        // minibatches with iteration of this loop.
        TrainOneMiniEpochAndReloadModel(net, refNet, refNode, epochNumber,
-                                        numFramesToUseInSearch, trainSetDataReader,
+                                        m_epochSize, trainSetDataReader,
                                        learnRatePerSample, trialMinibatchSize, featureNodes,
                                        labelNodes, criterionNodes,
                                        evaluationNodes, inputMatrices,
                                        learnableNodes, smoothedGradients, smoothedCounts,
                                        /*out*/ epochCriterion, /*out*/ epochEvalErrors,
-                                        isFirstIteration ? "BaseAdaptiveMinibatchSearch:" : "AdaptiveMinibatchSearch:");
+                                        isFirstIteration ? "BaseAdaptiveMinibatchSearch:" : "AdaptiveMinibatchSearch:",
+                                        numFramesToUseInSearch);

        if (isFirstIteration)
        {
@ -1858,14 +1891,15 @@ void SGD<ElemType>::TrainOneMiniEpochAndReloadModel(ComputationNetworkPtr net,
                                                    std::list<Matrix<ElemType>>& smoothedGradients, vector<double> smoothedCounts,
                                                    /*out*/ EpochCriterion& epochCriterion,
                                                    /*out*/ std::vector<EpochCriterion>& epochEvalErrors,
-                                                    std::string prefixMsg)
+                                                    std::string prefixMsg,
+                                                    const size_t maxNumOfSamples)
 {
    TrainOneEpoch(net, refNet, refNode, epochNumber, epochSize,
                  trainSetDataReader, learnRatePerSample, minibatchSize, featureNodes,
                  labelNodes, criterionNodes, evaluationNodes,
                  inputMatrices, learnableNodes, smoothedGradients, smoothedCounts,
                  /*out*/ epochCriterion, /*out*/ epochEvalErrors,
-                  "  " + prefixMsg); // indent log msg by 2 (that is 1 more than the Finished message below)
+                  "  " + prefixMsg, maxNumOfSamples); // indent log msg by 2 (that is 1 more than the Finished message below)

    LOGPRINTF(stderr, " Finished Mini-Epoch[%d]: ", (int)epochNumber+1);
    epochCriterion.LogCriterion(criterionNodes[0]->NodeName());
@ -2489,11 +2523,6 @@ SGDParams::SGDParams(const ConfigRecordType& configSGD, size_t sizeofElemType)
    m_minibatchSizeTuningMax = configAALR(L"minibatchSizeTuningMax", (size_t) 1048576);
    m_minibatchSearchCriterionErrorMargin = configAALR(L"minibatchSearchCriterionErrorMargin", (size_t) 1);

-    // the number of minibatches used to search
-    // the learning rate. It's typically set to 10-20% of
-    // the total minibatches in an epoch.
-    m_numMiniBatch4LRSearch = configAALR(L"numMiniBatch4LRSearch", ConfigRecordType::Array(intargvector(vector<int>{500})));
-
    m_numPrevLearnRates = configAALR(L"numPrevLearnRates", (size_t) 5);
    m_numBestSearchEpoch = configAALR(L"numBestSearchEpoch", (size_t) 1);
    m_loadBestModel = configAALR(L"loadBestModel", true);
@ -2508,6 +2537,26 @@ SGDParams::SGDParams(const ConfigRecordType& configSGD, size_t sizeofElemType)
    m_maxSamplesInRAM = configSGD(L"maxSamplesInRAM", (size_t) SIZE_MAX);
    m_numSubminiBatches = configSGD(L"numSubminibatches", (size_t) 1);

+    if (configAALR.Exists(L"numMiniBatch4LRSearch"))
+    {
+        LOGPRINTF(stderr, "WARNING: 'numMiniBatch4LRSearch' is deprecated, please remove it and use 'numSamples4Search' instead.\n");
+        // the number of minibatches used to search
+        // the learning rate. It's typically set to 10-20% of
+        // the total minibatches in an epoch.
+        auto numMiniBatch4LRSearch = configAALR(L"numMiniBatch4LRSearch", ConfigRecordType::Array(intargvector(vector<int>{500})));
+        m_numSamples4Search.resize(numMiniBatch4LRSearch.size());
+        for (size_t i = 0; i < numMiniBatch4LRSearch.size(); ++i)
+            m_numSamples4Search[i] = numMiniBatch4LRSearch[i] * m_mbSize[i];
+    }
+    else
+    {
+        // Default is default mbSize * 500, same as above.
+        intargvector defaultValues;
+        defaultValues.resize(m_mbSize.size());
+        std::transform(m_mbSize.begin(), m_mbSize.end(), defaultValues.begin(), [](int v) { return v * 500; });
+        m_numSamples4Search = configAALR(L"numSamples4Search", ConfigRecordType::Array(defaultValues));
+    }
+
    // the number of samples in each epoch (0 means, use all the samples in each epoch).
    m_epochSize = configSGD(L"epochSize", (size_t) 0);
    // the number of samples in each epoch (0 means, use all the samples in each epoch).
--- a/Source/SGDLib/SGD.h
+++ b/Source/SGDLib/SGD.h
@ -196,7 +196,7 @@ protected:
    bool m_gradientClippingWithTruncation;
    double m_clippingThresholdPerSample;

-    intargvector m_numMiniBatch4LRSearch;
+    intargvector m_numSamples4Search;
    size_t m_numBestSearchEpoch;

    LearningRateSearchAlgorithm m_autoLearnRateSearchType;
@ -414,7 +414,8 @@ protected:
                                         std::list<Matrix<ElemType>>& smoothedGradients, std::vector<double> smoothedCounts,
                                         /*out*/ EpochCriterion& epochCriterion,
                                         /*out*/ std::vector<EpochCriterion>& epochEvalErrors,
-                                         std::string prefixMsg = "");
+                                         std::string prefixMsg,
+                                         const size_t maxNumOfSamples);

    size_t AdaptiveMinibatchSizing(ComputationNetworkPtr net,
                                   ComputationNetworkPtr refNet,
@ -478,7 +479,8 @@ protected:
                         std::list<Matrix<ElemType>>& smoothedGradients, std::vector<double>& smoothedCounts,
                         /*out*/ EpochCriterion& epochCriterion,
                         /*out*/ std::vector<EpochCriterion>& epochEvalErrors,
-                         const std::string& prefixMsg = "");
+                         const std::string& prefixMsg = "",
+                         const size_t maxNumberOfSamples = SIZE_MAX);

    void InitDistGradAgg(int numEvalNodes, int numGradientBits, int traceLevel);
    void InitModelAggregationHandler(int traceLevel, DEVICEID_TYPE devID);
--- a/Tests/EndToEndTests/Speech/HTKDeserializers/DNN/ParallelBMWithAdjustLR/baseline.linux.cpu.txt
+++ b/Tests/EndToEndTests/Speech/HTKDeserializers/DNN/ParallelBMWithAdjustLR/baseline.linux.cpu.txt
--- a/Tests/EndToEndTests/Speech/HTKDeserializers/DNN/ParallelBMWithAdjustLR/baseline.linux.gpu.txt
+++ b/Tests/EndToEndTests/Speech/HTKDeserializers/DNN/ParallelBMWithAdjustLR/baseline.linux.gpu.txt
--- a/Tests/EndToEndTests/Speech/HTKDeserializers/DNN/ParallelBMWithAdjustLR/baseline.windows.cpu.txt
+++ b/Tests/EndToEndTests/Speech/HTKDeserializers/DNN/ParallelBMWithAdjustLR/baseline.windows.cpu.txt
--- a/Tests/EndToEndTests/Speech/HTKDeserializers/DNN/ParallelBMWithAdjustLR/baseline.windows.gpu.txt
+++ b/Tests/EndToEndTests/Speech/HTKDeserializers/DNN/ParallelBMWithAdjustLR/baseline.windows.gpu.txt
--- a/Tests/EndToEndTests/UnitTests/NetworkTests/baseline.txt
+++ b/Tests/EndToEndTests/UnitTests/NetworkTests/baseline.txt
@ -3,55 +3,42 @@ CPU info:
    Hardware threads: 24
    Total Memory: 264172964 kB
 -------------------------------------------------------------------
+
+About to throw exception 'Input is small to be cropped along x dimension in crop node.'
+
+About to throw exception 'Input is small to be cropped along y dimension in crop node.'
+
+About to throw exception 'Input is small to be cropped along x dimension in crop node.'
+
+About to throw exception 'Input is small to be cropped along y dimension in crop node.'
 Current working directory: /home/philly/jenkins/workspace/CNTK-Test-Linux-SlaveTest/Tests/EndToEndTests/UnitTests/NetworkTests
 Executable path: /home/philly/jenkins/workspace/CNTK-Test-Linux-SlaveTest/build/gpu/release/bin
 Test path: /home/philly/jenkins/workspace/CNTK-Test-Linux-SlaveTest/Tests/UnitTests/NetworkTests
 Set current path to: /home/philly/jenkins/workspace/CNTK-Test-Linux-SlaveTest/Tests/UnitTests/NetworkTests/Data
 Current working directory is now: /home/philly/jenkins/workspace/CNTK-Test-Linux-SlaveTest/Tests/UnitTests/NetworkTests/Data
 NDLBuilder Using CPU
-Node 'v1' (LearnableParameter operation): Initializing Parameter[1 x 1] <- 0.000000.
-Node 'v1' (LearnableParameter operation): Initializing Parameter[1 x 1] <- 1.000000.
-Node 'v1' (LearnableParameter operation): Initializing Parameter[1 x 1] <- 1.000000.
-Node 'v1' (LearnableParameter operation): Initializing Parameter[1 x 1] <- 1.000000.
-
-Post-processing network...
-
-1 roots:
-	v2 = Plus()
-
-Validating network. 3 nodes to process in pass 1.
-
-Validating --> features = InputValue() :  -> [1 x 1 x *]
-Validating --> v1 = LearnableParameter() :  -> [1 x 1]
-Validating --> v2 = Plus (features, v1) : [1 x 1 x *], [1 x 1] -> [1 x 1 x *]
-
-Validating network. 1 nodes to process in pass 2.
-
-
-Validating network, final pass.
-
-
-
-1 out of 3 nodes do not share the minibatch layout with the input data.
-
-Post-processing network complete.
-
-
-
-Allocating matrices for forward and/or backward propagation.
-
-Memory Sharing: Out of 3 matrices, 0 are shared as 0, and 3 are not shared.
-
-
 Minibatch[0]: ActualMBSize = 1
 Written to ../Output/out.txt*
 Total Samples Evaluated = 1
 Set current path to: /home/philly/jenkins/workspace/CNTK-Test-Linux-SlaveTest/Tests/EndToEndTests/UnitTests/NetworkTests
-Running 1 test case...
+Running 4 test cases...

 Test module "NetworkTests" has passed with:
-  1 test case out of 1 passed
-  1 assertion out of 1 passed
+  4 test cases out of 4 passed
+  55 assertions out of 55 passed
+
+  Test suite "CropNodeTestSuite" has passed with:
+    3 test cases out of 3 passed
+    54 assertions out of 54 passed
+
+    Test case "CropNodeTestSuite/CropNodeValidateTest" has passed with:
+      6 assertions out of 6 passed
+
+    Test case "CropNodeTestSuite/CropNodeForwardTest" has passed with:
+      8 assertions out of 8 passed
+
+    Test case "CropNodeTestSuite/CropNodeBackwardTest" has passed with:
+      40 assertions out of 40 passed

  Test suite "NetworkTestSuite" has passed with:
    1 test case out of 1 passed
--- a/Tests/EndToEndTests/run-test-common
+++ b/Tests/EndToEndTests/run-test-common
@ -7,6 +7,8 @@

 BinaryPath=$TEST_CNTK_BINARY

+export V2_LIB_TESTING=1
+
 if [ "$TEST_DEVICE" == "cpu" ]; then
  CNTKDeviceId=-1
 elif [ "$TEST_DEVICE" == "gpu" ]; then
--- a/Tests/UnitTests/NetworkTests/CropNodeTests.cpp
+++ b/Tests/UnitTests/NetworkTests/CropNodeTests.cpp
@ -0,0 +1,230 @@
+#include "stdafx.h"
+
+#include <memory>
+#include "../../../Source/ComputationNetworkLib/ReshapingNodes.h"
+
+using namespace Microsoft::MSR::CNTK;
+using namespace std;
+
+namespace Microsoft { namespace MSR { namespace CNTK { namespace Test {
+
+// We perform test on CPU since there is nothing device specific oin crop node.
+const DEVICEID_TYPE c_deviceId = CPUDEVICE;
+
+// Helper dummy node to be used as input to crop node.
+template <class ElemType>
+class DummyNodeTest : public ComputationNode<ElemType>
+{
+public:
+    typedef ComputationNode<ElemType> Base; UsingComputationNodeMembersBoilerplate;
+    static const std::wstring TypeName() { return L"DummyTest"; }
+
+    DummyNodeTest(SmallVector<size_t> shapeVec) : Base(c_deviceId, L"Dummy")
+    {
+        // Set given shape and allocate matrices.
+        TensorShape shape(shapeVec);
+        this->SetDims(shape, false);
+        this->CreateValueMatrixIfNull();
+        this->Value().Resize(1, shape.GetNumElements());
+        this->CreateGradientMatrixIfNull();
+        this->Gradient().Resize(1, shape.GetNumElements());
+    }
+    DummyNodeTest(DEVICEID_TYPE deviceId, const wstring& name) : Base(deviceId, name) {}
+
+    virtual void /*ComputationNode::*/ ForwardProp(const FrameRange& /*fr*/) override {}
+
+    virtual void /*ComputationNode::*/ BackpropTo(const size_t /*inputIndex*/, const FrameRange& /*fr*/) override {}
+
+    Matrix<ElemType>& GetGradient() { return this->Gradient(); }
+};
+
+// Extends crop node to provide acces to protected members.
+template <class ElemType>
+class CropNodeTest : public CropNode<ElemType>
+{
+public:
+    CropNodeTest() : CropNode<ElemType>(0, L"CropNodeTest"){}
+
+    int OffsetX() { return this->m_xOffset; }
+    int OffsetY() { return this->m_yOffset; }
+    SmallVector<size_t> GetOutputDims() { return this->GetSampleLayout().GetDims(); }
+    void AllocMatrices()
+    {
+        this->CreateValueMatrixIfNull();
+        this->CreateGradientMatrixIfNull();
+        this->Value().Resize(1, this->GetSampleLayout().GetNumElements());
+        this->Gradient().Resize(1, this->GetSampleLayout().GetNumElements());
+    }
+    Matrix<ElemType>& GetGradient()
+    {
+        return this->Gradient();
+    }
+};
+
+template<class ElemType>
+void CropNodeValidateTestImpl()
+{
+    {
+        // Test that validation fails if cropping cannot be done in x direction.
+        auto cropNode = make_shared<CropNode<ElemType>>(6, 3, c_deviceId, L"CropNode");
+        auto cropNodeTest = make_shared<CropNodeTest<ElemType>>();
+        cropNode->CopyTo(cropNodeTest, cropNodeTest->GetName(), CopyNodeFlags::copyNodeValue);
+
+        // 6 + 5 > 10 (offset + crop > input) -> cropping not possible in x direction.
+        SmallVector<size_t> firstInputDims = { 10, 10 };
+        SmallVector<size_t> secondInputDims = { 5, 5 };
+        auto firstInput = make_shared<DummyNodeTest<ElemType>>(firstInputDims);
+        auto secondInput = make_shared<DummyNodeTest<ElemType>>(secondInputDims);
+        vector<ComputationNodeBasePtr> inputs = { firstInput, secondInput };
+        cropNodeTest->AttachInputs(inputs);
+        BOOST_REQUIRE_EXCEPTION(
+            cropNodeTest->Validate(true),
+            std::runtime_error,
+            [](std::runtime_error const& ex) { return string("Input is small to be cropped along x dimension in crop node.") == ex.what(); }
+        );
+    }
+    {
+        // Test that validation fails if cropping cannot be done in y direction.
+        auto cropNode = make_shared<CropNode<ElemType>>(3, 7, c_deviceId, L"CropNode");
+        auto cropNodeTest = make_shared<CropNodeTest<ElemType>>();
+        cropNode->CopyTo(cropNodeTest, cropNodeTest->GetName(), CopyNodeFlags::copyNodeValue);
+
+        // 7 + 5 > 10 (offset + crop > input) -> cropping not possible in y direction.
+        SmallVector<size_t> firstInputDims = { 10, 10 };
+        SmallVector<size_t> secondInputDims = { 5, 5 };
+        auto firstInput = make_shared<DummyNodeTest<ElemType>>(firstInputDims);
+        auto secondInput = make_shared<DummyNodeTest<ElemType>>(secondInputDims);
+        vector<ComputationNodeBasePtr> inputs = { firstInput, secondInput };
+        cropNodeTest->AttachInputs(inputs);
+        BOOST_REQUIRE_EXCEPTION(
+            cropNodeTest->Validate(true),
+            std::runtime_error,
+            [](std::runtime_error const& ex) { return string("Input is small to be cropped along y dimension in crop node.") == ex.what(); }
+        );
+    }
+
+    {
+        // Test that crop node output is same size as second input after validation.
+        auto cropNode = make_shared<CropNode<ElemType>>(3, 3, c_deviceId, L"CropNode");
+        auto cropNodeTest = make_shared<CropNodeTest<ElemType>>();
+        cropNode->CopyTo(cropNodeTest, cropNodeTest->GetName(), CopyNodeFlags::copyNodeValue);
+
+        SmallVector<size_t> firstInputDims = { 10, 10 };
+        SmallVector<size_t> secondInputDims = { 5, 5 };
+        auto firstInput = make_shared<DummyNodeTest<ElemType>>(firstInputDims);
+        auto secondInput = make_shared<DummyNodeTest<ElemType>>(secondInputDims);
+        vector<ComputationNodeBasePtr> inputs = { firstInput, secondInput };
+        cropNodeTest->AttachInputs(inputs);
+        cropNodeTest->Validate(true);
+        SmallVector<size_t> outputDims = cropNodeTest->GetOutputDims();
+
+        BOOST_REQUIRE_MESSAGE(outputDims == secondInputDims, "Crop node output differs from its second input");
+    }
+}
+
+template<class ElemType>
+void CropNodeForwardTestImpl()
+{
+    // Test that input is correctly cropped.
+    auto cropNode = make_shared<CropNode<ElemType>>(1, 1, c_deviceId, L"CropNode");
+    auto cropNodeTest = make_shared<CropNodeTest<ElemType>>();
+    cropNode->CopyTo(cropNodeTest, cropNodeTest->GetName(), CopyNodeFlags::copyNodeValue);
+
+    SmallVector<size_t> firstInputDims = { 4, 4 };
+    SmallVector<size_t> secondInputDims = { 2, 2 };
+    auto firstInput = make_shared<DummyNodeTest<ElemType>>(firstInputDims);
+    auto secondInput = make_shared<DummyNodeTest<ElemType>>(secondInputDims);
+
+    Matrix<ElemType>& input = firstInput->Value();
+    ElemType inputVals[16] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
+    input.SetValue(4, 4, c_deviceId, inputVals);
+
+    vector<ComputationNodeBasePtr> inputs = { firstInput, secondInput };
+    cropNodeTest->AttachInputs(inputs);
+    cropNodeTest->Validate(true);
+    cropNodeTest->AllocMatrices();
+
+    FrameRange fr;
+    cropNodeTest->ForwardProp(fr);
+    ElemType* outputData = cropNodeTest->Value().Data();
+    BOOST_REQUIRE_MESSAGE(outputData[0] == inputVals[5], "Cropping output is invalid");
+    BOOST_REQUIRE_MESSAGE(outputData[1] == inputVals[6], "Cropping output is invalid");
+    BOOST_REQUIRE_MESSAGE(outputData[2] == inputVals[9], "Cropping output is invalid");
+    BOOST_REQUIRE_MESSAGE(outputData[3] == inputVals[10], "Cropping output is invalid");
+}
+
+template<class ElemType>
+void CropNodeBackwardTestImpl()
+{
+    // Test that gradients are correctly propagated.
+    auto cropNode = make_shared<CropNode<ElemType>>(1, 1, c_deviceId, L"CropNode");
+    auto cropNodeTest = make_shared<CropNodeTest<ElemType>>();
+    cropNode->CopyTo(cropNodeTest, cropNodeTest->GetName(), CopyNodeFlags::copyNodeValue);
+
+    SmallVector<size_t> firstInputDims = { 4, 4 };
+    SmallVector<size_t> secondInputDims = { 2, 2 };
+    auto firstInput = make_shared<DummyNodeTest<ElemType>>(firstInputDims);
+    auto secondInput = make_shared<DummyNodeTest<ElemType>>(secondInputDims);
+
+    vector<ComputationNodeBasePtr> inputs = { firstInput, secondInput };
+    cropNodeTest->AttachInputs(inputs);
+    cropNodeTest->Validate(true);
+    cropNodeTest->AllocMatrices();
+    Matrix<ElemType>& outputGrad = cropNodeTest->GetGradient();
+    ElemType outputGradVals[4] = { 0, 1, 2, 3 };
+    outputGrad.SetValue(2, 2, c_deviceId, outputGradVals);
+
+    FrameRange fr;
+    cropNodeTest->BackpropTo(0, fr);
+    ElemType* input0GradVals = firstInput->GetGradient().Data();
+    BOOST_REQUIRE_MESSAGE(input0GradVals[0] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[1] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[2] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[3] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[4] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[5] == outputGradVals[0], "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[6] == outputGradVals[1], "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[7] ==0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[8] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[9] == outputGradVals[2], "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[10] == outputGradVals[3], "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[11] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[12] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[13] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[14] == 0, "Cropping gradient is invalid");
+    BOOST_REQUIRE_MESSAGE(input0GradVals[15] == 0, "Cropping gradient is invalid");
+
+    // Test that gradients are not propagated to second input.
+    ElemType secondInputGradValue = 10;
+    secondInput->GetGradient().SetValue(secondInputGradValue);
+    cropNodeTest->BackpropTo(1, fr);
+    ElemType* input1GradVals = secondInput->GetGradient().Data();
+    for (int i = 0; i < 4; i++)
+    {
+        BOOST_REQUIRE_MESSAGE(input1GradVals[i] == secondInputGradValue, "Cropping output is invalid");
+    }
+}
+
+BOOST_AUTO_TEST_SUITE(CropNodeTestSuite)
+
+BOOST_AUTO_TEST_CASE(CropNodeValidateTest)
+{
+    CropNodeValidateTestImpl<float>();
+    CropNodeValidateTestImpl<double>();
+}
+
+BOOST_AUTO_TEST_CASE(CropNodeForwardTest)
+{
+    CropNodeForwardTestImpl<float>();
+    CropNodeForwardTestImpl<double>();
+}
+
+BOOST_AUTO_TEST_CASE(CropNodeBackwardTest)
+{
+    CropNodeBackwardTestImpl<float>();
+    CropNodeBackwardTestImpl<double>();
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+} } } }
--- a/Tests/UnitTests/NetworkTests/NetworkTests.vcxproj
+++ b/Tests/UnitTests/NetworkTests/NetworkTests.vcxproj
@ -109,6 +109,7 @@
  <ItemGroup>
    <ClCompile Include="..\..\..\Source\CNTK\BrainScript\BrainScriptEvaluator.cpp" />
    <ClCompile Include="..\..\..\Source\CNTK\BrainScript\BrainScriptParser.cpp" />
+    <ClCompile Include="CropNodeTests.cpp" />
    <ClCompile Include="OperatorEvaluation.cpp" />
    <ClCompile Include="stdafx.cpp">
      <PrecompiledHeader>Create</PrecompiledHeader>
--- a/Tests/UnitTests/NetworkTests/NetworkTests.vcxproj.filters
+++ b/Tests/UnitTests/NetworkTests/NetworkTests.vcxproj.filters
@ -16,6 +16,7 @@
    <ClCompile Include="..\..\..\Source\CNTK\BrainScript\BrainScriptEvaluator.cpp">
      <Filter>From BrainScript</Filter>
    </ClCompile>
+    <ClCompile Include="CropNodeTests.cpp" />
  </ItemGroup>
  <ItemGroup>
    <Filter Include="Config">
--- a/bindings/python/tutorials/CNTK_202_Language_Understanding.ipynb
+++ b/bindings/python/tutorials/CNTK_202_Language_Understanding.ipynb
@ -0,0 +1,882 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Hands-On Lab: Language Understanding with Recurrent Networks\n",
+    "\n",
+    "This hands-on lab shows how to implement a recurrent network to process text,\n",
+    "for the Air Travel Information Services (ATIS) tasks of slot tagging and intent classification.\n",
+    "We will start with a straight-forward embedding followed by a recurrent LSTM.\n",
+    "We will then extend it to include neighbor words and run bidirectionally.\n",
+    "Lastly, we will turn this system into an intent classifier.  \n",
+    "\n",
+    "The techniques you will practice include:\n",
+    "\n",
+    "* model description by composing layer blocks instead of writing formulas\n",
+    "* creating your own layer block\n",
+    "* variables with different sequence lengths in the same network\n",
+    "* parallel training\n",
+    "\n",
+    "We assume that you are familiar with basics of deep learning, and these specific concepts:\n",
+    "\n",
+    "* recurrent networks ([Wikipedia page](https://en.wikipedia.org/wiki/Recurrent_neural_network))\n",
+    "* text embedding ([Wikipedia page](https://en.wikipedia.org/wiki/Word_embedding))\n",
+    "\n",
+    "### Prerequisites\n",
+    "\n",
+    "We assume that you have already [installed CNTK](https://www.cntk.ai/pythondocs/setup.html).\n",
+    "This tutorial requires CNTK V2. We strongly recommend to run this tutorial on a machine with \n",
+    "a capable CUDA-compatible GPU. Deep learning without GPUs is not fun.\n",
+    "\n",
+    "Finally you need to download the training and test set. The following piece of code does that for you. If you get an error, please follow the manual instructions below it.\n",
+    "\n",
+    "We also list the imports we will need for this tutorial"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import math\n",
+    "from cntk.blocks import *  # non-layer like building blocks such as LSTM()\n",
+    "from cntk.layers import *  # layer-like stuff such as Linear()\n",
+    "from cntk.models import *  # higher abstraction level, e.g. entire standard models and also operators like Sequential()\n",
+    "from cntk.utils import *\n",
+    "from cntk.io import MinibatchSource, CTFDeserializer, StreamDef, StreamDefs, INFINITELY_REPEAT, FULL_DATA_SWEEP\n",
+    "from cntk import Trainer\n",
+    "from cntk.ops import cross_entropy_with_softmax, classification_error, splice\n",
+    "from cntk.learner import adam_sgd, learning_rate_schedule, momentum_schedule\n",
+    "from cntk.persist import load_model, save_model\n",
+    "\n",
+    "from _cntk_py import set_fixed_random_seed\n",
+    "set_fixed_random_seed(1) # to become invariant to initialization order\n",
+    "\n",
+    "try:\n",
+    "    from tqdm import tqdm\n",
+    "except:\n",
+    "    tqdm = lambda x: x\n",
+    "import requests\n",
+    "\n",
+    "def download(data):\n",
+    "    url = \"https://github.com/Microsoft/CNTK/blob/master/Examples/Tutorials/SLUHandsOn/atis.%s.ctf?raw=true\"\n",
+    "    response = requests.get(url%data, stream=True)\n",
+    "\n",
+    "    with open(\"atis.%s.ctf\"%data, \"wb\") as handle:\n",
+    "        for data in tqdm(response.iter_content()):\n",
+    "            handle.write(data)\n",
+    "\n",
+    "for t in \"train\",\"test\":\n",
+    "    try:\n",
+    "        f=open(\"atis.%s.ctf\"%t)\n",
+    "        f.close()\n",
+    "    except:\n",
+    "        download(t)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Fallback manual instructions\n",
+    "Please download the ATIS [training](https://github.com/Microsoft/CNTK/blob/master/Tutorials/SLUHandsOn/atis.train.ctf) \n",
+    "and [test](https://github.com/Microsoft/CNTK/blob/master/Tutorials/SLUHandsOn/atis.test.ctf) \n",
+    "files and put them at the same folder as this notebook.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# load dictionaries\n",
+    "query_wl = [line.rstrip('\\n') for line in open('query.wl')]\n",
+    "slots_wl = [line.rstrip('\\n') for line in open('slots.wl')]\n",
+    "query_dict = {query_wl[i]:i for i in range(len(query_wl))}\n",
+    "slots_dict = {slots_wl[i]:i for i in range(len(slots_wl))}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task and Model Structure\n",
+    "\n",
+    "The task we want to approach in this tutorial is slot tagging.\n",
+    "We use the [ATIS corpus](https://catalog.ldc.upenn.edu/LDC95S26).\n",
+    "ATIS contains human-computer queries from the domain of Air Travel Information Services,\n",
+    "and our task will be to annotate (tag) each word of a query whether it belongs to a\n",
+    "specific item of information (slot), and which one.\n",
+    "\n",
+    "The data in your working folder has already been converted into the \"CNTK Text Format.\"\n",
+    "Let's look at an example from the test-set file `atis.test.ctf`:\n",
+    "\n",
+    "    19  |S0 178:1 |# BOS      |S1 14:1 |# flight  |S2 128:1 |# O\n",
+    "    19  |S0 770:1 |# show                         |S2 128:1 |# O\n",
+    "    19  |S0 429:1 |# flights                      |S2 128:1 |# O\n",
+    "    19  |S0 444:1 |# from                         |S2 128:1 |# O\n",
+    "    19  |S0 272:1 |# burbank                      |S2 48:1  |# B-fromloc.city_name\n",
+    "    19  |S0 851:1 |# to                           |S2 128:1 |# O\n",
+    "    19  |S0 789:1 |# st.                          |S2 78:1  |# B-toloc.city_name\n",
+    "    19  |S0 564:1 |# louis                        |S2 125:1 |# I-toloc.city_name\n",
+    "    19  |S0 654:1 |# on                           |S2 128:1 |# O\n",
+    "    19  |S0 601:1 |# monday                       |S2 26:1  |# B-depart_date.day_name\n",
+    "    19  |S0 179:1 |# EOS                          |S2 128:1 |# O\n",
+    "\n",
+    "This file has 7 columns:\n",
+    "\n",
+    "* a sequence id (19). There are 11 entries with this sequence id. This means that sequence 19 consists\n",
+    "of 11 tokens;\n",
+    "* column `S0`, which contains numeric word indices;\n",
+    "* a comment column denoted by `#`, to allow a human reader to know what the numeric word index stands for;\n",
+    "Comment columns are ignored by the system. `BOS` and `EOS` are special words\n",
+    "to denote beginning and end of sentence, respectively;\n",
+    "* column `S1` is an intent label, which we will only use in the last part of the tutorial;\n",
+    "* another comment column that shows the human-readable label of the numeric intent index;\n",
+    "* column `S2` is the slot label, represented as a numeric index; and\n",
+    "* another comment column that shows the human-readable label of the numeric label index.\n",
+    "\n",
+    "The task of the neural network is to look at the query (column `S0`) and predict the\n",
+    "slot label (column `S2`).\n",
+    "As you can see, each word in the input gets assigned either an empty label `O`\n",
+    "or a slot label that begins with `B-` for the first word, and with `I-` for any\n",
+    "additional consecutive word that belongs to the same slot.\n",
+    "\n",
+    "The model we will use is a recurrent model consisting of an embedding layer,\n",
+    "a recurrent LSTM cell, and a dense layer to compute the posterior probabilities:\n",
+    "\n",
+    "\n",
+    "    slot label   \"O\"        \"O\"        \"O\"        \"O\"  \"B-fromloc.city_name\"\n",
+    "                  ^          ^          ^          ^          ^\n",
+    "                  |          |          |          |          |\n",
+    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
+    "              | Dense |  | Dense |  | Dense |  | Dense |  | Dense |  ...\n",
+    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
+    "                  ^          ^          ^          ^          ^\n",
+    "                  |          |          |          |          |\n",
+    "              +------+   +------+   +------+   +------+   +------+   \n",
+    "         0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...\n",
+    "              +------+   +------+   +------+   +------+   +------+   \n",
+    "                  ^          ^          ^          ^          ^\n",
+    "                  |          |          |          |          |\n",
+    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
+    "              | Embed |  | Embed |  | Embed |  | Embed |  | Embed |  ...\n",
+    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
+    "                  ^          ^          ^          ^          ^\n",
+    "                  |          |          |          |          |\n",
+    "    w      ------>+--------->+--------->+--------->+--------->+------... \n",
+    "                 BOS      \"show\"    \"flights\"    \"from\"   \"burbank\"\n",
+    "\n",
+    "Or, as a CNTK network description. Please have a quick look and match it with the description above:\n",
+    "(descriptions of these functions can be found at: [the layers reference](http://cntk.ai/pythondocs/layerref.html)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "vocab_size = 943 ; num_labels = 129 ; num_intents = 26    # number of words in vocab, slot labels, and intent labels\n",
+    "\n",
+    "model_dir = \"./Models\"\n",
+    "data_dir  = \".\"\n",
+    "# model dimensions\n",
+    "input_dim  = vocab_size\n",
+    "label_dim  = num_labels\n",
+    "emb_dim    = 150\n",
+    "hidden_dim = 300\n",
+    "\n",
+    "def create_model():\n",
+    "    with default_options(initial_state=0.1):\n",
+    "        return Sequential([\n",
+    "            Embedding(emb_dim),\n",
+    "            Recurrence(LSTM(hidden_dim), go_backwards=False),\n",
+    "            Dense(num_labels)\n",
+    "        ])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# peek\n",
+    "model = create_model()\n",
+    "print(len(model.layers))\n",
+    "print(model.layers[0].E.shape)\n",
+    "print(model.layers[2].b.value)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## CNTK Configuration\n",
+    "\n",
+    "To train and test a model in CNTK, we need to create a model and specify how to read data and perform training and testing. \n",
+    "\n",
+    "In order to train we need to specify:\n",
+    "\n",
+    "* how to read the data \n",
+    "* the model function and its inputs and outputs\n",
+    "* hyper-parameters for the learner\n",
+    "\n",
+    "[comment]: <> (For testing ...)\n",
+    "\n",
+    "### A Brief Look at Data and Data Reading\n",
+    "\n",
+    "We already looked at the data.\n",
+    "But how do you generate this format?\n",
+    "For reading text, this tutorial uses the `CNTKTextFormatReader`. It expects the input data to be\n",
+    "of a specific format, which is described [here](https://github.com/Microsoft/CNTK/wiki/CNTKTextFormat-Reader).\n",
+    "\n",
+    "For this tutorial, we created the corpora by two steps:\n",
+    "* convert the raw data into a plain text file that contains of TAB-separated columns of space-separated text. For example:\n",
+    "\n",
+    "  ```\n",
+    "  BOS show flights from burbank to st. louis on monday EOS (TAB) flight (TAB) O O O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name O B-depart_date.day_name O\n",
+    "  ```\n",
+    "\n",
+    "  This is meant to be compatible with the output of the `paste` command.\n",
+    "* convert it to CNTK Text Format (CTF) with the following command:\n",
+    "\n",
+    "  ```\n",
+    "  python Scripts/txt2ctf.py --map query.wl intent.wl slots.wl --annotated True --input atis.test.txt --output atis.test.ctf\n",
+    "  ```\n",
+    "\n",
+    "  where the three `.wl` files give the vocabulary as plain text files, one line per word.\n",
+    "\n",
+    "In these CTF files, our columns are labeled `S0`, `S1`, and `S2`.\n",
+    "These are connected to the actual network inputs by the corresponding lines in the reader definition:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def create_reader(path, is_training):\n",
+    "    return MinibatchSource(CTFDeserializer(path, StreamDefs(\n",
+    "         query         = StreamDef(field='S0', shape=vocab_size,  is_sparse=True),\n",
+    "         intent_unused = StreamDef(field='S1', shape=num_intents, is_sparse=True),  \n",
+    "         slot_labels   = StreamDef(field='S2', shape=num_labels,  is_sparse=True)\n",
+    "     )), randomize=is_training, epoch_size = INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# peek\n",
+    "reader = create_reader(data_dir + \"/atis.train.ctf\", is_training=True)\n",
+    "reader.streams"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Trainer\n",
+    "\n",
+    "We also must define the training criterion (loss function), and also an error metric to track."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def create_criterion_function(model):\n",
+    "    labels = Placeholder()\n",
+    "    ce   = cross_entropy_with_softmax(model, labels)\n",
+    "    errs = classification_error      (model, labels)\n",
+    "    return combine ([ce, errs]) # (features, labels) -> (loss, metric)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def train(reader, model, max_epochs=16):\n",
+    "    # criterion: (model args, labels) -> (loss, metric)\n",
+    "    #   here  (query, slot_labels) -> (ce, errs)\n",
+    "    criterion = create_criterion_function(model)\n",
+    "\n",
+    "    # declare argument types\n",
+    "    #criterion.set_signature(vocab_size, num_labels)\n",
+    "    criterion.replace_placeholders({criterion.placeholders[0]: Input(vocab_size),\n",
+    "                                    criterion.placeholders[1]: Input(num_labels)})\n",
+    "\n",
+    "    # training config\n",
+    "    epoch_size = 18000\n",
+    "    minibatch_size = 70\n",
+    "\n",
+    "    # learner\n",
+    "    momentum_as_time_constant = minibatch_size / -math.log(0.9)  # TODO: Change to round number. This is 664.39. 700?\n",
+    "    lr_per_sample = [0.003]*4+[0.0015]*24+[0.0003] # LR schedule over epochs (we don't run that mayn epochs, but if we did, these are good values)\n",
+    "    lr_schedule = learning_rate_schedule(lr_per_sample, units=epoch_size)\n",
+    "    learner = adam_sgd(criterion.parameters,\n",
+    "                       lr_per_sample=lr_schedule, momentum_time_constant=momentum_as_time_constant,\n",
+    "                       low_memory=True,\n",
+    "                       gradient_clipping_threshold_per_sample=15, gradient_clipping_with_truncation=True)\n",
+    "\n",
+    "    # trainer\n",
+    "    trainer = Trainer(model, criterion.outputs[0], criterion.outputs[1], learner)\n",
+    "\n",
+    "    # process minibatches and perform model training\n",
+    "    log_number_of_parameters(model)\n",
+    "    #progress_printer = ProgressPrinter(freq=100, first=10, tag='Training') # more detailed logging\n",
+    "    progress_printer = ProgressPrinter(tag='Training')\n",
+    "\n",
+    "    t = 0\n",
+    "    for epoch in range(max_epochs):         # loop over epochs\n",
+    "        epoch_end = (epoch+1) * epoch_size\n",
+    "        while t < epoch_end:                # loop over minibatches on the epoch\n",
+    "            data = reader.next_minibatch(minibatch_size, input_map={  # fetch minibatch\n",
+    "                criterion.arguments[0]: reader.streams.query,\n",
+    "                criterion.arguments[1]: reader.streams.slot_labels\n",
+    "            })\n",
+    "            trainer.train_minibatch(data)                                     # update model with it\n",
+    "            t += data[criterion.arguments[1]].num_samples                                # count samples processed so far\n",
+    "            progress_printer.update_with_trainer(trainer, with_metric=True)   # log progress\n",
+    "        loss, metric, actual_samples = progress_printer.epoch_summary(with_metric=True)\n",
+    "\n",
+    "    return loss, metric"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Running it\n",
+    "\n",
+    "You can find the complete recipe below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false,
+    "scrolled": false
+   },
+   "outputs": [],
+   "source": [
+    "def do_train():\n",
+    "    global model\n",
+    "    model = create_model()\n",
+    "    reader = create_reader(data_dir + \"/atis.train.ctf\", is_training=True)\n",
+    "    train(reader, model)\n",
+    "do_train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This shows how learning proceeds over epochs (passes through the data).\n",
+    "For example, after four epochs, the loss, which is the cross-entropy criterion, has reached 0.22 as measured on the ~18000 samples of this epoch,\n",
+    "and that the error rate is 5.0% on those same 18000 training samples.\n",
+    "\n",
+    "The epoch size is the number of samples--counted as *word tokens*, not sentences--to\n",
+    "process between model checkpoints.\n",
+    "\n",
+    "Once the training has completed (a little less than 2 minutes on a Titan-X or a Surface Book),\n",
+    "you will see an output like this\n",
+    "```\n",
+    "(0.06193035719939996, 0.014038397514149373)\n",
+    "```\n",
+    "which is a tuple containing the loss (cross entropy) and the metric (classification error) averaged over the final epoch.\n",
+    "\n",
+    "On a CPU-only machine, it can be 4 or more times slower."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Evaluating the model\n",
+    "\n",
+    "Like the train() function, we also define a function to measure accuracy on a test set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def evaluate(reader, model):\n",
+    "    criterion = create_criterion_function(model)\n",
+    "    #criterion.set_signature(None, Input(num_labels))\n",
+    "    criterion.replace_placeholders({criterion.placeholders[0]: Input(num_labels)})\n",
+    "\n",
+    "    # process minibatches and perform evaluation\n",
+    "    dummy_learner = adam_sgd(criterion.parameters, lr_per_sample=1, momentum_time_constant=0, low_memory=True)\n",
+    "    evaluator = Trainer(model, criterion.outputs[0], criterion.outputs[1], dummy_learner)\n",
+    "    progress_printer = ProgressPrinter(tag='Evaluation')\n",
+    "\n",
+    "    while True:\n",
+    "        minibatch_size = 1000\n",
+    "        data = reader.next_minibatch(minibatch_size, input_map={  # fetch minibatch\n",
+    "            criterion.arguments[0]: reader.streams.query,\n",
+    "            criterion.arguments[1]: reader.streams.slot_labels\n",
+    "        })\n",
+    "        #data = reader.next_minibatch(minibatch_size) # fetch minibatch\n",
+    "        if not data:                                 # until we hit the end\n",
+    "            break\n",
+    "        metric = evaluator.test_minibatch(data)\n",
+    "        progress_printer.update(0, data[criterion.arguments[1]].num_samples, metric) # log progress\n",
+    "    loss, metric, actual_samples = progress_printer.epoch_summary(with_metric=True)\n",
+    "\n",
+    "    return loss, metric"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can measure the model accuracy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def do_test():\n",
+    "    reader = create_reader(data_dir + \"/atis.test.ctf\", is_training=False)\n",
+    "    evaluate(reader, model)\n",
+    "do_test()\n",
+    "model.layers[2].b.value"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# let's run a sequence through\n",
+    "w = [query_dict[w] for w in 'BOS flights from new york to seattle EOS'.split()] # convert to word indices\n",
+    "print(w)\n",
+    "onehot = np.zeros([len(w),len(query_dict)], np.float32)\n",
+    "for t in range(len(w)):\n",
+    "    onehot[t,w[t]] = 1\n",
+    "pred = model.eval({model.arguments[0]:onehot})\n",
+    "print(pred.shape)\n",
+    "best = np.argmax(pred,axis=2)\n",
+    "print(best[0])\n",
+    "[slots_wl[s] for s in best[0]]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Modifying the Model\n",
+    "\n",
+    "In the following, you will be given tasks to practice modifying CNTK configurations.\n",
+    "The solutions are given at the end of this document... but please try without!\n",
+    "\n",
+    "### A Word About [`Sequential()`](https://www.cntk.ai/pythondocs/layerref.html#sequential)\n",
+    "\n",
+    "Before jumping to the tasks, let's have a look again at the model we just ran.\n",
+    "The model is described in what we call *function-composition style*.\n",
+    "```python\n",
+    "        Sequential([\n",
+    "            Embedding(emb_dim),\n",
+    "            Recurrence(LSTM(hidden_dim), go_backwards=False),\n",
+    "            Dense(num_labels)\n",
+    "        ])\n",
+    "```\n",
+    "You may be familiar with the \"sequential\" notation from other neural-network toolkits.\n",
+    "If not, [`Sequential()`](https://www.cntk.ai/pythondocs/layerref.html#sequential) is a powerful operation that,\n",
+    "in a nutshell, allows to compactly express a very common situation in neural networks\n",
+    "where an input is processed by propagating it through a progression of layers.\n",
+    "`Sequential()` takes an list of functions as its argument,\n",
+    "and returns a *new* function that invokes these functions in order,\n",
+    "each time passing the output of one to the next.\n",
+    "For example,\n",
+    "```python\n",
+    "\tFGH = Sequential ([F,G,H])\n",
+    "    y = FGH (x)\n",
+    "```\n",
+    "means the same as\n",
+    "```\n",
+    "    y = H(G(F(x))) \n",
+    "```\n",
+    "This is known as [\"function composition\"](https://en.wikipedia.org/wiki/Function_composition),\n",
+    "and is especially convenient for expressing neural networks, which often have this form:\n",
+    "\n",
+    "         +-------+   +-------+   +-------+\n",
+    "    x -->|   F   |-->|   G   |-->|   H   |--> y\n",
+    "         +-------+   +-------+   +-------+\n",
+    "\n",
+    "Coming back to our model at hand, the `Sequential` expression simply\n",
+    "says that our model has this form:\n",
+    "\n",
+    "         +-----------+   +----------------+   +------------+\n",
+    "    x -->| Embedding |-->| Recurrent LSTM |-->| DenseLayer |--> y\n",
+    "         +-----------+   +----------------+   +------------+"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Task 1: Add Batch Normalization\n",
+    "\n",
+    "We now want to add new layers to the model, specifically batch normalization.\n",
+    "\n",
+    "Batch normalization is a popular technique for speeding up convergence.\n",
+    "It is often used for image-processing setups, for example our other [hands-on lab on image\n",
+    "recognition](./Hands-On-Labs-Image-Recognition).\n",
+    "But could it work for recurrent models, too?\n",
+    "  \n",
+    "So your task will be to insert batch-normalization layers before and after the recurrent LSTM layer.\n",
+    "If you have completed the [hands-on labs on image processing](https://github.com/Microsoft/CNTK/blob/master/bindings/python/tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb),\n",
+    "you may remember that the [batch-normalization layer](https://www.cntk.ai/pythondocs/layerref.html#batchnormalization-layernormalization-stabilizer) has this form:\n",
+    "```\n",
+    "    BatchNormalization()\n",
+    "```\n",
+    "So please go ahead and modify the configuration and see what happens.\n",
+    "\n",
+    "If everything went right, you will notice improved convergence speed (`loss` and `metric`)\n",
+    "compared to the previous configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# TODO: Add batch normalization\n",
+    "def create_model():\n",
+    "    with default_options(initial_state=0.1):\n",
+    "        return Sequential([\n",
+    "            Embedding(emb_dim),\n",
+    "            Recurrence(LSTM(hidden_dim), go_backwards=False),\n",
+    "            Dense(num_labels)\n",
+    "        ])\n",
+    "do_train()\n",
+    "do_test()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Task 2: Add a Lookahead \n",
+    "\n",
+    "Our recurrent model suffers from a structural deficit:\n",
+    "Since the recurrence runs from left to right, the decision for a slot label\n",
+    "has no information about upcoming words. The model is a bit lopsided.\n",
+    "Your task will be to modify the model such that\n",
+    "the input to the recurrence consists not only of the current word, but also of the next one\n",
+    "(lookahead).\n",
+    "\n",
+    "Your solution should be in function-composition style.\n",
+    "Hence, you will need to write a Python function that does the following:\n",
+    "\n",
+    "* takes no input arguments\n",
+    "* creates a placeholder sequence variable\n",
+    "* computes the \"next value\" in this sequence using the `Delay()` layer (use this specific form: `Delay(T=-1)`); and\n",
+    "* concatenate the current and the next value into a vector of twice the embedding dimension using `splice()`\n",
+    "\n",
+    "and then insert this function into `Sequential()`'s list between the embedding and the recurrent layer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# TODO: Add lookahead\n",
+    "def create_model():\n",
+    "    with default_options(initial_state=0.1):\n",
+    "        return Sequential([\n",
+    "            Embedding(emb_dim),\n",
+    "            BatchNormalization(),\n",
+    "            Recurrence(LSTM(hidden_dim), go_backwards=False),\n",
+    "            BatchNormalization(),\n",
+    "            Dense(num_labels)\n",
+    "        ])\n",
+    "do_train()\n",
+    "do_test()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Task 3: Bidirectional Recurrent Model\n",
+    "\n",
+    "Aha, knowledge of future words help. So instead of a one-word lookahead,\n",
+    "why not look ahead until all the way to the end of the sentence, through a backward recurrence?\n",
+    "Let us create a bidirectional model!\n",
+    "\n",
+    "Your task is to implement a new layer that\n",
+    "performs both a forward and a backward recursion over the data, and\n",
+    "concatenates the output vectors.\n",
+    "\n",
+    "Note, however, that this differs from the previous task in that\n",
+    "the bidirectional layer contains learnable model parameters.\n",
+    "In function-composition style,\n",
+    "the pattern to implement a layer with model parameters is to write a *factory function*\n",
+    "that creates a *function object*.\n",
+    "\n",
+    "A function object, also known as *functor*, is an object that is both a function and an object.\n",
+    "Which means nothing else that it contains data yet still can be invoked as if it was a function.\n",
+    "\n",
+    "For example, `Dense(outDim)` is a factory function that returns a function object that contains\n",
+    "a weight matrix `W`, a bias `b`, and another function to compute `input @ W + b`.\n",
+    "E.g. saying `Dense(1024)` will create this function object, which can then be used\n",
+    "like any other function, also immediately: `Dense(1024)(x)`. \n",
+    "\n",
+    "Confused? Let's take an example: Let us implement a new layer that combines\n",
+    "a linear layer with a subsequent batch normalization. \n",
+    "To allow function composition, the layer needs to be realized as a factory function,\n",
+    "which could look like this:\n",
+    "\n",
+    "```python\n",
+    "def DenseLayerWithBN(dim):\n",
+    "    F = Dense(dim)\n",
+    "    G = BatchNormalization()\n",
+    "    x = Placeholder()\n",
+    "    apply_x = G(F(x))\n",
+    "    return apply_x\n",
+    "```\n",
+    "\n",
+    "Invoking this factory function will create `F`, `G`, `x`, and `apply_x`. In this example, `F` and `G` are function objects themselves, and `apply_x` is the function to be applied to the data.\n",
+    "Thus, e.g. calling `DenseLayerWithBN(1024)` will\n",
+    "create an object containing a linear-layer function object called `F`, a batch-normalization function object `G`,\n",
+    "and `apply_x` which is the function that implements the actual operation of this layer\n",
+    "using `F` and `G`. It will then return `apply_x`. To the outside, `apply_x` looks and behaves\n",
+    "like a function. Under the hood, however, `apply_x` retains access to its specific instances of `F` and `G`.\n",
+    "\n",
+    "Now back to our task at hand. You will now need to create a factory function,\n",
+    "very much like the example above.\n",
+    "You shall create a factory function\n",
+    "that creates two recurrent layer instances (one forward, one backward), and then defines an `apply_x` function\n",
+    "which applies both layer instances to the same `x` and concatenate the two results.\n",
+    "\n",
+    "Allright, give it a try! To know how to realize a backward recursion in CNTK,\n",
+    "please take a hint from how the forward recursion is done.\n",
+    "Please also do the following:\n",
+    "* remove the one-word lookahead you added in the previous task, which we aim to replace; and\n",
+    "* change the `hidden_dim` parameter from 300 to 150, to keep the total number of model parameters limited."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# TODO: Add bidirectional recurrence\n",
+    "def create_model():\n",
+    "    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day\n",
+    "        return Sequential([\n",
+    "            Embedding(emb_dim),\n",
+    "            BatchNormalization(),\n",
+    "            Recurrence(LSTM(hidden_dim), go_backwards=False),\n",
+    "            BatchNormalization(),\n",
+    "            Dense(num_labels)\n",
+    "        ])\n",
+    "do_train()\n",
+    "do_test()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Works like a charm! This model achieves 1.83%, a tiny bit better than the lookahead model above.\n",
+    "The bidirectional model has 40% less parameters than the lookahead one. However, if you go back and look closely\n",
+    "at the complete log output (not shown on this web page), you may find that the lookahead one trained\n",
+    "about 30% faster.\n",
+    "This is because the lookahead model has both less horizontal dependencies (one instead of two\n",
+    "recurrences) and larger matrix products, and can thus achieve higher parallelism."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Solution 1: Adding Batch Normalization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def create_model():\n",
+    "    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day\n",
+    "        return Sequential([\n",
+    "            Embedding(emb_dim),\n",
+    "            BatchNormalization(),\n",
+    "            Recurrence(LSTM(hidden_dim), go_backwards=False),\n",
+    "            BatchNormalization(),\n",
+    "            Dense(num_labels)\n",
+    "        ])\n",
+    "\n",
+    "reader = create_reader(data_dir + \"/atis.train.ctf\", is_training=True)\n",
+    "model = create_model()\n",
+    "train(reader, model, max_epochs=8)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Solution 2: Add a Lookahead"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def OneWordLookahead():\n",
+    "    x = Placeholder()\n",
+    "    apply_x = splice ([x, future_value(x)])\n",
+    "    return apply_x\n",
+    "\n",
+    "def create_model():\n",
+    "    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day\n",
+    "        return Sequential([\n",
+    "            Embedding(emb_dim),\n",
+    "            OneWordLookahead(),\n",
+    "            BatchNormalization(),\n",
+    "            Recurrence(LSTM(hidden_dim), go_backwards=False),\n",
+    "            BatchNormalization(),\n",
+    "            Dense(num_labels)        \n",
+    "        ])\n",
+    "\n",
+    "reader = create_reader(data_dir + \"/atis.train.ctf\", is_training=True)\n",
+    "model = create_model()\n",
+    "train(reader, model, max_epochs=1)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Solution 3: Bidirectional Recurrent Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def BiRecurrence(fwd, bwd):\n",
+    "    F = Recurrence(fwd)\n",
+    "    G = Recurrence(bwd, go_backwards=True)\n",
+    "    x = Placeholder()\n",
+    "    apply_x = splice ([F(x), G(x)])\n",
+    "    return apply_x \n",
+    "\n",
+    "def create_model():\n",
+    "    with default_options(initial_state=0.1):  # inject an option to mimic the BrainScript version identically; remove some day\n",
+    "        return Sequential([\n",
+    "            Embedding(emb_dim),\n",
+    "            BatchNormalization(),\n",
+    "            BiRecurrence(LSTM(hidden_dim), LSTM(hidden_dim)),\n",
+    "            BatchNormalization(),\n",
+    "            Dense(num_labels)\n",
+    "        ])\n",
+    "\n",
+    "reader = create_reader(data_dir + \"/atis.train.ctf\", is_training=True)\n",
+    "model = create_model()\n",
+    "train(reader, model, max_epochs=8)"
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python [conda env:cntk-py34]",
+   "language": "python",
+   "name": "conda-env-cntk-py34"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.4.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}