Integrate yuqtang/StageManualUseLearners2 into master

2017-09-14 23:08:38 -07:00 · 2017-09-14 23:08:38 -07:00 · ef49c980f9
--- a/Manual/Manual_How_to_use_learners.ipynb
+++ b/Manual/Manual_How_to_use_learners.ipynb
@ -31,89 +31,90 @@
    "\n",
    "features = C.input_variable(3)\n",
    "label = C.input_variable(2)\n",
-    "z = C.layers.Sequential([C.layers.Dense(4, activation=C.relu), C.layers.Dense(2)])(features)\n",
-    "\n",
-    "lr_schedule_m = C.learning_rate_schedule(0.5, C.UnitType.minibatch)\n",
-    "lr_schedule_s = C.learning_rate_schedule(0.5, C.UnitType.sample)\n",
-    "\n",
-    "sgd_learner_m = C.sgd(z.parameters, lr_schedule_m)\n",
-    "sgd_learner_s = C.sgd(z.parameters, lr_schedule_s)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We have created two learners here. When creating a learner we have to specify a learning rate schedule, which can be as simple as specifying a single number (0.5 in this example) or it can be a list of learning rates that specify what the learning rate should be at different points in time. \n",
-    "\n",
-    "Currently, the best results with deep learning are obtained by having a small number of *phases* where inside each phase the learning rate is fixed and the learning rate decays by a constant factor when moving between phases. We will come back to this point later.\n",
-    "\n",
-    "The second parameter in the learning rate schedule can be one of two different value:\n",
-    "- Per minibatch\n",
-    "- Per sample\n",
-    "\n",
-    "To understand the difference and get familiar with the learner properties and methods, let's write a small function that inspects the effect of a learner on the parameters assuming the parameters are all 0 and the gradients are all 1."
+    "z = C.layers.Sequential([C.layers.Dense(4, activation=C.relu), C.layers.Dense(2)])(features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "unit = minibatch\n",
-      " [array([[-0.25, -0.25],\n",
-      "       [-0.25, -0.25],\n",
-      "       [-0.25, -0.25],\n",
-      "       [-0.25, -0.25]], dtype=float32), array([-0.25, -0.25], dtype=float32), array([[-0.25, -0.25, -0.25, -0.25],\n",
-      "       [-0.25, -0.25, -0.25, -0.25],\n",
-      "       [-0.25, -0.25, -0.25, -0.25]], dtype=float32), array([-0.25, -0.25, -0.25, -0.25], dtype=float32)]\n"
-     ]
-    }
-   ],
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
   "source": [
-    "def inspect_update(learner, mbsize, count=1):\n",
-    "    # Save current parameter values\n",
-    "    old_values = [p.value for p in learner.parameters]\n",
-    "    # Set current parameter values to all 0\n",
-    "    for p in learner.parameters:\n",
-    "        p.value = 0 * p.value\n",
-    "    # create all-ones gradients and associate them with the parameters\n",
-    "    updates = {p: p.value + 1 for p in learner.parameters}    \n",
-    "    # do 'count' many updates\n",
-    "    for i in range(count):\n",
-    "        learner.update(updates, mbsize)\n",
-    "    ret_values = [p.value for p in learner.parameters]\n",
-    "    # Restore values\n",
-    "    for p, o in zip(learner.parameters, old_values):\n",
-    "        p.value = o\n",
-    "    return ret_values\n",
-    "\n",
-    "print('\\nunit = minibatch\\n', inspect_update(sgd_learner_m, mbsize=2))"
+    "sgd_learner_m = C.sgd(z.parameters, lr = 0.5, minibatch_size = C.learners.IGNORE)\n",
+    "sgd_learner_s2 = C.sgd(z.parameters, lr = 0.5, minibatch_size = 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With the knowledge that SGD is the update `parameter = old_parameter - learning_rate * gradient`, we can conclude that when the learning rate schedule is per minibatch, the learning rate is divided by the minibatch size. Let's see what happens when the learning rate schedule is per sample."
+    "We have created two learners here. `sgd_learner_m` is with a learning rate which is intended to be applied to any minibatches without considering their minibatich size; `sgd_learner_s2` is with a learning rate which is intended to be applied to a minibatch of 2 samples.\n",
+    "\n",
+    "When creating a learner we have to specify a learning rate schedule, which can be as simple as specifying a single number (0.5 in this example) or it can be a list of learning rates that specify what the learning rate should be at different points in time. \n",
+    "\n",
+    "Currently, the best results with deep learning are obtained by having a small number of *phases* where inside each phase the learning rate is fixed and the learning rate decays by a constant factor when moving between phases. We will come back to this point later.\n",
+    "\n",
+    "The `minibatch_size` parameter is the size of the minibatch that the learning rates are intended to apply to:\n",
+    "\n",
+    "- minibatch_size = N: the learning rate is intended to be applied to N samples; if the actual minibatch size $M$ is different from $N$, CNTK scales the learning rate $r$ by $\\frac{M}{N}$ so that the stepsize of the updates to the model parameters is the constant $r$ for every $N$ samples.  \n",
+    "\n",
+    "- minibatch_size = C.learners.IGNORE: the learning rate is applied the same way to minibatches of any size (ignoring the actual minibatch sizes).\n",
+    "\n",
+    "Explanation with examples is given below."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To understand the difference and get familiar with the learner properties and methods, let's write a small function that inspects the effect of a learner on the parameters assuming the parameters are all 0 and the gradients are all 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def inspect_update(learner, actual_minibatch_size, count=1):\n",
+    "    # Save current parameter values\n",
+    "    old_values = [p.value for p in learner.parameters]\n",
+    "    # Set current parameter values to all 0\n",
+    "    for p in learner.parameters:\n",
+    "        p.value = 0 * p.value\n",
+    "    # create all-ones gradients, and associate the sum of gradients over \n",
+    "    # the number of samples in the minibatch with the parameters\n",
+    "    gradients_sum = {p: np.zeros_like(p.value) \n",
+    "                     + 1.0 * actual_minibatch_size for p in learner.parameters}    \n",
+    "    # do 'count' many updates\n",
+    "    for i in range(count):\n",
+    "        # note that CNTK learner's update function consumes \n",
+    "        # sum of gradients over the samples in a minibatch\n",
+    "        learner.update(gradients_sum, actual_minibatch_size)\n",
+    "    ret_values = [p.value for p in learner.parameters]\n",
+    "    # Restore values\n",
+    "    for p, o in zip(learner.parameters, old_values):\n",
+    "        p.value = o\n",
+    "    return ret_values"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "scrolled": true
+   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
-      "unit = sample\n",
+      "per minibatch:\n",
      " [array([[-0.5, -0.5],\n",
      "       [-0.5, -0.5],\n",
      "       [-0.5, -0.5],\n",
@ -124,40 +125,147 @@
    }
   ],
   "source": [
-    "print('\\nunit = sample\\n', inspect_update(sgd_learner_s, mbsize=2))"
+    "print('\\nper minibatch:\\n', inspect_update(sgd_learner_m, actual_minibatch_size=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With the knowledge that SGD is the update `parameter = old_parameter - learning_rate * gradient`, we can conclude that when the learning rate schedule is per minibatch, the learning rate $0.5$ is applied to the mean gradient (which is $1$ here by construction) of whole minibatch. Let's see what happens when the learning rate schedule is per $2$ samples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "per 2 samples: \n",
+      " [array([[-0.5, -0.5],\n",
+      "       [-0.5, -0.5],\n",
+      "       [-0.5, -0.5],\n",
+      "       [-0.5, -0.5]], dtype=float32), array([-0.5, -0.5], dtype=float32), array([[-0.5, -0.5, -0.5, -0.5],\n",
+      "       [-0.5, -0.5, -0.5, -0.5],\n",
+      "       [-0.5, -0.5, -0.5, -0.5]], dtype=float32), array([-0.5, -0.5, -0.5, -0.5], dtype=float32)]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('\\nper 2 samples: \\n', inspect_update(sgd_learner_s2, actual_minibatch_size=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can conclude that when the learning rate schedule is per $2$ samples, the learning rate $0.5$ is also applied to the mean gradient of whole minibatch. Now let's see what happens when the data minibatch size is set to $10$ which is different from $2$. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "per minibatch:\n",
+      " [array([[-0.5, -0.5],\n",
+      "       [-0.5, -0.5],\n",
+      "       [-0.5, -0.5],\n",
+      "       [-0.5, -0.5]], dtype=float32), array([-0.5, -0.5], dtype=float32), array([[-0.5, -0.5, -0.5, -0.5],\n",
+      "       [-0.5, -0.5, -0.5, -0.5],\n",
+      "       [-0.5, -0.5, -0.5, -0.5]], dtype=float32), array([-0.5, -0.5, -0.5, -0.5], dtype=float32)]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('\\nper minibatch:\\n', inspect_update(sgd_learner_m, actual_minibatch_size=10))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again we can conclude that when the learning rate schedule is per minibatch, the learning rate $0.5$ is applied to the mean gradient of whole minibatch. Let's see what happens when the learning rate schedule is per 2 samples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "per 2 samples: \n",
+      " [array([[-2.5, -2.5],\n",
+      "       [-2.5, -2.5],\n",
+      "       [-2.5, -2.5],\n",
+      "       [-2.5, -2.5]], dtype=float32), array([-2.5, -2.5], dtype=float32), array([[-2.5, -2.5, -2.5, -2.5],\n",
+      "       [-2.5, -2.5, -2.5, -2.5],\n",
+      "       [-2.5, -2.5, -2.5, -2.5]], dtype=float32), array([-2.5, -2.5, -2.5, -2.5], dtype=float32)]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('\\nper 2 samples: \\n', inspect_update(sgd_learner_s2, actual_minibatch_size=10))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see the interesting update pattern of how a per $2$ sample learning rate $0.5$ is applied. It updates the model with $10 / 2 = 5$ times the mean gradient (which is $1$) with a learning rate $0.5$ resulting in an update in the magnitute of $5 \\times 0.5 = 2.5$. Such a property is useful to update a model function whose loss function, when being evaluated at the current parameters, is locally linear within ball of a diameter of $r \\frac{M}{N}$ (where $N$ is the specified minibatch size and $M$ is the actual minibatch size). If we have an actual minibath size $M > N$, this property allows us to increase the speed of updates aggressively; if $M < N$, this property requires us to scale down the learning rate to increase the chance of convergence."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Please note that calling update manually on the learner (as `inspect_update` does) is very tedious and not recommended. Besides, you need to compute the gradients separately and pass them to the learner. Instead, using a [**`Trainer`**](https://www.cntk.ai/pythondocs/cntk.train.trainer.html#module-cntk.train.trainer), you don't have to do any of that. The manual update used here is for educational purposes and for the vast majority of use cases CNTK users should avoid performing manual updates."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In the per sample specification, the learning rate is not divided by the minibatch size. CNTK offers both options because in some setups it is more convenient to work with per sample learning rates than per minibatch learning rates and vice versa. \n",
-    "\n",
-    "**Key concept**: It is important to understand the ramifications of choosing learning rates per minibatch vs per sample. For example, per minibatch learning rate schedules, typically don't require retuning when you want to change the minibatch size, but per sample schedules do. On the other hand with distributed training it is more accurate to specify the learning rate schedule as per sample rather than per minibatch.\n",
-    "\n",
-    "Calling update manually on the learner (as `inspect_update` does) is very tedious and not recommended. Besides, you need to compute the gradients separately and pass them to the learner. Instead, using a [**`Trainer`**](https://www.cntk.ai/pythondocs/cntk.train.trainer.html#module-cntk.train.trainer), you don't have to do any of that. The manual update used here is for educational purposes and for the vast majority of use cases CNTK users should avoid performing manual updates.\n",
    "\n",
    "## Trainers and Learners\n",
    "\n",
-    "A closely related class to the `Learner` is the `Trainer`. In CNTK a `Trainer` brings together all the ingredients necessary for training models:\n",
+    "A closely related class to the `Learner` is the `Trainer`. In CNTK, a `Trainer` brings together all the ingredients necessary for training models:\n",
    "- the model itself\n",
    "- the loss function (a differentiable function) and the actual metric we care about which is not necessarily differentiable (such as error rate)\n",
    "- the learners\n",
    "- optionally progress writers that log the training progress\n",
    "\n",
-    "While in the most typical case a `Trainer` has a single learner that handles all the parameters, it is possible to have **multiple learners** each working on a different subset of the parameters. Parameters that are not covered by any learner **will not** be updated.  Here is an example that illustrates typical use.\n"
+    "While in the most typical case a `Trainer` has a single learner that handles all the parameters, it is possible to have **multiple learners** each working on a different subset of the parameters. Parameters that are not covered by any learner **will not** be updated.  Here is an example that illustrates typical use."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
-    "lr_schedule = C.learning_rate_schedule([0.05]*3 + [0.025]*2 + [0.0125], C.UnitType.minibatch, epoch_size=100)\n",
-    "sgd_learner = C.sgd(z.parameters, lr_schedule)\n",
+    "sgd_learner = C.sgd(z.parameters, [0.05]*3 + [0.025]*2 + [0.0125], epoch_size=100)\n",
    "loss = C.cross_entropy_with_softmax(z, label)\n",
    "trainer = C.Trainer(z, loss, sgd_learner)\n",
    "# use the trainer with a minibatch source as in the trainer howto"
@ -172,8 +280,13 @@
    "\n",
    "What is happening in this paper is that the learning rate gets reduced by a factor of 0.1 after 150000 and 300000 updates (cf. section 3.4 of the paper). In the example above the learning drops by a factor of 0.5 between each phase. Right now there is no good guidance on how to choose this factor, but it's typically between 0.1 and 0.9.\n",
    "\n",
-    "Apart from specifying a `Trainer` yourself, it is also possible to use the `cntk.Function.train` convenience method. This allows you to specify the learner and the data and it internally creates a trainer that drives the training loop.\n",
-    "\n",
+    "Apart from specifying a `Trainer` yourself, it is also possible to use the `cntk.Function.train` convenience method. This allows you to specify the learner and the data and it internally creates a trainer that drives the training loop."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
    "## Other built-in learners\n",
    "\n",
    "Apart from SGD, other built-in learners include \n",
@ -183,46 +296,35 @@
    "- [RMSProp](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.rmsprop) (`rmsprop`) a correction to adagrad that prevents the learning rate from decaying too fast.\n",
    "- [FSAdagrad](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.fsadagrad) (`fsadagrad`) adds momentum and bias correction to RMSprop\n",
    "- [Adam / Adamax](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.adam) (`adam(..., adamax=False/True)`) see [this paper](https://arxiv.org/abs/1412.6980)\n",
-    "- [Adadelta](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.adadelta) (`adadelta`) see [this paper](https://arxiv.org/abs/1212.5701)\n",
-    "\n",
-    "### Momentum\n",
-    "\n",
-    "Among these learners, `momentum_sgd`, `nesterov`, `fsadagrad`, and `adam` take an additional momentum schedule. \n",
-    "\n",
-    "When using momentum, instead of updating the parameter using the current gradient, we update the parameter using all previous gradients exponentially decayed. If there is a consistent direction that the gradients are pointing to, the parameter updates will develop momentum in that direction. [This page](http://distill.pub/2017/momentum/) has a good explanation of momentum.\n",
-    "\n",
-    "Like the learning rate schedule, the momentum schedule can be specified in two equivalent ways:\n",
-    "- `momentum_schedule(float or list of floats, epoch_size)`\n",
-    "- `momentum_as_time_constant(float or list of floats, epoch_size)`\n",
-    "\n",
-    "As with `learning_rate_schedule`, the arguments are interpreted in the same way, i.e. there's flexibility in specifying different momentum for the first few minibatches and for later minibatches. \n",
-    "\n",
-    "The difference between the two calls is just a simple transformation as explained in the following. Since momentum is creating a sort of exponential moving average it is fair to ask \"when does the contribution of an old gradient diminish by a certain constant factor?\". If we choose the constant factor to be $0.5$ we call this the [half-life](https://en.wikipedia.org/wiki/Half-life) and if we choose the constant to be $e^{-1}\\approx 0.368$ we call this the [time constant](https://en.wikipedia.org/wiki/Time_constant). So `momentum_as_time_constant_schedule` specifies the number of samples it would take for the gradient of each minibatch to decay to $0.368$ of its original contribution on the momentum term. Specifying a `momentum_as_time_constant_schedule(300)` and a minibatch size of 10 is a little bit more meaningful than specifying `momentum_schedule(.967...)` even though both lead to the same updates. The way to convert between the two schedules is\n",
-    "- $\\textrm{momentum} = \\exp(-\\frac{\\textrm{minibatch_size}}{\\textrm{time_constant}})$\n",
-    "- $\\textrm{time_constant} = \\frac{\\textrm{minibatch_size}}{\\log(1/\\textrm{momentum})}$"
+    "- [Adadelta](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.adadelta) (`adadelta`) see [this paper](https://arxiv.org/abs/1212.5701)"
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": 5,
+   "cell_type": "markdown",
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "time constant for momentum of 0.967... =  300.00000000000006\n",
-      "momentum for time constant of 300      =  0.9672161004820059\n"
-     ]
-    }
-   ],
   "source": [
-    "mb_size = 10\n",
-    "time_constant = 300\n",
-    "momentum = math.exp(-mb_size/time_constant)\n",
+    "### Momentum\n",
    "\n",
-    "print('time constant for momentum of 0.967... = ', mb_size/math.log(1/momentum))\n",
-    "print('momentum for time constant of 300      = ', math.exp(-mb_size/time_constant))"
+    "Among these learners, [FSAdaGrad](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.fsadagrad),\n",
+    "[Adam](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.adam),\n",
+    "[MomentumSGD](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.momentum_sgd), and\n",
+    "[Nesterov](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.nesterov) take an additional momentum schedule. \n",
+    "\n",
+    "When using momentum, instead of updating the parameter using the current gradient, we update the parameter using all previous gradients exponentially decayed. If there is a consistent direction that the gradients are pointing to, the parameter updates will develop momentum in that direction. [This page](http://distill.pub/2017/momentum/) has a good explanation of momentum.\n",
+    "\n",
+    "Like the learning rate schedule, the momentum schedule can be specified in two equivalent ways. Let use `momentum_sgd` as an example:\n",
+    "\n",
+    "- `momentum_sgd(parameters, momentum=float or list of floats, minibatch_size=C.leanrers.IGNORE, epoch_size=epoch_size)`\n",
+    "\n",
+    "    * or simply: `momentum_sgd(parameters, momentum=float or list of floats, epoch_size=epoch_size)` where `minibatch_size=C.leanrers.IGNORE` is the default\n",
+    "    \n",
+    "- `momentum_sgd(parameters, momentum=float or list of floats, minibatch_size=minibatch_size, epoch_size=epoch_size)`\n",
+    "\n",
+    "As with `learning_rate_schedule`, the arguments are interpreted in the same way, i.e. there's flexibility in specifying different momentum for the first few minibatches and for later minibatches:\n",
+    "\n",
+    "- With minibatch_size=N, the decay momentum=$\\beta$ is applied to the mean gradient of every $N$ samples. For example,  minibatches of sizes $N$, $2N$, $3N$ and $k\\cdot N$ will have decays of $\\beta$, $\\beta^2$, $\\beta^3$ and $\\beta^k$ respectively. The decay is exponetial in the proportion of the actual minibatch size to the specified minibath size. \n",
+    "\n",
+    "- With minibatch_size=C.leanrers.IGNORE, the decay momentum=$\\beta$ is applied to the mean gradient of the whole minibatch regardless of its size. For example, regardless of the minibatch size being either $N$ or $2N$ (or any size), the mean gradient of such a minibatch will have same decay factor $\\beta$."
   ]
  },
  {
@ -230,43 +332,49 @@
   "metadata": {},
   "source": [
    "Apart from the momentum schedule, the momentum learners can also take a boolean \"unit_gain\" argument that determines the form of the momentum update:\n",
-    "- `unit_gain=True`: $\\textrm{momentum_direction} = \\textrm{momentum} \\cdot \\textrm{old_momentum_direction} + (1 - \\textrm{momentum}) \\cdot \\textrm{gradient}$\n",
-    "- `unit_gain=False`: $\\textrm{momentum_direction} = \\textrm{momentum} \\cdot \\textrm{old_momentum_direction} + \\textrm{gradient}$\n",
+    "\n",
+    "* `unit_gain=True`: $\\textrm{momentum_direction} = \\textrm{momentum} \\cdot \\textrm{old_momentum_direction} + (1 - \\textrm{momentum}) \\cdot \\textrm{gradient}$\n",
+    "\n",
+    "* `unit_gain=False`: $\\textrm{momentum_direction} = \\textrm{momentum} \\cdot \\textrm{old_momentum_direction} + \\textrm{gradient}$\n",
    "\n",
    "The idea behind the non-conventional `unit_gain=True` is that when momentum and/or learning rate changes, this way of updating does not lead to divergence. In general, users should exercise great caution when switching learning rate and/or momentum with `unit_gain=False`. One piece of relevant advice is Remark 2 in [this paper](https://arxiv.org/abs/1706.02677) which shows how to adjust your momentum when the learning rate changes in the `unit_gain=False` case.\n",
    "\n",
-    "The following code illustrates that, for the case of `unit_gain=False`, the two ways of specifying momentum (as time constant or not) are equivalent. It also shows that when `unit_gain=True` you need to scale your learning rate by $1/(1-\\textrm{momentum})$ to match the `unit_gain=False` case"
+    "The following code illustrates that, for the case of `unit_gain=False`, the two ways of specifying momentum are equivalent. It also shows that when `unit_gain=True` you need to scale your learning rate by $1/(1-\\textrm{momentum})$ to match the `unit_gain=False` case"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[-1.436 -1.436]\n",
-      "[-1.436 -1.436]\n",
-      "[-1.436 -1.436]\n"
+      "[-14.3602 -14.3602]\n",
+      "[-14.3602 -14.3602]\n",
+      "[-14.3602 -14.3602]\n"
     ]
    }
   ],
   "source": [
-    "lr_schedule = C.learning_rate_schedule(1, C.UnitType.minibatch)\n",
-    "ug_schedule = C.learning_rate_schedule(1/(1-momentum), C.UnitType.minibatch)\n",
+    "momentum = 0.9672161004820059\n",
+    "minibatch_size = 10\n",
+    "lr_schedule = C.learning_parameter_schedule(1, minibatch_size=C.learners.IGNORE)\n",
+    "ug_schedule = C.learning_parameter_schedule(1/(1-momentum), minibatch_size=C.learners.IGNORE)\n",
    "\n",
-    "m_schedule = C.momentum_schedule(momentum)\n",
-    "t_schedule = C.momentum_as_time_constant_schedule(time_constant)\n",
+    "m_schedule = C.momentum_schedule(momentum) #minibatch_size=C.learners.IGNORE\n",
+    "t_schedule = C.momentum_schedule(momentum, minibatch_size=minibatch_size)\n",
+    "#t_schedule is equivalent to the legacy API: (see the legacy interfaces section below for details)\n",
+    "#       t_schedule = C.momentum_as_time_constant_schedule(300)\n",
    "\n",
    "msgd = C.momentum_sgd(z.parameters, lr_schedule, m_schedule, unit_gain=False)\n",
    "tsgd = C.momentum_sgd(z.parameters, lr_schedule, t_schedule, unit_gain=False)\n",
    "usgd = C.momentum_sgd(z.parameters, ug_schedule, m_schedule, unit_gain=True)\n",
    "\n",
-    "print(inspect_update(msgd, mb_size, 5)[0][0])\n",
-    "print(inspect_update(tsgd, mb_size, 5)[0][0])\n",
-    "print(inspect_update(usgd, mb_size, 5)[0][0])"
+    "print(inspect_update(msgd, minibatch_size, 5)[0][0])\n",
+    "print(inspect_update(tsgd, minibatch_size, 5)[0][0])\n",
+    "print(inspect_update(usgd, minibatch_size, 5)[0][0])"
   ]
  },
  {
@ -280,34 +388,33 @@
    "These methods are typically easier to tune, but there is some new evidence that [they overfit more easily](https://arxiv.org/abs/1705.08292) than SGD with momentum.\n",
    "\n",
    "Below, we show how these learners can be configured and how their updates affect the model parameters. \n",
-    "The main take-away is that **if you switch learners, you need to retune the learning rate**. In this example the initial points and gradients are the same yet different learners arrive at different parameter values after 10 minibatches. Since the gradients are always 1, it is fair to say that in this case the learner with the most negative parameter value is the best. However, if we retune the learning rates, the learner with the least negative parameter value (adadelta), we can drive its parameters to similar values as the one with the most negative parameter value (adamax). Also, this is an artificial example where gradients are consistently equal to 1, so the methods that have momemtum built-in (`adam`/`adamax`/`fsadagrad`) should be better than the methods that don't have built-in momentum (for the same value of the learning rate)."
+    "The main take-away is that **if you switch learners, you need to retune the learning rate**. In this example the initial points and gradients are the same yet different learners arrive at different parameter values after 10 minibatches. Since the mean gradients are always 1, it is fair to say that in this case the learner with the most negative parameter value is the best. However, if we retune the learning rates, the learner with the least negative parameter value (fsadagrad, adam and adammax), we can drive its parameters to similar values as the ones with the very negative parameter value (e.g. adam, adamax and fsadagrad)."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "adadelta : [-0.0099 -0.0099]\n",
-      "adagrad  : [-0.3125 -0.3125]\n",
-      "adam     : [-9.9203 -9.9203]\n",
-      "adamax   : [-9.9227 -9.9227]\n",
-      "fsadagrad: [-8.8573 -8.8573]\n",
-      "rmsprop  : [-0.3125 -0.3125]\n",
-      "adadelta2: [-9.9228 -9.9228]\n"
+      "adadelta : [-0.0297 -0.0297]\n",
+      "adagrad  : [-30. -30.]\n",
+      "adam     : [-32.3197 -32.3197]\n",
+      "adamax   : [-32.3276 -32.3276]\n",
+      "fsadagrad: [-63.1707 -63.1707]\n",
+      "rmsprop  : [-30. -30.]\n",
+      "adadelta2: [-29.7969 -29.7969]\n"
     ]
    }
   ],
   "source": [
    "mb_size = 32\n",
-    "time_constant = 1000\n",
    "\n",
-    "lr_schedule = C.learning_rate_schedule(1, C.UnitType.minibatch)\n",
-    "t_schedule = C.momentum_as_time_constant_schedule(time_constant)\n",
+    "lr_schedule = C.learning_parameter_schedule(1, minibatch_size=C.learners.IGNORE)\n",
+    "t_schedule = C.momentum_schedule(0.971, minibatch_size=C.learners.IGNORE) \n",
    "\n",
    "tsgd = C.momentum_sgd(z.parameters, lr_schedule, t_schedule, unit_gain=False)\n",
    "\n",
@ -318,16 +425,17 @@
    "fsadagrad = C.fsadagrad(z.parameters, lr_schedule, t_schedule, unit_gain=False)\n",
    "rmsprop   = C.rmsprop(z.parameters, lr_schedule, gamma=0.999, inc=1.0+1e-9, dec=1.0-1e-9, max=np.inf, min=1e-30)\n",
    "\n",
-    "print('adadelta :', inspect_update(adadelta, mb_size, 10)[0][0])\n",
-    "print('adagrad  :', inspect_update(adagrad, mb_size, 10)[0][0])\n",
-    "print('adam     :', inspect_update(adam, mb_size, 10)[0][0])\n",
-    "print('adamax   :', inspect_update(adamax, mb_size, 10)[0][0])\n",
-    "print('fsadagrad:', inspect_update(fsadagrad, mb_size, 10)[0][0])\n",
-    "print('rmsprop  :', inspect_update(rmsprop, mb_size, 10)[0][0])\n",
+    "num_steps = 30\n",
+    "print('adadelta :', inspect_update(adadelta, mb_size, num_steps)[0][0])\n",
+    "print('adagrad  :', inspect_update(adagrad, mb_size, num_steps)[0][0])\n",
+    "print('adam     :', inspect_update(adam, mb_size, num_steps)[0][0])\n",
+    "print('adamax   :', inspect_update(adamax, mb_size, num_steps)[0][0])\n",
+    "print('fsadagrad:', inspect_update(fsadagrad, mb_size, num_steps)[0][0])\n",
+    "print('rmsprop  :', inspect_update(rmsprop, mb_size, num_steps)[0][0])\n",
    "\n",
-    "adadelta_schedule = C.learning_rate_schedule(1004, C.UnitType.minibatch)\n",
+    "adadelta_schedule = C.learning_parameter_schedule(1004, minibatch_size=C.learners.IGNORE)\n",
    "adadelta_tuned  = C.adadelta(z.parameters, adadelta_schedule, 0.999, 1e-6)\n",
-    "print('adadelta2:', inspect_update(adadelta_tuned, mb_size, 10)[0][0])"
+    "print('adadelta2:', inspect_update(adadelta_tuned, mb_size, num_steps)[0][0])"
   ]
  },
  {
@ -345,14 +453,14 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[-0.5397 -0.5397]\n"
+      "[-0.5399 -0.5399]\n"
     ]
    }
   ],
@ -392,7 +500,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
@ -443,7 +551,7 @@
    "        return True\n",
    "    \n",
    "mb_size = 64\n",
-    "lr_schedule = C.learning_rate_schedule(1, C.UnitType.minibatch)\n",
+    "lr_schedule = C.learning_parameter_schedule(1, minibatch_size=C.learners.IGNORE)\n",
    "my_sgd = MySgd(z.parameters, lr_schedule)\n",
    "\n",
    "def inspect_user_learner_update(learner, mbsize, count):\n",
@ -473,6 +581,221 @@
    "And that's all there is to learners! They are at the heart of neural network training, but by themselves they are not very useful, and they are typically driven by a trainer. So a good next step for you would be to take a look at our [Trainer howto](How_to_train_using_declarative_and_imperative_API.ipynb)."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "## Legacy learner interfaces"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below, we explain how to use the learners with the legacy APIs prior to CNTK release 2.2. The APIs discussed below will be **deprecated in future release**. Please use the CNTK 2.2 APIs explained above from now on.\n",
+    "\n",
+    "Firstly, the learning rate schedule can be specified in two way in the legacy APIs: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "lr_schedule_m = C.learning_rate_schedule(0.5, C.UnitType.minibatch)\n",
+    "lr_schedule_s = C.learning_rate_schedule(0.5, C.UnitType.sample)\n",
+    "\n",
+    "sgd_learner_m = C.sgd(z.parameters, lr_schedule_m)\n",
+    "sgd_learner_s = C.sgd(z.parameters, lr_schedule_s)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We have created two learners here. When creating a learner we have to specify a learning rate schedule, which can be as simple as specifying a single number (0.5 in this example) or it can be a list of learning rates that specify what the learning rate should be at different points in time. \n",
+    "\n",
+    "Currently, the best results with deep learning are obtained by having a small number of *phases* where inside each phase the learning rate is fixed and the learning rate decays by a constant factor when moving between phases. We will come back to this point later.\n",
+    "\n",
+    "The second parameter in the learning rate schedule can be one of two different value:\n",
+    "- Per minibatch\n",
+    "- Per sample"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To understand the difference and get familiar with the learner properties and methods, let's inspect the effect of a learner on the parameters assuming the parameters are all 0 and the gradients are all 1."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "unit = minibatch\n",
+      " [array([[-0.5, -0.5],\n",
+      "       [-0.5, -0.5],\n",
+      "       [-0.5, -0.5],\n",
+      "       [-0.5, -0.5]], dtype=float32), array([-0.5, -0.5], dtype=float32), array([[-0.5, -0.5, -0.5, -0.5],\n",
+      "       [-0.5, -0.5, -0.5, -0.5],\n",
+      "       [-0.5, -0.5, -0.5, -0.5]], dtype=float32), array([-0.5, -0.5, -0.5, -0.5], dtype=float32)]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('\\nunit = minibatch\\n', inspect_update(sgd_learner_m, actual_minibatch_size=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Recall that that SGD is the update `parameter = old_parameter - learning_rate * gradient`, we can conclude that when the learning rate schedule is per minibatch, the learning rate is aplied to the mean gradient of the whole minibatch. Let's see what happens when the learning rate schedule is per sample."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "unit = sample\n",
+      " [array([[-1., -1.],\n",
+      "       [-1., -1.],\n",
+      "       [-1., -1.],\n",
+      "       [-1., -1.]], dtype=float32), array([-1., -1.], dtype=float32), array([[-1., -1., -1., -1.],\n",
+      "       [-1., -1., -1., -1.],\n",
+      "       [-1., -1., -1., -1.]], dtype=float32), array([-1., -1., -1., -1.], dtype=float32)]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('\\nunit = sample\\n', inspect_update(sgd_learner_s, actual_minibatch_size=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the per sample specification, the learning rate is applied to the gradient of individual sample instead of the mean gradient of the whole minibatch. CNTK offers both options because in some setups it is more convenient to work with per sample learning rates than per minibatch learning rates and vice versa. \n",
+    "\n",
+    "**Key concept**: It is important to understand the ramifications of choosing learning rates per minibatch vs per sample. For example, per minibatch learning rate schedules, typically don't require retuning when you want to change the minibatch size, but per sample schedules do. On the other hand with distributed training it is more accurate to specify the learning rate schedule as per sample rather than per minibatch.\n",
+    "\n",
+    "Calling update manually on the learner (as `inspect_update` does) is very tedious and not recommended. Besides, you need to compute the gradients separately and pass them to the learner. Instead, using a [Trainer](https://www.cntk.ai/pythondocs/cntk.train.trainer.html#module-cntk.train.trainer), you don't have to do any of that. The manual update used here is for educational purposes and for the vast majority of use cases CNTK users should avoid performing manual updates."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Momentum\n",
+    "\n",
+    "Among these learners, `momentum_sgd`, `nesterov`, `fsadagrad`, and `adam` take an additional momentum schedule. \n",
+    "\n",
+    "When using momentum, instead of updating the parameter using the current gradient, we update the parameter using all previous gradients exponentially decayed. If there is a consistent direction that the gradients are pointing to, the parameter updates will develop momentum in that direction. [This page](http://distill.pub/2017/momentum/) has a good explanation of momentum.\n",
+    "\n",
+    "Like the learning rate schedule, the momentum schedule can be specified in two equivalent ways:\n",
+    "\n",
+    "- `momentum_schedule(float or list of floats, epoch_size)`\n",
+    "\n",
+    "- `momentum_as_time_constant(float or list of floats, epoch_size)`\n",
+    "\n",
+    "As with `learning_rate_schedule`, the arguments are interpreted in the same way, i.e. there's flexibility in specifying different momentum for the first few minibatches and for later minibatches. \n",
+    "\n",
+    "The difference between the two calls is just a simple transformation as explained in the following. Since momentum is creating a sort of exponential moving average it is fair to ask \"when does the contribution of an old gradient diminish by a certain constant factor?\". If we choose the constant factor to be $0.5$ we call this the [half-life](https://en.wikipedia.org/wiki/Half-life) and if we choose the constant to be $e^{-1}\\approx 0.368$ we call this the [time constant](https://en.wikipedia.org/wiki/Time_constant). So `momentum_as_time_constant_schedule` specifies the number of samples it would take for the gradient of each minibatch to decay to $0.368$ of its original contribution on the momentum term. Specifying a `momentum_as_time_constant_schedule(300)` and a minibatch size of 10 is a little bit more meaningful than specifying `momentum_schedule(.967...)` even though both lead to the same updates. The way to convert between the two schedules is\n",
+    "- $\\textrm{momentum} = \\exp(-\\frac{\\textrm{minibatch_size}}{\\textrm{time_constant}})$\n",
+    "- $\\textrm{time_constant} = \\frac{\\textrm{minibatch_size}}{\\log(1/\\textrm{momentum})}$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "time constant for momentum of 0.967... =  300.00000000000006\n",
+      "momentum for time constant of 300      =  0.9672161004820059\n"
+     ]
+    }
+   ],
+   "source": [
+    "mb_size = 10\n",
+    "time_constant = 300\n",
+    "momentum = math.exp(-mb_size/time_constant)\n",
+    "\n",
+    "print('time constant for momentum of 0.967... = ', mb_size/math.log(1/momentum))\n",
+    "print('momentum for time constant of 300      = ', math.exp(-mb_size/time_constant))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Apart from the momentum schedule, the momentum learners can also take a boolean \"unit_gain\" argument that determines the form of the momentum update:\n",
+    "\n",
+    "- `unit_gain=True`: $\\textrm{momentum_direction} = \\textrm{momentum} \\cdot \\textrm{old_momentum_direction} + (1 - \\textrm{momentum}) \\cdot \\textrm{gradient}$\n",
+    "\n",
+    "- `unit_gain=False`: $\\textrm{momentum_direction} = \\textrm{momentum} \\cdot \\textrm{old_momentum_direction} + \\textrm{gradient}$\n",
+    "\n",
+    "The idea behind the non-conventional `unit_gain=True` is that when momentum and/or learning rate changes, this way of updating does not lead to divergence. In general, users should exercise great caution when switching learning rate and/or momentum with `unit_gain=False`. One piece of relevant advice is Remark 2 in [this paper](https://arxiv.org/abs/1706.02677) which shows how to adjust your momentum when the learning rate changes in the `unit_gain=False` case.\n",
+    "\n",
+    "The following code illustrates that, for the case of `unit_gain=False`, the two ways of specifying momentum (as time constant or not) are equivalent. It also shows that when `unit_gain=True` you need to scale your learning rate by $1/(1-\\textrm{momentum})$ to match the `unit_gain=False` case"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[-14.3602 -14.3602]\n",
+      "[-14.3602 -14.3602]\n",
+      "[-14.3602 -14.3602]\n"
+     ]
+    }
+   ],
+   "source": [
+    "lr_schedule = C.learning_rate_schedule(1, C.UnitType.minibatch)\n",
+    "ug_schedule = C.learning_rate_schedule(1/(1-momentum), C.UnitType.minibatch)\n",
+    "\n",
+    "m_schedule = C.momentum_schedule(momentum)\n",
+    "t_schedule = C.momentum_as_time_constant_schedule(time_constant)\n",
+    "\n",
+    "msgd = C.momentum_sgd(z.parameters, lr_schedule, m_schedule, unit_gain=False)\n",
+    "tsgd = C.momentum_sgd(z.parameters, lr_schedule, t_schedule, unit_gain=False)\n",
+    "usgd = C.momentum_sgd(z.parameters, ug_schedule, m_schedule, unit_gain=True)\n",
+    "\n",
+    "print(inspect_update(msgd, mb_size, 5)[0][0])\n",
+    "print(inspect_update(tsgd, mb_size, 5)[0][0])\n",
+    "print(inspect_update(usgd, mb_size, 5)[0][0])"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -500,7 +823,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.5.2"
+   "version": "3.5.4"
  }
 },
 "nbformat": 4,