Fix some typos, tune

2016-11-16 22:33:23 +01:00 · 2016-11-16 22:33:23 +01:00 · e8c9865977
--- a/Tutorials/CNTK_101_LogisticRegression.ipynb
+++ b/Tutorials/CNTK_101_LogisticRegression.ipynb
@ -10,7 +10,7 @@
    "\n",
    "This tutorial is targeted to individuals who are new to CNTK and to machine learning. In this tutorial, you will train a simple yet powerful machine learning model that is widely used in industry for a variety of applications. The model trained below scales to massive data sets in the most expeditious manner by harnessing computational scalability leveraging the computational resources you may have (one or more CPU cores, one or more GPUs, a cluster of CPUs or a cluster of GPUs), transparently via the CNTK library.\n",
    "\n",
-    "The following notebook users Python APIs. If you are looking for this example in Brainscript, please look [here](https://github.com/Microsoft/CNTK/tree/v2.0.beta3.0/Tutorials/HelloWorld-LogisticRegression). \n",
+    "The following notebook users Python APIs. If you are looking for this example in BrainScript, please look [here](https://github.com/Microsoft/CNTK/tree/v2.0.beta3.0/Tutorials/HelloWorld-LogisticRegression). \n",
    "\n",
    "## Introduction\n",
    "\n",
@ -27,7 +27,7 @@
    "In the figure above, the green line represents the learnt model from the data and separates the blue dots from the red dots. In this tutorial, we will walk you through the steps to learn the green line. Note: this classifier does make mistakes where couple of blue dots are on the wrong side of the green line. However, there are ways to fix this and we will look into some of the techniques in later tutorials. \n",
    "\n",
    "**Approach**: \n",
-    "Any learning algorithm has typically 5 stages namely, Data reading, Data preprocessing, Creating a model, Learning the model parameters and Evaluating (a.k.a. testing/prediction) the model. \n",
+    "Any learning algorithm has typically five stages. These are Data reading, Data preprocessing, Creating a model, Learning the model parameters, and Evaluating (a.k.a. testing/prediction) the model. \n",
    "\n",
    ">1. Data reading: We generate simulated data sets with each sample having two features (plotted below) indicative of the age and tumor size.\n",
    ">2. Data preprocessing: Often the individual features such as size or age needs to be scaled. Typically one would scale the data between 0 and 1. To keep things simple, we are not doing any scaling in this tutorial (for details look here: [feature scaling][]).\n",
@ -36,7 +36,7 @@
    ">5. Evaluation: This is also known as testing where one takes data sets with known labels (a.k.a ground-truth) that was not ever used for training. This allows us to assess how a model would perform in real world (previously unseen) observations.\n",
    "\n",
    "## Logistic Regression\n",
-    "[Logistic regression][] is fundamental machine learning technique that uses a linear weighted combination of features and generates the probability of predicting different classes. In our case the classifer will generate a  probability in [0,1] which can then be compared with a threshold (such as 0.5) to produce a binary label (0 or 1). However, the method shown can be extended to multiple classes easily. \n",
+    "[Logistic regression][] is fundamental machine learning technique that uses a linear weighted combination of features and generates the probability of predicting different classes. In our case the classifier will generate a  probability in [0,1] which can then be compared with a threshold (such as 0.5) to produce a binary label (0 or 1). However, the method shown can be extended to multiple classes easily. \n",
    "\n",
    "<img src=\"https://www.cntk.ai/jup/logistic_neuron.jpg\", width=300, height=200>\n",
    "\n",
@ -83,7 +83,7 @@
    "## Data Generation\n",
    "Let us generate some synthetic data emulating the cancer example using `numpy` library. We have two features (represented in two-dimensions)  each either being to one of the two classes (benign:blue dot or malignant:red dot). \n",
    "\n",
-    "In our example, each observation in the training data has a label (blue or red) corresponding to each observation (set of features - age and size). In this example, we have two classes represened by labels 0 or 1, thus a  binary classification task. "
+    "In our example, each observation in the training data has a label (blue or red) corresponding to each observation (set of features - age and size). In this example, we have two classes represented by labels 0 or 1, thus a  binary classification task. "
   ]
  },
  {
@ -188,7 +188,7 @@
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
-    "#given this is a 2 class () \n",
+    "# given this is a 2 class () \n",
    "colors = ['r' if l == 0 else 'b' for l in labels[:,0]]\n",
    "\n",
    "plt.scatter(features[:,0], features[:,1], c=colors)\n",
@ -218,7 +218,7 @@
    "\n",
    "Network input and output: \n",
    "- **input** variable (a key CNTK concept): \n",
-    ">An **input** variable is a user-code facing container where user-provided code fills in different observations (data point or sample, equivalent to a blue/red dot in our example) as inputs to the model function during model learning (a.k.a.training) and model evaluation (a.k.a testing). Thus, the shape of the `input_variable` must match the shape of the data that will be provided.  For example, when data are images each of  height 10 pixels  and width 5 pixels, the input feature dimension will be 2 (representing image height and width). Similarly, in our example the dimensions are age and tumor size, thus `input_dim` = 2). More on data and their dimensions to appear in separate tutorials. \n",
+    ">An **input** variable is a user-code facing container where user-provided code fills in different observations (data point or sample, equivalent to a blue/red dot in our example) as inputs to the model function during model learning (a.k.a.training) and model evaluation (a.k.a testing). Thus, the shape of the `input_variable` must match the shape of the data that will be provided.  For example, when data are images each of  height 10 pixels  and width 5 pixels, the input feature dimension will be 2 (representing image height and width). Similarly, in our example the dimensions are age and tumor size, thus `input_dim` = 2. More on data and their dimensions to appear in separate tutorials. \n",
    "\n",
    "[bias]: https://www.quora.com/What-does-the-bias-term-represent-in-logistic-regression\n",
    "\n",
@ -407,8 +407,8 @@
   "source": [
    "from cntk.utils import get_train_eval_criterion, get_train_loss\n",
    "\n",
-    "# Define a utiltiy function to compute moving average sum (\n",
-    "# More efficient implementation is possible with np.cumsum() function\n",
+    "# Define a utility function to compute moving average sum.\n",
+    "# A more efficient implementation is possible with np.cumsum() function\n",
    "def moving_average(a, w=10) :\n",
    "    if len(a) < w: \n",
    "        return a[:]    \n",
@ -486,7 +486,7 @@
    }
   ],
   "source": [
-    "# Run the trainer on and perform model training\n",
+    "# Run the trainer and perform model training\n",
    "training_progress_output_freq = 50\n",
    "\n",
    "plotdata = {\"batchsize\":[], \"loss\":[], \"error\":[]}\n",
@ -535,12 +535,11 @@
    }
   ],
   "source": [
-    "# Compute the moving average loss to smooth out the noise in SGD    \n",
-    "\n",
+    "# Compute the moving average loss to smooth out the noise in SGD\n",
    "plotdata[\"avgloss\"] = moving_average(plotdata[\"loss\"])\n",
    "plotdata[\"avgerror\"] = moving_average(plotdata[\"error\"])\n",
    "\n",
-    "#Plot the training loss and the training error\n",
+    "# Plot the training loss and the training error\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(1)\n",
@ -548,7 +547,7 @@
    "plt.plot(plotdata[\"batchsize\"], plotdata[\"avgloss\"], 'b--')\n",
    "plt.xlabel('Minibatch number')\n",
    "plt.ylabel('Loss')\n",
-    "plt.title('Minibatch run vs. Training loss ')\n",
+    "plt.title('Minibatch run vs. Training loss')\n",
    "\n",
    "plt.show()\n",
    "\n",
@ -556,7 +555,7 @@
    "plt.plot(plotdata[\"batchsize\"], plotdata[\"avgerror\"], 'r--')\n",
    "plt.xlabel('Minibatch number')\n",
    "plt.ylabel('Label Prediction Error')\n",
-    "plt.title('Minibatch run vs. Label Prediction Error ')\n",
+    "plt.title('Minibatch run vs. Label Prediction Error')\n",
    "plt.show()"
   ]
  },
@ -566,7 +565,7 @@
   "source": [
    "## Evaluation / Testing \n",
    "\n",
-    "Now that we have trained the network. Let us evaluate the trained network on data that hasn't been used for training. This is called **testing**. Let us create some new data and evaluate the average error & loss on this set. This is done using `trainer.test_minibatch`. Note the error on this previously unseen data is comparable to training error. This is a **key** check. Should the error be larger than the training error by a large margin, it indicates that the train model will not perform well on data that it has not seen during training. This is known as [overfitting][]. There are several ways to address overfitting that is beyond the scope of this tutorial but CNTK toolkit provide the necessary components to address overfitting.\n",
+    "Now that we have trained the network. Let us evaluate the trained network on data that hasn't been used for training. This is called **testing**. Let us create some new data and evaluate the average error and loss on this set. This is done using `trainer.test_minibatch`. Note the error on this previously unseen data is comparable to training error. This is a **key** check. Should the error be larger than the training error by a large margin, it indicates that the trained model will not perform well on data that it has not seen during training. This is known as [overfitting][]. There are several ways to address overfitting that is beyond the scope of this tutorial but the Cognitive Toolkit provides the necessary components to address overfitting.\n",
    "\n",
    "Note: We are testing on a single minibatch for illustrative purposes. In practice one runs several minibatches of test data and reports the average. \n",
    "\n",
@ -595,7 +594,6 @@
   ],
   "source": [
    "# Run the trained model on newly generated dataset\n",
-    "# \n",
    "test_minibatch_size = 25\n",
    "features, labels = generate_random_data_sample(test_minibatch_size, input_dim, num_output_classes)\n",
    "\n",
@ -626,7 +624,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Lets compare the ground-truth label with the predictions. They should be in agreement.\n",
+    "Let us compare the ground-truth label with the predictions. They should be in agreement.\n",
    "\n",
    "**Question:** \n",
    "- How many predictions were mislabeled? Can you change the code below to identify which observations were misclassified? "
@ -660,7 +658,7 @@
   },
   "source": [
    "### Visualization\n",
-    "It is desirable to visualize the results. In this example, the data is conveniently in two dimensions and can be plotted. For data with higher dimensions, visualtion can be challenging. There are advanced dimensionality reduction techniques that allow for such visualisations [t-sne][].\n",
+    "It is desirable to visualize the results. In this example, the data is conveniently in two dimensions and can be plotted. For data with higher dimensions, visualization can be challenging. There are advanced dimensionality reduction techniques that allow for such visualizations [t-sne][].\n",
    "\n",
    "[t-sne]: https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding"
   ]
@ -700,7 +698,7 @@
    "# Plot the data \n",
    "import matplotlib.pyplot as plt\n",
    "\n",
-    "#given this is a 2 class \n",
+    "# given this is a 2 class \n",
    "colors = ['r' if l == 0 else 'b' for l in labels[:,0]]\n",
    "plt.scatter(features[:,0], features[:,1], c=colors)\n",
    "plt.plot([0, bias_vector[0]/weight_matrix[0][1]], \n",
@ -716,7 +714,7 @@
    "collapsed": true
   },
   "source": [
-    "**Exploration Suggestion** \n",
+    "**Exploration Suggestions** \n",
    "- Try exploring how the classifier behaves with different data distributions - suggest changing the `minibatch_size` parameter from 25 to say 64. Why is the error increasing?\n",
    "- Try exploring different activation functions\n",
    "- Try exploring different learners \n",
--- a/Tutorials/CNTK_102_FeedForward.ipynb
+++ b/Tutorials/CNTK_102_FeedForward.ipynb
@ -90,7 +90,7 @@
    "\n",
    "Let us generate some synthetic data emulating the cancer example using `numpy` library. We have two features (represented in two-dimensions)  each either being to one of the two classes (benign:blue dot or malignant:red dot). \n",
    "\n",
-    "In our example, each observation in the training data has a label (blue or red) corresponding to each observation (set of features - age and size). In this example, we have two classes represened by labels 0 or 1, thus a  binary classification task."
+    "In our example, each observation in the training data has a label (blue or red) corresponding to each observation (set of features - age and size). In this example, we have two classes represented by labels 0 or 1, thus a  binary classification task."
   ]
  },
  {
@ -101,7 +101,7 @@
   },
   "outputs": [],
   "source": [
-    "#Ensure we always get the same amount of randomness\n",
+    "# Ensure we always get the same amount of randomness\n",
    "np.random.seed(0)\n",
    "\n",
    "# Define the data dimensions\n",
@ -115,7 +115,7 @@
   "source": [
    "### Input and Labels\n",
    "\n",
-    "In this tutorial we are generating synthetic data using `numpy` library. In real world problems, one would use a reader, that would read feature values (`features`: *age* and *tumor size*) corresponding to each obeservation (patient).  Note, each observation can reside in a higher dimension space (when more features are available) and will be represented as a tensor in CNTK. More advanced tutorials shall introduce the handling of high dimensional data."
+    "In this tutorial we are generating synthetic data using `numpy` library. In real world problems, one would use a reader, that would read feature values (`features`: *age* and *tumor size*) corresponding to each observation (patient).  Note, each observation can reside in a higher dimension space (when more features are available) and will be represented as a tensor in CNTK. More advanced tutorials shall introduce the handling of high dimensional data."
   ]
  },
  {
@ -126,7 +126,7 @@
   },
   "outputs": [],
   "source": [
-    "#Helper function to generate a random data sample\n",
+    "# Helper function to generate a random data sample\n",
    "def generate_random_data_sample(sample_size, feature_dim, num_classes):\n",
    "    # Create synthetic data using NumPy. \n",
    "    Y = np.random.randint(size=(sample_size, 1), low=0, high=num_classes)\n",
@ -188,7 +188,7 @@
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
-    "#given this is a 2 class \n",
+    "# given this is a 2 class \n",
    "colors = ['r' if l == 0 else 'b' for l in labels[:,0]]\n",
    "\n",
    "plt.scatter(features[:,0], features[:,1], c=colors)\n",
@ -456,8 +456,8 @@
    "\n",
    "With minibatches we often sample observation from the larger training dataset. We repeat the process of model parameters update using different combination of training samples and over a period of time minimize the `loss` (and the error). When the incremental error rates are no longer changing significantly or after a preset number of maximum minibatches to train, we claim that our model is trained.\n",
    "\n",
-    "One of the key parameter for optimization is called the `learning_rate`. For now, we can think of it as a scaling factor that modulates how much we change the parameters in any iteration. We will be covering more details in later tutorial. \n",
-    "With this information, we are ready to create our trainer. \n",
+    "One of the key parameter for optimization is called the learning rate. For now, we can think of it as a scaling factor that modulates how much we change the parameters in any iteration. We will be covering more details in later tutorial. \n",
+    "With this information, we are ready to create our trainer.\n",
    "\n",
    "[optimization]: https://en.wikipedia.org/wiki/Category:Convex_optimization\n",
    "[Stochastic Gradient Descent]: https://en.wikipedia.org/wiki/Stochastic_gradient_descent\n",
@ -500,7 +500,7 @@
    "def moving_average(a, w=10) :\n",
    "    \n",
    "    if len(a) < w: \n",
-    "        return a[:]    #Need to send a copy of the array\n",
+    "        return a[:]    # Need to send a copy of the array\n",
    "    return [val if idx < w else sum(a[(idx-w):idx])/w for idx, val in enumerate(a)]\n",
    "\n",
    "\n",
@ -541,7 +541,7 @@
   },
   "outputs": [],
   "source": [
-    "#Initialize the parameters for the trainer\n",
+    "# Initialize the parameters for the trainer\n",
    "minibatch_size = 25\n",
    "num_samples = 20000\n",
    "num_minibatches_to_train = num_samples / minibatch_size"
@ -555,7 +555,7 @@
   },
   "outputs": [],
   "source": [
-    "#Run the trainer on and perform model training\n",
+    "# Run the trainer and perform model training\n",
    "training_progress_output_freq = 20\n",
    "\n",
    "plotdata = {\"batchsize\":[], \"loss\":[], \"error\":[]}\n",
@ -614,12 +614,12 @@
    }
   ],
   "source": [
-    "#Compute the moving average loss to smooth out the noise in SGD    \n",
+    "# Compute the moving average loss to smooth out the noise in SGD\n",
    "\n",
    "plotdata[\"avgloss\"] = moving_average(plotdata[\"loss\"])\n",
    "plotdata[\"avgerror\"] = moving_average(plotdata[\"error\"])\n",
    "\n",
-    "#Plot the training loss and the training error\n",
+    "# Plot the training loss and the training error\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(1)\n",
@ -627,7 +627,7 @@
    "plt.plot(plotdata[\"batchsize\"], plotdata[\"avgloss\"], 'b--')\n",
    "plt.xlabel('Minibatch number')\n",
    "plt.ylabel('Loss')\n",
-    "plt.title('Minibatch run vs. Training loss ')\n",
+    "plt.title('Minibatch run vs. Training loss')\n",
    "\n",
    "plt.show()\n",
    "\n",
@ -635,7 +635,7 @@
    "plt.plot(plotdata[\"batchsize\"], plotdata[\"avgerror\"], 'r--')\n",
    "plt.xlabel('Minibatch number')\n",
    "plt.ylabel('Label Prediction Error')\n",
-    "plt.title('Minibatch run vs. Label Prediction Error ')\n",
+    "plt.title('Minibatch run vs. Label Prediction Error')\n",
    "plt.show()"
   ]
  },
@ -667,7 +667,7 @@
    }
   ],
   "source": [
-    "#Generate new data\n",
+    "# Generate new data\n",
    "test_minibatch_size = 25\n",
    "features, labels = generate_random_data_sample(test_minibatch_size, input_dim, num_output_classes)\n",
    "\n",
@ -709,7 +709,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Lets test on previously unseen data."
+    "Let us test on previously unseen data."
   ]
  },
  {
--- a/Tutorials/CNTK_103B_MNIST_FeedForwardNetwork.ipynb
+++ b/Tutorials/CNTK_103B_MNIST_FeedForwardNetwork.ipynb
@ -12,7 +12,7 @@
    "\n",
    "We assume that you have successfully completed CNTK 103 Part A.\n",
    "\n",
-    "In this tutorial we will train a fully connected network on MNIST data. This notebook provides the recipe using Python APIs. If you are looking for this example in Brainscript, please look [here](https://github.com/Microsoft/CNTK/tree/v2.0.beta3.0/Examples/Image/GettingStarted)\n",
+    "In this tutorial we will train a fully connected network on MNIST data. This notebook provides the recipe using Python APIs. If you are looking for this example in BrainScript, please look [here](https://github.com/Microsoft/CNTK/tree/v2.0.beta3.0/Examples/Image/GettingStarted)\n",
    "\n",
    "## Introduction\n",
    "\n",
@ -246,7 +246,7 @@
    "\n",
    "** Suggested Activity **\n",
    "- Record the training error you get with `sigmoid` as the activation function\n",
-    "- Now change to `relu` as the activation fucntion and see if you can improve your training error\n",
+    "- Now change to `relu` as the activation function and see if you can improve your training error\n",
    "\n",
    "*Quiz*: Different supported activation functions can be [found here][]. Which activation function gives the least training error?\n",
    "\n",
@ -630,7 +630,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We have so far been dealing with aggregate measures of error. Lets now get the probabilities associated with individual data points. For each observation, the `eval` function returns the probability distribution across all the classes. The classifer is trained to recognize digits, hence has 10 classes. First let us route the network output through a `softmax` function. This maps the aggregated activations across the netowrk to probabilities across the 10 classes."
+    "We have so far been dealing with aggregate measures of error. Lets now get the probabilities associated with individual data points. For each observation, the `eval` function returns the probability distribution across all the classes. The classifier is trained to recognize digits, hence has 10 classes. First let us route the network output through a `softmax` function. This maps the aggregated activations across the network to probabilities across the 10 classes."
   ]
  },
  {
--- a/Tutorials/CNTK_202_Language_Understanding.ipynb
+++ b/Tutorials/CNTK_202_Language_Understanding.ipynb
@ -763,7 +763,7 @@
    "that creates two recurrent layer instances (one forward, one backward), and then defines an `apply_x` function\n",
    "which applies both layer instances to the same `x` and concatenate the two results.\n",
    "\n",
-    "Allright, give it a try! To know how to realize a backward recursion in CNTK,\n",
+    "Alright, give it a try! To know how to realize a backward recursion in CNTK,\n",
    "please take a hint from how the forward recursion is done.\n",
    "Please also do the following:\n",
    "* remove the one-word lookahead you added in the previous task, which we aim to replace; and\n",
--- a/Tutorials/CNTK_203_Reinforcement_Learning_Basics.ipynb
+++ b/Tutorials/CNTK_203_Reinforcement_Learning_Basics.ipynb
@ -11,7 +11,7 @@
    "\n",
    "In some machine learning settings, we do not have immediate access to labels, so we cannot rely on supervised learning techniques. If, however, there is something we can interact with and thereby get some feedback that tells us occasionally, whether our previous behavior was good or not, we can use RL to learn how to improve our behavior.\n",
    "\n",
-    "Unlike in supervised learning, in RL, labeled correct input/output pairs are never presented and sub-optimal actions are never explicitly corrected. This mimics many of the online learning paradigms which involves finding a balance between exploration (of conditions or actions never learnt before) and exploitation (of already learnt conditions or actions from pervious encounters). Multi-arm bandit problems is one of the category of RL algorithms where  exploration vs. exploitation trade-off have been thoroughly studied. See figure below for [reference](http://www.simongrant.org/pubs/thesis/3/2.html).\n",
+    "Unlike in supervised learning, in RL, labeled correct input/output pairs are never presented and sub-optimal actions are never explicitly corrected. This mimics many of the online learning paradigms which involves finding a balance between exploration (of conditions or actions never learnt before) and exploitation (of already learnt conditions or actions from previous encounters). Multi-arm bandit problems is one of the category of RL algorithms where  exploration vs. exploitation trade-off have been thoroughly studied. See figure below for [reference](http://www.simongrant.org/pubs/thesis/3/2.html).\n",
    "\n",
    "<img src=\"https://cntk.ai/jup/polecart.gif\", width=300, height=300>\n",
    "\n",
@ -130,7 +130,7 @@
    "\n",
    "DQNs\n",
    " * learn the _Q-function_ that maps observation (state, action) to a `score`\n",
-    " * use memory replay (previously recorded $Q$ values corresponding to differnt $(s,a)$ to decorrelate experiences (sequence state transitions)\n",
+    " * use memory replay (previously recorded $Q$ values corresponding to different $(s,a)$ to decorrelate experiences (sequence state transitions)\n",
    " * use a second network to stabilize learning (*not* part of this tutorial)"
   ]
  },
@ -207,7 +207,7 @@
    "\n",
    "Additionally, you will note that, CNTK model doesn't need to be compiled explicitly and is implicitly done when data is processed during training.\n",
    "\n",
-    "CNTK effectively uses available memory on the system between minibatch execution. Thus the leanring rates are stated as **rates per sample** instead of **rates per minibatch** (as with other toolkits)."
+    "CNTK effectively uses available memory on the system between minibatch execution. Thus the learning rates are stated as **rates per sample** instead of **rates per minibatch** (as with other toolkits)."
   ]
  },
  {
@ -420,7 +420,7 @@
   "source": [
    "### Exploration - exploitation trade-off\n",
    "\n",
-    "Note initiall $\\epsilon$ is set to 1 which implies we are enitrely exploraing but as steps increase we reduce exploration and start leveraging the learnt space to collect rewards (a.k.a exploitation) as well."
+    "Note that the initial $\\epsilon$ is set to 1 which implies we are entirely explorating but as steps increase we reduce exploration and start leveraging the learnt space to collect rewards (a.k.a exploitation) as well."
   ]
  },
  {