CNTK/Tutorials/CNTK_103B_MNIST_FeedForward...

706 строки
26 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "29b9bd1d-766f-4422-ad96-de0accc1ce58"
}
},
"source": [
"# CNTK 103: Part B - Feed Forward Network with MNIST\n",
"\n",
"We assume that you have successfully completed CNTK 103 Part A.\n",
"\n",
"In this tutorial we will train a fully connected network on MNIST data. This notebook provides the recipe using Python APIs. If you are looking for this example in BrainScript, please look [here](https://github.com/Microsoft/CNTK/tree/v2.0.beta9.0/Examples/Image/GettingStarted)\n",
"\n",
"## Introduction\n",
"\n",
"**Problem** (recap from the CNTK 101):\n",
"\n",
"The MNIST data comprises of hand-written digits with little background noise.\n",
"\n",
"<img src=\"http://3.bp.blogspot.com/_UpN7DfJA0j4/TJtUBWPk0SI/AAAAAAAAABY/oWPMtmqJn3k/s1600/mnist_originals.png\", width=200, height=200>\n",
"\n",
"\n",
"**Goal**:\n",
"Our goal is to train a classifier that will identify the digits in the MNIST dataset. \n",
"\n",
"**Approach**:\n",
"The same 5 stages we have used in the previous tutorial are applicable: Data reading, Data preprocessing, Creating a model, Learning the model parameters and Evaluating (a.k.a. testing/prediction) the model. \n",
"- Data reading: We will use the CNTK Text reader \n",
"- Data preprocessing: Covered in part A (suggested extension section). \n",
"\n",
"Rest of the steps are kept identical to CNTK 102. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"nbpresent": {
"id": "138d1a78-02e2-4bd6-a20e-07b83f303563"
}
},
"outputs": [],
"source": [
"# Import the relevant components\n",
"from __future__ import print_function\n",
"import matplotlib.image as mpimg\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import sys\n",
"import os\n",
"\n",
"import cntk as C\n",
"from cntk import Trainer, learning_rate_schedule, UnitType\n",
"from cntk.blocks import default_options, Input\n",
"from cntk.io import CTFDeserializer, MinibatchSource, StreamDef, StreamDefs\n",
"from cntk.io import INFINITELY_REPEAT, FULL_DATA_SWEEP\n",
"from cntk.initializer import glorot_uniform\n",
"from cntk.layers import Dense\n",
"from cntk.learner import sgd\n",
"\n",
"# Select the right target device when this notebook is being tested:\n",
"if 'TEST_DEVICE' in os.environ:\n",
" import cntk\n",
" if os.environ['TEST_DEVICE'] == 'cpu':\n",
" cntk.device.set_default_device(cntk.device.cpu())\n",
" else:\n",
" cntk.device.set_default_device(cntk.device.gpu(0))\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data reading\n",
"\n",
"In this section, we will read the data generated in CNTK 103 Part B."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Ensure we always get the same amount of randomness\n",
"np.random.seed(0)\n",
"\n",
"# Define the data dimensions\n",
"input_dim = 784\n",
"num_output_classes = 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data reading\n",
"\n",
"In this tutorial we are using the MNIST data you have downloaded using CNTK_103A_MNIST_DataLoader notebook. The dataset has 60,000 training images and 10,000 test images with each image being 28 x 28 pixels. Thus the number of features is equal to 784 (= 28 x 28 pixels), 1 per pixel. The variable `num_output_classes` is set to 10 corresponding to the number of digits (0-9) in the dataset.\n",
"\n",
"The data is in the following format:\n",
"\n",
" |labels 0 0 0 0 0 0 0 1 0 0 |features 0 0 0 0 ... \n",
" (784 integers each representing a pixel)\n",
" \n",
"In this tutorial we are going to use the image pixels corresponding the integer stream named \"features\". We define a `create_reader` function to read the training and test data using the [CTF deserializer](https://cntk.ai/pythondocs/cntk.io.html?highlight=ctfdeserializer#cntk.io.CTFDeserializer). The labels are [1-hot encoded](https://en.wikipedia.org/wiki/One-hot). \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Read a CTF formatted text (as mentioned above) using the CTF deserializer from a file\n",
"def create_reader(path, is_training, input_dim, num_label_classes):\n",
" return MinibatchSource(CTFDeserializer(path, StreamDefs(\n",
" labels = StreamDef(field='labels', shape=num_label_classes, is_sparse=False),\n",
" features = StreamDef(field='features', shape=input_dim, is_sparse=False)\n",
" )), randomize = is_training, epoch_size = INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Ensure the training and test data is generated and available for this tutorial.\n",
"# We search in two locations in the toolkit for the cached MNIST data set.\n",
"data_found = False\n",
"for data_dir in [os.path.join(\"..\", \"Examples\", \"Image\", \"DataSets\", \"MNIST\"),\n",
" os.path.join(\"data\", \"MNIST\")]:\n",
" train_file = os.path.join(data_dir, \"Train-28x28_cntk_text.txt\")\n",
" test_file = os.path.join(data_dir, \"Test-28x28_cntk_text.txt\")\n",
" if os.path.isfile(train_file) and os.path.isfile(test_file):\n",
" data_found = True\n",
" break\n",
"if not data_found:\n",
" raise ValueError(\"Please generate the data by completing CNTK 103 Part A\")\n",
"print(\"Data directory is {0}\".format(data_dir))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='#Model Creation'></a>\n",
"## Model Creation\n",
"\n",
"Our feed forward network will be relatively simple with 2 hidden layers (`num_hidden_layers`) with each layer having 200 hidden nodes (`hidden_layers_dim`). \n",
"\n",
"<img src=\"http://cntk.ai/jup/feedforward_network.jpg\",width=200, height=200>\n",
"\n",
"If you are not familiar with the terms *hidden_layer* and *number of hidden layers*, please refer back to CNTK 102 tutorial.\n",
"\n",
"For this tutorial: The number of green nodes (refer to picture above) in each hidden layer is set to 200 and the number of hidden layers (refer to the number of layers of green nodes) is 2. Fill in the following values:\n",
"- num_hidden_layers\n",
"- hidden_layers_dim\n",
"\n",
"Note: In this illustration, we have not shown the bias node (introduced in the logistic regression tutorial). Each hidden layer would have a bias node."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"num_hidden_layers = 2\n",
"hidden_layers_dim = 400"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Network input and output: \n",
"- **input** variable (a key CNTK concept): \n",
">An **input** variable is a container in which we fill different observations in this case image pixels during model learning (a.k.a.training) and model evaluation (a.k.a. testing). Thus, the shape of the `input_variable` must match the shape of the data that will be provided. For example, when data are images each of height 10 pixels and width 5 pixels, the input feature dimension will be 50 (representing the total number of image pixels). More on data and their dimensions to appear in separate tutorials.\n",
"\n",
"\n",
"**Question** What is the input dimension of your chosen model? This is fundamental to our understanding of variables in a network or model representation in CNTK.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"input = Input(input_dim)\n",
"label = Input(num_output_classes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feed forward network setup\n",
"\n",
"If you are not familiar with the feedforward network, please refer to CNTK 102. In this tutorial we are using the same network. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def create_model(features):\n",
" with default_options(init = glorot_uniform(), activation = C.ops.relu):\n",
" h = features\n",
" for _ in range(num_hidden_layers):\n",
" h = Dense(hidden_layers_dim)(h)\n",
" r = Dense(num_output_classes, activation = None)(h)\n",
" return r\n",
" \n",
"z = create_model(input)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`z` will be used to represent the output of a network.\n",
"\n",
"We introduced sigmoid function in CNTK 102, in this tutorial you should try different activation functions. You may choose to do this right away and take a peek into the performance later in the tutorial or run the preset tutorial and then choose to perform the suggested activity.\n",
"\n",
"\n",
"** Suggested Activity **\n",
"- Record the training error you get with `sigmoid` as the activation function\n",
"- Now change to `relu` as the activation function and see if you can improve your training error\n",
"\n",
"*Quiz*: Different supported activation functions can be [found here][]. Which activation function gives the least training error?\n",
"\n",
"[found here]: https://github.com/Microsoft/CNTK/wiki/Activation-Functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Scale the input to 0-1 range by dividing each pixel by 256.\n",
"z = create_model(input/256.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Learning model parameters\n",
"\n",
"Same as the previous tutorial, we use the `softmax` function to map the accumulated evidences or activations to a probability distribution over the classes (Details of the [softmax function][] and other [activation][] functions).\n",
"\n",
"[softmax function]: http://cntk.ai/pythondocs/cntk.ops.html#cntk.ops.softmax\n",
"\n",
"[activation]: https://github.com/Microsoft/CNTK/wiki/Activation-Functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training\n",
"\n",
"Similar to CNTK 102, we use minimize the cross-entropy between the label and predicted probability by the network. If this terminology sounds strange to you, please refer to the CNTK 102 for a refresher. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"loss = C.ops.cross_entropy_with_softmax(z, label)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Evaluation\n",
"\n",
"In order to evaluate the classification, one can compare the output of the network which for each observation emits a vector of evidences (can be converted into probabilities using `softmax` functions) with dimension equal to number of classes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"label_error = C.ops.classification_error(z, label)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure training\n",
"\n",
"The trainer strives to reduce the `loss` function by different optimization approaches, [Stochastic Gradient Descent][] (`sgd`) being one of the most popular one. Typically, one would start with random initialization of the model parameters. The `sgd` optimizer would calculate the `loss` or error between the predicted label against the corresponding ground-truth label and using [gradient-decent][] generate a new set model parameters in a single iteration. \n",
"\n",
"The aforementioned model parameter update using a single observation at a time is attractive since it does not require the entire data set (all observation) to be loaded in memory and also requires gradient computation over fewer datapoints, thus allowing for training on large data sets. However, the updates generated using a single observation sample at a time can vary wildly between iterations. An intermediate ground is to load a small set of observations and use an average of the `loss` or error from that set to update the model parameters. This subset is called a *minibatch*.\n",
"\n",
"With minibatches we often sample observation from the larger training dataset. We repeat the process of model parameters update using different combination of training samples and over a period of time minimize the `loss` (and the error). When the incremental error rates are no longer changing significantly or after a preset number of maximum minibatches to train, we claim that our model is trained.\n",
"\n",
"One of the key parameter for optimization is called the `learning_rate`. For now, we can think of it as a scaling factor that modulates how much we change the parameters in any iteration. We will be covering more details in later tutorial. \n",
"With this information, we are ready to create our trainer. \n",
"\n",
"[optimization]: https://en.wikipedia.org/wiki/Category:Convex_optimization\n",
"[Stochastic Gradient Descent]: https://en.wikipedia.org/wiki/Stochastic_gradient_descent\n",
"[gradient-decent]: http://www.statisticsviews.com/details/feature/5722691/Getting-to-the-Bottom-of-Regression-with-Gradient-Descent.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Instantiate the trainer object to drive the model training\n",
"learning_rate = 0.2\n",
"lr_schedule = learning_rate_schedule(learning_rate, UnitType.minibatch)\n",
"learner = sgd(z.parameters, lr_schedule)\n",
"trainer = Trainer(z, loss, label_error, [learner])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First let us create some helper functions that will be needed to visualize different functions associated with training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from cntk.utils import get_train_eval_criterion, get_train_loss\n",
"\n",
"# Define a utility function to compute the moving average sum.\n",
"# A more efficient implementation is possible with np.cumsum() function\n",
"def moving_average(a, w=5):\n",
" if len(a) < w:\n",
" return a[:] # Need to send a copy of the array\n",
" return [val if idx < w else sum(a[(idx-w):idx])/w for idx, val in enumerate(a)]\n",
"\n",
"\n",
"# Defines a utility that prints the training progress\n",
"def print_training_progress(trainer, mb, frequency, verbose=1):\n",
" training_loss = \"NA\"\n",
" eval_error = \"NA\"\n",
"\n",
" if mb%frequency == 0:\n",
" training_loss = get_train_loss(trainer)\n",
" eval_error = get_train_eval_criterion(trainer)\n",
" if verbose: \n",
" print (\"Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}%\".format(mb, training_loss, eval_error*100))\n",
" \n",
" return mb, training_loss, eval_error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='#Run the trainer'></a>\n",
"### Run the trainer\n",
"\n",
"We are now ready to train our fully connected neural net. We want to decide what data we need to feed into the training engine.\n",
"\n",
"In this example, each iteration of the optimizer will work on `minibatch_size` sized samples. We would like to train on all 60000 observations. Additionally we will make multiple passes through the data specified by the variable `num_sweeps_to_train_with`. With these parameters we can proceed with training our simple feed forward network."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Initialize the parameters for the trainer\n",
"minibatch_size = 64\n",
"num_samples_per_sweep = 60000\n",
"num_sweeps_to_train_with = 10\n",
"num_minibatches_to_train = (num_samples_per_sweep * num_sweeps_to_train_with) / minibatch_size"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Create the reader to training data set\n",
"reader_train = create_reader(train_file, True, input_dim, num_output_classes)\n",
"\n",
"# Map the data streams to the input and labels.\n",
"input_map = {\n",
" label : reader_train.streams.labels,\n",
" input : reader_train.streams.features\n",
"} \n",
"\n",
"# Run the trainer on and perform model training\n",
"training_progress_output_freq = 500\n",
"\n",
"plotdata = {\"batchsize\":[], \"loss\":[], \"error\":[]}\n",
"\n",
"for i in range(0, int(num_minibatches_to_train)):\n",
" \n",
" # Read a mini batch from the training data file\n",
" data = reader_train.next_minibatch(minibatch_size, input_map = input_map)\n",
" \n",
" trainer.train_minibatch(data)\n",
" batchsize, loss, error = print_training_progress(trainer, i, training_progress_output_freq, verbose=1)\n",
" \n",
" if not (loss == \"NA\" or error ==\"NA\"):\n",
" plotdata[\"batchsize\"].append(batchsize)\n",
" plotdata[\"loss\"].append(loss)\n",
" plotdata[\"error\"].append(error)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us plot the errors over the different training minibatches. Note that as we iterate the training loss decreases though we do see some intermediate bumps. \n",
"\n",
"Hence, we use smaller minibatches and using `sgd` enables us to have a great scalability while being performant for large data sets. There are advanced variants of the optimizer unique to CNTK that enable harnessing computational efficiency for real world data sets and will be introduced in advanced tutorials. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Compute the moving average loss to smooth out the noise in SGD\n",
"plotdata[\"avgloss\"] = moving_average(plotdata[\"loss\"])\n",
"plotdata[\"avgerror\"] = moving_average(plotdata[\"error\"])\n",
"\n",
"# Plot the training loss and the training error\n",
"import matplotlib.pyplot as plt\n",
"\n",
"plt.figure(1)\n",
"plt.subplot(211)\n",
"plt.plot(plotdata[\"batchsize\"], plotdata[\"avgloss\"], 'b--')\n",
"plt.xlabel('Minibatch number')\n",
"plt.ylabel('Loss')\n",
"plt.title('Minibatch run vs. Training loss')\n",
"\n",
"plt.show()\n",
"\n",
"plt.subplot(212)\n",
"plt.plot(plotdata[\"batchsize\"], plotdata[\"avgerror\"], 'r--')\n",
"plt.xlabel('Minibatch number')\n",
"plt.ylabel('Label Prediction Error')\n",
"plt.title('Minibatch run vs. Label Prediction Error')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation / Testing \n",
"\n",
"Now that we have trained the network, let us evaluate the trained network on the test data. This is done using `trainer.test_minibatch`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Read the training data\n",
"reader_test = create_reader(test_file, False, input_dim, num_output_classes)\n",
"\n",
"test_input_map = {\n",
" label : reader_test.streams.labels,\n",
" input : reader_test.streams.features,\n",
"}\n",
"\n",
"# Test data for trained model\n",
"test_minibatch_size = 512\n",
"num_samples = 10000\n",
"num_minibatches_to_test = num_samples // test_minibatch_size\n",
"test_result = 0.0\n",
"\n",
"for i in range(num_minibatches_to_test):\n",
" \n",
" # We are loading test data in batches specified by test_minibatch_size\n",
" # Each data point in the minibatch is a MNIST digit image of 784 dimensions \n",
" # with one pixel per dimension that we will encode / decode with the \n",
" # trained model.\n",
" data = reader_test.next_minibatch(test_minibatch_size,\n",
" input_map = test_input_map)\n",
"\n",
" eval_error = trainer.test_minibatch(data)\n",
" test_result = test_result + eval_error\n",
"\n",
"# Average of evaluation errors of all test minibatches\n",
"print(\"Average test error: {0:.2f}%\".format(test_result*100 / num_minibatches_to_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note, this error is very comparable to our training error indicating that our model has good \"out of sample\" error a.k.a. generalization error. This implies that our model can very effectively deal with previously unseen observations (during the training process). This is key to avoid the phenomenon of overfitting."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have so far been dealing with aggregate measures of error. Let us now get the probabilities associated with individual data points. For each observation, the `eval` function returns the probability distribution across all the classes. The classifier is trained to recognize digits, hence has 10 classes. First let us route the network output through a `softmax` function. This maps the aggregated activations across the network to probabilities across the 10 classes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"out = C.ops.softmax(z)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us a small minibatch sample from the test data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Read the data for evaluation\n",
"reader_eval = create_reader(test_file, False, input_dim, num_output_classes)\n",
"\n",
"eval_minibatch_size = 25\n",
"eval_input_map = { input : reader_eval.streams.features } \n",
"\n",
"data = reader_test.next_minibatch(eval_minibatch_size, input_map = test_input_map)\n",
"\n",
"img_label = data[label].value\n",
"img_data = data[input].value\n",
"predicted_label_prob = [out.eval(img_data[i,:,:]) for i in range(img_data.shape[0])]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Find the index with the maximum value for both predicted as well as the ground truth\n",
"pred = [np.argmax(predicted_label_prob[i]) for i in range(len(predicted_label_prob))]\n",
"gtlabel = [np.argmax(img_label[i,:,:]) for i in range(img_label.shape[0])]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"Label :\", gtlabel[:25])\n",
"print(\"Predicted:\", pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us visualize some of the results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Plot a random image\n",
"sample_number = 5\n",
"plt.imshow(img_data[sample_number].reshape(28,28), cmap=\"gray_r\")\n",
"plt.axis('off')\n",
"\n",
"img_gt, img_pred = gtlabel[sample_number], pred[sample_number]\n",
"print(\"Image Label: \", img_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"**Exploration Suggestion**\n",
"- Try exploring how the classifier behaves with different parameters - suggest changing the `minibatch_size` parameter from 25 to say 64 or 128. What happens to the error rate? How does the error compare to the logistic regression classifier?\n",
"- Suggest trying to increase the number of sweeps\n",
"- Can you change the network to reduce the training error rate? When do you see *overfitting* happening?"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### Code link\n",
"\n",
"If you want to try running the tutorial from Python command prompt please run the [SimpleMNIST.py](https://github.com/Microsoft/CNTK/tree/v2.0.beta9.0/Examples/Image/Classification/MLP/Python) example."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda env:cntk-py34]",
"language": "python",
"name": "conda-env-cntk-py34-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}