"# CNTK 101: Logistic Regression and ML Primer\n",
"\n",
"This tutorial is targeted to individuals who are new to CNTK and to machine learning. In this tutorial, you will train a simple yet powerful machine learning model that is widely used in industry for a variety of applications. The model trained below scales to massive data sets in the most expeditious manner by harnessing computational scalability leveraging the computational resources you may have (one or more CPU cores, one or more GPUs, a cluster of CPUs or a cluster of GPUs), transparently via the CNTK library.\n",
"The following notebook uses Python APIs. If you are looking for this example in BrainScript, please look [here](https://github.com/Microsoft/CNTK/tree/release/2.1/Tutorials/HelloWorld-LogisticRegression). \n",
"A cancer hospital has provided data and wants us to determine if a patient has a fatal [malignant](https://en.wikipedia.org/wiki/Malignancy) cancer vs. a benign growth. This is known as a classification problem. To help classify each patient, we are given their age and the size of the tumor. Intuitively, one can imagine that younger patients and/or patients with small tumors are less likely to have a malignant cancer. The data set simulates this application: each observation is a patient represented as a dot (in the plot below), where red indicates malignant and blue indicates benign. Note: This is a toy example for learning; in real life many features from different tests/examination sources and the expertise of doctors would play into the diagnosis/treatment decision for a patient. "
"Our goal is to learn a classifier that can automatically label any patient into either the benign or malignant categories given two features (age and tumor size). In this tutorial, we will create a linear classifier, a fundamental building-block in deep networks."
"In the figure above, the green line represents the model learned from the data and separates the blue dots from the red dots. In this tutorial, we will walk you through the steps to learn the green line. Note: this classifier does make mistakes, where a couple of blue dots are on the wrong side of the green line. However, there are ways to fix this and we will look into some of the techniques in later tutorials. \n",
"Any learning algorithm typically has five stages. These are Data reading, Data preprocessing, Creating a model, Learning the model parameters, and Evaluating the model (a.k.a. testing/prediction). \n",
">2. Data preprocessing: Often, the individual features such as size or age need to be scaled. Typically, one would scale the data between 0 and 1. To keep things simple, we are not doing any scaling in this tutorial (for details look here: [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling).\n",
">4. Learning the model: This is also known as training. While fitting a linear model can be done in a variety of ways ([linear regression](https://en.wikipedia.org/wiki/Linear_regression), in CNTK we use Stochastic Gradient Descent a.k.a. [SGD](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).\n",
">5. Evaluation: This is also known as testing, where one evaluates the model on data sets with known labels (a.k.a. ground-truth) that were never used for training. This allows us to assess how a model would perform in real-world (previously unseen) observations.\n",
"[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a fundamental machine learning technique that uses a linear weighted combination of features and generates the probability of predicting different classes. In our case, the classifier will generate a probability in [0,1] which can then be compared to a threshold (such as 0.5) to produce a binary label (0 or 1). However, the method shown can easily be extended to multiple classes. \n",
"In the above figure, contributions from different input features are linearly weighted and aggregated. The resulting sum is mapped to a (0, 1) range via a [sigmoid]( https://en.wikipedia.org/wiki/Sigmoid_function) function. For classifiers with more than two output labels, one can use a [softmax](https://en.wikipedia.org/wiki/Softmax_function) function."
"Let us generate some synthetic data emulating the cancer example using the `numpy` library. We have two input features (represented in two-dimensions) and two output classes (benign/blue or malignant/red). \n",
"In our example, each observation (a single 2-tuple of features - age and size) in the training data has a label (blue or red). Because we have two output labels, we call this a binary classification task. "
"In this tutorial we are generating synthetic data using the `numpy` library. In real-world problems, one would use a [reader](https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-and-python---understanding-and-extending-readers), that would read feature values (`features`: *age* and *tumor size*) corresponding to each observation (patient). The simulated *age* variable is scaled down to have a similar range to that of the other variable. This is a key aspect of data pre-processing that we will learn more about in later tutorials. Note: in general, observations and labels can reside in higher dimensional spaces (when more features or classifications are available) and are then represented as [tensors](https://en.wikipedia.org/wiki/Tensor) in CNTK. More advanced tutorials introduce the handling of high dimensional data. "
"**Note**: If the import of `matplotlib.pyplot` fails, please run `conda install matplotlib`, which will fix the `pyplot` version dependencies. If you are on a python environment different from Anaconda, then use `pip install matplotlib`."
"A logistic regression (a.k.a. LR) network is a simple building block, but has powered many ML \n",
"applications in the past decade. LR is a simple linear model that takes as input a vector of numbers describing the properties of what we are classifying (also known as a feature vector, $\\bf{x}$, the blue nodes in the figure below) and emits the *evidence* ($z$) (output of the green node, also known as \"activation\"). Each feature in the input layer is connected to an output node by a corresponding weight $w$ (indicated by the black lines of varying thickness). "
"where $\\bf{w}$ is the weight vector of length $n$ and $b$ is known as the [bias](https://www.quora.com/What-does-the-bias-term-represent-in-logistic-regression) term. Note: we use **bold** notation to denote vectors. \n",
"The computed evidence is mapped to a (0, 1) range using a `sigmoid` (when the outcome can be in one of two possible classes) or a `softmax` function (when the outcome can be in one of more than two possible classes).\n",
">An **input** variable is a user-code-facing container where user-provided code fills in different observations (a data point or sample of data points, equivalent to (age, size) tuples in our example) as inputs to the model function during model learning (a.k.a.training) and model evaluation (a.k.a. testing). Thus, the shape of the `input` must match the shape of the data that will be provided. For example, if each data point was a grayscale image of height 10 pixels and width 5 pixels, the input feature would be a vector of 50 floating-point values representing the intensity of each of the 50 pixels, and could be written as `C.input_variable(10*5, np.float32)`. Similarly, in our example the dimensions are age and tumor size, thus `input_dim` = 2. More on data and their dimensions to appear in separate tutorials. "
"Now that the network is set up, we would like to learn the parameters $\\bf w$ and $b$ for our simple linear layer. To do so we convert, the computed evidence ($z$) into a set of predicted probabilities ($\\textbf p$) using a `softmax` function.\n",
"The `softmax` is an activation function that normalizes the accumulated evidence into a probability distribution over the classes (Details of [softmax](https://www.cntk.ai/pythondocs/cntk.ops.html#cntk.ops.softmax)). Other choices of activation function can be [here](https://cntk.ai/pythondocs/cntk.layers.layers.html#cntk.layers.layers.Activation)."
"The output of the `softmax` is the probabilities of an observation belonging each of the respective classes. For training the classifier, we need to determine what behavior the model needs to mimic. In other words, we want the generated probabilities to be as close as possible to the observed labels. We can accomplish this by minimizing the difference between our output and the ground-truth labels. This difference is calculated by the *cost* or *loss* function.\n",
"where $p$ is our predicted probability from `softmax` function and $y$ is the ground-truth label, provided with the training data. In the two-class example, the `label` variable has two dimensions (equal to the `num_output_classes` or $| \\textbf y |$). Generally speaking, the label variable will have $| \\textbf y |$ elements with 0 everywhere except at the index of the true class of the data point, where it will be 1. Understanding the [details](http://colah.github.io/posts/2015-09-Visual-Information/) of the cross-entropy function is highly recommended."
"In order to evaluate the classification, we can compute the [classification_error](https://www.cntk.ai/pythondocs/cntk.metrics.html#cntk.metrics.classification_error), which is 0 if our model was correct (it assigned the true label the most probability), otherwise 1."
"The trainer strives to minimize the `loss` function using an optimization technique. In this tutorial, we will use [Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (`sgd`), one of the most popular techniques. Typically, one starts with random initialization of the model parameters (the weights and biases, in our case). For each observation, the `sgd` optimizer can calculate the `loss` or error between the predicted label and the corresponding ground-truth label, and apply [gradient descent](http://www.statisticsviews.com/details/feature/5722691/Getting-to-the-Bottom-of-Regression-with-Gradient-Descent.html) to generate a new set of model parameters after each observation. \n",
"The aforementioned process of updating all parameters after each observation is attractive because it does not require the entire data set (all observations) to be loaded in memory and also computes the gradient over fewer datapoints, thus allowing for training on large data sets. However, the updates generated using a single observation at a time can vary wildly between iterations. An intermediate ground is to load a small set of observations into the model and use an average of the `loss` or error from that set to update the model parameters. This subset is called a *minibatch*.\n",
"With minibatches we often sample observations from the larger training dataset. We repeat the process of updating the model parameters using different combinations of training samples, and over a period of time minimize the `loss` (and the error). When the incremental error rates are no longer changing significantly, or after a preset maximum number of minibatches have been processed, we claim that our model is trained.\n",
"One of the key parameters of [optimization](https://en.wikipedia.org/wiki/Category:Convex_optimization) is the `learning_rate`. For now, we can think of it as a scaling factor that modulates how much we change the parameters in any iteration. We will cover more details in later tutorials. \n",
"With this information, we are ready to create our trainer. "
"First, let us create some helper functions that will be needed to visualize different functions associated with training. Note: these convenience functions are for understanding what goes on under the hood."
"In this example, each iteration of the optimizer will work on 25 samples (25 dots w.r.t. the plot above) a.k.a. `minibatch_size`. We would like to train on 20000 observations. If the number of samples in the data is only 10000, the trainer will make 2 passes through the data. This is represented by `num_minibatches_to_train`. Note: in a real world scenario, we would be given a certain amount of labeled data (in the context of this example, (age, size) observations and their labels (benign / malignant)). We would use a large number of observations for training, say 70%, and set aside the remainder for the evaluation of the trained model.\n",
"Now that we have trained the network, let us evaluate the trained network on data that hasn't been used for training. This is called **testing**. Let us create some new data and evaluate the average error and loss on this set. This is done using `trainer.test_minibatch`. Note the error on this previously unseen data is comparable to the training error. This is a **key** check. Should the error be larger than the training error by a large margin, it indicates that the trained model will not perform well on data that it has not seen during training. This is known as [overfitting](https://en.wikipedia.org/wiki/Overfitting). There are several ways to address overfitting that are beyond the scope of this tutorial, but the Cognitive Toolkit provides the necessary components to address overfitting.\n",
"Note: we are testing on a single minibatch for illustrative purposes. In practice, one runs several minibatches of test data and reports the average. \n",
"**Question** Why is this suggested? Try plotting the test error over several set of generated data sample and plot using plotting functions used for training. Do you see a pattern?"
"For evaluation, we softmax the output of the network into a probability distribution over the two classes, the probability of each observation being malignant or benign. "
"It is desirable to visualize the results. In this example, the data can be conveniently plotted using two spatial dimensions for the input (patient age on the x-axis and tumor size on the y-axis), and a color dimension for the output (red for malignant and blue for benign). For data with higher dimensions, visualization can be challenging. There are advanced dimensionality reduction techniques, such as [t-sne](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) that allow for such visualizations."
"- Try exploring how the classifier behaves with different data distributions - e.g., changing the `minibatch_size` parameter from 25 to 64. Why is the error increasing?\n",
"Many of the terminologies in machine learning literature can be difficult to map to. Each toolkit / book / paper have their way of referring to the synonymous terminologies. Here are a few CNTK terminologies and their equivalent terms in literature:\n",
"\n",
"- a sequence (in CNTK) is also referred to as an instance\n",
"- a sample (in CNTK) is also referred to as a feature\n",
"- input stream(s) (in CNTK) is also referred to as feature column or a feature\n",
"- the criterion (in CNTK) is also referred to as the loss\n",
"- the evalutaion error (in CNTK) is also referred to as a metric"