EdgeML/cpp/docs/README_PROTONN_OSS.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ProtoNN: Compressed and accurate KNN for resource-constrained devices ([paper](../../docs/publications/ProtoNN.pdf))\n",
    "ProtoNN is an algorithm developed for binary, multiclass and multilabel supervised learning. ProtoNN models are time and memory efficient and are thus ideal for resource-constrained scenarios like Internet of Things (IoT). \n",
    "\n",
    "## Overview of algorithm\n",
    "Suppose a single data-point has **dimension** $D$. Suppose also that the total number of **classes** is $L$. For the most basic version of ProtoNN, there are 2 more user-defined hyper-parameters: the **projection dimension** $d$ and the **number of prototypes** $m$. \n",
    "\n",
    "- ProtoNN learns 3 parameter matrices:\n",
    "    - A **projection matrix** $W$ of dimension $(d,\\space D)$ that projects the datapoints to a small dimension $d$.\n",
    "    - A **prototypes matrix** $B$ that learns $m$ prototypes in the projected space, each $d$-dimensional. $B = [B_1,\\space B_2, ... \\space B_m]$.\n",
    "    - A **prototype labels matrix** $Z$ that learns $m$ label vectors for each of the prototypes to allow a single prototype to represent multiple labels. Each prototype label is $L$-dimensional. $Z = [Z_1,\\space Z_2, ... \\space Z_m]$.\n",
    "\n",
    "- By default, these matrices are dense. However, for high model-size compression, we need to learn sparse versions of the above matrices. The user can restrict the **sparsity of these matrices using the parameters**: $\\lambda_W$, $\\lambda_B$ and $\\lambda_Z$.\n",
    "    - $||W||_0 < \\lambda_W \\cdot size(W) = \\lambda_W \\cdot d \\cdot D$\n",
    "    - $||B||_0 < \\lambda_B \\cdot size(B) = \\lambda_B \\cdot d \\cdot m$\n",
    "    - $||Z||_0 < \\lambda_Z \\cdot size(Z) = \\lambda_Z \\cdot L \\cdot m$ \n",
    "\n",
    "- ProtoNN also assumes an **RBF-kernel parametrized by a single parameter:** $\\gamma$, which can be inferred heuristically from data, or be specified by the user.\n",
    "\n",
    "More details about the ProtoNN prediction function, the training algorithm, and pointers on how to tune hyper-parameters are suspended to the end of this Readme for better readability. \n",
    "\n",
    "\n",
    "## Training ProtoNN\n",
    "Follow the instructions on the main Readme to compile and create an executable `ProtoNN` \n",
    "##### A sample execution with 10-class USPS\n",
    "After creating the executable, we download a sample dataset: **USPS10**. Instructions for this can be found on the main README. To execute ProtoNN on this dataset, run the following script in bash:\n",
    "```bash\n",
    "sh run_ProtoNN_usps10.sh\n",
    "```\n",
    "This should give you output on screen as described in the output section. The final test accuracy will be about 93.4 with the specified parameters. \n",
    "\n",
    "##### Format of data files\n",
    "Data files can exist in one of the following two formats: \n",
    "- **Tab-separated (tsv)**: This is only supported for multiclass and binary datasets, not multilabel ones. The file should have $N$ rows and $D+1$ columns, where $N$ is the number of data-points and $D$ is the dimensionality of each point. Columns should be separated by _tabs_. The first column contains the label, which must be a natural number between $1$ and $L$. The rest of the $D$ columns contain the data which are real numbers.\n",
    "- **Libsvm format**: See https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The labels should be between $1$ and $L$, and the indices should be between $1$ and $D$. The sample **USPS-10** dataset uses this format. \n",
    "\n",
    "The following flag in `config.mk` changes this behavior: \n",
    "    \n",
    "    ZERO_BASED_IO: The expected label range becomes 0 ... (L-1), and the expected feature range becomes 0 ... (D-1). \n",
    "The number of lines in train and validation/test data files, the dimension of the data, and the number of labels will _not_ be inferred automatically. They must be specified as described below. \n",
    "\n",
    "##### Specifying parameters and executing\n",
    "To specify hyper-parameters for ProtoNN as well as metadata such as the location of the dataset, input format, etc., one has to write a bash script akin to the sample script at `run_ProtoNN_usps10.sh`. \n",
    "\n",
    "Once ProtoNN is compiled, we execute it via this script: \n",
    "```bash\n",
    "sh run_ProtoNN_usps10.sh\n",
    "```\n",
    "This bash script is a config file as well as an execution script. There are a number of hyper-parameters to specify, so we split them into categories as described below. Consult `run_ProtoNN_usps10.sh` for a sample. The value in the bracket indicates the command line flag used to set the given hyperparameter. \n",
    "\n",
    "##### Input-output parameters\n",
    "- **-C**: Problem format. Specify one of:\n",
    "    - 0 (binary)\n",
    "    - 1 (multiclass)\n",
    "    - 2 (multilabel)\n",
    "- **-I**: File that contains training data. \n",
    "- **-V**: [Optional] File that contains validation/test data. \n",
    "- **-O**: Folder to store output (see output section below). \n",
    "- **-F**: Input format for data (described above):\n",
    "    - 0 (libsvm format)\n",
    "    - 1 (tab-separated format)\n",
    "- **-P**: Option to load a predefined model. [**Default:** 0]. Specify as 1 if pre-loading initial values of matrices $W$, $B$, $Z$. One can use this option to initialize with the output of a previous run, or with SLEEC, LMNN, etc. All three matrices, should be present in a single directory in tsv format. The directory is specified with the `-M` flag (see next). The values of the parameters $d$, $D$, $L$ will _not_ be inferred, and must be specified correctly in the rest of the fields. The filenames and dimensions of the matrices should be as follows: \n",
    "    - $W$: Filename: \"W\". Dimension: ($d$, $D$). \n",
    "    - $B$: Filename: \"B\". Dimension: ($d$, $m$). \n",
    "    - $Z$: Filename: \"Z\". Dimension: ($L$, $m$).\n",
    "    - $\\gamma$: Filename: \"gamma\". A single number representing the RBF kernel parameter.\n",
    "- **-M**: Folder that contains the predefined model files.\n",
    "\t\n",
    "\n",
    "##### Data-dependent parameters\n",
    "- **-r**: Number of training points.\n",
    "- **-v**: Number of validation/test points.\n",
    "- **-D**: The original dimension of the data.\n",
    "- **-l**: Number of classes.\n",
    "\n",
    "##### ProtoNN hyper-parameters (required)\n",
    "- **-d**: Projection dimension (the dimension into which the data is projected). [**Default:** $15$]\n",
    "- **Specify only one of the -m and the -k flags:**\n",
    "    - **-m**: Number of Prototypes. On specifying this parameter, the initialization of prototypes is done by clustering the training data in projected space. [**Default:** $20$]\n",
    "    - **-k**: Number of Prototypes Per Class. On specifying this parameter, initialization of prototypes is done by performing k-means clustering separately for each class, to identify $k$ different prototypes for each class. Thus, $m$ is automatically set to $L\\cdot k$.\n",
    "\n",
    "##### ProtoNN hyper-parameters (optional)\n",
    "- **-W**: Projection sparsity ($\\lambda_W$). [**Default:** $1.0$]\n",
    "- **-B**: Prototype sparsity. [**Default:** $1.0$]\n",
    "- **-Z**: Label Sparsity. [**Default:** $1.0$]\n",
    "- **-g**: GammaNumerator.\n",
    "    - On setting GammaNumerator, the RBF kernel parameter $\\gamma$ is set as;\n",
    "    - $\\gamma = (2.5 \\cdot GammaNumerator)/(median(||B_j,W - X_i||_2^2))$\n",
    "    - **Default:** $1.0$\n",
    "- **-N**: Normalization. Specify one of: \n",
    "    - **Default**: 0  (no normalization)\n",
    "    - 1 (min-max normalization wherein each feature is linearly scaled to lie with 0 and 1)\n",
    "    - 2 (l2-normalization wherein each data-point is normalized to unit l2-norm)\n",
    "- **-R**: A random number seed which can be used to re-generate previously obtained experimental results. [**Default:** $42$]\n",
    "\n",
    "##### ProtoNN optimization hyper-parameters (optional)\n",
    "- **-b**: Batch size for mini-batch stochastic gradient descent. [**Default:** $1024$]\n",
    "- **-T**: Total number of optimization iterations. [**Default:** $20$]\n",
    "- **-E**: Number of epochs (complete see-through's) of the data for each iteration, and each parameter. [**Default:** $20$] \n",
    "\n",
    "##### Final Execution\n",
    "The script in this section combines all the specified hyper-parameters to create an execution command. This command is printed to stdout, and then executed.\n",
    "Most users should copy this section directly to all their ProtoNN execution scripts without change. We provide a single option here that is commented out by default: \n",
    "- **gdb --args**: Run ProtoNN with given hyper-parameters in debug mode using gdb. \n",
    "\n",
    "## Testing a trained model\n",
    "Once a ProtoNN model has been trained, one can test it on a new dataset. \n",
    "##### A sample execution with 10-class USPS\n",
    "The model trained using the sample script mentioned before can be tested with the following script: \n",
    "```bash\n",
    "sh run_ProtoNNPredict_usps10.sh\n",
    "```\n",
    "\n",
    "##### Explanation of parameters: \n",
    "- **-I**: File that contains test data. \n",
    "- **-M**: The model file in non-human readable format that is output on running ProtoNN. \n",
    "- **-n**: Normalization files if data was normalized when ProtoNN was trained. \n",
    "- **-O**: Folder to store output (see output section below). \n",
    "- **-F**: Input format for data (same as described in training section).\n",
    "- **-e**: Number of test points.\n",
    "- **-b**: [Optional] If unspecified, testing happens for each data-point separately (to simulate a real-world scenario). For faster prediction when prototyping, use the parameter to specify a batch on which prediction happens in one go. \n",
    "\t\n",
    "\n",
    "\n",
    "## Disclaimers\n",
    "- The training data is not automatically shuffled in the code. If possible, **pre-shuffle** the data before passing to ProtoNN. For instance, all examples of a single class should not occur consecutively.\n",
    "- **Normalization**: Ideally, the user should provide **standardized** (Mean-Variance normalized) data. If this is not possible, use one of the normalization options that we provide. The code may be unstable in the absence of normalization.\n",
    "- The results on various datasets as reported in the ProtoNN paper were using **Gradient Descent** as the optimization algorithm, whereas this repository uses **Stochastic Gradient Descent**. It is possible that the results don't match exactly. We will publish an update to this repository with Gradient Descent implemented. \n",
    "- We do not provide support for **Cross-Validation**, only **Train-Test** style runs. The user can write a bash wrapper to perform Cross-Validation. \n",
    "\n",
    "## Interpreting the output\n",
    "##### Output of ProtoNNTrainer:\n",
    "- The following information is printed to **std::cout**: \n",
    "    - The chosen value of $\\gamma$.\n",
    "    - **Training, testing accuracy, and training objective value**, thrice for each iteration, once after optimizing each parameter. For multilabel problems, **prec@1** is output instead.  \n",
    "    - On enabling the `VERBOSE` flag in `config.mk`, additional informative output is printed to stdout. \n",
    "\n",
    "- **Errors and warnings** are printed to **std::cerr**. \n",
    "\n",
    "- Additional **parameter dumps**, **timer logs** and other **debugging logs** will be placed in the output folder specified with the `-O` flag above. The user should have read-write permissions on the folder. \n",
    "    -  On execution, a folder will be created in the output directory that will indicate to the user the list of parameters with which the run was instantiated. In this folder, upto **7 files** and **2 folders** will be created depending on which flags are set in `config.mk`: \n",
    "    - **runInfo**: This file contains the hyperparameters and meta-information for the respective instantiation of ProtoNN. It also shows you the exact bash script call that was made, which is helpful for reproducing results purposes. Additionally, the training, testing accuracy and objective value at the end of each iteration is printed in a readable format. **This file is created at the end of the ProtoNN optimization.**\n",
    "    - **W, B, Z**: These files contain the learnt parameter matrices $W$, $B$ and $Z$ in human-readable tsv format. The dimensions of storage are $(d, D)$, $(d, m)$ and $(L, m)$ respectively. **These files are created at the end of the ProtoNN optimization.**\n",
    "    - **gamma**: This file contains a single number, the chosen value of $\\gamma$, the RBF kernel parameter. **This file is created at the end of the ProtoNN optimization.**\n",
    "    - **model**: This file contains the final trained model with all the parameters and hyperparameters in a non-human readable format. This is to facilitate the prediction code. **This file is created at the end of the ProtoNN optimization.**\n",
    "    - **diagnosticLog**: Created on using the `LOGGER` or `LIGHT_LOGGER` flags. This file stores logging information such as the call flow of ProtoNN and the min, max, norms of various matrices. This is mainly for debugging/optimization purposes and requires a more detailed understanding of the code to interpret. It may contain useful information if your code did not run as expected. **The diagnosticLog file is populated synchronously while the ProtoNN optimization is executing.** \n",
    "    - **timerLog**: Created on using the `TIMER` flag. This file stores proc time and wall time taken to execute various function calls in the code. Indicates the degree of parallelization and is useful for identifying bottlenecks to optimize the code. On specifying the `CONCISE` flag, timing information will only be printed if running time is higher than a threshold specified in `src/common/timer.cpp`. **The timerLog file is populated synchronously while the ProtoNN optimization is executing.**  \n",
    "    - **dump**: A folder that is created on using the `DUMP` flag. The parameter matrices are outputted after each iteration in this folder. **This folder is populated synchronously while the ProtoNN optimization is executing.**\n",
    "    - **verify**: A folder that is created on using the `VERIFY` flag. Code for backward verification with legacy Matlab code. **This folder is populated synchronously while the ProtoNN optimization is executing.**\n",
    "\n",
    "The files **W, B, Z**, and **gamma** can be used to continue training of ProtoNN by initializing with these previously learned matrices. Use the **-P** option for this (see above). On doing so, the starting train/test accuracies should match the final accuracy as specified in the runInfo file. \n",
    "\n",
    "##### Output of ProtoNNPredictor:\n",
    "On execution, the test accuracy, or precision@1,3,5 will be output to stdout. Additionally, a folder will be created in the output directory that will indicate to the user the list of parameters with which the model model to be tested was trained. In this folder, there will be one file detailedPrediction. This file contains for each test point the true labels of that point as well as the scores of the top 5 predicted labels. \n",
    "\n",
    "## Choosing hyperparameters\n",
    "##### Model size as a function of hyperparameters\n",
    "The user presented with a model-size budget has to make a decision regarding the following 5 hyper-parameters: \n",
    "- The projection dimension $d$\n",
    "- The number of prototypes $m$\n",
    "- The 3 sparsity parameters: $\\lambda_W$, $\\lambda_B$, $\\lambda_Z$\n",
    " \n",
    "Each parameter requires the following number of non-zero values for storage:\n",
    "- $S_W: min(1, 2\\lambda_W) \\cdot d \\cdot D$\n",
    "- $S_B: min(1, 2\\lambda_B) \\cdot d \\cdot m$\n",
    "- $S_Z: min(1, 2\\lambda_Z) \\cdot L \\cdot m$\n",
    "\n",
    "The factor of 2 is for storing the index of a sparse matrix, apart from the value at that index. Clearly, if a matrix is more than 50% dense ($\\lambda > 0.5$), it is better to store the matrix as dense instead of incurring the overhead of storing indices along with the values. Hence the minimum operator. \n",
    "Suppose each value is a single-precision floating point (4 bytes), then the total space required by ProtoNN is $4\\cdot(S_W + S_B + S_Z)$. This value is computed and output to screen on running ProtoNN. \n",
    "\n",
    "##### Pointers on choosing hyperparameters\n",
    "Choosing the right hyperparameters may seem to be a daunting task in the beginning but becomes much easier with a little bit of thought. To get an idea of default parameters on some sample datasets, see the ([paper](../../docs/publications/ProtoNN.pdf)). Few rules of thumb:\n",
    "- $S_B$ is typically small, and hence $\\lambda_B \\approx 1.0$. \n",
    "- One can set $m$ to $min(10\\cdot L, 0.01\\cdot numTrainingPoints)$, and $d$ to $15$ for an initial experiment. Typically, you want to cross-validate for $m$ and $d$. \n",
    "- Depending on $L$ and $D$, $S_W$ or $S_Z$ is the biggest contributors to model size. $\\lambda_W$ and $\\lambda_Z$ can be adjusted accordingly or cross-validated for. \n",
    "\n",
    "## Additional ProtoNN flags \n",
    "#### [Beta] Alternative optimization routine\n",
    "One can use the `BTLS` flag in `src/ProtoNN/Makefile` (variable PROTONN_FLAGS) to enable optimization with [Back Tracking Line Search]. This is a faster and more stable optimization route. \n",
    "\n",
    "#### [Beta] ProtoNN for Extreme Multilabel Learning (XML)\n",
    "[XML](http://manikvarma.org/downloads/XC/XMLRepository.html) refers to a difficult class of multi-label learning problems where the number of labels is large (ranging from a few thousands to a few millions). ProtoNN has been written to be compatible with these datasets. This mode can be enabled by the `XML` flag. \n",
    "\n",
    "## Formal details\n",
    "##### Prediction function\n",
    "ProtoNN predicts on a new test-point in the following manner. For a test-point $X$, ProtoNN computes the following $L$ dimensional score vector:\n",
    "$Y_{score}=\\sum_{j=0}^{m}\\space \\left(RBF_\\gamma(W\\cdot X,B_j)\\cdot Z_j\\right)$, where\n",
    "$RBF_\\gamma (U, V) = exp\\left[-\\gamma^2||U - V||_2^2\\right]$\n",
    "The prediction label is then $\\space max(Y_{score})$. \n",
    "\n",
    "##### Training \n",
    "While training, we are presented with training examples $X_1, X_2, ... X_n$ along with their label vectors $Y_1, Y_2, ... Y_n$ respectively. $Y_i$ is an L-dimensional vector that is $0$ everywhere, except the component to which the training point belongs, where it is $1$.  For example, for a $3$ class problem, for a data-point that belongs to class $2$, $Y=[0, 1, 0]$. \n",
    "We optimize the $l_2$-square loss over all training points as follows:  $\\sum_{i=0}^{n} = ||Y_i-\\sum_{j=0}^{m}\\space \\left(exp\\left[-\\gamma^2||W\\cdot X_i - B_j||^2\\right]\\cdot Z_j\\right)||_2^2$. \n",
    "While performing stochastic gradient descent, we hard threshold after each gradient update step to ensure that the three memory constraints (one each for $\\lambda_W, \\lambda_B, \\lambda_Z$) are satisfied by the matrices $W$, $B$ and $Z$. \n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}