Adding and updating markdown files

2017-09-01 00:00:18 +05:30 · 2017-09-01 00:00:18 +05:30 · 03306088a3
--- a/README.md
+++ b/README.md
@ -1,13 +1,30 @@
 ## Edge Machine Learning

+This repository provides code for machine learning algorithms for edge devices developed at the [Microsoft Research India Lab](https://www.microsoft.com/en-us/research/project/resource-efficient-ml-for-the-edge-and-endpoint-iot-devices/). 
+
+Machine learning models need to have a small footprint in terms of battery, storage and latency to  be deployed on edge devices. One example of a ubiquitous real-world application where such models are desirable is resource-scarce devices and sensors in the Internet of Things (IoT) setting. To make real-time predictions locally on IoT devices without connecting to the cloud, we need models that fit in a few kilobytes.
+
+This repository contains two such algorithms **Bonsai** and **ProtoNN** that shine in this setting. These algorithms can train models for classical supervised learning problems with memory requirements that are orders of magnitude lower than other modern ML algorithms. The trained models can be loaded onto on edge and IoT devices/sensors, and used to make fast, precise, and accurate predictions completely offline.
+
+For technical details, please see the ICML'17 publications on [Bonsai](publications/Bonsai.pdf) and [ProtoNN](publications/ProtoNN.pdf) algorithms.
+
+Contributors: Initial contributions were written by Chirag Gupta, [Aditya Kusupati](https://adityakusupati.github.io/), Ashish Kumar, and [Harsha Vardhan Simhadri](http://harsha-simhadri.org).
+
+We welcome contributions, comments and criticism. For questions, please send an [email](mailto:harshasi@microsoft.com).
+
 ### Requirements
 - Linux. We developed the code on Ubuntu 16.04LTS.
-  The code can also compiled in Windows with Visual Studio 2015, but this release does not include makefile. 
+  For Windows 10 Anniversary Update or later, one can also use the Windows Subsystem for Linux. 
+  The code can also be compiled in Windows with Visual Studio,
+  but this release does not include necessary makefiles yet. 
 - gcc version 5.4. Other gcc versions above 5.0 could also work.
- [Intel(R) Math Kernel Library](https://software.intel.com/en-us/mkl). We use BLAS, sparseBLAS and VML routines. 
+- An implementation of BLAS, sparseBLAS and vector math calls.
+  We link with the implementation provided by the [Intel(R) Math Kernel Library](https://software.intel.com/en-us/mkl).
+  Please download later versions (2017v3+) of MKL as far as possible.
+  The code can be made to work with other math libraries with a few modifications.

 ### Building
-After cloning this reposirory, do:
+After cloning this reposirory, set compiler and flags appropriately in `config.mk` and do:

 ```bash
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<MKL_PATH>:<EDGEML_ROOT>
@ -15,7 +32,8 @@ make -j
 ```
 Typically, MKL_PATH = /opt/intel/mkl/lib/intel64_lin/, and EDGEML_ROOT is '.'.

-This will build two executables _Bonsai_ and _ProtoNN_. 
+This will build two executables _Bonsai_ and _ProtoNN_.
+Sample data to try these executables is not included in this repository. 

 ### Download a sample dataset

@ -30,7 +48,8 @@ mv usps train.txt
 mv usps.t test.txt
 cd <EDGEML_ROOT>
 ```
-This will create a sample test set. You can now train and test Bonsai and ProtoNN algorithms on this dataset.
+This will create a sample train and test dataset, on which
+you can  train and test Bonsai and ProtoNN algorithms.
 For detailed instructions, see [Bonsai Readme](README_BONSAI_OSS.md) and [ProtoNN Readme](README_PROTONN_OSS.md).

 ### Makefile flags
@ -43,10 +62,10 @@ LIGHT_LOGGER
 VERBOSE
 MKL_PAR/SEQ

+Currently, MKL_SEQ_LDFLAGS is default for _Bonsai_, one can enable Parallel flag
+for MKL using MKL_PAR_LDFLAGS in main Makefile.  Also, float is SINGLE
+precision but can be changed to DOUBLE in config.mk.
+
+
 ### Microsoft Open Source Code of Conduct
-This project has adopted the [Microsoft Open Source Code of
-Conduct](https://opensource.microsoft.com/codeofconduct/).
-For more information see the [Code of Conduct
-FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
-contact [opencode@microsoft.com](mailto:opencode@microsoft.com)
-with any additional questions or comments.
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
--- a/README_BONSAI_OSS.md
+++ b/README_BONSAI_OSS.md
@ -0,0 +1,103 @@
+# Bonsai
+
+[Bonsai](publications/Bonsai.pdf) is a novel tree based algorithm for efficient prediction on IoT devices – such as those based on the Arduino Uno board having an 8 bit ATmega328P microcontroller operating at 16 MHz with no native floating point support, 2 KB RAM and 32 KB read-only flash.
+
+Bonsai maintains prediction accuracy while minimizing model size and prediction costs by: 
+
+        (a) Developing a tree model which learns a single, shallow, sparse tree with powerful nodes 
+        (b) Sparsely projecting all data into a low-dimensional space in which the tree is learnt
+        (c) Jointly learning all tree and projection parameters
+
+Experimental results on multiple benchmark datasets demonstrate that Bonsai can make predictions in milliseconds even on slow microcontrollers, can fit in KB of memory, has lower battery consumption than all other algorithms while achieving prediction accuracies that can be as much as 30% higher than state-of-the-art methods for resource-efficient machine learning. Bonsai is also shown to generalize to other resource constrained settings beyond IoT by generating significantly better search results as compared to Bing’s L3 ranker when the model size is restricted to 300 bytes.
+
+## Algorithm
+
+Bonsai learns a balanced tree of user speciﬁed height `h`.
+
+The parameters that need to be learnt include:
+
+        (a) Z: the sparse projection matrix; 
+        (b) θ = [θ1,...,θ2h−1]: the parameters of the branching function at each internal node
+        (c) W = [W1,...,W2h+1−1] and V = [V1,...,V2h+1−1]:the predictor parameters at each node
+
+We formulate a joint optimization problem to train all the parameters using the following three phase training routine:
+
+        (a) Unconstrained Gradient Descent: Train all the parameters without having any Budget Constraint
+        (b) Iterative Hard Thresholding (IHT): Applies IHT constantly while training
+	    (c) Training with constant support: After the IHT phase the support(budget) for the parameters is fixed and are trained
+
+We use simple Batch Gradient Descent as the solver with Armijo rule as the step size selector.
+
+## Prediction
+
+When given an input feature vector X, Bonsai gives the prediction as follows :
+
+        (a) We project the data onto a low dimensional space by computing x^ = Zx
+        (b) The final bonsai prediction score is the non linear scores (wx^ * tanh(sigma*vx^) ) predicted by each of the individual nodes along the path traversed by the Bonsai tree
+
+## Usage
+
+    ./Bonsai [Options] DataFolder
+    Options:
+
+    -F    : [Required] Number of features in the data.
+    -C    : [Required] Number of Classification Classes/Labels.
+    -nT   : [Required] Number of training examples.
+    -nE   : [Required] Number of examples in test file.
+    -O    : [Optional] Flag to Indicate if Labels are One Indexed (Default set to 0)
+    -f    : [Optional] Input format. Takes two values [0 and 1]. 0 is for libsvm_format(default), 1 is for tab/space separated input.
+
+    -P   : [Optional] Projection Dimension. (Default: 10 Try: [5, 20, 30, 50]) 
+    -D   : [Optional] Depth of the Bonsai tree. (Default: 3 Try: [2, 4, 5])
+    -S   : [Optional] sigma = parameter for sigmoid sharpness  (Default: 1.0 Try: [3.0, 0.05, 0.005] ).
+
+    -lW  : [Optional] lambda_W = regularizer for classifier parameter W  (Default: 0.0001 Try: [0.01, 0.001, 0.00001]).
+    -lT  : [Optional] lambda_Theta = regularizer for kernel parameter Theta  (Default: 0.0001 Try: [0.01, 0.001, 0.00001]).
+    -lV  : [Optional] lambda_V = regularizer for kernel parameters V  (Default: 0.0001 Try: [0.01, 0.001, 0.00001]).
+    -lZ  : [Optional] lambda_Z = regularizer for kernel parameters Z  (Default: 0.00001 Try: [0.001, 0.0001, 0.000001]).
+
+    Use Sparsity Params to vary your model size for a given tree depth and projection dimension
+    -sW  : [Optional] sparsity_W = regularizer for classifier parameter W  (Default: For Binaray 1.0 else 0.2 Try: [0.1, 0.3, 0.4, 0.5]).
+    -sT  : [Optional] sparsity_Theta = regularizer for kernel parameter Theta  (Default: For Binaray 1.0 else 0.2 Try: [0.1, 0.3, 0.4, 0.5]).
+    -sV  : [Optional] sparsity_V = regularizer for kernel parameters V  (Default: For Binaray 1.0 else 0.2 Try: [0.1, 0.3, 0.4, 0.5]).
+    -sZ  : [Optional] sparsity_Z = regularizer for kernel parameters Z  (Default: 0.2 Try: [0.1, 0.3, 0.4, 0.5]).
+
+    -I   : [Optional] [Default: 40 Try: [100, 30, 60]] Number of passes through the dataset.
+	-B   : [Optional] Batch Factor [Default: 1 Try: [2.5, 10, 100]] Float Factor to multiply with sqrt(ntrain) to make the batch_size = min(max(100, B*sqrt(nT)), nT).
+    DataFolder : [Required] Path to folder containing data with filenames being 'train.txt' and 'test.txt' in the folder."
+    
+    Note - libsvm_format can be either Zero or One Indexed in labels. Space/Tab separated format has to be Zero indexed in labels by design
+      
+
+## Data Format    
+    
+    (a) "train.txt" is train data file with label followed by features, "test.txt" is test data file with label followed by features
+    (b) They can be either in libsvm_format or a simple tab/space separated format
+    (c) Try to shuffle the "train.txt" file before feeding it in. Ensure that all instances of a single class are not together
+
+## Running on USPS-10
+
+Following the instructions in the [common readme](README.md) will give you a binary for Bonsai and a folder called usps10 with train and test datasets.
+Now run the script
+```bash
+sh run_Bonsai_usps10.sh
+```
+This should give you output as described in the next section. Test accuracy will be about 94.07% with the specified parameters.
+
+## Output
+
+The DataFolder will have a new forlder named Results with the following files in it:
+
+    (a) A directory for each run with the signature hrs_min_sec_day_month with the following in it:
+        (1) loadableModel - Char file which can be directly loaded using the inbuilt load model functions
+        (2) loadableMeanVar - Char file which can be directly loaded using inbuilt load mean-var functions
+        (3) predClassAndScore - File with Prediction Score and Predicted Class for each Data point in the test set
+        (4) runInfo - File with the hyperparameters for that run of Bonsai along with Test Accuracy and Total NonZeros in the model
+        (5) Params - A directory with readable files with Z, W, V, Theta, Mean and Variance
+    (b) A file resultDump which has consolidated results and map to the respective run directory
+
+##  Notes
+    (a) As of now, there is no support to Multi Label Classification, Ranking and Regression in Bonsai
+    (b) Model Size = 8*totalNonZeros Bytes. 4 Bytes to store index and 4 Bytes to store value to store a sparse model
+    (c) We do not provide support for Cross-Validation, support exists only for Train-Test. The user can write a bash wrapper to perform Cross-Validation.
+    
--- a/README_BONSAI_TLC.md
+++ b/README_BONSAI_TLC.md
@ -0,0 +1,62 @@
+# Bonsai
+
+[Bonsai](http://proceedings.mlr.press/v70/kumar17a/kumar17a.pdf) 
+is a novel tree based algorithm for for efficient prediction on IoT devices – 
+such as those based on the Arduino Uno board having an 8 bit ATmega328P microcontroller operating 
+at 16 MHz with no native floating point support, 2 KB RAM and 32 KB read-only flash.
+
+    Bonsai maintains prediction accuracy while minimizing model size and prediction costs by: 
+        (a) developing a tree model which learns a single, shallow, sparse tree with powerful nodes; 
+        (b) sparsely projecting all data into a low-dimensional space in which the tree is learnt; 
+        (c) jointly learning all tree and projection parameters.
+
+Experimental results on multiple benchmark datasets demonstrate that Bonsai can make predictions in milliseconds even on slow microcontrollers, 
+can fit in KB of memory, has lower battery consumption than all other algorithms while achieving prediction accuracies that can be as much as 
+30% higher than state-of-the-art methods for resource-efficient machine learning.
+
+Bonsai is also shown to generalize to other resource constrained settings beyond IoT 
+by generating significantly better search results as compared to Bing’s L3 ranker when the model size is restricted to 300 bytes.
+
+## Algorithm
+
+Bonsai learns a balanced tree of user speciﬁed height `h`.
+
+    The parameters that need to be learnt include: 
+        (a) Z: the sparse projection matrix; 
+        (b) θ = [θ1,...,θ2h−1]: the parameters of the branching function at each internal node;
+        (c) W = [W1,...,W2h+1−1] and V = [V1,...,V2h+1−1]:the predictor parameters at each node
+
+We formulate a joint optimization problem to train all the parameters using a training routine which is as follows.
+
+    It has 3 Phases:
+        (a) Unconstrained Gradient Descent: Train all the parameters without having any Budget Constraint
+        (b) Iterative Hard Thresholding (IHT): Applies IHT constantly while training
+        (c) Training with constant support: After the IHT phase the support(budget) for the parameters is fixed and are trained
+We use simple Batch Gradient Descent as the solver with Armijo rule as the step size selector.
+
+## Prediction
+
+When given an input fearure vector X, Bonsai gives the prediction as follows :
+
+    (a) We project the data onto a low dimensional space by computing x^ = Zx.
+    (b) The final bonsai prediction score is the non linear scores ( wx^ * tanh(sigma*vx^) ) predicted by each of the individual nodes along the path traversed by the Bonsai tree.
+
+
+## Parameters and HyperParameters
+
+    pd   : Projection Dimension. (Default: 10 Try: [5, 20, 30, 50]) 
+    td   : Depth of the Bonsai tree. (Default: 3 Try: [2, 4, 5])
+    s    : sigma = parameter for sigmoid sharpness  (Default: 1.0 Try: [3.0, 0.05, 0.005] ).
+
+    rw   : lambda_W = regularizer for classifier parameter W  (Default: 0.0001 Try: [0.01, 0.001, 0.00001]).
+    rTheta  : lambda_Theta = regularizer for kernel parameter Theta  (Default: 0.0001 Try: [0.01, 0.001, 0.00001]).
+    rv   : lambda_V = regularizer for kernel parameters V  (Default: 0.0001 Try: [0.01, 0.001, 0.00001]).
+    rz   : lambda_Z = regularizer for kernel parameters Z  (Default: 0.00001 Try: [0.001, 0.0001, 0.000001]).
+
+    Use Sparsity Params to vary your model Size
+    sw  : sparsity_W = regularizer for classifier parameter W  (Default: For Binaray 1.0 else 0.2 Try: [0.1, 0.3, 0.4, 0.5]).
+    sTheta  : sparsity_Theta = regularizer for kernel parameter Theta  (Default: For Binaray 1.0 else 0.2 Try: [0.1, 0.3, 0.4, 0.5]).
+    sv  : sparsity_V = regularizer for kernel parameters V  (Default: For Binaray 1.0 else 0.2 Try: [0.1, 0.3, 0.4, 0.5]).
+    sz  : sparsity_Z = regularizer for kernel parameters Z  (Default: 0.2 Try: [0.1, 0.3, 0.4, 0.5]).
+
+    iter   : [Default: 40 Try: [100, 30, 60]] Number of passes through the dataset.
--- a/README_PROTONN_OSS.md
+++ b/README_PROTONN_OSS.md
@ -0,0 +1,147 @@
+# ProtoNN: Compressed and accurate KNN for resource-constrained devices([paper](publications/ProtoNN.pdf))
+
+Suppose a single data-point has **dimension** $$D$$. Suppose also that the total number of **classes** is $$L$$. For the most basic version of ProtoNN, there are 2 more user-defined hyper-parameters: the **projection dimension** $$d$$ and the **number of prototypes** $$m$$. 
+
+- ProtoNN learns 3 parameter matrices:
+    - A **projection matrix** $$W$$ of dimension $$(d,\space D)$$ that projects the datapoints to a small dimension $$d$$.
+    - A **prototypes matrix** $$B$$ that learns $$m$$ prototypes in the projected space, each $$d$$-dimensional. $$B = [B_1,\space B_2, ... \space B_m]$$.
+    - A **prototype labels matrix** $$Z$$ that learns $$m$$ label vectors for each of the prototypes to allow a single prototype to represent multiple labels. Each prototype label is $$L$$-dimensional. $$Z = [Z_1,\space Z_2, ... \space Z_m]$$.
+
+- By default, these matrices are dense. However, for high model-size compression, we need to learn sparse versions of the above matrices. The user can restrict the **sparsity of these matrices using the parameters**: $$\lambda_W$$, $$\lambda_B$$ and $$\lambda_Z$$.
+    - $$||W||_0 < \lambda_W \cdot size(W) = \lambda_W \cdot d \cdot D$$
+    - $$||B||_0 < \lambda_B \cdot size(B) = \lambda_B \cdot d \cdot m$$
+    - $$||Z||_0 < \lambda_Z \cdot size(Z) = \lambda_Z \cdot L \cdot m$$ 
+
+- ProtoNN also assumes an **RBF-kernel parametrized by a single parameter:** $$\gamma$$, which can be inferred heuristically from data, or be specified by the user.
+
+More details about the ProtoNN prediction function, the training algorithm, and pointers on how to tune hyper-parameters are suspended to the end of this Readme for better readability. 
+
+
+## Running
+Follow the instructions on the main Readme to compile and create an executable _ProtoNN_. 
+##### A sample execution with 10-class USPS
+Follow the instructions on the main Readme to download the **USPS10 dataset**. To execute ProtoNN on this dataset, go to EDGEML_ROOT and type the following in bash:
+```bash
+sh run_ProtoNN_usps10.sh
+```
+This should give you output on screen as described in the output section. The final test accuracy will be about 93.4 with the specified parameters. 
+
+##### Loading a new dataset
+A folder (say **foo**) is required to hold the dataset. **foo** must contain two files: train.txt and test.txt, that hold the training and testing data respectively. The dataset should be in one of the following two formats: 
+- **Tab-separated (tsv)**: This is only supported for multiclass and binary datasets, not multilabel ones. The file should have $$N$$ rows and $$D+1$$ columns, where $$N$$ is the number of data-points and $$D$$ is the dimensionality of each point. Columns should be separated by _tabs_. The first column contains the label, which must be a natural number between $$1$$ and $$L$$. The rest of the $$D$$ columns contain the data which are real numbers. 
+- **Libsvm format**: See https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The labels should be between $$1$$ and $$L$$, and the indices should be between $$1$$ and $$D$$. The sample **USPS-10** dataset uses this format. 
+
+The number of lines in train and test data files, the dimension of the data, and the number of labels will _not_ be inferred automatically. They must be specified as described below. 
+
+##### Specifying parameters and executing
+To specify hyper-parameters for ProtoNN as well as metadata such as the location of the dataset, input format, etc., one has to write a bash script akin to the sample script at **run_ProtoNN_usps10.sh**. 
+
+Once ProtoNN is compiled, we execute it via this script: 
+```bash
+sh run_ProtoNN_usps10.sh
+```
+This bash script is a config file as well as an execution script. There are a number of hyper-parameters to specify, so we split them into categories as described below. The format of run_ProtoNN_usps10.sh is exactly the same as this Re[5~adme file, to help the user follow. The value in the bracket indicates the command line flag used to set the given hyperparameter. 
+
+##### Input-output parameters
+- Predefined model (**-P**): default is 0. Specify as 1 if pre-loading initial values of matrices $$W$$, $$B$$, $$Z$$. One can use this option to initialize with the output of a previous run, or with SLEEC, LMNN, etc. All three matrices, should be present in the data input directory **foo** in tsv format. The values of the parameters $$d$$, $$D$$, $$L$$ will _not_ be inferred, and must be specified correctly in the rest of the fields. The filenames and dimensions of the matrices should be as follows: 
+    - $$W$$: Filename: "W". Dimension: ($$d$$, $$D$$). 
+    - $$B$$: Filename: "B". Dimension: ($$d$$, $$m$$). 
+    - $$Z$$: Filename: "Z". Dimension: ($$L$$, $$m$$).
+    - $$\gamma$$: Filename: "gamma". A single number representing the RBF kernel parameter.
+	
+- Problem format (**-C**): specify one of:
+    - 0 (binary)
+    - 1 (multiclass)
+    - 2 (multilabel)
+- Input directory (**-I**): the input directory for the data, referred to above as **foo**
+- Input format (**-F**): specify one of (formats described above): 
+    - 0 (libsvm format)
+    - 1 (tab-separated format)
+
+##### Data-dependent parameters
+- Number of training points (**-r**)
+- Number of testing points (**-e**)
+- Ambient dimension (**-D**): the original dimension of the data
+- Number of classes (**-l**)
+
+##### ProtoNN hyper-parameters (required)
+- Projection dimension (**-d**): the dimension into which the data is projected
+- Number of Prototypes (**-m**): This is the number of prototypes. Use this parameter if you want to cluster the entire training data to assign prototypes. **Specify only one of the -m and the -k flags.**
+- Num of Prototypes Per Class (**-k**): This is the number of prototypes per class. Use this parameter if you want $$k$$ prototypes to be assigned to each class, initialized using k-means clustering on all data-points belonging to that class. On using it, $$m$$ becomes $$L\cdot k$$ where $$L$$ is the number of classes. **Specify only one of the -m and the -k flags.**
+
+##### ProtoNN hyper-parameters (optional)
+- Sparsity parameters (described in detail above): Projection sparsity (**-W**), Prototype Sparsity (**-B**), Label Sparsity (**-Z**). [**Default:** $$1.0$$]
+- GammaNumerator (**-g**):
+    - On setting GammaNumerator, the RBF kernel parameter $$\gamma$$ is set as;
+    - $$\gamma = (2.5 \cdot GammaNumerator)/(median(||B_j,W - X_i||_2^2))$$
+    - **Default:** $$1.0$$
+- Normalization (**-N**): specify one of: 
+    - 0 (no normalization) (**default**)
+    - 1 (min-max normalization wherein each feature is linearly scaled to lie with 0 and 1)
+    - 2 (l2-normalization wherein each data-point is normalized to unit l2-norm)
+- Seed (**-R**): A random number seed which can be used to re-generate previously obtained experimental results. [**Default:** $$42$$]
+
+##### ProtoNN optimization hyper-parameters (optional)
+- Batch size (**-b**): batch size for mini-batch stochastic gradient descent. [**Default:** $$1024$$]
+- Number of iterations (**-T**): total number of optimization iterations. [**Default:** $$20$$]
+- Epochs (**-E**): number of see-through's of the data for each iteration, and each parameter. [**Default:** $$20$$] 
+##### Executable
+
+The script in this section combines all the specified hyper-parameters to create an execution command. This command is printed to stdout, and then executed.
+Most users should copy this section directly to all their ProtoNN execution scripts without change. We provide a single option here that is commented out by default: 
+- **gdb --args**: Run ProtoNN with given hyper-parameters in debug mode using gdb. 
+
+## Disclaimers
+- The training data is not shuffled in the code, and hence it is a good idea to **pre-shuffle** it once before passing to ProtoNN. For example, all examples of a single class should not occur consecutively. A simple bash command should accomplish this.
+- **Normalization**: Ideally, the user should provide **standardized** (Mean-Variance normalized) data. If this is not possible, use one of the normalization options that we provide. The code may be unstable in the absence of normalization.
+- The results on various datasets as reported in the ProtoNN paper were using **Gradient Descent** as the optimization algorithm, whereas this repository uses **Stochastic Gradient Descent**. It is possible that the results don't match exactly. We will publish an update to this repository with Gradient Descent implemented. 
+- We do _not_ provide support for **Cross-Validation**, only **Train-Test** style runs. The user can write a bash wrapper to perform Cross-Validation. 
+
+## Interpreting the output
+- The following information is printed to **std::cout**: 
+    - The chosen value of $$\gamma$$
+    - **Training, testing accuracy, and training objective value**, thrice for each iteration, once after optimizing each parameter
+
+- **Errors and warnings** are printed to **std::cerr**.
+
+- Additional **parameter dumps**, **timer logs** and other **debugging logs** will be placed in the input folder **foo**. Hence, the user should have read-write permissions on **foo** (use chmod if necessary). 
+    -  On execution, a folder called **results** is created in **foo**. The results folder will have another folder whose name will indicate to the user the list of parameters with which the run was instantiated. In this folder, **6 files** will be created: 
+    - **log**: This file stores logging information such as the time taken to run various parts of the code, the norms of the matrices etc. This is mainly for debugging/optimization purposes and requires a more detailed understanding of the code to interpret. It may contain useful information if your code did not run as expected. **The log file is populated synchronously while the ProtoNN optimization is executing.** 
+    - **runInfo**: This file contains the hyperparameters and meta-information for the respective instantiation of ProtoNN. It also shows you the exact bash script call that was made, which is helpful for reproducing results purposes. Additionally, the training, testing accuracy and objective value at the end of each iteration is printed in a readable format. **This file is created at the end of the ProtoNN optimization.**
+    - **W, B, Z**: These files contain the learnt parameter matrices $$W$$, $$B$$ and $$Z$$ in human-readable tsv format. The dimensions of storage are $$(d, D)$$, $$(d, m)$$ and $$(L, m)$$ respectively. **These files are created at the end of the ProtoNN optimization.**
+    - **gamma**: This file contains a single number, the chosen value of $$\gamma$$, the RBF kernel parameter.
+
+The files **W, B, Z, and gamma** can be copied to **foo** to continue training of ProtoNN by initializing with these previously learned matrices. Use the **-P** option for this (see above). On doing so, the starting train/test accuracies should match the final accuracy as specified in the runInfo file. 
+
+## Choosing hyperparameters
+##### Model size as a function of hyperparameters
+The user presented with a model-size budget has to make a decision regarding the following 5 hyper-parameters: 
+- The projection dimension $$d$$
+- The number of prototypes $$m$$
+- The 3 sparsity parameters: $$\lambda_W$$, $$\lambda_B$$, $$\lambda_Z$$
+ 
+Each parameter requires the following number of non-zero values for storage:
+- $$S_W: min(1, 2\lambda_W) \cdot d \cdot D$$
+- $$S_B: min(1, 2\lambda_B) \cdot d \cdot m$$
+- $$S_Z: min(1, 2\lambda_Z) \cdot L \cdot m$$
+
+The factor of 2 is for storing the index of a sparse matrix, apart from the value at that index. Clearly, if a matrix is more than 50% dense ($$\lambda > 0.5$$), it is better to store the matrix as dense instead of incurring the overhead of storing indices along with the values. Hence the minimum operator. 
+Suppose each value is a single-precision floating point (4 bytes), then the total space required by ProtoNN is $$4\cdot(S_W + S_B + S_Z)$$. This value is computed and output to screen on running ProtoNN. 
+
+##### Pointers on choosing hyperparameters
+Choosing the right hyperparameters may seem to be a daunting task in the beginning but becomes much easier with a little bit of thought. To get an idea of default parameters on some sample datasets, see the ([paper](publications/protonn.pdf)). Few rules of thumb:
+-- $$S_B$$ is typically small, and hence $$\lambda_B \approx 1.0$$. 
+-- One can set $$m$$ to $$min(10\cdot L, 0.01\cdot numTrainingPoints)$$, and $$d$$ to $$15$$ for an initial experiment. Typically, you want to cross-validate for $$m$$ and $$d$$. 
+-- Depending on $$L$$ and $$D$$, $$S_W$$ or $$S_Z$$ is the biggest contributors to model size. $$\lambda_W$$ and $$\lambda_Z$$ can be adjusted accordingly or cross-validated for. 
+
+## Formal details
+##### Prediction function
+ProtoNN predicts on a new test-point in the following manner. For a test-point $$X$$, ProtoNN computes the following $$L$$ dimensional score vector:
+$$Y_{score}=\sum_{j=0}^{m}\space \left(RBF_\gamma(W\cdot X,B_j)\cdot Z_j\right)$$, where
+$$RBF_\gamma (U, V) = exp\left[-\gamma^2||U - V||_2^2\right]$$
+The prediction label is then $$\space max(Y_{score})$$. 
+
+##### Training 
+While training, we are presented with training examples $$X_1, X_2, ... X_n$$ along with their label vectors $$Y_1, Y_2, ... Y_n$$ respectively. $$Y_i$$ is an L-dimensional vector that is $$0$$ everywhere, except the component to which the training point belongs, where it is $$1$$.  For example, for a $$3$$ class problem, for a data-point that belongs to class $$2$$, $$Y=[0, 1, 0]$$. 
+We optimize the $$l_2$$-square loss over all training points as follows:  $$\sum_{i=0}^{n} = ||Y_i-\sum_{j=0}^{m}\space \left(exp\left[-\gamma^2||W\cdot X_i - B_j||^2\right]\cdot Z_j\right)||_2^2$$. 
+While performing stochastic gradient descent, we hard threshold after each gradient update step to ensure that the three memory constraints (one each for $$\lambda_W, \lambda_B, \lambda_Z$$) are satisfied by the matrices $$W$$, $$B$$ and $$Z$$. 
--- a/README_PROTONN_TLC.md
+++ b/README_PROTONN_TLC.md
@ -0,0 +1,61 @@
+# ProtoNN: Compressed and accurate KNN for resource-constrained devices
+
+ProtoNN ([paper](http://manikvarma.org/pubs/gupta17.pdf)) has been developed for machine learning applications where the intended footprint of the ML model is small. ProtoNN models have memory requirements that are several orders of magnitude lower than other modern ML algorithms. At prediction time, ProtoNN is fast, precise, and accurate. 
+
+One example of a ubiquitous real-world application where such a model is desirable are resource-scarce devices such as an Internet of Things (IoT) sensor. To make real-time predictions locally on IoT devices, without connecting to the cloud, we need models that are just a few kilobytes large. ProtoNN shines in this setting, beating all other algorithms by a significant margin. 
+
+## The model
+Suppose a single data-point is $$D$$-dimensional. Suppose also that there are a total of $$L$$ labels to predict. 
+
+ProtoNN learns 3 parameters:
+- A projection matrix $$W$$ of dimension $$(d,\space D)$$ projects the datapoints to a small dimension $$d$$
+- $$m$$ prototypes in the projected space, each $$d$$-dimensional: $$B = [B_1,\space B_2, ... \space B_m]$$
+- $$m$$ label vectors for each of the prototypes to allow a single prototype to store information for multiple labels, each $$L$$-dimensional: $$Z = [Z_1,\space Z_2, ... \space Z_m]$$
+
+ProtoNN also assumes an RBF-kernel parametrized by a single parameter $$\gamma$$. Each of the three matrices are trained to be sparse. The user can specify the maximum proportion of entries that can be non-zero in each of these matrices using the parameters $$\lambda_W$$, $$\lambda_B$$ and $$\lambda_Z$$:
+- $$||W||_0 < \lambda_W \cdot size(W)$$
+- $$||B||_0 < \lambda_B \cdot size(B)$$
+- $$||Z||_0 < \lambda_Z \cdot size(Z)$$ 
+
+## Effect of various parameters
+The user presented with a model-size budget has to make a decision regarding the following 5 parameters: 
+- The projection dimension $$d$$
+- The number of prototypes $$m$$
+- The 3 sparsity parameters: $$\lambda_W$$, $$\lambda_B$$, $$\lambda_Z$$
+ 
+Each parameter requires the following number of non-zero values for storage:
+- $$S_W: min(1, 2\lambda_W) \cdot d \cdot D$$
+- $$S_B: min(1, 2\lambda_B) \cdot d \cdot m$$
+- $$S_Z: min(1, 2\lambda_Z) \cdot L \cdot m$$
+
+The factor of 2 is for storing the index of a sparse matrix, apart from the value at that index. Clearly, if a matrix is more than 50% dense ($$\lambda > 0.5$$), it is better to store the matrix as dense instead of incurring the overhead of storing indices along with the values. Hence the minimum operator. 
+Suppose each value is a single-precision floating point (4 bytes), then the total space required by ProtoNN is $$4\cdot(S_W + S_B + S_Z)$$.
+
+## Prediction
+Given these parameters, ProtoNN predicts on a new test-point in the following manner. For a test-point $$X$$, ProtoNN computes the following $$L$$ dimensional score vector:
+$$Y_{score}=\sum_{j=0}^{m}\space \left(RBF_\gamma(W\cdot X,B_j)\cdot Z_j\right)$$, where
+$$RBF_\gamma (U, V) = exp\left[-\gamma^2||U - V||_2^2\right]$$
+The prediction label is then $$\space max(Y_{score})$$. 
+
+## Training 
+While training, we are presented with training examples $$X_1, X_2, ... X_n$$ along with their label vectors $$Y_1, Y_2, ... Y_n$$ respectively. $$Y_i$$ is an L-dimensional vector that is $$0$$ everywhere, except the component to which the training point belongs, where it is $$1$$.  For example, for a $$3$$ class problem, for a data-point that belongs to class $$2$$, $$Y=[0, 1, 0]$$. 
+
+We optimize the $$l_2$$-square loss over all training points as follows:  $$\sum_{i=0}^{n} = ||Y_i-\sum_{j=0}^{m}\space \left(exp\left[-\gamma^2||W\cdot X_i - B_j||^2\right]\cdot Z_j\right)||_2^2$$. 
+While performing stochastic gradient descent, we hard threshold after each gradient update step to ensure that the three memory constraints (one each for $$\lambda_W, \lambda_B, \lambda_Z$$) are satisfied by the matrices $$W$$, $$B$$ and $$Z$$. 
+
+## Output
+TODO
+
+## Parameters
+- Projection Dimension ($$d$$): this is the dimension into which the data is projected
+- Clustering Init: This option specifies whether the initialization for the prototypes is performed by clustering the entire training data (OverallKmeans), or clustering data-points belonging to different classes separately (PerClassKmeans). 
+- Num Prototypes ($$m$$): This is the number of prototypes. This parameter is only used if Clustering Init is specified as OverallKmeans. 
+- Num Prototypes Per Class ($$k$$): This is the number of prototypes per class. This parameter is only used if Clustering Init is specified as PerClassKmeans. On using it, $$m$$ becomes $$L\cdot k$$ where L is the number of classes. 
+- gammaNumerator:
+    - On setting gammaNumerator, the RBF kernel parameter $$\gamma$$ is set as;
+    - $$\gamma = (2.5 \cdot gammaNumerator)/(median(||B_j,W - X_i||_2^2))$$
+- sparsity parameters (described in detail above): Projection sparsity ($$\lambda_W$$), Prototype Sparsity ($$\lambda_B$$), Label Sparsity($$\lambda_Z$$).
+- Batch size: Batch size for mini-batch stochastic gradient descent.
+- Number of iterations: total number of optimization iterations.
+- Epochs: Number of see-through's of the data for each iteration, and each parameter. 
+- Seed: A random number seed which can be used to re-generate previously obtained experimental results.