initial adjustments to readme, setup, tests markdown files

2018-12-04 14:57:32 -05:00 · 2018-12-04 14:57:32 -05:00 · fad27af27a
--- a/README.md
+++ b/README.md
@ -1,15 +1,17 @@
 # Recommenders 

-This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learning to illustrate four key tasks: 
-1. Preparing and loading data for each recommender algorithm. 
-2. Using different algorithms such as Smart Adaptive Recommendation (SAR), Alternating Least Square (ALS), etc., for building recommender models. 
-3. Evaluating algorithms with offline metrics. 
-4. Operationalizing models in a production environment on Azure.
+This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on four key tasks: 
+1. [Data Prep](notebooks/01_data/README.md): Preparing and loading data for each recommender algorithm
+2. [Model](notebooks/02_modeling/README.md): Building models using various recommender algorithms such as Smart Adaptive Recommendation (SAR), Alternating Least Square (ALS), etc.
+3. [Evalute](notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics
+4. [Operationalize](notebooks/04_operationalize/README.md): Operationalizing models in a production environment on Azure

-Several utilities are provided in [reco_utils](reco_utils) to do common tasks such as loading datasets in the manner expected by different algorithms, evaluate model outputs, and split training data. Reference implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
+Several utilities are provided in [reco_utils](reco_utils) to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting train/test data. Reference implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.

 ## Getting Started
-Please see the [setup guide](SETUP.md) to setup including GPU or Spark dependencies or to [setup on Azure Databricks](/SETUP.md#setup-guide-for-azure-databricks). To setup on your local machine:
+Please see the [setup guide](SETUP.md) to setup your machine locally, on Spark, or on [Azure Databricks](/SETUP.md#setup-guide-for-azure-databricks). 
+
+To setup on your local machine:
 1. Install [Anaconda Python 3.6](https://conda.io/miniconda.html)
 2. Run the generate conda file script and create a conda environment:   
    ```
@ -19,8 +21,8 @@ Please see the [setup guide](SETUP.md) to setup including GPU or Spark dependenc
    ```
 3. Activate the conda environment and register it with Jupyter:
    ```
-    source activate my_env_name
-    python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"
+    conda activate reco
+    python -m ipykernel install --user --name reco --display-name "Python (reco)"
    ```
 4. Run the [ALS Movielens Quickstart](notebooks/00_quick_start/als_pyspark_movielens.ipynb) notebook.

@ -58,26 +60,15 @@ The [Operationalize Notebooks](notebooks/04_operationalize) discuss how to deplo
 | --- | --- | 
 | [als_movie_o16n](notebooks/04_operationalize/als_movie_o16n.ipynb) | End-to-end examples demonstrate how to build, evaluate, and deploye a Spark ALS based movie recommender with Azure services such as [Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction), and [Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/).

-## Workflow
-The diagram below depicts how the best-practice examples help researchers / developers in the recommendation system development workflow.
-
-![workflow](/reco_workflow.png)
-
-A few Azure services are recommended for scalable data storage ([Azure Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction)), model development ([Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Azure Data Science Virtual Machine](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (DSVM), [Azure Machine Learning Services](https://azure.microsoft.com/en-us/services/machine-learning-service/)), and model opertionalization ([Azure Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/) (AKS)). 
-
-![architecture](/reco-arch.png)
-
-
 ## Benchmarks

 Here we benchmark the algorithms available in this repository. A notebook for reproducing the benchmarking results can be found [here](notebooks/00_quick_start/benchmark.ipynb).

-We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is split into train/test sets at at 75/25 ratio and splitting is random. A recommendation model is trained using each of the below collaborative filtering algorithms. We utilize empirical parameter values reported in literature that generated optimal results as reported [here](http://mymedialite.net/examples/datasets.html). We benchmark on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Algorithms that do not apply with GPU accelerations are run on CPU instead. Spark ALS is run in local standalone mode.
+We benchmark on the Movielens 1M dataset. Data is split into train/test sets at at 75/25 ratio and splitting is random. A recommendation model is trained using each of the below collaborative filtering algorithms. We utilize empirical parameter values reported in literature that generated optimal results as reported [here](http://mymedialite.net/examples/datasets.html). We benchmark on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Spark ALS is run in local standalone mode.

 **Benchmark results**
 <table>
 <tr>
-  <th>Dataset</th>
  <th>Algorithm</th>
  <th>Precision</th>
  <th>Recall</th>
@ -89,41 +80,6 @@ We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is s
  <th>R squared</th>
 </tr>
 <tr>
-  <td rowspan=3>Movielens 100k</td>
-  <td>ALS</td>
-  <td align="right">0.096</td>
-  <td align="right">0.079</td>
-  <td align="right">0.026</td>
-  <td align="right">0.100</td>
-  <td align="right">1.110</td>
-  <td align="right">0.860</td>
-  <td align="right">0.025</td>
-  <td align="right">0.023</td>
- </tr>
- <tr>
-  <td>Surprise SVD</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">0.958</td>
-  <td align="right">0.755</td>
-  <td align="right">0.287</td>
-  <td align="right">0.287</td>
- </tr>
- <tr>
-  <td>SAR Single Node</td>
-  <td align="right">0.327</td>
-  <td align="right">0.176</td>
-  <td align="right">0.106</td>
-  <td align="right">0.373</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
- </tr>
- <tr>
-  <td rowspan=3>Movielens 1M</td>
  <td>ALS</td>
  <td align="right">0.120</td>
  <td align="right">0.062</td>
@ -156,154 +112,8 @@ We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is s
  <td align="right">N/A</td>
  <td align="right">N/A</td>
 </tr>
- <tr>
-  <td rowspan=3>Movielens 10M</td>
-  <td>ALS</td>
-  <td align="right">0.090</td>
-  <td align="right">0.057</td>
-  <td align="right">0.015</td>
-  <td align="right">0.084</td>
-  <td align="right">0.850</td>
-  <td align="right">0.647</td>
-  <td align="right">0.359</td>
-  <td align="right">0.359</td>
- </tr>
- <tr>
-  <td>Surprise SVD</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">0.804</td>
-  <td align="right">0.616</td>
-  <td align="right">0.424</td>
-  <td align="right">0.424</td>
- </tr>
- <tr>
-  <td>SAR Single Node</td>
-  <td align="right">0.276</td>
-  <td align="right">0.156</td>
-  <td align="right">0.101</td>
-  <td align="right">0.321</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
- </tr>
- <tr>
-  <td rowspan=3>Movielens 20M</td>
-  <td>ALS</td>
-  <td align="right">0.081</td>
-  <td align="right">0.052</td>
-  <td align="right">0.014</td>
-  <td align="right">0.076</td>
-  <td align="right">0.830</td>
-  <td align="right">0.633</td>
-  <td align="right">0.372</td>
-  <td align="right">0.371</td>
- </tr>
- <tr>
-  <td>Surprise SVD</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">0.790</td>
-  <td align="right">0.601</td>
-  <td align="right">0.436</td>
-  <td align="right">0.436</td>
- </tr>
- <tr >
-  <td>SAR Single Node</td>
-  <td align="right">0.247</td>
-  <td align="right">0.135</td>
-  <td align="right">0.085</td>
-  <td align="right">0.287</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
-  <td align="right">N/A</td>
- </tr>
-
 </table>

-**Benchmark comparing time metrics**
-
-In order to benchmark the run-time performance, the algorithms are run on the same data set and the elapsed time for training and testing are collected as follows.
-
-<table>
- <tr>
-  <th>Dataset</th>
-  <th>Algorithm</th>
-  <th>Training time (s)</th>
-  <th>Testing time (s)</th>
- </tr>
- <tr>
-  <td rowspan=3>Movielens 100k</td>
-  <td>ALS</td>
-  <td align="right">5.7</td>
-  <td align="right">0.3</td>
- </tr>
- <tr >
-  <td >Surprise SVD</td>
-  <td align="right">13.3</td>
-  <td align="right">3.4</td>
- </tr>
- <tr>
-  <td>SAR Single Node</td>
-  <td align="right">0.7</td>
-  <td align="right">0.1</td>
- </tr>
- <tr>
-  <td rowspan=3>Movielens 1M</td>
-  <td>ALS</td>
-  <td align="right">18.0</td>
-  <td align="right">0.3</td>
- </tr>
- <tr>
-  <td>Surprise SVD</td>
-  <td align="right">129.0</td>
-  <td align="right">35.7</td>
- </tr>
- <tr>
-  <td>SAR Single Node</td>
-  <td align="right">5.8</td>
-  <td align="right">0.6</td>
- </tr>
- <tr>
-  <td rowspan=3>Movielens 10M</td>
-  <td>ALS</td>
-  <td align="right">92.0</td>
-  <td align="right">0.2</td>
- </tr>
- <tr>
-  <td>Surprise SVD</td>
-  <td align="right">1285.0</td>
-  <td align="right">253.0</td>
- </tr>
- <tr>
-  <td>SAR Single Node</td>
-  <td align="right">111.0</td>
-  <td align="right">12.6</td>
- </tr>
- <tr>
-  <td rowspan=3>Movielens 20M</td>
-  <td>ALS</td>
-  <td align="right">142.0</td>
-  <td align="right">0.3</td>
- </tr>
- <tr>
-  <td>Surprise SVD</td>
-  <td align="right">2562.0</td>
-  <td align="right">506.0</td>
- </tr>
- <tr >
-  <td>SAR Single Node</td>
-  <td align="right">559.0</td>
-  <td align="right">47.3</td>
- </tr>
-
-</table>

 ## Contributing
 This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md).
@ -316,5 +126,5 @@ This project welcomes contributions and suggestions. Before contributing, please
 | **Linux GPU** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu?branchName=master)](https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/latest?definitionId=4997) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu_staging?branchName=staging)](https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/latest?definitionId=4998)|
 | **Linux Spark** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4804) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark_staging?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4805)|

-*NOTE: the tests are executed every night, we use pytest for testing python [utilities]((reco_utils)) and papermill for testing [notebooks](notebooks)*.
+**NOTE** - the tests are executed every night, we use pytest for testing python [utilities]((reco_utils)) and papermill for testing [notebooks](notebooks).

--- a/SETUP.md
+++ b/SETUP.md
@ -10,7 +10,6 @@ In this guide we show how to setup all the dependencies to run the notebooks of
  * [Setup Requirements](#setup-requirements)
  * [Dependencies setup](#dependencies-setup)
  * [Register the conda environment in Jupyter notebook](register-the-conda-environment-in-jupyter-notebook)
-  * [Tests](#tests)
  * [Troubleshooting for the DSVM](#troubleshooting-for-the-dsvm)
 * [Setup guide for Azure Databricks](#setup-guide-for-azure-databricks)
  * [Requirements of Azure Databricks](#requirements-of-azure-databricks)
@ -25,7 +24,6 @@ We have different compute environments, depending on the kind of machine

 Environments supported to run the notebooks on the DSVM:
 * Python CPU
-* Python GPU
 * PySpark

 Environments supported to run the notebooks on Azure Databricks:
@ -35,16 +33,15 @@ Environments supported to run the notebooks on Azure Databricks:

 ### Setup Requirements

- [Anaconda Python 3.6](https://conda.io/miniconda.html)
+- [Anaconda Python 3](https://conda.io/miniconda.html)
 - The Python library dependencies can be found in this [script](scripts/generate_conda_file.sh).
 - Machine with Spark (optional for Python environment but mandatory for PySpark environment).
- Machine with GPU (optional but desirable for computing acceleration).

 ### Dependencies setup

 We install the dependencies with Conda. As a pre-requisite, we may want to make sure that Conda is up-to-date:

-    conda update conda
+    conda update anaconda

 We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use.

@ -61,17 +58,6 @@ Assuming the repo is cloned as `Recommenders` in the local system, to install th

 </details>

-<details>
-<summary><strong><em>Python GPU environment</em></strong></summary>
-
-Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:
-
-    cd Recommenders
-    ./scripts/generate_conda_file.sh --gpu
-    conda env create -n reco_gpu -f conda_gpu.yaml 
-
-</details>
-
 <details>
 <summary><strong><em>PySpark environment</em></strong></summary>

@ -81,9 +67,9 @@ To install the PySpark environment, which by default installs the CPU environmen
    ./scripts/generate_conda_file.sh --pyspark
    conda env create -n reco_pyspark -f conda_pyspark.yaml

-**NOTE** for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
+**NOTE** - for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.

-For setting these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/activate.d/env_vars.sh` and add:
+To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh` and add:

 ```bash
 #!/bin/sh
@ -91,7 +77,7 @@ export PYSPARK_PYTHON=/anaconda/envs/reco_pyspark/bin/python
 export PYSPARK_DRIVER_PYTHON=/anaconda/envs/reco_pyspark/bin/python
 ```

-This will export the variables every time we do `source activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/deactivate.d/env_vars.sh` and add:
+This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/etc/conda/deactivate.d/env_vars.sh` and add:

 ```bash
 #!/bin/sh
@ -100,23 +86,12 @@ unset PYSPARK_DRIVER_PYTHON
 ```
 </details>

-<details>
-<summary><strong><em>All environments</em></strong></summary>
-
-To install all three environments:
-
-    cd Recommenders
-    ./scripts/generate_conda_file.sh  --gpu --pyspark
-    conda env create -n reco_full -f conda_full.yaml
-
-</details>
-

 ### Register the conda environment in Jupyter notebook

 We can register our created conda environment to appear as a kernel in the Jupyter notebooks. 

-    source activate my_env_name
+    conda activate my_env_name
    python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"


--- a/TESTS.md
+++ b/TESTS.md
@ -1,6 +1,6 @@
 # Tests

-This project use unit, smoke and integration tests with Python files and notebooks. For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/). To manually execute the unit tests in the different environments, first **make sure you are in the correct environment as described in the [setup](/SETUP.md)**. Click on the following menus to see more details:
+This project uses unit, smoke and integration tests with Python files and notebooks. For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/). To manually execute the unit tests in the different environments, first **make sure you are in the correct environment as described in the [setup](/SETUP.md)**. Click on the following menus to see more details:

 <details>
 <summary><strong><em>Unit tests</em></strong></summary>
--- a/notebooks/04_operationalize/README.md
+++ b/notebooks/04_operationalize/README.md
@ -3,3 +3,11 @@
 In this directory, notebooks are provided to demonstrate how recommendation systems developed in a heterogeneous environment (e.g., Spark, GPU, etc.) can be operationalized.


+## Workflow
+The diagram below depicts how the best-practice examples help researchers / developers in the recommendation system development workflow.
+
+![workflow](/reco_workflow.png)
+
+A few Azure services are recommended for scalable data storage ([Azure Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction)), model development ([Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Azure Data Science Virtual Machine](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (DSVM), [Azure Machine Learning Services](https://azure.microsoft.com/en-us/services/machine-learning-service/)), and model opertionalization ([Azure Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/) (AKS)). 
+
+![architecture](/reco-arch.png)