initial adjustments to readme, setup, tests markdown files

This commit is contained in:
Scott Graham 2018-12-04 14:57:32 -05:00
Родитель e132a5bb00
Коммит fad27af27a
4 изменённых файлов: 28 добавлений и 235 удалений

216
README.md
Просмотреть файл

@ -1,15 +1,17 @@
# Recommenders
This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learning to illustrate four key tasks:
1. Preparing and loading data for each recommender algorithm.
2. Using different algorithms such as Smart Adaptive Recommendation (SAR), Alternating Least Square (ALS), etc., for building recommender models.
3. Evaluating algorithms with offline metrics.
4. Operationalizing models in a production environment on Azure.
This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on four key tasks:
1. [Data Prep](notebooks/01_data/README.md): Preparing and loading data for each recommender algorithm
2. [Model](notebooks/02_modeling/README.md): Building models using various recommender algorithms such as Smart Adaptive Recommendation (SAR), Alternating Least Square (ALS), etc.
3. [Evalute](notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics
4. [Operationalize](notebooks/04_operationalize/README.md): Operationalizing models in a production environment on Azure
Several utilities are provided in [reco_utils](reco_utils) to do common tasks such as loading datasets in the manner expected by different algorithms, evaluate model outputs, and split training data. Reference implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
Several utilities are provided in [reco_utils](reco_utils) to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting train/test data. Reference implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
## Getting Started
Please see the [setup guide](SETUP.md) to setup including GPU or Spark dependencies or to [setup on Azure Databricks](/SETUP.md#setup-guide-for-azure-databricks). To setup on your local machine:
Please see the [setup guide](SETUP.md) to setup your machine locally, on Spark, or on [Azure Databricks](/SETUP.md#setup-guide-for-azure-databricks).
To setup on your local machine:
1. Install [Anaconda Python 3.6](https://conda.io/miniconda.html)
2. Run the generate conda file script and create a conda environment:
```
@ -19,8 +21,8 @@ Please see the [setup guide](SETUP.md) to setup including GPU or Spark dependenc
```
3. Activate the conda environment and register it with Jupyter:
```
source activate my_env_name
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"
conda activate reco
python -m ipykernel install --user --name reco --display-name "Python (reco)"
```
4. Run the [ALS Movielens Quickstart](notebooks/00_quick_start/als_pyspark_movielens.ipynb) notebook.
@ -58,26 +60,15 @@ The [Operationalize Notebooks](notebooks/04_operationalize) discuss how to deplo
| --- | --- |
| [als_movie_o16n](notebooks/04_operationalize/als_movie_o16n.ipynb) | End-to-end examples demonstrate how to build, evaluate, and deploye a Spark ALS based movie recommender with Azure services such as [Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction), and [Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/).
## Workflow
The diagram below depicts how the best-practice examples help researchers / developers in the recommendation system development workflow.
![workflow](/reco_workflow.png)
A few Azure services are recommended for scalable data storage ([Azure Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction)), model development ([Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Azure Data Science Virtual Machine](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (DSVM), [Azure Machine Learning Services](https://azure.microsoft.com/en-us/services/machine-learning-service/)), and model opertionalization ([Azure Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/) (AKS)).
![architecture](/reco-arch.png)
## Benchmarks
Here we benchmark the algorithms available in this repository. A notebook for reproducing the benchmarking results can be found [here](notebooks/00_quick_start/benchmark.ipynb).
We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is split into train/test sets at at 75/25 ratio and splitting is random. A recommendation model is trained using each of the below collaborative filtering algorithms. We utilize empirical parameter values reported in literature that generated optimal results as reported [here](http://mymedialite.net/examples/datasets.html). We benchmark on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Algorithms that do not apply with GPU accelerations are run on CPU instead. Spark ALS is run in local standalone mode.
We benchmark on the Movielens 1M dataset. Data is split into train/test sets at at 75/25 ratio and splitting is random. A recommendation model is trained using each of the below collaborative filtering algorithms. We utilize empirical parameter values reported in literature that generated optimal results as reported [here](http://mymedialite.net/examples/datasets.html). We benchmark on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Spark ALS is run in local standalone mode.
**Benchmark results**
<table>
<tr>
<th>Dataset</th>
<th>Algorithm</th>
<th>Precision</th>
<th>Recall</th>
@ -89,41 +80,6 @@ We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is s
<th>R squared</th>
</tr>
<tr>
<td rowspan=3>Movielens 100k</td>
<td>ALS</td>
<td align="right">0.096</td>
<td align="right">0.079</td>
<td align="right">0.026</td>
<td align="right">0.100</td>
<td align="right">1.110</td>
<td align="right">0.860</td>
<td align="right">0.025</td>
<td align="right">0.023</td>
</tr>
<tr>
<td>Surprise SVD</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">0.958</td>
<td align="right">0.755</td>
<td align="right">0.287</td>
<td align="right">0.287</td>
</tr>
<tr>
<td>SAR Single Node</td>
<td align="right">0.327</td>
<td align="right">0.176</td>
<td align="right">0.106</td>
<td align="right">0.373</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
</tr>
<tr>
<td rowspan=3>Movielens 1M</td>
<td>ALS</td>
<td align="right">0.120</td>
<td align="right">0.062</td>
@ -156,154 +112,8 @@ We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is s
<td align="right">N/A</td>
<td align="right">N/A</td>
</tr>
<tr>
<td rowspan=3>Movielens 10M</td>
<td>ALS</td>
<td align="right">0.090</td>
<td align="right">0.057</td>
<td align="right">0.015</td>
<td align="right">0.084</td>
<td align="right">0.850</td>
<td align="right">0.647</td>
<td align="right">0.359</td>
<td align="right">0.359</td>
</tr>
<tr>
<td>Surprise SVD</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">0.804</td>
<td align="right">0.616</td>
<td align="right">0.424</td>
<td align="right">0.424</td>
</tr>
<tr>
<td>SAR Single Node</td>
<td align="right">0.276</td>
<td align="right">0.156</td>
<td align="right">0.101</td>
<td align="right">0.321</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
</tr>
<tr>
<td rowspan=3>Movielens 20M</td>
<td>ALS</td>
<td align="right">0.081</td>
<td align="right">0.052</td>
<td align="right">0.014</td>
<td align="right">0.076</td>
<td align="right">0.830</td>
<td align="right">0.633</td>
<td align="right">0.372</td>
<td align="right">0.371</td>
</tr>
<tr>
<td>Surprise SVD</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">0.790</td>
<td align="right">0.601</td>
<td align="right">0.436</td>
<td align="right">0.436</td>
</tr>
<tr >
<td>SAR Single Node</td>
<td align="right">0.247</td>
<td align="right">0.135</td>
<td align="right">0.085</td>
<td align="right">0.287</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
<td align="right">N/A</td>
</tr>
</table>
**Benchmark comparing time metrics**
In order to benchmark the run-time performance, the algorithms are run on the same data set and the elapsed time for training and testing are collected as follows.
<table>
<tr>
<th>Dataset</th>
<th>Algorithm</th>
<th>Training time (s)</th>
<th>Testing time (s)</th>
</tr>
<tr>
<td rowspan=3>Movielens 100k</td>
<td>ALS</td>
<td align="right">5.7</td>
<td align="right">0.3</td>
</tr>
<tr >
<td >Surprise SVD</td>
<td align="right">13.3</td>
<td align="right">3.4</td>
</tr>
<tr>
<td>SAR Single Node</td>
<td align="right">0.7</td>
<td align="right">0.1</td>
</tr>
<tr>
<td rowspan=3>Movielens 1M</td>
<td>ALS</td>
<td align="right">18.0</td>
<td align="right">0.3</td>
</tr>
<tr>
<td>Surprise SVD</td>
<td align="right">129.0</td>
<td align="right">35.7</td>
</tr>
<tr>
<td>SAR Single Node</td>
<td align="right">5.8</td>
<td align="right">0.6</td>
</tr>
<tr>
<td rowspan=3>Movielens 10M</td>
<td>ALS</td>
<td align="right">92.0</td>
<td align="right">0.2</td>
</tr>
<tr>
<td>Surprise SVD</td>
<td align="right">1285.0</td>
<td align="right">253.0</td>
</tr>
<tr>
<td>SAR Single Node</td>
<td align="right">111.0</td>
<td align="right">12.6</td>
</tr>
<tr>
<td rowspan=3>Movielens 20M</td>
<td>ALS</td>
<td align="right">142.0</td>
<td align="right">0.3</td>
</tr>
<tr>
<td>Surprise SVD</td>
<td align="right">2562.0</td>
<td align="right">506.0</td>
</tr>
<tr >
<td>SAR Single Node</td>
<td align="right">559.0</td>
<td align="right">47.3</td>
</tr>
</table>
## Contributing
This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md).
@ -316,5 +126,5 @@ This project welcomes contributions and suggestions. Before contributing, please
| **Linux GPU** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu?branchName=master)](https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/latest?definitionId=4997) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu_staging?branchName=staging)](https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/latest?definitionId=4998)|
| **Linux Spark** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4804) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark_staging?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4805)|
*NOTE: the tests are executed every night, we use pytest for testing python [utilities]((reco_utils)) and papermill for testing [notebooks](notebooks)*.
**NOTE** - the tests are executed every night, we use pytest for testing python [utilities]((reco_utils)) and papermill for testing [notebooks](notebooks).

Просмотреть файл

@ -10,7 +10,6 @@ In this guide we show how to setup all the dependencies to run the notebooks of
* [Setup Requirements](#setup-requirements)
* [Dependencies setup](#dependencies-setup)
* [Register the conda environment in Jupyter notebook](register-the-conda-environment-in-jupyter-notebook)
* [Tests](#tests)
* [Troubleshooting for the DSVM](#troubleshooting-for-the-dsvm)
* [Setup guide for Azure Databricks](#setup-guide-for-azure-databricks)
* [Requirements of Azure Databricks](#requirements-of-azure-databricks)
@ -25,7 +24,6 @@ We have different compute environments, depending on the kind of machine
Environments supported to run the notebooks on the DSVM:
* Python CPU
* Python GPU
* PySpark
Environments supported to run the notebooks on Azure Databricks:
@ -35,16 +33,15 @@ Environments supported to run the notebooks on Azure Databricks:
### Setup Requirements
- [Anaconda Python 3.6](https://conda.io/miniconda.html)
- [Anaconda Python 3](https://conda.io/miniconda.html)
- The Python library dependencies can be found in this [script](scripts/generate_conda_file.sh).
- Machine with Spark (optional for Python environment but mandatory for PySpark environment).
- Machine with GPU (optional but desirable for computing acceleration).
### Dependencies setup
We install the dependencies with Conda. As a pre-requisite, we may want to make sure that Conda is up-to-date:
conda update conda
conda update anaconda
We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use.
@ -61,17 +58,6 @@ Assuming the repo is cloned as `Recommenders` in the local system, to install th
</details>
<details>
<summary><strong><em>Python GPU environment</em></strong></summary>
Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:
cd Recommenders
./scripts/generate_conda_file.sh --gpu
conda env create -n reco_gpu -f conda_gpu.yaml
</details>
<details>
<summary><strong><em>PySpark environment</em></strong></summary>
@ -81,9 +67,9 @@ To install the PySpark environment, which by default installs the CPU environmen
./scripts/generate_conda_file.sh --pyspark
conda env create -n reco_pyspark -f conda_pyspark.yaml
**NOTE** for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
**NOTE** - for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
For setting these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/activate.d/env_vars.sh` and add:
To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh` and add:
```bash
#!/bin/sh
@ -91,7 +77,7 @@ export PYSPARK_PYTHON=/anaconda/envs/reco_pyspark/bin/python
export PYSPARK_DRIVER_PYTHON=/anaconda/envs/reco_pyspark/bin/python
```
This will export the variables every time we do `source activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/deactivate.d/env_vars.sh` and add:
This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/etc/conda/deactivate.d/env_vars.sh` and add:
```bash
#!/bin/sh
@ -100,23 +86,12 @@ unset PYSPARK_DRIVER_PYTHON
```
</details>
<details>
<summary><strong><em>All environments</em></strong></summary>
To install all three environments:
cd Recommenders
./scripts/generate_conda_file.sh --gpu --pyspark
conda env create -n reco_full -f conda_full.yaml
</details>
### Register the conda environment in Jupyter notebook
We can register our created conda environment to appear as a kernel in the Jupyter notebooks.
source activate my_env_name
conda activate my_env_name
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"

Просмотреть файл

@ -1,6 +1,6 @@
# Tests
This project use unit, smoke and integration tests with Python files and notebooks. For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/). To manually execute the unit tests in the different environments, first **make sure you are in the correct environment as described in the [setup](/SETUP.md)**. Click on the following menus to see more details:
This project uses unit, smoke and integration tests with Python files and notebooks. For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/). To manually execute the unit tests in the different environments, first **make sure you are in the correct environment as described in the [setup](/SETUP.md)**. Click on the following menus to see more details:
<details>
<summary><strong><em>Unit tests</em></strong></summary>

Просмотреть файл

@ -3,3 +3,11 @@
In this directory, notebooks are provided to demonstrate how recommendation systems developed in a heterogeneous environment (e.g., Spark, GPU, etc.) can be operationalized.
## Workflow
The diagram below depicts how the best-practice examples help researchers / developers in the recommendation system development workflow.
![workflow](/reco_workflow.png)
A few Azure services are recommended for scalable data storage ([Azure Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction)), model development ([Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Azure Data Science Virtual Machine](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (DSVM), [Azure Machine Learning Services](https://azure.microsoft.com/en-us/services/machine-learning-service/)), and model opertionalization ([Azure Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/) (AKS)).
![architecture](/reco-arch.png)