initial adjustments to readme, setup, tests markdown files
This commit is contained in:
Родитель
e132a5bb00
Коммит
fad27af27a
216
README.md
216
README.md
|
@ -1,15 +1,17 @@
|
|||
# Recommenders
|
||||
|
||||
This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learning to illustrate four key tasks:
|
||||
1. Preparing and loading data for each recommender algorithm.
|
||||
2. Using different algorithms such as Smart Adaptive Recommendation (SAR), Alternating Least Square (ALS), etc., for building recommender models.
|
||||
3. Evaluating algorithms with offline metrics.
|
||||
4. Operationalizing models in a production environment on Azure.
|
||||
This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on four key tasks:
|
||||
1. [Data Prep](notebooks/01_data/README.md): Preparing and loading data for each recommender algorithm
|
||||
2. [Model](notebooks/02_modeling/README.md): Building models using various recommender algorithms such as Smart Adaptive Recommendation (SAR), Alternating Least Square (ALS), etc.
|
||||
3. [Evalute](notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics
|
||||
4. [Operationalize](notebooks/04_operationalize/README.md): Operationalizing models in a production environment on Azure
|
||||
|
||||
Several utilities are provided in [reco_utils](reco_utils) to do common tasks such as loading datasets in the manner expected by different algorithms, evaluate model outputs, and split training data. Reference implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
|
||||
Several utilities are provided in [reco_utils](reco_utils) to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting train/test data. Reference implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
|
||||
|
||||
## Getting Started
|
||||
Please see the [setup guide](SETUP.md) to setup including GPU or Spark dependencies or to [setup on Azure Databricks](/SETUP.md#setup-guide-for-azure-databricks). To setup on your local machine:
|
||||
Please see the [setup guide](SETUP.md) to setup your machine locally, on Spark, or on [Azure Databricks](/SETUP.md#setup-guide-for-azure-databricks).
|
||||
|
||||
To setup on your local machine:
|
||||
1. Install [Anaconda Python 3.6](https://conda.io/miniconda.html)
|
||||
2. Run the generate conda file script and create a conda environment:
|
||||
```
|
||||
|
@ -19,8 +21,8 @@ Please see the [setup guide](SETUP.md) to setup including GPU or Spark dependenc
|
|||
```
|
||||
3. Activate the conda environment and register it with Jupyter:
|
||||
```
|
||||
source activate my_env_name
|
||||
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"
|
||||
conda activate reco
|
||||
python -m ipykernel install --user --name reco --display-name "Python (reco)"
|
||||
```
|
||||
4. Run the [ALS Movielens Quickstart](notebooks/00_quick_start/als_pyspark_movielens.ipynb) notebook.
|
||||
|
||||
|
@ -58,26 +60,15 @@ The [Operationalize Notebooks](notebooks/04_operationalize) discuss how to deplo
|
|||
| --- | --- |
|
||||
| [als_movie_o16n](notebooks/04_operationalize/als_movie_o16n.ipynb) | End-to-end examples demonstrate how to build, evaluate, and deploye a Spark ALS based movie recommender with Azure services such as [Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction), and [Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/).
|
||||
|
||||
## Workflow
|
||||
The diagram below depicts how the best-practice examples help researchers / developers in the recommendation system development workflow.
|
||||
|
||||
![workflow](/reco_workflow.png)
|
||||
|
||||
A few Azure services are recommended for scalable data storage ([Azure Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction)), model development ([Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Azure Data Science Virtual Machine](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (DSVM), [Azure Machine Learning Services](https://azure.microsoft.com/en-us/services/machine-learning-service/)), and model opertionalization ([Azure Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/) (AKS)).
|
||||
|
||||
![architecture](/reco-arch.png)
|
||||
|
||||
|
||||
## Benchmarks
|
||||
|
||||
Here we benchmark the algorithms available in this repository. A notebook for reproducing the benchmarking results can be found [here](notebooks/00_quick_start/benchmark.ipynb).
|
||||
|
||||
We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is split into train/test sets at at 75/25 ratio and splitting is random. A recommendation model is trained using each of the below collaborative filtering algorithms. We utilize empirical parameter values reported in literature that generated optimal results as reported [here](http://mymedialite.net/examples/datasets.html). We benchmark on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Algorithms that do not apply with GPU accelerations are run on CPU instead. Spark ALS is run in local standalone mode.
|
||||
We benchmark on the Movielens 1M dataset. Data is split into train/test sets at at 75/25 ratio and splitting is random. A recommendation model is trained using each of the below collaborative filtering algorithms. We utilize empirical parameter values reported in literature that generated optimal results as reported [here](http://mymedialite.net/examples/datasets.html). We benchmark on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Spark ALS is run in local standalone mode.
|
||||
|
||||
**Benchmark results**
|
||||
<table>
|
||||
<tr>
|
||||
<th>Dataset</th>
|
||||
<th>Algorithm</th>
|
||||
<th>Precision</th>
|
||||
<th>Recall</th>
|
||||
|
@ -89,41 +80,6 @@ We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is s
|
|||
<th>R squared</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan=3>Movielens 100k</td>
|
||||
<td>ALS</td>
|
||||
<td align="right">0.096</td>
|
||||
<td align="right">0.079</td>
|
||||
<td align="right">0.026</td>
|
||||
<td align="right">0.100</td>
|
||||
<td align="right">1.110</td>
|
||||
<td align="right">0.860</td>
|
||||
<td align="right">0.025</td>
|
||||
<td align="right">0.023</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Surprise SVD</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">0.958</td>
|
||||
<td align="right">0.755</td>
|
||||
<td align="right">0.287</td>
|
||||
<td align="right">0.287</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SAR Single Node</td>
|
||||
<td align="right">0.327</td>
|
||||
<td align="right">0.176</td>
|
||||
<td align="right">0.106</td>
|
||||
<td align="right">0.373</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan=3>Movielens 1M</td>
|
||||
<td>ALS</td>
|
||||
<td align="right">0.120</td>
|
||||
<td align="right">0.062</td>
|
||||
|
@ -156,154 +112,8 @@ We benchmark on the Movielens dataset at 100K, 1M, 10M, and 20M sizes. Data is s
|
|||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan=3>Movielens 10M</td>
|
||||
<td>ALS</td>
|
||||
<td align="right">0.090</td>
|
||||
<td align="right">0.057</td>
|
||||
<td align="right">0.015</td>
|
||||
<td align="right">0.084</td>
|
||||
<td align="right">0.850</td>
|
||||
<td align="right">0.647</td>
|
||||
<td align="right">0.359</td>
|
||||
<td align="right">0.359</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Surprise SVD</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">0.804</td>
|
||||
<td align="right">0.616</td>
|
||||
<td align="right">0.424</td>
|
||||
<td align="right">0.424</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SAR Single Node</td>
|
||||
<td align="right">0.276</td>
|
||||
<td align="right">0.156</td>
|
||||
<td align="right">0.101</td>
|
||||
<td align="right">0.321</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan=3>Movielens 20M</td>
|
||||
<td>ALS</td>
|
||||
<td align="right">0.081</td>
|
||||
<td align="right">0.052</td>
|
||||
<td align="right">0.014</td>
|
||||
<td align="right">0.076</td>
|
||||
<td align="right">0.830</td>
|
||||
<td align="right">0.633</td>
|
||||
<td align="right">0.372</td>
|
||||
<td align="right">0.371</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Surprise SVD</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">0.790</td>
|
||||
<td align="right">0.601</td>
|
||||
<td align="right">0.436</td>
|
||||
<td align="right">0.436</td>
|
||||
</tr>
|
||||
<tr >
|
||||
<td>SAR Single Node</td>
|
||||
<td align="right">0.247</td>
|
||||
<td align="right">0.135</td>
|
||||
<td align="right">0.085</td>
|
||||
<td align="right">0.287</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
<td align="right">N/A</td>
|
||||
</tr>
|
||||
|
||||
</table>
|
||||
|
||||
**Benchmark comparing time metrics**
|
||||
|
||||
In order to benchmark the run-time performance, the algorithms are run on the same data set and the elapsed time for training and testing are collected as follows.
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<th>Dataset</th>
|
||||
<th>Algorithm</th>
|
||||
<th>Training time (s)</th>
|
||||
<th>Testing time (s)</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan=3>Movielens 100k</td>
|
||||
<td>ALS</td>
|
||||
<td align="right">5.7</td>
|
||||
<td align="right">0.3</td>
|
||||
</tr>
|
||||
<tr >
|
||||
<td >Surprise SVD</td>
|
||||
<td align="right">13.3</td>
|
||||
<td align="right">3.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SAR Single Node</td>
|
||||
<td align="right">0.7</td>
|
||||
<td align="right">0.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan=3>Movielens 1M</td>
|
||||
<td>ALS</td>
|
||||
<td align="right">18.0</td>
|
||||
<td align="right">0.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Surprise SVD</td>
|
||||
<td align="right">129.0</td>
|
||||
<td align="right">35.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SAR Single Node</td>
|
||||
<td align="right">5.8</td>
|
||||
<td align="right">0.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan=3>Movielens 10M</td>
|
||||
<td>ALS</td>
|
||||
<td align="right">92.0</td>
|
||||
<td align="right">0.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Surprise SVD</td>
|
||||
<td align="right">1285.0</td>
|
||||
<td align="right">253.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SAR Single Node</td>
|
||||
<td align="right">111.0</td>
|
||||
<td align="right">12.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan=3>Movielens 20M</td>
|
||||
<td>ALS</td>
|
||||
<td align="right">142.0</td>
|
||||
<td align="right">0.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Surprise SVD</td>
|
||||
<td align="right">2562.0</td>
|
||||
<td align="right">506.0</td>
|
||||
</tr>
|
||||
<tr >
|
||||
<td>SAR Single Node</td>
|
||||
<td align="right">559.0</td>
|
||||
<td align="right">47.3</td>
|
||||
</tr>
|
||||
|
||||
</table>
|
||||
|
||||
## Contributing
|
||||
This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md).
|
||||
|
@ -316,5 +126,5 @@ This project welcomes contributions and suggestions. Before contributing, please
|
|||
| **Linux GPU** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu?branchName=master)](https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/latest?definitionId=4997) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu_staging?branchName=staging)](https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/latest?definitionId=4998)|
|
||||
| **Linux Spark** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4804) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark_staging?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4805)|
|
||||
|
||||
*NOTE: the tests are executed every night, we use pytest for testing python [utilities]((reco_utils)) and papermill for testing [notebooks](notebooks)*.
|
||||
**NOTE** - the tests are executed every night, we use pytest for testing python [utilities]((reco_utils)) and papermill for testing [notebooks](notebooks).
|
||||
|
||||
|
|
37
SETUP.md
37
SETUP.md
|
@ -10,7 +10,6 @@ In this guide we show how to setup all the dependencies to run the notebooks of
|
|||
* [Setup Requirements](#setup-requirements)
|
||||
* [Dependencies setup](#dependencies-setup)
|
||||
* [Register the conda environment in Jupyter notebook](register-the-conda-environment-in-jupyter-notebook)
|
||||
* [Tests](#tests)
|
||||
* [Troubleshooting for the DSVM](#troubleshooting-for-the-dsvm)
|
||||
* [Setup guide for Azure Databricks](#setup-guide-for-azure-databricks)
|
||||
* [Requirements of Azure Databricks](#requirements-of-azure-databricks)
|
||||
|
@ -25,7 +24,6 @@ We have different compute environments, depending on the kind of machine
|
|||
|
||||
Environments supported to run the notebooks on the DSVM:
|
||||
* Python CPU
|
||||
* Python GPU
|
||||
* PySpark
|
||||
|
||||
Environments supported to run the notebooks on Azure Databricks:
|
||||
|
@ -35,16 +33,15 @@ Environments supported to run the notebooks on Azure Databricks:
|
|||
|
||||
### Setup Requirements
|
||||
|
||||
- [Anaconda Python 3.6](https://conda.io/miniconda.html)
|
||||
- [Anaconda Python 3](https://conda.io/miniconda.html)
|
||||
- The Python library dependencies can be found in this [script](scripts/generate_conda_file.sh).
|
||||
- Machine with Spark (optional for Python environment but mandatory for PySpark environment).
|
||||
- Machine with GPU (optional but desirable for computing acceleration).
|
||||
|
||||
### Dependencies setup
|
||||
|
||||
We install the dependencies with Conda. As a pre-requisite, we may want to make sure that Conda is up-to-date:
|
||||
|
||||
conda update conda
|
||||
conda update anaconda
|
||||
|
||||
We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use.
|
||||
|
||||
|
@ -61,17 +58,6 @@ Assuming the repo is cloned as `Recommenders` in the local system, to install th
|
|||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong><em>Python GPU environment</em></strong></summary>
|
||||
|
||||
Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:
|
||||
|
||||
cd Recommenders
|
||||
./scripts/generate_conda_file.sh --gpu
|
||||
conda env create -n reco_gpu -f conda_gpu.yaml
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong><em>PySpark environment</em></strong></summary>
|
||||
|
||||
|
@ -81,9 +67,9 @@ To install the PySpark environment, which by default installs the CPU environmen
|
|||
./scripts/generate_conda_file.sh --pyspark
|
||||
conda env create -n reco_pyspark -f conda_pyspark.yaml
|
||||
|
||||
**NOTE** for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
|
||||
**NOTE** - for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
|
||||
|
||||
For setting these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/activate.d/env_vars.sh` and add:
|
||||
To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh` and add:
|
||||
|
||||
```bash
|
||||
#!/bin/sh
|
||||
|
@ -91,7 +77,7 @@ export PYSPARK_PYTHON=/anaconda/envs/reco_pyspark/bin/python
|
|||
export PYSPARK_DRIVER_PYTHON=/anaconda/envs/reco_pyspark/bin/python
|
||||
```
|
||||
|
||||
This will export the variables every time we do `source activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/deactivate.d/env_vars.sh` and add:
|
||||
This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/etc/conda/deactivate.d/env_vars.sh` and add:
|
||||
|
||||
```bash
|
||||
#!/bin/sh
|
||||
|
@ -100,23 +86,12 @@ unset PYSPARK_DRIVER_PYTHON
|
|||
```
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong><em>All environments</em></strong></summary>
|
||||
|
||||
To install all three environments:
|
||||
|
||||
cd Recommenders
|
||||
./scripts/generate_conda_file.sh --gpu --pyspark
|
||||
conda env create -n reco_full -f conda_full.yaml
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Register the conda environment in Jupyter notebook
|
||||
|
||||
We can register our created conda environment to appear as a kernel in the Jupyter notebooks.
|
||||
|
||||
source activate my_env_name
|
||||
conda activate my_env_name
|
||||
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"
|
||||
|
||||
|
||||
|
|
2
TESTS.md
2
TESTS.md
|
@ -1,6 +1,6 @@
|
|||
# Tests
|
||||
|
||||
This project use unit, smoke and integration tests with Python files and notebooks. For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/). To manually execute the unit tests in the different environments, first **make sure you are in the correct environment as described in the [setup](/SETUP.md)**. Click on the following menus to see more details:
|
||||
This project uses unit, smoke and integration tests with Python files and notebooks. For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/). To manually execute the unit tests in the different environments, first **make sure you are in the correct environment as described in the [setup](/SETUP.md)**. Click on the following menus to see more details:
|
||||
|
||||
<details>
|
||||
<summary><strong><em>Unit tests</em></strong></summary>
|
||||
|
|
|
@ -3,3 +3,11 @@
|
|||
In this directory, notebooks are provided to demonstrate how recommendation systems developed in a heterogeneous environment (e.g., Spark, GPU, etc.) can be operationalized.
|
||||
|
||||
|
||||
## Workflow
|
||||
The diagram below depicts how the best-practice examples help researchers / developers in the recommendation system development workflow.
|
||||
|
||||
![workflow](/reco_workflow.png)
|
||||
|
||||
A few Azure services are recommended for scalable data storage ([Azure Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction)), model development ([Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Azure Data Science Virtual Machine](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (DSVM), [Azure Machine Learning Services](https://azure.microsoft.com/en-us/services/machine-learning-service/)), and model opertionalization ([Azure Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/) (AKS)).
|
||||
|
||||
![architecture](/reco-arch.png)
|
||||
|
|
Загрузка…
Ссылка в новой задаче