This commit is contained in:
Nikhil Joglekar 2018-12-10 20:50:19 -08:00
Родитель fe584eea86
Коммит 7300cf5b9b
5 изменённых файлов: 11 добавлений и 11 удалений

Просмотреть файл

@ -49,7 +49,7 @@ We provide several notebooks to show how recommendation algorithms can be design
- The [Operationalizion Notebook](notebooks/04_operationalize) demonstrates how to deploy models in production systems.
In addition, We also provide a [comparison notebook](notebooks/03_evaluate/comparison.ipynb) to illustrate how different algorithms could be evaluated and compared. In this notebook, data (MovieLens 1M) is randomly split into train/test sets at at 75/25 ratio. A recommendation model is trained using each of the below collaborative filtering algorithms. We utilize empirical parameter values reported in literature reported [here](http://mymedialite.net/examples/datasets.html). For ranking metrics we use k = 10 (top 10 results). We benchmark on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Spark ALS is run in local standalone mode.
In addition, We also provide a [comparison notebook](notebooks/03_evaluate/comparison.ipynb) to illustrate how different algorithms could be evaluated and compared. In this notebook, data (MovieLens 1M) is randomly split into train/test sets at a 75/25 ratio. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature [here](http://mymedialite.net/examples/datasets.html). For ranking metrics we use k = 10 (top 10 results). We run the comparison on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Spark ALS is run in local standalone mode.
**Preliminary Comparison**

Просмотреть файл

@ -2,8 +2,7 @@
In this guide we show how to setup all the dependencies to run the notebooks of this repo on a local environment or [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) and on [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/).
<details>
<summary><strong><em>Click here to see the Table of Contents</em></strong></summary>
## Table of Contents
* [Compute environments](#compute-environments)
* [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm)
@ -25,7 +24,7 @@ We have different compute environments, depending on the kind of machine
Environments supported to run the notebooks on the DSVM:
* Python CPU
* PySpark
b
Environments supported to run the notebooks on Azure Databricks:
* PySpark
@ -45,7 +44,7 @@ We install the dependencies with Conda. As a pre-requisite, we may want to make
We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use. This will create the environment using the Python version 3.6 with all the correct dependencies.
To install each environment, first we need to generate a conda yml file and then install the environment. We can specify the environment name with the input `-n`.
To install each environment, first we need to generate a conda yaml file and then install the environment. We can specify the environment name with the input `-n`.
Click on the following menus to see more details:
@ -102,8 +101,8 @@ We can register our created conda environment to appear as a kernel in the Jupyt
* We found that there could be problems if the Spark version of the machine is not the same as the one in the conda file. You will have to adapt the conda file to your machine.
* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`.
```
SPARK_LOCAL_DIRS=/mnt/.spark/scratch
SPARK_MASTER_OPTS="-Dspark.worker.cleanup.enabled=true
SPARK_LOCAL_DIRS="/mnt"
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true"
```
## Setup guide for Azure Databricks

Просмотреть файл

@ -7,6 +7,6 @@ data preparation, model building, and model evaluation by using the utility func
| Notebook | Description |
| --- | --- |
| [als_pyspark_movielens](als_pyspark_movielens.ipynb) | Utilizing the ALS algorithm to power movie ratings in a PySpark environment.
| [als_pyspark_movielens](als_pyspark_movielens.ipynb) | Utilizing the ALS algorithm to predict movie ratings in a PySpark environment.
| [sar_python_cpu_movielens](sar_single_node_movielens.ipynb) | Utilizing the Smart Adaptive Recommendations (SAR) algorithm to power movie ratings in a Python+CPU environment.

Просмотреть файл

@ -9,6 +9,6 @@ data preparation tasks witnessed in recommendation system development.
| [data_split](data_split.ipynb) | Details on splitting data (randomly, chronologically, etc).
Three methods of splitting the data for training and testing are demonstrated in this notebook. Each support both Spark and pandas DataFrames.
1. Random Split: this is the simplest way to split the data, it randomly assigns entries to either the training set or the test set based on the allocation ratio desired.
2. Chronological Split: in many cases accounting for temporal variations when evaluating your model can provide more realistic measures of performance. This approach will split the train and test set based on timestamps for the user or item data.
3. Stratified Split: it may be preferable to ensure the same number of users or items are in the training and test sets, this method of splitting will ensure that is the case.
1. Random Split: this is the simplest way to split the data, it randomly assigns entries to either the train set or the test set based on the allocation ratio desired.
2. Chronological Split: in many cases accounting for temporal variations when evaluating your model can provide more realistic measures of performance. This approach will split the train and test set based on timestamps by user or item.
3. Stratified Split: it may be preferable to ensure the same set of users or items are in the training and test sets, this method of splitting will ensure that is the case.

Просмотреть файл

@ -8,5 +8,6 @@ In this directory, notebooks are provided to give a deep dive into training mode
| --- | --- |
| [als_deep_dive](als_deep_dive.ipynb) | Deep dive on the ALS algorithm and implementation
| [surprise_svd_deep_dive](surprise_svd_deep_dive.ipynb) | Deep dive on a SVD algorithm and implementation
| [sar_single_node_deep_dive](sar_single_node_deep_dive.ipynb) | Deep dive on the SAR algorithm and implementation
Details on model training are best found inside each notebook.