recommenders/SETUP.md

# Setup guide 

In this guide we show how to setup all the dependencies to run the notebooks of this repo on a local Linux system or Linux [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) and on [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/). 

## Table of Contents
 
* [Compute environments](#compute-environments)
* [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm)
  * [Setup Requirements](#setup-requirements)
  * [Dependencies setup](#dependencies-setup)
  * [Register the conda environment in Jupyter notebook](#register-the-conda-environment-in-jupyter-notebook)
  * [Troubleshooting for the DSVM](#troubleshooting-for-the-dsvm)
* [Setup guide for Azure Databricks](#setup-guide-for-azure-databricks)
  * [Requirements of Azure Databricks](#requirements-of-azure-databricks)
  * [Repository installation](#repository-installation)
  * [Troubleshooting for Azure Databricks](#troubleshooting-for-azure-databricks)

## Compute environments

We have different compute environments, depending on the kind of machine

Environments supported to run the notebooks on the Linux DSVM:
* Python CPU
* Python GPU
* PySpark
Environments supported to run the notebooks on Azure Databricks:
* PySpark

## Setup guide for Local or DSVM

### Setup Requirements

- Anaconda with Python version >= 3.6. [Miniconda](https://conda.io/miniconda.html) is the fastest way to get started.
- The Python library dependencies can be found in this [script](scripts/generate_conda_file.sh).
- Machine with Spark (optional for Python environment but mandatory for PySpark environment).

### Dependencies setup

We install the dependencies with Conda. As a pre-requisite, we may want to make sure that Conda is up-to-date:

    conda update anaconda

We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use. This will create the environment using the Python version 3.6 with all the correct dependencies.

To install each environment, first we need to generate a conda yaml file and then install the environment. We can specify the environment name with the input `-n`. 

Click on the following menus to see more details:

<details>
<summary><strong><em>Python CPU environment</em></strong></summary>

Assuming the repo is cloned as `Recommenders` in the local system, to install the Python CPU environment:

    cd Recommenders
    ./scripts/generate_conda_file.sh
    conda env create -n reco_bare -f conda_bare.yaml 

</details>


<details>
<summary><strong><em>Python GPU environment</em></strong></summary>

Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:

    cd Recommenders
    ./scripts/generate_conda_file.sh --gpu
    conda env create -n reco_gpu -f conda_gpu.yaml 

</details>

<details>
<summary><strong><em>PySpark environment</em></strong></summary>

To install the PySpark environment, which by default installs the CPU environment:

    cd Recommenders
    ./scripts/generate_conda_file.sh --pyspark
    conda env create -n reco_pyspark -f conda_pyspark.yaml

**NOTE** - for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.

To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh` and add:

```bash
#!/bin/sh
export PYSPARK_PYTHON=/anaconda/envs/reco_pyspark/bin/python
export PYSPARK_DRIVER_PYTHON=/anaconda/envs/reco_pyspark/bin/python
```

This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/etc/conda/deactivate.d/env_vars.sh` and add:

```bash
#!/bin/sh
unset PYSPARK_PYTHON
unset PYSPARK_DRIVER_PYTHON
```
</details>

<details>
<summary><strong><em>All environments</em></strong></summary>

To install all three environments:

    cd Recommenders
    ./scripts/generate_conda_file.sh  --gpu --pyspark
    conda env create -n reco_full -f conda_full.yaml

</details>


### Register the conda environment in Jupyter notebook

We can register our created conda environment to appear as a kernel in the Jupyter notebooks. 

    conda activate my_env_name
    python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"


### Troubleshooting for the DSVM

* We found that there could be problems if the Spark version of the machine is not the same as the one in the conda file. You will have to adapt the conda file to your machine. 
* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`.
```
SPARK_LOCAL_DIRS="/mnt"
SPARK_WORKER_DIR="/mnt"
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true"
```

## Setup guide for Azure Databricks

### Requirements of Azure Databricks
* Runtime version 4.1 (Apache Spark 2.3.0, Scala 2.11)
* Python 3

### Repository installation
You can setup the repository as a library on Databricks either manually or by simply running an installation script. 


<details>
<summary><strong><em>Quick install</em></strong></summary>

Prerequisite
* Install [Azure Databricks CLI (command-line interface)](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#install-the-cli)
and setup CLI [authentication](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#set-up-authentication).

1. Start a target cluster and copy the target cluster id. Cluster id can be found with following script:
    ```
    databricks clusters list
    
    <CLUSTER_ID> <CLUSTER_NAME> <STATUS>
    ...
    ```
2. If the cluster status is not *RUNNING*, start it with the command `databricks clusters start --cluster-id <CLUSTER_ID>`.
If the cluster is already running, skip this step.
3. Once the cluster status turns into *RUNNING*, use following commands to install the repository:
    ```
    cd Recommenders
    ./scripts/databricks_install.sh <CLUSTER_ID>
    ```

</details> 

<details>
<summary><strong><em>Manual setup</em></strong></summary>

To install the repo manually onto Databricks, follow the steps:
1. Clone Microsoft Recommenders repo in your local computer.
2. Zip the contents inside the Recommenders folder (Azure Databricks requires compressed folders to have the .egg suffix, so we don't use the standard .zip):
    ```
    cd Recommenders
    zip -r Recommenders.egg .
    ```
3. Once your cluster has started, go to the Databricks home workspace, then go to your user and press import.
4. In the next menu there is an option to import a library, it says: `To import a library, such as a jar or egg, click here`. Press click here.
5. Then, at the first drop-down menu, mark the option `Upload Python egg or PyPI`.
6. Then press on `Drop library egg here to upload` and select the the file `Recommenders.egg` you just created.
7. Then press `Create library`. This will upload the zip and make it available in your workspace.
8. Finally, in the next menu, attach the library to your cluster.

</details>

To make sure it works, you can now create a new notebook and import the utilities from Databricks:
```
import reco_utils
...
```

### Troubleshooting for Azure Databricks
* For the [reco_utils](reco_utils) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.
updated instructions for databricks 2018-11-13 20:28:31 +03:00			`# Setup guide`
install md fix #69 2018-10-18 17:55:45 +03:00
Add Linux system as prerequisite 2019-01-23 09:27:44 +03:00			`In this guide we show how to setup all the dependencies to run the notebooks of this repo on a local Linux system or Linux [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) and on [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/).`
addressing comments and removing dependencies that we don't use anymore 2018-10-18 19:12:35 +03:00
markdown file cleanup 2018-12-11 07:50:19 +03:00			`## Table of Contents`
fix 2018-11-15 20:26:26 +03:00
relocate compute environments 2018-11-15 20:22:28 +03:00			`* [Compute environments](#compute-environments)`
setup / tests split 2018-12-03 12:20:01 +03:00			`* [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm)`
			`* [Setup Requirements](#setup-requirements)`
			`* [Dependencies setup](#dependencies-setup)`
fixing typos and broken links 2018-12-06 21:35:27 +03:00			`* [Register the conda environment in Jupyter notebook](#register-the-conda-environment-in-jupyter-notebook)`
updated instructions for databricks 2018-11-13 20:28:31 +03:00			`* [Troubleshooting for the DSVM](#troubleshooting-for-the-dsvm)`
			`* [Setup guide for Azure Databricks](#setup-guide-for-azure-databricks)`
			`* [Requirements of Azure Databricks](#requirements-of-azure-databricks)`
Databricks installation script Add installation sh Update SETUP.md accordingly 2019-01-28 21:16:13 +03:00			`* [Repository installation](#repository-installation)`
updated instructions for databricks 2018-11-13 20:28:31 +03:00			`* [Troubleshooting for Azure Databricks](#troubleshooting-for-azure-databricks)`

relocate compute environments 2018-11-15 20:22:28 +03:00			`## Compute environments`

			`We have different compute environments, depending on the kind of machine`

Add Linux system as prerequisite 2019-01-23 09:27:44 +03:00			`Environments supported to run the notebooks on the Linux DSVM:`
relocate compute environments 2018-11-15 20:22:28 +03:00			`* Python CPU`
add gpu env readme and setup 2019-01-09 13:35:02 +03:00			`* Python GPU`
relocate compute environments 2018-11-15 20:22:28 +03:00			`* PySpark`
			`Environments supported to run the notebooks on Azure Databricks:`
			`* PySpark`

setup / tests split 2018-12-03 12:20:01 +03:00			`## Setup guide for Local or DSVM`
updated instructions for databricks 2018-11-13 20:28:31 +03:00
setup / tests split 2018-12-03 12:20:01 +03:00			`### Setup Requirements`
install md fix #69 2018-10-18 17:55:45 +03:00
addressing code review comments for readmes 2018-12-06 20:37:36 +03:00			`- Anaconda with Python version >= 3.6. [Miniconda](https://conda.io/miniconda.html) is the fastest way to get started.`
SETUP: a few text edits 2018-11-01 10:43:26 +03:00			`- The Python library dependencies can be found in this [script](scripts/generate_conda_file.sh).`
			`- Machine with Spark (optional for Python environment but mandatory for PySpark environment).`
install md fix #69 2018-10-18 17:55:45 +03:00
setup / tests split 2018-12-03 12:20:01 +03:00			`### Dependencies setup`
install md fix #69 2018-10-18 17:55:45 +03:00
dependencies 2018-11-13 20:37:10 +03:00			`We install the dependencies with Conda. As a pre-requisite, we may want to make sure that Conda is up-to-date:`
install md fix #69 2018-10-18 17:55:45 +03:00
initial adjustments to readme, setup, tests markdown files 2018-12-04 22:57:32 +03:00			`conda update anaconda`
install md fix #69 2018-10-18 17:55:45 +03:00
addressing code review comments for readmes 2018-12-06 20:37:36 +03:00			`We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use. This will create the environment using the Python version 3.6 with all the correct dependencies.`
install md fix #69 2018-10-18 17:55:45 +03:00
markdown file cleanup 2018-12-11 07:50:19 +03:00			To install each environment, first we need to generate a conda yaml file and then install the environment. We can specify the environment name with the input `-n`.
addressing code review comments for readmes 2018-12-06 20:37:36 +03:00
			`Click on the following menus to see more details:`
install md fix #69 2018-10-18 17:55:45 +03:00
drop down menus everywhere :sunglasses: 2018-11-19 13:40:47 +03:00			`<details>`
			`<summary><strong><em>Python CPU environment</em></strong></summary>`
install md fix #69 2018-10-18 17:55:45 +03:00
SETUP: a few text edits 2018-11-01 10:43:26 +03:00			Assuming the repo is cloned as `Recommenders` in the local system, to install the Python CPU environment:
install md fix #69 2018-10-18 17:55:45 +03:00
addressing comments :boom: 2018-10-18 19:16:21 +03:00			`cd Recommenders`
install md fix #69 2018-10-18 17:55:45 +03:00			`./scripts/generate_conda_file.sh`
			`conda env create -n reco_bare -f conda_bare.yaml`

drop down menus everywhere :sunglasses: 2018-11-19 13:40:47 +03:00			`</details>`

add gpu env readme and setup 2019-01-09 13:35:02 +03:00
			`<details>`
			`<summary><strong><em>Python GPU environment</em></strong></summary>`

			`Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:`

			`cd Recommenders`
			`./scripts/generate_conda_file.sh --gpu`
			`conda env create -n reco_gpu -f conda_gpu.yaml`

			`</details>`

drop down menus everywhere :sunglasses: 2018-11-19 13:40:47 +03:00			`<details>`
			`<summary><strong><em>PySpark environment</em></strong></summary>`
install md fix #69 2018-10-18 17:55:45 +03:00
			`To install the PySpark environment, which by default installs the CPU environment:`

addressing comments :boom: 2018-10-18 19:16:21 +03:00			`cd Recommenders`
install md fix #69 2018-10-18 17:55:45 +03:00			`./scripts/generate_conda_file.sh --pyspark`
SETUP: fix typo 2018-10-30 12:31:51 +03:00			`conda env create -n reco_pyspark -f conda_pyspark.yaml`
install md fix #69 2018-10-18 17:55:45 +03:00
initial adjustments to readme, setup, tests markdown files 2018-12-04 22:57:32 +03:00			NOTE - for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
addressing comments and removing dependencies that we don't use anymore 2018-10-18 19:12:35 +03:00
initial adjustments to readme, setup, tests markdown files 2018-12-04 22:57:32 +03:00			To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh` and add:
addressing comments and removing dependencies that we don't use anymore 2018-10-18 19:12:35 +03:00
			```bash
			`#!/bin/sh`
			`export PYSPARK_PYTHON=/anaconda/envs/reco_pyspark/bin/python`
			`export PYSPARK_DRIVER_PYTHON=/anaconda/envs/reco_pyspark/bin/python`
			```

initial adjustments to readme, setup, tests markdown files 2018-12-04 22:57:32 +03:00			This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/etc/conda/deactivate.d/env_vars.sh` and add:
addressing comments and removing dependencies that we don't use anymore 2018-10-18 19:12:35 +03:00
			```bash
			`#!/bin/sh`
			`unset PYSPARK_PYTHON`
			`unset PYSPARK_DRIVER_PYTHON`
			```
drop down menus everywhere :sunglasses: 2018-11-19 13:40:47 +03:00			`</details>`
install md fix #69 2018-10-18 17:55:45 +03:00
add gpu env readme and setup 2019-01-09 13:35:02 +03:00			`<details>`
			`<summary><strong><em>All environments</em></strong></summary>`

			`To install all three environments:`

			`cd Recommenders`
			`./scripts/generate_conda_file.sh --gpu --pyspark`
			`conda env create -n reco_full -f conda_full.yaml`

			`</details>`

drop down menus everywhere :sunglasses: 2018-11-19 13:40:47 +03:00
addressing comments and removing dependencies that we don't use anymore 2018-10-18 19:12:35 +03:00			`### Register the conda environment in Jupyter notebook`

			`We can register our created conda environment to appear as a kernel in the Jupyter notebooks.`

initial adjustments to readme, setup, tests markdown files 2018-12-04 22:57:32 +03:00			`conda activate my_env_name`
addressing comments and removing dependencies that we don't use anymore 2018-10-18 19:12:35 +03:00			`python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"`

added suggestions 2018-11-08 13:01:16 +03:00
updated instructions for databricks 2018-11-13 20:28:31 +03:00			`### Troubleshooting for the DSVM`
addressing comments and removing dependencies that we don't use anymore 2018-10-18 19:12:35 +03:00
			`* We found that there could be problems if the Spark version of the machine is not the same as the one in the conda file. You will have to adapt the conda file to your machine.`
adjusting notebook folder names per comments, removing gpu references, adding pyspark troubleshooting, cleaning up benchmarking 2018-12-05 16:23:45 +03:00			* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`.
			```
markdown file cleanup 2018-12-11 07:50:19 +03:00			`SPARK_LOCAL_DIRS="/mnt"`
additional spark env var 2018-12-12 04:40:57 +03:00			`SPARK_WORKER_DIR="/mnt"`
markdown file cleanup 2018-12-11 07:50:19 +03:00			`SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true"`
adjusting notebook folder names per comments, removing gpu references, adding pyspark troubleshooting, cleaning up benchmarking 2018-12-05 16:23:45 +03:00			```
addressing comments and removing dependencies that we don't use anymore 2018-10-18 19:12:35 +03:00
updated instructions for databricks 2018-11-13 20:28:31 +03:00			`## Setup guide for Azure Databricks`

			`### Requirements of Azure Databricks`
Update setup note we can't use DB 4.3 / Spark 2.3.1 yet - the O16N notebook depends on a CosmosDB Spark connector which doesn't support this version. 2018-12-08 02:21:26 +03:00			`* Runtime version 4.1 (Apache Spark 2.3.0, Scala 2.11)`
updated instructions for databricks 2018-11-13 20:28:31 +03:00			`* Python 3`

Databricks installation script Add installation sh Update SETUP.md accordingly 2019-01-28 21:16:13 +03:00			`### Repository installation`
			`You can setup the repository as a library on Databricks either manually or by simply running an installation script.`


			`<details>`
			`<summary><strong><em>Quick install</em></strong></summary>`

			`Prerequisite`
			`* Install [Azure Databricks CLI (command-line interface)](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#install-the-cli)`
			`and setup CLI [authentication](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#set-up-authentication).`

			`1. Start a target cluster and copy the target cluster id. Cluster id can be found with following script:`
			```
			`databricks clusters list`

			`<CLUSTER_ID> <CLUSTER_NAME> <STATUS>`
			`...`
			```
Update SETUP 2019-01-29 03:51:52 +03:00			2. If the cluster status is not RUNNING, start it with the command `databricks clusters start --cluster-id <CLUSTER_ID>`.
			`If the cluster is already running, skip this step.`
			`3. Once the cluster status turns into RUNNING, use following commands to install the repository:`
Databricks installation script Add installation sh Update SETUP.md accordingly 2019-01-28 21:16:13 +03:00			```
			`cd Recommenders`
Update SETUP 2019-01-29 03:51:52 +03:00			`./scripts/databricks_install.sh <CLUSTER_ID>`
Databricks installation script Add installation sh Update SETUP.md accordingly 2019-01-28 21:16:13 +03:00			```

			`</details>`

			`<details>`
			`<summary><strong><em>Manual setup</em></strong></summary>`

			`To install the repo manually onto Databricks, follow the steps:`
			`1. Clone Microsoft Recommenders repo in your local computer.`
			`2. Zip the contents inside the Recommenders folder (Azure Databricks requires compressed folders to have the .egg suffix, so we don't use the standard .zip):`
			```
			`cd Recommenders`
			`zip -r Recommenders.egg .`
			```
			`3. Once your cluster has started, go to the Databricks home workspace, then go to your user and press import.`
			4. In the next menu there is an option to import a library, it says: `To import a library, such as a jar or egg, click here`. Press click here.
			5. Then, at the first drop-down menu, mark the option `Upload Python egg or PyPI`.
			6. Then press on `Drop library egg here to upload` and select the the file `Recommenders.egg` you just created.
			7. Then press `Create library`. This will upload the zip and make it available in your workspace.
			`8. Finally, in the next menu, attach the library to your cluster.`

			`</details>`

			`To make sure it works, you can now create a new notebook and import the utilities from Databricks:`
updated instructions for databricks 2018-11-13 20:28:31 +03:00			```
			`import reco_utils`
Databricks installation script Add installation sh Update SETUP.md accordingly 2019-01-28 21:16:13 +03:00			`...`
updated instructions for databricks 2018-11-13 20:28:31 +03:00			```

			`### Troubleshooting for Azure Databricks`
addressing code review comments for readmes 2018-12-06 20:37:36 +03:00			`* For the [reco_utils](reco_utils) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.`
updated instructions for databricks 2018-11-13 20:28:31 +03:00