recommenders/SETUP.md

131 строка
5.6 KiB
Markdown
Исходник Обычный вид История

2018-11-13 20:28:31 +03:00
# Setup guide
2018-10-18 17:55:45 +03:00
2018-12-03 12:20:01 +03:00
In this guide we show how to setup all the dependencies to run the notebooks of this repo on a local environment or [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) and on [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/).
<details>
2018-11-20 11:37:51 +03:00
<summary><strong><em>Click here to see the Table of Contents</em></strong></summary>
2018-11-15 20:26:26 +03:00
2018-11-15 20:22:28 +03:00
* [Compute environments](#compute-environments)
2018-12-03 12:20:01 +03:00
* [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm)
* [Setup Requirements](#setup-requirements)
* [Dependencies setup](#dependencies-setup)
2018-11-13 20:28:31 +03:00
* [Register the conda environment in Jupyter notebook](register-the-conda-environment-in-jupyter-notebook)
* [Troubleshooting for the DSVM](#troubleshooting-for-the-dsvm)
* [Setup guide for Azure Databricks](#setup-guide-for-azure-databricks)
* [Requirements of Azure Databricks](#requirements-of-azure-databricks)
2018-11-13 20:37:10 +03:00
* [Repository upload](#repository-upload)
* [Dependencies setup for Azure Databricks](#dependencies-setup-for-azure-databricks)
2018-11-13 20:28:31 +03:00
* [Troubleshooting for Azure Databricks](#troubleshooting-for-azure-databricks)
</details>
2018-11-13 20:28:31 +03:00
2018-11-15 20:22:28 +03:00
## Compute environments
We have different compute environments, depending on the kind of machine
Environments supported to run the notebooks on the DSVM:
* Python CPU
* PySpark
Environments supported to run the notebooks on Azure Databricks:
* PySpark
2018-12-03 12:20:01 +03:00
## Setup guide for Local or DSVM
2018-11-13 20:28:31 +03:00
2018-12-03 12:20:01 +03:00
### Setup Requirements
2018-10-18 17:55:45 +03:00
- [Anaconda Python 3](https://conda.io/miniconda.html)
2018-11-01 10:43:26 +03:00
- The Python library dependencies can be found in this [script](scripts/generate_conda_file.sh).
- Machine with Spark (optional for Python environment but mandatory for PySpark environment).
2018-10-18 17:55:45 +03:00
2018-12-03 12:20:01 +03:00
### Dependencies setup
2018-10-18 17:55:45 +03:00
2018-11-13 20:37:10 +03:00
We install the dependencies with Conda. As a pre-requisite, we may want to make sure that Conda is up-to-date:
2018-10-18 17:55:45 +03:00
conda update anaconda
2018-10-18 17:55:45 +03:00
We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use.
2018-10-18 17:55:45 +03:00
2018-11-20 11:37:51 +03:00
To install each environment, first we need to generate a conda yml file and then install the environment. We can specify the environment name with the input `-n`. Click on the following menus to see more details:
2018-10-18 17:55:45 +03:00
<details>
<summary><strong><em>Python CPU environment</em></strong></summary>
2018-10-18 17:55:45 +03:00
2018-11-01 10:43:26 +03:00
Assuming the repo is cloned as `Recommenders` in the local system, to install the Python CPU environment:
2018-10-18 17:55:45 +03:00
2018-10-18 19:16:21 +03:00
cd Recommenders
2018-10-18 17:55:45 +03:00
./scripts/generate_conda_file.sh
conda env create -n reco_bare -f conda_bare.yaml
</details>
<details>
<summary><strong><em>PySpark environment</em></strong></summary>
2018-10-18 17:55:45 +03:00
To install the PySpark environment, which by default installs the CPU environment:
2018-10-18 19:16:21 +03:00
cd Recommenders
2018-10-18 17:55:45 +03:00
./scripts/generate_conda_file.sh --pyspark
2018-10-30 12:31:51 +03:00
conda env create -n reco_pyspark -f conda_pyspark.yaml
2018-10-18 17:55:45 +03:00
**NOTE** - for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh` and add:
```bash
#!/bin/sh
export PYSPARK_PYTHON=/anaconda/envs/reco_pyspark/bin/python
export PYSPARK_DRIVER_PYTHON=/anaconda/envs/reco_pyspark/bin/python
```
This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/etc/conda/deactivate.d/env_vars.sh` and add:
```bash
#!/bin/sh
unset PYSPARK_PYTHON
unset PYSPARK_DRIVER_PYTHON
```
</details>
2018-10-18 17:55:45 +03:00
### Register the conda environment in Jupyter notebook
We can register our created conda environment to appear as a kernel in the Jupyter notebooks.
conda activate my_env_name
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"
2018-11-08 13:01:16 +03:00
2018-11-13 20:28:31 +03:00
### Troubleshooting for the DSVM
* We found that there could be problems if the Spark version of the machine is not the same as the one in the conda file. You will have to adapt the conda file to your machine.
2018-11-13 20:28:31 +03:00
## Setup guide for Azure Databricks
### Requirements of Azure Databricks
* Runtime version 4.3 (Apache Spark 2.3.1, Scala 2.11)
* Python 3
### Repository upload
We need to zip and upload the repository to be used in Databricks, the steps are the following:
* Clone Microsoft Recommenders repo in your local computer.
* Zip the content inside the root folder:
```
cd Recommenders
zip -r Recommenders.zip .
```
* Once your cluster has started, go to the Databricks home workspace, then go to your user and press import.
* In the next menu there is an option to import a library, it says: `To import a library, such as a jar or egg, click here`. Press click here.
* Then, at the first drop-down menu, mark the option `Upload Python egg or PyPI`.
2018-11-13 20:28:31 +03:00
* Then press on `Drop library egg here to upload` and select the the file `Recommenders.zip` you just created.
* Then press `Create library`. This will upload the zip and make it available in your workspace.
* Finally, in the next menu, attach the library to your cluster.
To make sure it works, you can now create a new notebook and import the utilities:
```
import reco_utils
```
### Troubleshooting for Azure Databricks
* For the [utilities](reco_utils) to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the root folder, if you zip directly the root folder, it won't work.