In this guide we show how to setup all the dependencies to run the notebooks of this repo on a local Linux system or Linux [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) and on [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/).
We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use. This will create the environment using the Python version 3.6 with all the correct dependencies.
To install each environment, first we need to generate a conda yaml file and then install the environment. We can specify the environment name with the input `-n`.
**NOTE** - for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh` and add:
This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/etc/conda/deactivate.d/env_vars.sh` and add:
* We found that there could be problems if the Spark version of the machine is not the same as the one in the conda file. You will have to adapt the conda file to your machine.
* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`.
To install the repo manually onto Databricks, follow the steps:
1. Clone Microsoft Recommenders repo in your local computer.
2. Zip the contents inside the Recommenders folder (Azure Databricks requires compressed folders to have the .egg suffix, so we don't use the standard .zip):
```
cd Recommenders
zip -r Recommenders.egg .
```
3. Once your cluster has started, go to the Databricks home workspace, then go to your user and press import.
4. In the next menu there is an option to import a library, it says: `To import a library, such as a jar or egg, click here`. Press click here.
5. Then, at the first drop-down menu, mark the option `Upload Python egg or PyPI`.
6. Then press on `Drop library egg here to upload` and select the the file `Recommenders.egg` you just created.
7. Then press `Create library`. This will upload the zip and make it available in your workspace.
8. Finally, in the next menu, attach the library to your cluster.
</details>
To make sure it works, you can now create a new notebook and import the utilities from Databricks:
* For the [reco_utils](reco_utils) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.