In this guide we show how to setup all the dependencies to run the notebooks of this repo on an [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) and on [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/).
To install each environment, first we need to generate a conda yml file and then install the environment. We can specify the environment name with the input `-n`. Click on the following menus to see more details:
**NOTE** for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
For setting these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/activate.d/env_vars.sh` and add:
This will export the variables every time we do `source activate reco_pyspark`. To unset these variables when we deactivate the environment, we create the file `/anaconda/envs/reco_pyspark/deactivate.d/env_vars.sh` and add:
This project use unit, smoke and integration tests with Python files and notebooks. For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/). Click on the following menus to see more details:
Unit tests ensure that each class or function behaves as it should. Every time a developer makes a pull request to staging or master branch, a battery of unit tests is executed. To manually execute the unit tests in the different environments, first **make sure you are in the correct environment**.
For executing the Python unit tests for the utilities:
pytest tests/unit -m "not notebooks and not spark and not gpu"
For executing the Python unit tests for the notebooks:
pytest tests/unit -m "notebooks and not spark and not gpu"
For executing the Python GPU unit tests for the utilities:
pytest tests/unit -m "not notebooks and not spark and gpu"
For executing the Python GPU unit tests for the notebooks:
pytest tests/unit -m "notebooks and not spark and gpu"
For executing the PySpark unit tests for the utilities:
pytest tests/unit -m "not notebooks and spark and not gpu"
For executing the PySpark unit tests for the notebooks:
pytest tests/unit -m "notebooks and spark and not gpu"
* We found that there could be problems if the Spark version of the machine is not the same as the one in the conda file. You will have to adapt the conda file to your machine.
* For the [utilities](reco_utils) to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the root folder, if you zip directly the root folder, it won't work.