recommenders/SETUP.md

5.4 KiB

Setup guide

In this guide we show how to setup all the dependencies to run the notebooks of this repo.

Three environments are supported to run the notebooks in the repo:

  • Python CPU
  • Python GPU
  • PySpark

Requirements

  • Anaconda Python 3.6
  • The Python library dependencies can be found in this script.
  • Machine with Spark (optional for Python environment but mandatory for PySpark environment).
  • Machine with GPU (optional but desirable for computing acceleration).

Conda environments

As a pre-requisite, we may want to make sure that Conda is up-to-date:

conda update conda

We provided a script to generate a conda file, depending of the environment we want to use.

To install each environment, first we need to generate a conda yml file and then install the environment. We can specify the environment name with the input -n. In the following examples, we provide a name example.

Python CPU environment

Assuming the repo is cloned as Recommenders in the local system, to install the Python CPU environment:

cd Recommenders
./scripts/generate_conda_file.sh
conda env create -n reco_bare -f conda_bare.yaml 

Python GPU environment

Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:

cd Recommenders
./scripts/generate_conda_file.sh --gpu
conda env create -n reco_gpu -f conda_gpu.yaml 

PySpark environment

To install the PySpark environment, which by default installs the CPU environment:

cd Recommenders
./scripts/generate_conda_file.sh --pyspark
conda env create -n reco_pyspark -f conda_pyspark.yaml

NOTE for this environment, we need to set the environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to point to the conda python executable.

For setting these variables every time the environment is activated, we can follow the steps of this guide. Assuming that we have installed the environment in /anaconda/envs/reco_pyspark, we create the file /anaconda/envs/reco_pyspark/activate.d/env_vars.sh and add:

#!/bin/sh
export PYSPARK_PYTHON=/anaconda/envs/reco_pyspark/bin/python
export PYSPARK_DRIVER_PYTHON=/anaconda/envs/reco_pyspark/bin/python

This will export the variables every time we do source activate reco_pyspark. To unset these variables when we deactivate the environment, we create the file /anaconda/envs/reco_pyspark/deactivate.d/env_vars.sh and add:

#!/bin/sh
unset PYSPARK_PYTHON
unset PYSPARK_DRIVER_PYTHON

All environments

To install all three environments:

cd Recommenders
./scripts/generate_conda_file.sh  --gpu --pyspark
conda env create -n reco_full -f conda_full.yaml

Register the conda environment in Jupyter notebook

We can register our created conda environment to appear as a kernel in the Jupyter notebooks.

source activate my_env_name
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"

Tests

This project use unit, smoke and integration tests with Python files and notebooks. For more information, see a quick introduction to unit, smoke and integration tests.

Unit tests

Unit tests ensure that each class or function behaves as it should. Every time a developer makes a pull request to staging or master branch, a battery of unit tests is executed. To manually execute the unit tests in the different environments, first make sure you are in the correct environment.

For executing the Python unit tests for the utilities:

pytest tests/unit -m "not notebooks and not spark and not gpu"

For executing the Python unit tests for the notebooks:

pytest tests/unit -m "notebooks and not spark and not gpu"

For executing the Python GPU unit tests for the utilities:

pytest tests/unit -m "not notebooks and not spark and gpu"

For executing the Python GPU unit tests for the notebooks:

pytest tests/unit -m "notebooks and not spark and gpu"

For executing the PySpark unit tests for the utilities:

pytest tests/unit -m "not notebooks and spark and not gpu"

For executing the PySpark unit tests for the notebooks:

pytest tests/unit -m "notebooks and spark and not gpu"

Smoke tests

Smoke tests make sure that the system works and are executed just before the integration tests every night.

For executing the Python smoke tests:

pytest tests/smoke -m "smoke and not spark and not gpu"

For executing the Python GPU smoke tests:

pytest tests/smoke -m "smoke and not spark and gpu"

For executing the PySpark smoke tests:

pytest tests/smoke -m "smoke and spark and not gpu"

Integration tests

Integration tests make sure that the program results are acceptable

For executing the Python integration tests:

pytest tests/integration -m "integration and not spark and not gpu"

For executing the Python GPU integration tests:

pytest tests/integration -m "integration and not spark and gpu"

For executing the PySpark integration tests:

pytest tests/integration -m "integration and spark and not gpu"

Troubleshooting

  • We found that there could be problems if the Spark version of the machine is not the same as the one in the conda file. You will have to adapt the conda file to your machine.