recommenders/SETUP.md

# Setup Guide

The repo, including this guide, is tested on Linux.    Where applicable, we document differences in [Windows](#windows-specific-instructions)  and [macOS](#macos-specific-instructions) although 
such documentation may not always be up to date.   

## Extras
In addition to the pip installable package, several extras are provided, including:
+ `[examples]`: Needed for running examples.
+ `[gpu]`: Needed for running GPU models.  
+ `[spark]`: Needed for running Spark models.
+ `[dev]`: Needed for development.
+ `[all]`: `[examples]`|`[gpu]`|`[spark]`|`[dev]`
+ `[experimental]`: Models that are not thoroughly tested and/or may require additional steps in installation).
+ `[nni]`: Needed for running models integrated with [NNI](https://nni.readthedocs.io/en/stable/).


## Setup for Core Package

Follow the [Getting Started](./README.md#Getting-Started) section in the [README](./README.md) to install the package and run the examples.


## Setup for Spark 

```bash
# 1. Make sure JDK is installed.  For example, OpenJDK 11 can be installed using the command
# sudo apt-get install openjdk-11-jdk

# 2. Follow Steps 1-5 in the Getting Started section in README.md to install the package and Jupyter kernel, adding the spark extra to the pip install command:
pip install recommenders[examples,spark]

# 3. Within VS Code:
#   a. Open a notebook with a Spark model, e.g., examples/00_quick_start/als_movielens.ipynb;  
#   b. Select Jupyter kernel <kernel_name>;
#   c. Run the notebook.
```

## Setup for developers

If you want to contribute to Recommenders, please first read the [Contributing Guide](./CONTRIBUTING.md). You will notice that our development branch is `staging`.

To start developing, you need to install the latest `staging` branch in local, the `dev` package, and any other package you want. For example, for starting developing with GPU models, you can use the following command:

```bash
git checkout staging
pip install -e .[dev,gpu]
```

You can decide which packages you want to install, if you want to install all of them, you can use the following command:

```bash
git checkout staging
pip install -e .[all]
```

## Setup for Azure Databricks

The following instructions were tested on Azure Databricks Runtime 12.2 LTS (Apache Spark version 3.3.2) and 11.3 LTS (Apache Spark version 3.3.0).
As of April 2023, Databricks Runtime 13 is not yet supported as it is on Python 3.10.

After an Azure Databricks cluster is provisioned:
```bash
# 1. Go to the "Compute" tab on the left of the page, click on the provisioned cluster and then click on "Libraries". 
# 2. Click the "Install new" button.  
# 3. In the popup window, select "PyPI" as the library source. Enter "recommenders[examples]" as the package name. Click "Install" to install the package.
```

### Prepare Azure Databricks for Operationalization
<!-- TO DO: This is to be verified/updated 23/04/16 -->
This repository includes an end-to-end example notebook that uses Azure Databricks to estimate a recommendation model using matrix factorization with Alternating Least Squares, writes pre-computed recommendations to Azure Cosmos DB, and then creates a real-time scoring service that retrieves the recommendations from Cosmos DB. In order to execute that [notebook](examples/05_operationalize/als_movie_o16n.ipynb), you must install the Recommenders repository as a library (as described above), **AND** you must also install some additional dependencies. With the *Quick install* method, you just need to pass an additional option to the [installation script](tools/databricks_install.py).

<details>
<summary><strong><em>Quick install</em></strong></summary>

This option utilizes the installation script to do the setup. Just run the installation script
with an additional option. If you have already run the script once to upload and install the `Recommenders.egg` library, you can also add an `--overwrite` option:

```{shell}
python tools/databricks_install.py --overwrite --prepare-o16n <CLUSTER_ID>
```

This script does all of the steps described in the *Manual setup* section below.

</details>

<details>
<summary><strong><em>Manual setup</em></strong></summary>

You must install three packages as libraries from PyPI:

* `azure-cli==2.0.56`
* `azureml-sdk[databricks]==1.0.8`
* `pydocumentdb==2.3.3`

You can follow instructions [here](https://docs.azuredatabricks.net/user-guide/libraries.html#install-a-library-on-a-cluster) for details on how to install packages from PyPI.

Additionally, you must install the [spark-cosmosdb connector](https://docs.databricks.com/spark/latest/data-sources/azure/cosmosdb-connector.html) on the cluster. The easiest way to manually do that is to:


1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar) from MAVEN. **NOTE** This is the appropriate jar for spark versions `3.1.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above. See the [Databricks installation script](https://github.com/microsoft/recommenders/blob/main/tools/databricks_install.py#L45) for other Databricks runtimes.
2. Upload and install the jar by:
   1. Log into your `Azure Databricks` workspace
   2. Select the `Clusters` button on the left.
   3. Select the cluster on which you want to import the library.
   4. Select the `Upload` and `Jar` options, and click in the box that has the text `Drop JAR here` in it.
   5. Navigate to the downloaded `.jar` file, select it, and click `Open`.
   6. Click on `Install`.
   7. Restart the cluster.

</details>


## Setup for Experimental 
<!-- FIXME FIXME 23/04/01 move to experimental. Have not tested -->
The `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/). 

## Windows-Specific Instructions

For Spark features to work, make sure Java and Spark are installed and respective environment varialbes such as `JAVA_HOME`, `SPARK_HOME` and `HADOOP_HOME` are set properly. Also make sure environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` are set to the the same python executable.

## macOS-Specific Instructions

We recommend using [Homebrew](https://brew.sh/) to install the dependencies on macOS, including conda (please remember to add conda's path to `$PATH`). One may also need to install lightgbm using Homebrew before pip install the package.

If zsh is used, one will need to use `pip install 'recommenders[<extras>]'` to install \<extras\>.

For Spark features to work, make sure Java and Spark are installed first. Also make sure environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` are set to the the same python executable.
<!-- TO DO: Pytorch m1 mac GPU suppoort -->

## Test Environments

Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements.

Currently, tests are done on **Python CPU** (the base environment), **Python GPU** (corresponding to `[gpu]` extra above) and **PySpark** (corresponding to `[spark]` extra above).

Another way is to build a docker image and use the functions inside a [docker container](#setup-guide-for-docker).

Another alternative is to run all the recommender utilities directly from a local copy of the source code. This requires installing all the necessary dependencies from Anaconda and PyPI. For instructions on how to do this, see [this guide](conda.md).

## Setup for Making a Release

The process of making a new release and publishing it to pypi is as follows:

First make sure that the tag that you want to add, e.g. `0.6.0`, is added in [`recommenders.py/__init__.py`](recommenders.py/__init__.py). Follow the [contribution guideline](CONTRIBUTING.md) to add the change.

1. Make sure that the code in main passes all the tests (unit and nightly tests).
1. Create a tag with the version number: e.g. `git tag -a 0.6.0 -m "Recommenders 0.6.0"`.
1. Push the tag to the remote server: `git push origin 0.6.0`.
1. When the new tag is pushed, a release pipeline is executed. This pipeline runs all the tests again (unit, smoke and integration), 
generates a wheel and a tar.gz which are uploaded to a [GitHub draft release](https://github.com/microsoft/recommenders/releases).
1. Fill up the draft release with all the recent changes in the code.
1. Download the wheel and tar.gz locally, these files shouldn't have any bug, since they passed all the tests.
1. Install twine: `pip install twine`
1. Publish the wheel and tar.gz to pypi: `twine upload recommenders*`
setup update 2023-04-10 15:37:42 +03:00			`# Setup Guide`
install md fix #69 2018-10-18 17:55:45 +03:00
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			`The repo, including this guide, is tested on Linux. Where applicable, we document differences in [Windows](#windows-specific-instructions) and [macOS](#macos-specific-instructions) although`
			`such documentation may not always be up to date.`
relocate compute environments 2018-11-15 20:22:28 +03:00
setup update 2023-04-10 15:37:42 +03:00			`## Extras`
			`In addition to the pip installable package, several extras are provided, including:`
			+ `[examples]`: Needed for running examples.
			+ `[gpu]`: Needed for running GPU models.
			+ `[spark]`: Needed for running Spark models.
			+ `[dev]`: Needed for development.
			+ `[all]`: `[examples]`\|`[gpu]`\|`[spark]`\|`[dev]`
Update SETUP.md 2023-04-20 02:28:01 +03:00			+ `[experimental]`: Models that are not thoroughly tested and/or may require additional steps in installation).
setup update 2023-04-10 15:37:42 +03:00			+ `[nni]`: Needed for running models integrated with [NNI](https://nni.readthedocs.io/en/stable/).
relocate compute environments 2018-11-15 20:22:28 +03:00
Rearrange SETUP.md 2021-08-27 20:20:00 +03:00
setup update 2023-04-10 15:37:42 +03:00			`## Setup for Core Package`
install md fix #69 2018-10-18 17:55:45 +03:00
setup update 2023-04-10 15:37:42 +03:00			`Follow the [Getting Started](./README.md#Getting-Started) section in the [README](./README.md) to install the package and run the examples.`
SETUP: add notes for installing cmake for xlearn 2020-01-15 06:18:10 +03:00
Update Spark SETUP doc 2019-11-22 00:10:24 +03:00
setup update 2023-04-10 15:37:42 +03:00			`## Setup for Spark`
Update Spark SETUP doc 2019-11-22 00:10:24 +03:00
trick to copy-past faster 2021-01-26 17:20:43 +03:00			```bash
Update Setup/README for Java/Spark instructions 2023-04-10 16:16:10 +03:00			`# 1. Make sure JDK is installed. For example, OpenJDK 11 can be installed using the command`
			`# sudo apt-get install openjdk-11-jdk`

Update SETUP.md 2023-04-20 02:28:01 +03:00			`# 2. Follow Steps 1-5 in the Getting Started section in README.md to install the package and Jupyter kernel, adding the spark extra to the pip install command:`
Update Setup/README for Java/Spark instructions 2023-04-10 16:16:10 +03:00			`pip install recommenders[examples,spark]`

			`# 3. Within VS Code:`
			`# a. Open a notebook with a Spark model, e.g., examples/00_quick_start/als_movielens.ipynb;`
			`# b. Select Jupyter kernel <kernel_name>;`
			`# c. Run the notebook.`
trick to copy-past faster 2021-01-26 17:20:43 +03:00			```
Update Spark SETUP doc 2019-11-22 00:10:24 +03:00
dev 2023-04-21 16:25:09 +03:00			`## Setup for developers`

dev 2023-04-21 16:30:40 +03:00			If you want to contribute to Recommenders, please first read the [Contributing Guide](./CONTRIBUTING.md). You will notice that our development branch is `staging`.
dev 2023-04-21 16:25:09 +03:00
dev 2023-04-21 16:30:40 +03:00			To start developing, you need to install the latest `staging` branch in local, the `dev` package, and any other package you want. For example, for starting developing with GPU models, you can use the following command:
dev 2023-04-21 16:25:09 +03:00
			```bash
			`git checkout staging`
dev 2023-04-21 16:30:40 +03:00			`pip install -e .[dev,gpu]`
			```

			`You can decide which packages you want to install, if you want to install all of them, you can use the following command:`

			```bash
			`git checkout staging`
			`pip install -e .[all]`
dev 2023-04-21 16:25:09 +03:00			```

Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			`## Setup for Azure Databricks`
notice for dsvm on conflict with MMLSpark versions 2020-11-03 15:50:49 +03:00
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			`The following instructions were tested on Azure Databricks Runtime 12.2 LTS (Apache Spark version 3.3.2) and 11.3 LTS (Apache Spark version 3.3.0).`
			`As of April 2023, Databricks Runtime 13 is not yet supported as it is on Python 3.10.`
updated instructions for databricks 2018-11-13 20:28:31 +03:00
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			`After an Azure Databricks cluster is provisioned:`
			```bash
			`# 1. Go to the "Compute" tab on the left of the page, click on the provisioned cluster and then click on "Libraries".`
			`# 2. Click the "Install new" button.`
			`# 3. In the popup window, select "PyPI" as the library source. Enter "recommenders[examples]" as the package name. Click "Install" to install the package.`
updated instructions for databricks 2018-11-13 20:28:31 +03:00			```

modification of readme and setup 2019-09-11 18:57:47 +03:00			`### Prepare Azure Databricks for Operationalization`
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			`<!-- TO DO: This is to be verified/updated 23/04/16 -->`
fix paths 2020-06-16 17:41:58 +03:00			This repository includes an end-to-end example notebook that uses Azure Databricks to estimate a recommendation model using matrix factorization with Alternating Least Squares, writes pre-computed recommendations to Azure Cosmos DB, and then creates a real-time scoring service that retrieves the recommendations from Cosmos DB. In order to execute that [notebook](examples/05_operationalize/als_movie_o16n.ipynb), you must install the Recommenders repository as a library (as described above), AND you must also install some additional dependencies. With the Quick install method, you just need to pass an additional option to the [installation script](tools/databricks_install.py).
update SETUP.md for clarity and add a section for operationalization to have all setup in one place 2019-01-30 20:34:28 +03:00
			`<details>`
			`<summary><strong><em>Quick install</em></strong></summary>`

updated documentation 2019-02-28 06:45:47 +03:00			`This option utilizes the installation script to do the setup. Just run the installation script`
			with an additional option. If you have already run the script once to upload and install the `Recommenders.egg` library, you can also add an `--overwrite` option:
update SETUP.md for clarity and add a section for operationalization to have all setup in one place 2019-01-30 20:34:28 +03:00
			```{shell}
fix paths 2020-06-16 17:41:58 +03:00			`python tools/databricks_install.py --overwrite --prepare-o16n <CLUSTER_ID>`
update SETUP.md for clarity and add a section for operationalization to have all setup in one place 2019-01-30 20:34:28 +03:00			```

			`This script does all of the steps described in the Manual setup section below.`

			`</details>`

			`<details>`
			`<summary><strong><em>Manual setup</em></strong></summary>`

			`You must install three packages as libraries from PyPI:`

updated documentation 2019-02-28 06:45:47 +03:00			* `azure-cli==2.0.56`
			* `azureml-sdk[databricks]==1.0.8`
			* `pydocumentdb==2.3.3`
update SETUP.md for clarity and add a section for operationalization to have all setup in one place 2019-01-30 20:34:28 +03:00
			`You can follow instructions [here](https://docs.azuredatabricks.net/user-guide/libraries.html#install-a-library-on-a-cluster) for details on how to install packages from PyPI.`

			`Additionally, you must install the [spark-cosmosdb connector](https://docs.databricks.com/spark/latest/data-sources/azure/cosmosdb-connector.html) on the cluster. The easiest way to manually do that is to:`

Update docs and conda script 2021-10-07 19:43:19 +03:00
Update SETUP.md with Spark version supported 2021-10-22 13:11:03 +03:00			1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar) from MAVEN. NOTE This is the appropriate jar for spark versions `3.1.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above. See the [Databricks installation script](https://github.com/microsoft/recommenders/blob/main/tools/databricks_install.py#L45) for other Databricks runtimes.
update SETUP.md for clarity and add a section for operationalization to have all setup in one place 2019-01-30 20:34:28 +03:00			`2. Upload and install the jar by:`
			1. Log into your `Azure Databricks` workspace
			2. Select the `Clusters` button on the left.
			`3. Select the cluster on which you want to import the library.`
			4. Select the `Upload` and `Jar` options, and click in the box that has the text `Drop JAR here` in it.
			5. Navigate to the downloaded `.jar` file, select it, and click `Open`.
			6. Click on `Install`.
			`7. Restart the cluster.`

			`</details>`
Docker Support (#718) * DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo 2019-07-30 16:37:31 +03:00
modification of readme and setup 2019-09-11 18:57:47 +03:00
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			`## Setup for Experimental`
			`<!-- FIXME FIXME 23/04/01 move to experimental. Have not tested -->`
			The `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/).

			`## Windows-Specific Instructions`

			For Spark features to work, make sure Java and Spark are installed and respective environment varialbes such as `JAVA_HOME`, `SPARK_HOME` and `HADOOP_HOME` are set properly. Also make sure environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` are set to the the same python executable.

			`## macOS-Specific Instructions`
Docker Support (#718) * DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo 2019-07-30 16:37:31 +03:00
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			We recommend using [Homebrew](https://brew.sh/) to install the dependencies on macOS, including conda (please remember to add conda's path to `$PATH`). One may also need to install lightgbm using Homebrew before pip install the package.
Docker Support (#718) * DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo 2019-07-30 16:37:31 +03:00
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			If zsh is used, one will need to use `pip install 'recommenders[<extras>]'` to install \<extras\>.
Docker Support (#718) * DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo 2019-07-30 16:37:31 +03:00
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			For Spark features to work, make sure Java and Spark are installed first. Also make sure environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` are set to the the same python executable.
			`<!-- TO DO: Pytorch m1 mac GPU suppoort -->`
publish pypi 2021-06-14 19:51:24 +03:00
Completed Spark; need o16n update 2023-04-17 05:27:39 +03:00			`## Test Environments`

			`Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements.`

			Currently, tests are done on Python CPU (the base environment), Python GPU (corresponding to `[gpu]` extra above) and PySpark (corresponding to `[spark]` extra above).

			`Another way is to build a docker image and use the functions inside a [docker container](#setup-guide-for-docker).`

			`Another alternative is to run all the recommender utilities directly from a local copy of the source code. This requires installing all the necessary dependencies from Anaconda and PyPI. For instructions on how to do this, see [this guide](conda.md).`

			`## Setup for Making a Release`
publish pypi 2021-06-14 19:51:24 +03:00
Update SETUP.md 2021-06-14 20:28:18 +03:00			`The process of making a new release and publishing it to pypi is as follows:`
publish pypi 2021-06-14 19:51:24 +03:00
:bug: 2022-01-11 12:44:29 +03:00			First make sure that the tag that you want to add, e.g. `0.6.0`, is added in [`recommenders.py/__init__.py`](recommenders.py/__init__.py). Follow the [contribution guideline](CONTRIBUTING.md) to add the change.
wip 2021-06-15 17:59:15 +03:00
publish pypi 2021-06-14 19:51:24 +03:00			`1. Make sure that the code in main passes all the tests (unit and nightly tests).`
wip 2021-06-15 17:59:15 +03:00			1. Create a tag with the version number: e.g. `git tag -a 0.6.0 -m "Recommenders 0.6.0"`.
			1. Push the tag to the remote server: `git push origin 0.6.0`.
			`1. When the new tag is pushed, a release pipeline is executed. This pipeline runs all the tests again (unit, smoke and integration),`
wip 2021-06-15 19:24:30 +03:00			`generates a wheel and a tar.gz which are uploaded to a [GitHub draft release](https://github.com/microsoft/recommenders/releases).`
wip 2021-06-15 17:59:15 +03:00			`1. Fill up the draft release with all the recent changes in the code.`
wip 2021-06-15 19:24:30 +03:00			`1. Download the wheel and tar.gz locally, these files shouldn't have any bug, since they passed all the tests.`
twine 2021-12-21 16:37:57 +03:00			1. Install twine: `pip install twine`
change package name 2021-07-15 18:53:01 +03:00			1. Publish the wheel and tar.gz to pypi: `twine upload recommenders*`
setup update 2023-04-10 15:37:42 +03:00