sarplus/README.md

# sarplus
pronounced sUrplus as it's simply better if not best!

Features
* Scalable PySpark based implementation
* Fast C++ based prediction: 
* Reduced memory consumption: similarity matrix cached in-memory once per worker, shared accross python executors 
* Easy setup using [Spark Packages](https://spark-packages.org/package/eisber/sarplus)

# Benchmarks

| # Users | # Items | # Ratings | Runtime | Environment | Dataset | 
|---------|---------|-----------|---------|-------------|---------|
| 2.5mio  | 35k     | 100mio    | 1.3h    | Databricks, 8 workers, [Azure Standard DS3 v2](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) | |

# Usage

```python
import pandas as pd
from pysarplus import SARPlus

# spark dataframe with user/item/rating/optional timestamp tuples
train_df = spark.createDataFrame(
      pd.DataFrame({
        'user_id': [1, 1, 2, 3, 3],
        'item_id': [1, 2, 1, 1, 3],
        'rating':  [1, 1, 1, 1, 1],
    }))
   
# spark dataframe with user/item tuples
test_df = spark.createDataFrame(
      pd.DataFrame({
        'user_id': [1, 3],
        'item_id': [1, 3],
        'rating':  [1, 1],
    }))
    
model = SARPlus(spark, col_user='user_id', col_item='item_id', col_rating='rating', col_timestamp='timestamp')
model.fit(train_df, similarity_type='jaccard')

model.recommend_k_items(test_df, 'sarplus_cache', top_k=3).show()
```

# Jupyter Notebook

# PySpark Shell

```bash
pip install pysarplus
pyspark --packages eisber:sarplus:0.2.1 --conf spark.sql.crossJoin.enabled=true
```

# Databricks

One must set the crossJoin property to enable calculation of the similarity matrix (Clusters / <Cluster> / Configuration / Spark Config)

```
spark.sql.crossJoin.enabled true
```

1. Navigate to your workspace 
2. Create library
3. Under 'Source' select 'Maven Coordinate'
4. Enter eisber:sarplus:0.2.1
5. Hit 'Create Libbrary'

This will install C++, Python and Scala code on your cluster.

# Packaging

For [databricks](https://databricks.com/) to properly install a [C++ extension](https://docs.python.org/3/extending/building.html), one must take a detour through [pypi](https://pypi.org/).
Use [twine](https://github.com/pypa/twine) to upload the package to [pypi](https://pypi.org/).

```bash
cd python

python setup.py sdist

twine upload dist/pysarplus-*.tar.gz
```

On [Spark](https://spark.apache.org/) one can install all 3 components (C++, Python, Scala) in one pass by creating a [Spark Package](https://spark-packages.org/). Documentation is rather sparse. Steps to install

1. Package and publish the [pip package](python/setup.py) (see above)
2. Package the [Spark package](scala/build.sbt), which includes the [Scala formatter](scala/src/main/scala/eisber/sarplus) and references the [pip package](scala/python/requirements.txt) (see below)
3. Upload the zipped Scala package to [Spark Package](https://spark-packages.org/) through a browser. [sbt spPublish](https://github.com/databricks/sbt-spark-package) has a few [issues](https://github.com/databricks/sbt-spark-package/issues/31) so it always fails for me. Don't use spPublishLocal as the packages are not created properly (names don't match up, [issue](https://github.com/databricks/sbt-spark-package/issues/17)) and furthermore fail to install if published to [Spark-Packages.org](https://spark-packages.org/).  

```bash
cd scala
sbt spPublish
```

# Testing

To test the python UDF + C++ backend

```bash
cd python 
python setup.py install && pytest -s tests/
```

To test the Scala formatter

```bash
cd scala
sbt test
```

(use ~test and it will automatically check for changes in source files, but not build.sbt)
Initial commit 2018-10-24 19:03:40 +03:00			`# sarplus`
			`pronounced sUrplus as it's simply better if not best!`
Update README.md 2018-10-26 13:51:41 +03:00
Update README.md 2018-10-26 20:58:29 +03:00			`Features`
			`* Scalable PySpark based implementation`
			`* Fast C++ based prediction:`
			`* Reduced memory consumption: similarity matrix cached in-memory once per worker, shared accross python executors`
			`* Easy setup using [Spark Packages](https://spark-packages.org/package/eisber/sarplus)`
Update README.md 2018-10-26 13:51:41 +03:00
Update README.md 2018-10-26 20:58:29 +03:00			`# Benchmarks`
Update README.md 2018-10-26 13:51:41 +03:00
Update README.md 2018-10-26 20:58:29 +03:00			`\| # Users \| # Items \| # Ratings \| Runtime \| Environment \| Dataset \|`
			`\|---------\|---------\|-----------\|---------\|-------------\|---------\|`
			`\| 2.5mio \| 35k \| 100mio \| 1.3h \| Databricks, 8 workers, [Azure Standard DS3 v2](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) \| \|`

Update README.md 2018-10-26 21:28:59 +03:00			`# Usage`

			```python
			`import pandas as pd`
			`from pysarplus import SARPlus`

			`# spark dataframe with user/item/rating/optional timestamp tuples`
			`train_df = spark.createDataFrame(`
			`pd.DataFrame({`
			`'user_id': [1, 1, 2, 3, 3],`
			`'item_id': [1, 2, 1, 1, 3],`
			`'rating': [1, 1, 1, 1, 1],`
			`}))`

			`# spark dataframe with user/item tuples`
			`test_df = spark.createDataFrame(`
			`pd.DataFrame({`
			`'user_id': [1, 3],`
			`'item_id': [1, 3],`
			`'rating': [1, 1],`
			`}))`

			`model = SARPlus(spark, col_user='user_id', col_item='item_id', col_rating='rating', col_timestamp='timestamp')`
			`model.fit(train_df, similarity_type='jaccard')`

			`model.recommend_k_items(test_df, 'sarplus_cache', top_k=3).show()`
			```
Update README.md 2018-10-26 20:58:29 +03:00
Update README.md 2018-10-26 21:28:59 +03:00			`# Jupyter Notebook`
Update README.md 2018-10-26 20:58:29 +03:00
Update README.md 2018-10-26 21:28:59 +03:00			`# PySpark Shell`
Update README.md 2018-10-26 13:51:41 +03:00
Update README.md 2018-10-26 21:28:59 +03:00			```bash
			`pip install pysarplus`
			`pyspark --packages eisber:sarplus:0.2.1 --conf spark.sql.crossJoin.enabled=true`
Update README.md 2018-10-26 13:51:41 +03:00			```

Update README.md 2018-10-26 21:28:59 +03:00			`# Databricks`
Update README.md 2018-10-26 13:51:41 +03:00
Update README.md 2018-10-26 21:28:59 +03:00			`One must set the crossJoin property to enable calculation of the similarity matrix (Clusters / <Cluster> / Configuration / Spark Config)`
Update README.md 2018-10-26 13:51:41 +03:00
Update README.md 2018-10-26 20:58:29 +03:00			```
			`spark.sql.crossJoin.enabled true`
Update README.md 2018-10-26 13:51:41 +03:00			```

Update README.md 2018-10-26 20:58:29 +03:00			`1. Navigate to your workspace`
			`2. Create library`
			`3. Under 'Source' select 'Maven Coordinate'`
			`4. Enter eisber:sarplus:0.2.1`
			`5. Hit 'Create Libbrary'`

			`This will install C++, Python and Scala code on your cluster.`
Update README.md 2018-10-26 13:51:41 +03:00
			`# Packaging`

			`For [databricks](https://databricks.com/) to properly install a [C++ extension](https://docs.python.org/3/extending/building.html), one must take a detour through [pypi](https://pypi.org/).`
			`Use [twine](https://github.com/pypa/twine) to upload the package to [pypi](https://pypi.org/).`

			```bash
			`cd python`

			`python setup.py sdist`

			`twine upload dist/pysarplus-*.tar.gz`
			```

			`On [Spark](https://spark.apache.org/) one can install all 3 components (C++, Python, Scala) in one pass by creating a [Spark Package](https://spark-packages.org/). Documentation is rather sparse. Steps to install`

			`1. Package and publish the [pip package](python/setup.py) (see above)`
			`2. Package the [Spark package](scala/build.sbt), which includes the [Scala formatter](scala/src/main/scala/eisber/sarplus) and references the [pip package](scala/python/requirements.txt) (see below)`
Update README.md 2018-10-26 13:53:44 +03:00			3. Upload the zipped Scala package to [Spark Package](https://spark-packages.org/) through a browser. [sbt spPublish](https://github.com/databricks/sbt-spark-package) has a few [issues](https://github.com/databricks/sbt-spark-package/issues/31) so it always fails for me. Don't use spPublishLocal as the packages are not created properly (names don't match up, [issue](https://github.com/databricks/sbt-spark-package/issues/17)) and furthermore fail to install if published to [Spark-Packages.org](https://spark-packages.org/).
Update README.md 2018-10-26 13:51:41 +03:00
			```bash
Update README.md 2018-10-26 21:28:59 +03:00			`cd scala`
Update README.md 2018-10-26 13:51:41 +03:00			`sbt spPublish`
			```
Update README.md 2018-10-26 13:57:46 +03:00
Update README.md 2018-10-26 20:58:29 +03:00			`# Testing`
Update README.md 2018-10-26 13:57:46 +03:00
Update README.md 2018-10-26 20:58:29 +03:00			`To test the python UDF + C++ backend`
Update README.md 2018-10-26 13:57:46 +03:00
Update README.md 2018-10-26 20:58:29 +03:00			```bash
			`cd python`
			`python setup.py install && pytest -s tests/`
Update README.md 2018-10-26 13:57:46 +03:00			```

Update README.md 2018-10-26 20:58:29 +03:00			`To test the Scala formatter`
Update README.md 2018-10-26 13:57:46 +03:00
Update README.md 2018-10-26 20:58:29 +03:00			```bash
			`cd scala`
			`sbt test`
Update README.md 2018-10-26 13:57:46 +03:00			```

Update README.md 2018-10-26 20:58:29 +03:00			`(use ~test and it will automatically check for changes in source files, but not build.sbt)`