sarplus/README.md

186 строки
6.5 KiB
Markdown
Исходник Обычный вид История

2018-10-30 17:43:34 +03:00
# sarplus (preview)
2018-10-24 19:03:40 +03:00
pronounced sUrplus as it's simply better if not best!
2018-10-26 13:51:41 +03:00
2018-10-29 16:06:58 +03:00
[![Build Status](https://dev.azure.com/marcozo-sarplus/sarplus/_apis/build/status/eisber.sarplus)](https://dev.azure.com/marcozo-sarplus/sarplus/_build/latest?definitionId=1)
2018-10-29 16:10:15 +03:00
[![PyPI version](https://badge.fury.io/py/pysarplus.svg)](https://badge.fury.io/py/pysarplus)
2018-10-29 16:06:58 +03:00
2018-10-26 20:58:29 +03:00
Features
2018-10-26 21:37:17 +03:00
* Scalable PySpark based [implementation](python/pysarplus/SARPlus.py)
* Fast C++ based [predictions](python/src/pysarplus.cpp)
2018-10-26 20:58:29 +03:00
* Reduced memory consumption: similarity matrix cached in-memory once per worker, shared accross python executors
* Easy setup using [Spark Packages](https://spark-packages.org/package/eisber/sarplus)
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
# Benchmarks
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
| # Users | # Items | # Ratings | Runtime | Environment | Dataset |
|---------|---------|-----------|---------|-------------|---------|
2018-10-31 13:43:49 +03:00
| 2.5mio | 35k | 100mio | 1.3h | Databricks, 8 workers, [Azure Standard DS3 v2](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) (4 core machines) | |
2018-10-26 20:58:29 +03:00
2018-10-29 21:57:15 +03:00
# Top-K Recommendation Optimization
2018-10-31 13:46:56 +03:00
There are a couple of key optimizations:
2018-10-29 21:57:15 +03:00
* map item ids (e.g. strings) to a continuous set of indexes to optmize storage and simplify access
* convert similarity matrix to exactly the representation the C++ component needs, thus enabling simple shared, memory mapping of the cache file and avoid parsing. This requires a customer formatter, written in Scala
* shared read-only memory mapping allows us to re-use the same memory from multiple python executors on the same worker node
* partition the input test users and past seen items by users, allowing for scale out
* perform as much of the work as possible in PySpark (way simpler)
* top-k computation
** reverse the join by summing reverse joining the users past seen items with any related items
** make sure to always just keep top-k items in-memory
** use standard join using binary search between users past seen items and the related items
![Image of sarplus top-k recommendation optimization](images/sarplus_udf.svg)
2018-10-26 21:28:59 +03:00
# Usage
```python
import pandas as pd
from pysarplus import SARPlus
# spark dataframe with user/item/rating/optional timestamp tuples
train_df = spark.createDataFrame(
pd.DataFrame({
'user_id': [1, 1, 2, 3, 3],
'item_id': [1, 2, 1, 1, 3],
'rating': [1, 1, 1, 1, 1],
}))
# spark dataframe with user/item tuples
test_df = spark.createDataFrame(
pd.DataFrame({
'user_id': [1, 3],
'item_id': [1, 3],
'rating': [1, 1],
}))
model = SARPlus(spark, col_user='user_id', col_item='item_id', col_rating='rating', col_timestamp='timestamp')
model.fit(train_df, similarity_type='jaccard')
2018-10-31 14:07:22 +03:00
2018-10-26 21:28:59 +03:00
model.recommend_k_items(test_df, 'sarplus_cache', top_k=3).show()
2018-10-31 14:07:22 +03:00
# For databricks
# model.recommend_k_items(test_df, 'dbfs:/mnt/sarpluscache', top_k=3).show()
2018-10-26 21:28:59 +03:00
```
2018-10-26 20:58:29 +03:00
2018-10-26 21:56:22 +03:00
## Jupyter Notebook
2018-10-26 20:58:29 +03:00
2018-10-26 21:56:22 +03:00
Insert this cell prior to the code above.
```python
import os
2018-10-26 23:58:33 +03:00
SUBMIT_ARGS = "--packages eisber:sarplus:0.2.2 pyspark-shell"
2018-10-26 21:56:22 +03:00
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.appName("sample")
.master("local[*]")
.config("memory", "1G")
.config("spark.sql.shuffle.partitions", "1")
.config("spark.sql.crossJoin.enabled", True)
.config("spark.ui.enabled", False)
.getOrCreate()
)
```
## PySpark Shell
2018-10-26 13:51:41 +03:00
2018-10-26 21:28:59 +03:00
```bash
pip install pysarplus
2018-10-26 23:58:33 +03:00
pyspark --packages eisber:sarplus:0.2.2 --conf spark.sql.crossJoin.enabled=true
2018-10-26 13:51:41 +03:00
```
2018-10-26 21:56:22 +03:00
## Databricks
2018-10-26 13:51:41 +03:00
2018-10-30 17:24:56 +03:00
One must set the crossJoin property to enable calculation of the similarity matrix (Clusters / < Cluster > / Configuration / Spark Config)
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
```
spark.sql.crossJoin.enabled true
2018-10-26 13:51:41 +03:00
```
2018-10-26 20:58:29 +03:00
1. Navigate to your workspace
2. Create library
3. Under 'Source' select 'Maven Coordinate'
2018-11-02 11:34:37 +03:00
4. Enter 'eisber:sarplus:0.2.4'
2018-10-30 17:24:56 +03:00
5. Hit 'Create Library'
2018-10-31 13:22:16 +03:00
6. Attach to your cluster
7. Create 2nd library
8. Under 'Source' select 'Upload Python Egg or PyPI'
9. Enter 'pysarplus'
10. Hit 'Create Library'
2018-10-26 20:58:29 +03:00
This will install C++, Python and Scala code on your cluster.
2018-10-26 13:51:41 +03:00
2018-10-31 14:07:22 +03:00
You'll also have to mount shared storage
1. Create [Azure Storage Blob](https://ms.portal.azure.com/#create/Microsoft.StorageAccount-ARM)
2. Create storage account (e.g. <yourcontainer>)
3. Create container (e.g. sarpluscache)
1. Navigate to User / User Settings
2. Generate new token: enter 'sarplus'
3. Use databricks shell (installation here)
4. databricks configure --token
4.1. Host: e.g. https://westus.azuredatabricks.net
5. databricks secrets create-scope --scope all --initial-manage-principal users
6. databricks secrets put --scope all --key sarpluscache
6.1. enter Azure Storage Blob key of Azure Storage created before
7. Run mount code
```pyspark
dbutils.fs.mount(
2018-11-02 11:43:36 +03:00
source = "wasbs://sarpluscache@<accountname>.blob.core.windows.net",
2018-10-31 14:07:22 +03:00
mount_point = "/mnt/sarpluscache",
2018-11-02 11:43:36 +03:00
extra_configs = {"fs.azure.account.key.<accountname>.blob.core.windows.net":dbutils.secrets.get(scope = "all", key = "sarpluscache")})
2018-10-31 14:07:22 +03:00
```
2018-10-26 13:51:41 +03:00
# Packaging
For [databricks](https://databricks.com/) to properly install a [C++ extension](https://docs.python.org/3/extending/building.html), one must take a detour through [pypi](https://pypi.org/).
Use [twine](https://github.com/pypa/twine) to upload the package to [pypi](https://pypi.org/).
```bash
cd python
python setup.py sdist
twine upload dist/pysarplus-*.tar.gz
```
On [Spark](https://spark.apache.org/) one can install all 3 components (C++, Python, Scala) in one pass by creating a [Spark Package](https://spark-packages.org/). Documentation is rather sparse. Steps to install
1. Package and publish the [pip package](python/setup.py) (see above)
2. Package the [Spark package](scala/build.sbt), which includes the [Scala formatter](scala/src/main/scala/eisber/sarplus) and references the [pip package](scala/python/requirements.txt) (see below)
2018-10-26 13:53:44 +03:00
3. Upload the zipped Scala package to [Spark Package](https://spark-packages.org/) through a browser. [sbt spPublish](https://github.com/databricks/sbt-spark-package) has a few [issues](https://github.com/databricks/sbt-spark-package/issues/31) so it always fails for me. Don't use spPublishLocal as the packages are not created properly (names don't match up, [issue](https://github.com/databricks/sbt-spark-package/issues/17)) and furthermore fail to install if published to [Spark-Packages.org](https://spark-packages.org/).
2018-10-26 13:51:41 +03:00
```bash
2018-10-26 21:28:59 +03:00
cd scala
2018-10-26 13:51:41 +03:00
sbt spPublish
```
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
# Testing
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
To test the python UDF + C++ backend
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
```bash
cd python
python setup.py install && pytest -s tests/
2018-10-26 13:57:46 +03:00
```
2018-10-26 20:58:29 +03:00
To test the Scala formatter
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
```bash
cd scala
sbt test
2018-10-26 13:57:46 +03:00
```
2018-10-26 20:58:29 +03:00
(use ~test and it will automatically check for changes in source files, but not build.sbt)
2018-10-26 21:48:12 +03:00