зеркало из https://github.com/microsoft/sarplus.git
d97d7cfefb | ||
---|---|---|
python | ||
scala | ||
.gitignore | ||
LICENSE | ||
README.md |
README.md
sarplus
pronounced sUrplus as it's simply better if not best!
Features
- Scalable PySpark based implementation
- Fast C++ based prediction:
- Reduced memory consumption: similarity matrix cached in-memory once per worker, shared accross python executors
- Easy setup using Spark Packages
Benchmarks
# Users | # Items | # Ratings | Runtime | Environment | Dataset |
---|---|---|---|---|---|
2.5mio | 35k | 100mio | 1.3h | Databricks, 8 workers, Azure Standard DS3 v2 |
Usage
import pandas as pd
from pysarplus import SARPlus
# spark dataframe with user/item/rating/optional timestamp tuples
train_df = spark.createDataFrame(
pd.DataFrame({
'user_id': [1, 1, 2, 3, 3],
'item_id': [1, 2, 1, 1, 3],
'rating': [1, 1, 1, 1, 1],
}))
# spark dataframe with user/item tuples
test_df = spark.createDataFrame(
pd.DataFrame({
'user_id': [1, 3],
'item_id': [1, 3],
'rating': [1, 1],
}))
model = SARPlus(spark, col_user='user_id', col_item='item_id', col_rating='rating', col_timestamp='timestamp')
model.fit(train_df, similarity_type='jaccard')
model.recommend_k_items(test_df, 'sarplus_cache', top_k=3).show()
Jupyter Notebook
PySpark Shell
pip install pysarplus
pyspark --packages eisber:sarplus:0.2.1 --conf spark.sql.crossJoin.enabled=true
Databricks
One must set the crossJoin property to enable calculation of the similarity matrix (Clusters / / Configuration / Spark Config)
spark.sql.crossJoin.enabled true
- Navigate to your workspace
- Create library
- Under 'Source' select 'Maven Coordinate'
- Enter eisber:sarplus:0.2.1
- Hit 'Create Libbrary'
This will install C++, Python and Scala code on your cluster.
Packaging
For databricks to properly install a C++ extension, one must take a detour through pypi. Use twine to upload the package to pypi.
cd python
python setup.py sdist
twine upload dist/pysarplus-*.tar.gz
On Spark one can install all 3 components (C++, Python, Scala) in one pass by creating a Spark Package. Documentation is rather sparse. Steps to install
- Package and publish the pip package (see above)
- Package the Spark package, which includes the Scala formatter and references the pip package (see below)
- Upload the zipped Scala package to Spark Package through a browser. sbt spPublish has a few issues so it always fails for me. Don't use spPublishLocal as the packages are not created properly (names don't match up, issue) and furthermore fail to install if published to Spark-Packages.org.
cd scala
sbt spPublish
Testing
To test the python UDF + C++ backend
cd python
python setup.py install && pytest -s tests/
To test the Scala formatter
cd scala
sbt test
(use ~test and it will automatically check for changes in source files, but not build.sbt)