sarplus/README.md

84 строки
2.8 KiB
Markdown
Исходник Обычный вид История

2018-10-24 19:03:40 +03:00
# sarplus
pronounced sUrplus as it's simply better if not best!
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
Features
* Scalable PySpark based implementation
* Fast C++ based prediction:
* Reduced memory consumption: similarity matrix cached in-memory once per worker, shared accross python executors
* Easy setup using [Spark Packages](https://spark-packages.org/package/eisber/sarplus)
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
# Benchmarks
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
| # Users | # Items | # Ratings | Runtime | Environment | Dataset |
|---------|---------|-----------|---------|-------------|---------|
| 2.5mio | 35k | 100mio | 1.3h | Databricks, 8 workers, [Azure Standard DS3 v2](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) | |
# Jupyter Notebook Setup
# Spark Setup
One must set the crossJoin property to enable calculation of the similarity matrix.
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
```
spark.sql.crossJoin.enabled true
2018-10-26 13:51:41 +03:00
```
2018-10-26 20:58:29 +03:00
# Databricks Setup
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
One must set the crossJoin property to enable calculation of the similarity matrix.
2018-10-26 13:51:41 +03:00
2018-10-26 20:58:29 +03:00
```
spark.sql.crossJoin.enabled true
2018-10-26 13:51:41 +03:00
```
2018-10-26 20:58:29 +03:00
1. Navigate to your workspace
2. Create library
3. Under 'Source' select 'Maven Coordinate'
4. Enter eisber:sarplus:0.2.1
5. Hit 'Create Libbrary'
This will install C++, Python and Scala code on your cluster.
2018-10-26 13:51:41 +03:00
# Packaging
For [databricks](https://databricks.com/) to properly install a [C++ extension](https://docs.python.org/3/extending/building.html), one must take a detour through [pypi](https://pypi.org/).
Use [twine](https://github.com/pypa/twine) to upload the package to [pypi](https://pypi.org/).
```bash
cd python
python setup.py sdist
twine upload dist/pysarplus-*.tar.gz
```
On [Spark](https://spark.apache.org/) one can install all 3 components (C++, Python, Scala) in one pass by creating a [Spark Package](https://spark-packages.org/). Documentation is rather sparse. Steps to install
1. Package and publish the [pip package](python/setup.py) (see above)
2. Package the [Spark package](scala/build.sbt), which includes the [Scala formatter](scala/src/main/scala/eisber/sarplus) and references the [pip package](scala/python/requirements.txt) (see below)
2018-10-26 13:53:44 +03:00
3. Upload the zipped Scala package to [Spark Package](https://spark-packages.org/) through a browser. [sbt spPublish](https://github.com/databricks/sbt-spark-package) has a few [issues](https://github.com/databricks/sbt-spark-package/issues/31) so it always fails for me. Don't use spPublishLocal as the packages are not created properly (names don't match up, [issue](https://github.com/databricks/sbt-spark-package/issues/17)) and furthermore fail to install if published to [Spark-Packages.org](https://spark-packages.org/).
2018-10-26 13:51:41 +03:00
```bash
sbt spPublish
```
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
# Testing
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
To test the python UDF + C++ backend
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
```bash
cd python
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
python setup.py install && pytest -s tests/
2018-10-26 13:57:46 +03:00
```
2018-10-26 20:58:29 +03:00
To test the Scala formatter
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
```bash
cd scala
2018-10-26 13:57:46 +03:00
2018-10-26 20:58:29 +03:00
sbt test
2018-10-26 13:57:46 +03:00
```
2018-10-26 20:58:29 +03:00
(use ~test and it will automatically check for changes in source files, but not build.sbt)