654f459c95 | ||
---|---|---|
.github | ||
notebooks | ||
reco_utils | ||
scripts | ||
tests | ||
.gitignore | ||
CONTRIBUTING.md | ||
LICENSE | ||
README.md | ||
SETUP.md |
README.md
Build Type | Branch | Status | Branch | Status | |
---|---|---|---|---|---|
Linux CPU | master | staging | |||
Linux Spark | master | staging |
NOTE: the tests are executed every night, we use pytest for testing python utilities and papermill for testing notebooks.
Recommenders
This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learning to illustrate four key tasks:
- Preparing and loading data for each recommender algorithm.
- Using different algorithms such as Smart Adapative Recommendation (SAR), Alternating Least Square (ALS), etc., for building recommender models.
- Evaluating algorithms with offline metrics.
- Operationalizing models in a production environment. The examples work across Python + CPU and PySpark environments, and contain guidance as to which algorithm to run in which environment based on scale and other requirements.
Several utilities are provided in reco_utils which will help accelerate experimenting with and building recommendation systems. These utility functions are used to load datasets (i.e., via Pandas DataFrames in python and via Spark DataFrames in PySpark) in the manner expected by different algorithms, evaluate different model outputs, split training data, and perform other common tasks. Reference implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
Environment Setup
- Please see the setup guide.
Notebooks Overview
- The Quick-Start Notebooks detail how you can quickly get up and run with state-of-the-art algorithms such as the SAR algorithm.
- The Data Notebooks detail how to prepare and split data properly for recommendation systems
- The Modeling Notebooks deep dive into implemetnations of different recommender algorithms
- The Evaluate Notebooks discuss how to evaluate recommender algorithms for different ranking and rating metrics
- The Operationalize Notebooks discuss how to deploy models in production systems
Notebook | Description |
---|---|
als_pyspark_movielens | Utilizing the ALS algorithm to power movie ratings in a PySpark environment. |
sar_python_cpu_movielens | Utilizing the SAR algorithm to power movie ratings in a Python + CPU environment. |
sar_pyspark_movielens | Utilizing the SAR algorithm to power movie ratings in a PySpark environment. |
sarplus_movielens | Utilizing the SAR+ algorithm to power movie ratings in a PySpark environment. |
data_split | Details on splitting data (randomly, chronologically, etc). |
als_deep_dive | Deep dive on the ALS algorithm and implementation. |
sar_deep_dive | Deep dive on the SAR algorithm and implementation. |
evaluation | Examples of different rating and ranking metrics in Python + CPU and PySpark environments. |
Benchmarks
Here we benchmark all the algorithms available in this repository.
NOTES:
- Time for training and testing is measured in second.
- Ranking metrics (i.e., precision, recall, MAP, and NDCG) are evaluated with k equal to 10. They are not applied to SAR-family algorithms (SAR PySpark, SAR+, and SAR CPU) because these algorithms do not predict explicit ratings that have the same scale with those in the original input data.
- The machine we used is an Azure DSVM Standard NC6s_v2 with 6 vcpus, 112 GB memory and 1 K80 GPU.
Dataset | Algorithm | Training time | Testing time | Precision | Recall | MAP | NDCG | RMSE | MAE | Exp Var | R squared |
---|---|---|---|---|---|---|---|---|---|---|---|
Movielens 100k | ALS | 5.730 | 0.326 | 0.096 | 0.079 | 0.026 | 0.100 | 1.110 | 0.860 | 0.025 | 0.023 |
SAR PySpark | 0.838 | 9.560 | 0.327 | 0.179 | 0.110 | 0.379 | |||||
SAR+ | 7.660 | 16.700 | 0.327 | 0.176 | 0.106 | 0.373 | |||||
SAR CPU | 0.679 | 0.116 | 0.327 | 0.176 | 0.106 | 0.373 | |||||
Movielens 1M | ALS | 18.000 | 0.339 | 0.120 | 0.062 | 0.022 | 0.119 | 0.950 | 0.735 | 0.280 | 0.280 |
SAR PySpark | 9.230 | 38.300 | 0.278 | 0.108 | 0.064 | 0.309 | |||||
SAR+ | 38.000 | 108.000 | 0.278 | 0.108 | 0.064 | 0.309 | |||||
SAR CPU | 5.830 | 0.586 | 0.277 | 0.109 | 0.064 | 0.308 | |||||
Movielens 10M | ALS | 92.000 | 0.169 | 0.090 | 0.057 | 0.015 | 0.084 | 0.850 | 0.647 | 0.359 | 0.359 |
SAR PySpark | |||||||||||
SAR+ | 170.000 | 80.000 | 0.256 | 0.129 | 0.081 | 0.295 | |||||
SAR CPU | 111.000 | 12.600 | 0.276 | 0.156 | 0.101 | 0.321 | |||||
Movielens 20M | ALS | 142.000 | 0.345 | 0.081 | 0.052 | 0.014 | 0.076 | 0.830 | 0.633 | 0.372 | 0.371 |
SAR PySpark | |||||||||||
SAR+ | 400.000 | 221.000 | 0.203 | 0.071 | 0.041 | 0.226 | |||||
SAR CPU | 559.000 | 47.300 | 0.247 | 0.135 | 0.085 | 0.287 |
Contributing
This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.