Best Practices on Recommendation Systems
Перейти к файлу
Miguel González-Fierro 654f459c95
fix
2018-11-15 18:26:26 +01:00
.github TEMPLATE: issue template update 2018-11-01 07:25:29 +00:00
notebooks ALS notebook update for Databricks 2018-11-15 05:25:12 +00:00
reco_utils ALS notebook update for Databricks 2018-11-15 05:25:12 +00:00
scripts memory profiler from my codebase 💥 2018-11-09 11:58:47 +01:00
tests ALS notebook smoke and integration test 2018-11-12 17:12:19 +00:00
.gitignore updated instructions for databricks 2018-11-13 18:28:31 +01:00
CONTRIBUTING.md fix 2018-11-08 11:40:57 +01:00
LICENSE Initial commit 2018-09-19 03:06:13 -07:00
README.md README: a few minor edits 2018-11-13 11:38:21 +08:00
SETUP.md fix 2018-11-15 18:26:26 +01:00

README.md

Build Type Branch Status Branch Status
Linux CPU master Status staging Status
Linux Spark master Status staging Status

NOTE: the tests are executed every night, we use pytest for testing python utilities and papermill for testing notebooks.

Recommenders

This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learning to illustrate four key tasks:

  1. Preparing and loading data for each recommender algorithm.
  2. Using different algorithms such as Smart Adapative Recommendation (SAR), Alternating Least Square (ALS), etc., for building recommender models.
  3. Evaluating algorithms with offline metrics.
  4. Operationalizing models in a production environment. The examples work across Python + CPU and PySpark environments, and contain guidance as to which algorithm to run in which environment based on scale and other requirements.

Several utilities are provided in reco_utils which will help accelerate experimenting with and building recommendation systems. These utility functions are used to load datasets (i.e., via Pandas DataFrames in python and via Spark DataFrames in PySpark) in the manner expected by different algorithms, evaluate different model outputs, split training data, and perform other common tasks. Reference implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.

Environment Setup

Notebooks Overview

  • The Quick-Start Notebooks detail how you can quickly get up and run with state-of-the-art algorithms such as the SAR algorithm.
  • The Data Notebooks detail how to prepare and split data properly for recommendation systems
  • The Modeling Notebooks deep dive into implemetnations of different recommender algorithms
  • The Evaluate Notebooks discuss how to evaluate recommender algorithms for different ranking and rating metrics
  • The Operationalize Notebooks discuss how to deploy models in production systems
Notebook Description
als_pyspark_movielens Utilizing the ALS algorithm to power movie ratings in a PySpark environment.
sar_python_cpu_movielens Utilizing the SAR algorithm to power movie ratings in a Python + CPU environment.
sar_pyspark_movielens Utilizing the SAR algorithm to power movie ratings in a PySpark environment.
sarplus_movielens Utilizing the SAR+ algorithm to power movie ratings in a PySpark environment.
data_split Details on splitting data (randomly, chronologically, etc).
als_deep_dive Deep dive on the ALS algorithm and implementation.
sar_deep_dive Deep dive on the SAR algorithm and implementation.
evaluation Examples of different rating and ranking metrics in Python + CPU and PySpark environments.

Benchmarks

Here we benchmark all the algorithms available in this repository.

NOTES:

  • Time for training and testing is measured in second.
  • Ranking metrics (i.e., precision, recall, MAP, and NDCG) are evaluated with k equal to 10. They are not applied to SAR-family algorithms (SAR PySpark, SAR+, and SAR CPU) because these algorithms do not predict explicit ratings that have the same scale with those in the original input data.
  • The machine we used is an Azure DSVM Standard NC6s_v2 with 6 vcpus, 112 GB memory and 1 K80 GPU.
Dataset Algorithm Training time Testing time Precision Recall MAP NDCG RMSE MAE Exp Var R squared
Movielens 100k ALS 5.730 0.326 0.096 0.079 0.026 0.100 1.110 0.860 0.025 0.023
SAR PySpark 0.838 9.560 0.327 0.179 0.110 0.379
SAR+ 7.660 16.700 0.327 0.176 0.106 0.373
SAR CPU 0.679 0.116 0.327 0.176 0.106 0.373
Movielens 1M ALS 18.000 0.339 0.120 0.062 0.022 0.119 0.950 0.735 0.280 0.280
SAR PySpark 9.230 38.300 0.278 0.108 0.064 0.309
SAR+ 38.000 108.000 0.278 0.108 0.064 0.309
SAR CPU 5.830 0.586 0.277 0.109 0.064 0.308
Movielens 10M ALS 92.000 0.169 0.090 0.057 0.015 0.084 0.850 0.647 0.359 0.359
SAR PySpark
SAR+ 170.000 80.000 0.256 0.129 0.081 0.295
SAR CPU 111.000 12.600 0.276 0.156 0.101 0.321
Movielens 20M ALS 142.000 0.345 0.081 0.052 0.014 0.076 0.830 0.633 0.372 0.371
SAR PySpark
SAR+ 400.000 221.000 0.203 0.071 0.041 0.226
SAR CPU 559.000 47.300 0.247 0.135 0.085 0.287

Contributing

This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.