updating readme adding py script
This commit is contained in:
Коммит
622aa2f4c2
|
@ -129,4 +129,4 @@ ml-1m/
|
|||
ml-20m/
|
||||
*.jar
|
||||
*.item
|
||||
*.pkl
|
||||
*.pkl
|
30
README.md
30
README.md
|
@ -4,7 +4,7 @@ This repository provides examples and best practices for building recommendation
|
|||
- [Prepare Data](notebooks/01_prepare_data/README.md): Preparing and loading data for each recommender algorithm
|
||||
- [Model](notebooks/02_model/README.md): Building models using various recommender algorithms such as Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)), Singular Value Decomposition ([SVD](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)), etc.
|
||||
- [Evaluate](notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics
|
||||
- [Model Select and Optimize](notebooks/04_model_select_and_optimize): Tuning and optimizing hyperparameteres for recommender models
|
||||
- [Model Select and Optimize](notebooks/04_model_select_and_optimize): Tuning and optimizing hyperparameters for recommender models
|
||||
- [Operationalize](notebooks/05_operationalize/README.md): Operationalizing models in a production environment on Azure
|
||||
|
||||
Several utilities are provided in [reco_utils](reco_utils) to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting train/test data. Implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
|
||||
|
@ -22,7 +22,7 @@ To setup on your local machine:
|
|||
```
|
||||
cd Recommenders
|
||||
python scripts/generate_conda_file.py
|
||||
conda env create -f conda_base.yaml
|
||||
conda env create -f reco_base.yaml
|
||||
```
|
||||
4. Activate the conda environment and register it with Jupyter:
|
||||
```
|
||||
|
@ -57,20 +57,22 @@ The Quick-Start and Modeling notebooks showcase how to utilize the following alg
|
|||
|
||||
**Algorithms**
|
||||
|
||||
The table below lists recommender algorithms available in the repository at the moment.
|
||||
|
||||
| Algorithm | Environment | Type | Description |
|
||||
| --- | --- | --- | --- |
|
||||
| **`Classic Recommenders`** |
|
||||
| [Surprise/Singular Value Decomposition (SVD)](notebooks/00_quick_start/sar_single_node_movielens.ipynb) | Python | Collaborative Filtering | General purpose algorithm for smaller datasets |
|
||||
| [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_pyspark_movielens.ipynb) | Spark | Collaborative | General purpose algorithm for larger datasets, optimized with Spark |
|
||||
| **`Microsoft Recommenders`** |
|
||||
| [Smart Adaptive Recommendations (SAR)](notebooks/00_quick_start/sar_single_node_movielens.ipynb) | Python / Spark | Collaborative Filtering | Generalized algorithm utilizing item similarities and can easily adapt to new users |
|
||||
| [Vowpal Wabbit Family (VW)](notebooks/02_model/vowpal_wabbit_deep_dive.ipynb) | Python / Online | Collaborative, Content Based | Fast online learning algorithms, great for scenarios where user features / context are constantly changing, like real-time bidding |
|
||||
| [eXtreme Deep Factorization Machine (xDeepFM)](notebooks/00_quick_start/xdeepfm.ipynb) | Python / GPU | Hybrid | Deep learning model combining implicit and explicit features |
|
||||
| [Deep Knowledge-Aware Network (DKN)](notebooks/00_quick_start/dkn.ipynb) | Python / GPU | Content Based | Deep learning model incorporating a knowledge graph and article embeddings to provide powerful news or article recommendations |
|
||||
| **`Deep Learning`** |
|
||||
| **Classic Recommenders** |
|
||||
| [Surprise/Singular Value Decomposition (SVD)](notebooks/00_quick_start/sar_movielens.ipynb) | Python | Collaborative Filtering | General purpose algorithm for smaller datasets |
|
||||
| [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_movielens.ipynb) | Spark | Collaborative Filtering | General purpose algorithm for larger datasets, optimized with Spark |
|
||||
| **Microsoft Recommenders** |
|
||||
| [Smart Adaptive Recommendations (SAR)](notebooks/00_quick_start/sar_movielens.ipynb) | Python / Spark | Collaborative Filtering | Generalized algorithm utilizing item similarities and can easily adapt to new users |
|
||||
| [Vowpal Wabbit Family (VW)](notebooks/02_model/vowpal_wabbit_deep_dive.ipynb) | Python / Online | Collaborative, Content-based Filtering | Fast online learning algorithms, great for scenarios where user features / context are constantly changing, like real-time bidding |
|
||||
| [eXtreme Deep Factorization Machine (xDeepFM)](notebooks/00_quick_start/xdeepfm_synthetic.ipynb) | Python / GPU | Hybrid | Deep learning model combining implicit and explicit features |
|
||||
| [Deep Knowledge-Aware Network (DKN)](notebooks/00_quick_start/dkn_synthetic.ipynb) | Python / GPU | Content-based Filtering | Deep learning model incorporating a knowledge graph and article embeddings to provide powerful news or article recommendations |
|
||||
| **Deep Learning Recommenders** |
|
||||
| [Neural Collaborative Filtering (NCF)](notebooks/00_quick_start/ncf_movielens.ipynb) | Python / GPU | Collaborative Filtering | General algorithm built using a multi-layer perceptron |
|
||||
| [Restricted Boltzmann Machines (RBM)](notebooks/00_quick_start/rbm_movielens.ipynb) | Python / GPU | Collaborative Filtering | Generative neural network algorithm built to learn the underlying probability distribution for user/item affinity |
|
||||
| [FastAI Embedding Dot Bias (FAST)](notebooks/00_quick_start/fastai_recommendation.ipynb) | Python / GPU | Collaborative Filtering | General purpose algorithm embedding dot biases for users and items |
|
||||
| [FastAI Embedding Dot Bias (FAST)](notebooks/00_quick_start/fastai_movielens.ipynb) | Python / GPU | Collaborative Filtering | General purpose algorithm embedding dot biases for users and items |
|
||||
|
||||
In addition, we also provide a [comparison notebook](notebooks/03_evaluate/comparison.ipynb) to illustrate how different algorithms could be evaluated and compared. In this notebook, data (MovieLens 1M) is randomly split into train/test sets at a 75/25 ratio. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature [here](http://mymedialite.net/examples/datasets.html). For ranking metrics we use k = 10 (top 10 results). We run the comparison on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Spark ALS is run in local standalone mode.
|
||||
|
||||
|
@ -78,9 +80,9 @@ In addition, we also provide a [comparison notebook](notebooks/03_evaluate/compa
|
|||
|
||||
| Algo | MAP | nDCG@k | Precision@k | Recall@k | RMSE | MAE | R<sup>2</sup> | Explained Variance |
|
||||
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||
| [ALS](notebooks/00_quick_start/als_pyspark_movielens.ipynb) | 0.002020 | 0.024313 | 0.030677 | 0.009649 | 0.860502 | 0.680608 | 0.406014 | 0.411603 |
|
||||
| [ALS](notebooks/00_quick_start/als_movielens.ipynb) | 0.002020 | 0.024313 | 0.030677 | 0.009649 | 0.860502 | 0.680608 | 0.406014 | 0.411603 |
|
||||
| [SVD](notebooks/02_model/surprise_svd_deep_dive.ipynb) | 0.010915 | 0.102398 | 0.092996 | 0.025362 | 0.888991 | 0.696781 | 0.364178 | 0.364178 |
|
||||
| [FastAI](notebooks/00_quick_start/fastai_recommendation.ipynb) | 0.023022 |0.168714 |0.154761 |0.050153 |0.887224 |0.705609 |0.371552 |0.374281 |
|
||||
| [FastAI](notebooks/00_quick_start/fastai_movielens.ipynb) | 0.023022 |0.168714 |0.154761 |0.050153 |0.887224 |0.705609 |0.371552 |0.374281 |
|
||||
|
||||
|
||||
|
||||
|
|
12
SETUP.md
12
SETUP.md
|
@ -60,7 +60,7 @@ Assuming the repo is cloned as `Recommenders` in the local system, to install th
|
|||
|
||||
cd Recommenders
|
||||
python scripts/generate_conda_file.py
|
||||
conda env create -f conda_bare.yaml
|
||||
conda env create -f reco_base.yaml
|
||||
|
||||
</details>
|
||||
|
||||
|
@ -71,7 +71,7 @@ Assuming that you have a GPU machine, to install the Python GPU environment, whi
|
|||
|
||||
cd Recommenders
|
||||
python scripts/generate_conda_file.py --gpu
|
||||
conda env create -f conda_gpu.yaml
|
||||
conda env create -f reco_gpu.yaml
|
||||
|
||||
</details>
|
||||
|
||||
|
@ -82,11 +82,11 @@ To install the PySpark environment, which by default installs the CPU environmen
|
|||
|
||||
cd Recommenders
|
||||
python scripts/generate_conda_file.py --pyspark
|
||||
conda env create -f conda_pyspark.yaml
|
||||
conda env create -f reo_pyspark.yaml
|
||||
|
||||
Additionally, if you want to test a particular version of spark, you may pass the --pyspark-version argument:
|
||||
|
||||
python /scripts/generate_conda_file.py --pyspark-version 2.4.0
|
||||
python scripts/generate_conda_file.py --pyspark-version 2.4.0
|
||||
|
||||
**NOTE** - for a PySpark environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
|
||||
|
||||
|
@ -113,8 +113,8 @@ unset PYSPARK_DRIVER_PYTHON
|
|||
To install all three environments:
|
||||
|
||||
cd Recommenders
|
||||
python /scripts/generate_conda_file.py --gpu --pyspark
|
||||
conda env create -f conda_full.yaml
|
||||
python scripts/generate_conda_file.py --gpu --pyspark
|
||||
conda env create -f reco_full.yaml
|
||||
|
||||
</details>
|
||||
|
||||
|
|
|
@ -5,17 +5,17 @@ In this directory, notebooks are provided to demonstrate the use of different al
|
|||
data preparation, model building, and model evaluation by using the utility functions ([reco_utils](../../reco_utils))
|
||||
available in the repo.
|
||||
|
||||
| Notebook | Description |
|
||||
| --- | --- |
|
||||
| [als_pyspark_movielens](als_pyspark_movielens.ipynb) | Utilizing ALS algorithm to predict movie ratings in a PySpark environment.
|
||||
| [fastai_recommendation](fastai_recommendation.ipynb) | Utilizing FastAI recommender to predict movie ratings in a Python+GPU (PyTorch) environment.
|
||||
| [ncf_movielens](ncf_movielens.ipynb) | Utilizing Neural Collaborative Filtering (NCF) [1] to predict movie ratings in a Python+GPU (TensorFlow) environment.
|
||||
| [sar_python_cpu_movielens](sar_single_node_movielens.ipynb) | Utilizing Smart Adaptive Recommendations (SAR) algorithm to predict movie ratings in a Python+CPU environment.
|
||||
| [dkn](dkn.ipynb) | Utilizing the Deep Knowledge-Aware Network (DKN) [2] algorithm for news recommendations using information from a knowledge graph, in a Python+GPU (TensorFlow) environment.
|
||||
| [xdeepfm](xdeepfm.ipynb) | Utilizing the eXtreme Deep Factorization Machine (xDeepFM) [3] to learn both low and high order feature interactions for predicting CTR, in a Python+GPU (TensorFlow) environment.
|
||||
| [rbm](rbm_movielens.ipynb)| Utilizing the Restricted Boltzmann Machine (rbm) [4] to predict movie ratings in a Python+GPU (TensorFlow) environment.<br>
|
||||
| Notebook | Dataset | Environment | Description |
|
||||
| --- | --- | --- | --- |
|
||||
| [als](als_movielens.ipynb) | MovieLens | PySpark | Utilizing ALS algorithm to predict movie ratings in a PySpark environment.
|
||||
| [dkn](dkn_synthetic.ipynb) | Synthetic Data | Python CPU, GPU | Utilizing the Deep Knowledge-Aware Network (DKN) [2] algorithm for news recommendations using information from a knowledge graph, in a Python+GPU (TensorFlow) environment.
|
||||
| [fastai](fastai_movielens.ipynb) | MovieLens | Python CPU, GPU | Utilizing FastAI recommender to predict movie ratings in a Python+GPU (PyTorch) environment.
|
||||
| [ncf](ncf_movielens.ipynb) | MovieLens | Python CPU, GPU | Utilizing Neural Collaborative Filtering (NCF) [1] to predict movie ratings in a Python+GPU (TensorFlow) environment.
|
||||
| [rbm](rbm_movielens.ipynb)| MovieLens | Python CPU, GPU | Utilizing the Restricted Boltzmann Machine (rbm) [4] to predict movie ratings in a Python+GPU (TensorFlow) environment.<br>
|
||||
| [sar](sar_movielens.ipynb) | MovieLens | Python CPU | Utilizing Smart Adaptive Recommendations (SAR) algorithm to predict movie ratings in a Python+CPU environment.
|
||||
| [xdeepfm](xdeepfm_synthetic.ipynb) | Synthetic Data | Python CPU, GPU | Utilizing the eXtreme Deep Factorization Machine (xDeepFM) [3] to learn both low and high order feature interactions for predicting CTR, in a Python+GPU (TensorFlow) environment.
|
||||
|
||||
[1] _Neural Collaborative Filtering_, Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu and Tat-Seng Chua. WWW 2017.<br>
|
||||
[2] _DKN: Deep Knowledge-Aware Network for News Recommendation_, Hongwei Wang, Fuzheng Zhang, Xing Xie and Minyi Guo. WWW 2018.<br>
|
||||
[3] _xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems_, Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie and Guangzhong Sun. KDD 2018.<br>
|
||||
[4] _Restricted Boltzmann Machines for Collaborative Filtering_, Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton. ICML 2007.
|
||||
[4] _Restricted Boltzmann Machines for Collaborative Filtering_, Ruslan Salakhutdinov, Andriy Mnih and Geoffrey Hinton. ICML 2007.
|
||||
|
|
|
@ -1,11 +1,20 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
|
||||
"\n",
|
||||
"<i>Licensed under the MIT License.</i>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# DKN : Deep Knowledge-Aware Network for News Recommendation\n",
|
||||
"DKN\\[1\\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX\\[2\\] method for knowledge graph representaion learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. \n",
|
||||
"DKN \\[1\\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \\[2\\] method for knowledge graph representaion learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. \n",
|
||||
"\n",
|
||||
"## Properties of DKN:\n",
|
||||
"- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. \n",
|
||||
|
@ -241,9 +250,9 @@
|
|||
"metadata": {
|
||||
"celltoolbar": "Tags",
|
||||
"kernelspec": {
|
||||
"display_name": "Python (reco)",
|
||||
"display_name": "Python (reco_bare)",
|
||||
"language": "python",
|
||||
"name": "reco"
|
||||
"name": "reco_bare"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
@ -255,7 +264,7 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.0"
|
||||
"version": "3.6.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
|
@ -76,7 +76,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"USER,ITEM,RATING,TIMESTAMP,PREDICTION,TITLE = 'UserId','MovieId','Rating','Timestamp','Prediction','Title'"
|
||||
"USER, ITEM, RATING, TIMESTAMP, PREDICTION, TITLE = 'UserId', 'MovieId', 'Rating', 'Timestamp', 'Prediction', 'Title'"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -141,7 +141,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# fix random seeds to make sure out runs are reproducible\n",
|
||||
"# fix random seeds to make sure our runs are reproducible\n",
|
||||
"np.random.seed(101)\n",
|
||||
"torch.manual_seed(101)\n",
|
||||
"torch.cuda.manual_seed_all(101)"
|
||||
|
@ -582,7 +582,7 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The above numbers are lower than SAR, but expected, since the model is explicitly trying to generalize the users and items to the latent factors. Next look at how well the model predicts how the user would rate the movie. Need to score `test_df`, but this time don't ask for top_k. "
|
||||
"The above numbers are lower than [SAR](../sar_single_node_movielens.ipynb), but expected, since the model is explicitly trying to generalize the users and items to the latent factors. Next look at how well the model predicts how the user would rate the movie. Need to score `test_df`, but this time don't ask for top_k. "
|
||||
]
|
||||
},
|
||||
{
|
|
@ -15,7 +15,7 @@
|
|||
"source": [
|
||||
"# Neural Collaborative Filtering on Movielens dataset.\n",
|
||||
"\n",
|
||||
"Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalize the matrix factorization problem with multi-layer perceptron. \n",
|
||||
"Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalizes the matrix factorization problem with multi-layer perceptron. \n",
|
||||
"\n",
|
||||
"This notebook provides an example of how to utilize and evaluate NCF implementation in the `reco_utils`. We use a smaller dataset in this example to run NCF efficiently with GPU acceleration on a [Data Science Virtual Machine](https://azure.microsoft.com/en-gb/services/virtual-machines/data-science-virtual-machines/)."
|
||||
]
|
||||
|
@ -143,7 +143,7 @@
|
|||
"source": [
|
||||
"### 3. Train the NCF model on the training data, and get the top-k recommendations for our testing data\n",
|
||||
"\n",
|
||||
"NCF is for implicity feedback typed recommender, and it generates prospensity of items to be recommended to users in the scale of 0 to 1. A recommended item list can then be generated based on the scores. NOTE this quickstart notebook is using a smaller number of epoch size to reduce time for training. As a consequence, the model performance will be slighlty deteriorated. "
|
||||
"NCF accepts implicit feedback and generates prospensity of items to be recommended to users in the scale of 0 to 1. A recommended item list can then be generated based on the scores. Note that this quickstart notebook is using a smaller number of epochs to reduce time for training. As a consequence, the model performance will be slighlty deteriorated. "
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -1,5 +1,23 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
|
||||
"\n",
|
||||
"<i>Licensed under the MIT License.</i>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
|
||||
"\n",
|
||||
"<i>Licensed under the MIT License.</i>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
|
@ -66,7 +84,7 @@
|
|||
"import papermill as pm\n",
|
||||
"\n",
|
||||
"from reco_utils.recommender.rbm.rbm import RBM\n",
|
||||
"from reco_utils.dataset.numpy_splitters import numpy_stratified_split\n",
|
||||
"from reco_utils.dataset.python_splitters import numpy_stratified_split\n",
|
||||
"from reco_utils.dataset.sparse import AffinityMatrix\n",
|
||||
"\n",
|
||||
"\n",
|
||||
|
|
|
@ -41,16 +41,16 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"System version: 3.6.0 | packaged by conda-forge | (default, Feb 9 2017, 14:36:55) \n",
|
||||
"[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]\n",
|
||||
"Pandas version: 0.23.4\n"
|
||||
"System version: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43) \n",
|
||||
"[GCC 7.3.0]\n",
|
||||
"Pandas version: 0.24.1\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -58,17 +58,20 @@
|
|||
"# set the environment path to find Recommenders\n",
|
||||
"import sys\n",
|
||||
"sys.path.append(\"../../\")\n",
|
||||
"import time\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"import itertools\n",
|
||||
"import logging\n",
|
||||
"import os\n",
|
||||
"import time\n",
|
||||
"\n",
|
||||
"import numpy as np\n",
|
||||
"import pandas as pd\n",
|
||||
"import papermill as pm\n",
|
||||
"\n",
|
||||
"from reco_utils.recommender.sar.sar_singlenode import SARSingleNode\n",
|
||||
"from reco_utils.dataset import movielens\n",
|
||||
"from reco_utils.dataset.python_splitters import python_random_split\n",
|
||||
"from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k\n",
|
||||
"from reco_utils.recommender.sar.sar_singlenode import SARSingleNode\n",
|
||||
"\n",
|
||||
"print(\"System version: {}\".format(sys.version))\n",
|
||||
"print(\"Pandas version: {}\".format(pd.__version__))"
|
||||
|
@ -90,7 +93,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"parameters"
|
||||
|
@ -114,7 +117,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -193,7 +196,7 @@
|
|||
"4 166 346 1.0 886397596"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
|
@ -221,7 +224,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -246,7 +249,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -257,46 +260,15 @@
|
|||
" \"col_timestamp\": \"Timestamp\",\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"logging.basicConfig(level=logging.DEBUG, \n",
|
||||
" format='%(asctime)s %(levelname)-8s %(message)s')\n",
|
||||
"\n",
|
||||
"model = SARSingleNode(\n",
|
||||
" remove_seen=True, similarity_type=\"jaccard\", \n",
|
||||
" time_decay_coefficient=30, time_now=None, timedecay_formula=True, **header\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will hash users and items to smaller continuous space.\n",
|
||||
"This is an ordered set - it's discrete, but contiguous.\n",
|
||||
"This helps keep the matrices we keep in memory as small as possible."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"start_time = time.time()\n",
|
||||
"\n",
|
||||
"unique_users = data[\"UserId\"].unique()\n",
|
||||
"unique_items = data[\"MovieId\"].unique()\n",
|
||||
"enumerate_items_1, enumerate_items_2 = itertools.tee(enumerate(unique_items))\n",
|
||||
"enumerate_users_1, enumerate_users_2 = itertools.tee(enumerate(unique_users))\n",
|
||||
"\n",
|
||||
"item_map_dict = {x: i for i, x in enumerate_items_1}\n",
|
||||
"user_map_dict = {x: i for i, x in enumerate_users_1}\n",
|
||||
"# The reverse of the dictionary above - array index to actual ID\n",
|
||||
"index2user = dict(enumerate_users_2)\n",
|
||||
"index2item = dict(enumerate_items_2)\n",
|
||||
"\n",
|
||||
"# We need to index the train and test sets for SAR matrix operations to work\n",
|
||||
"model.set_index(unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item)\n",
|
||||
"\n",
|
||||
"preprocess_time = time.time() - start_time"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
@ -314,29 +286,30 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Collecting user affinity matrix...\n",
|
||||
"Calculating time-decayed affinities...\n",
|
||||
"Creating index columns...\n",
|
||||
"Building user affinity sparse matrix...\n",
|
||||
"Calculating item cooccurrence...\n",
|
||||
"Calculating item similarity...\n",
|
||||
"Calculating jaccard...\n",
|
||||
"Calculating recommendation scores...\n",
|
||||
"done training\n"
|
||||
"2019-02-05 13:19:22,533 INFO Collecting user affinity matrix\n",
|
||||
"2019-02-05 13:19:22,538 INFO Calculating time-decayed affinities\n",
|
||||
"2019-02-05 13:19:22,589 INFO Creating index columns\n",
|
||||
"2019-02-05 13:19:22,607 INFO Building user affinity sparse matrix\n",
|
||||
"2019-02-05 13:19:22,615 INFO Calculating item co-occurrence\n",
|
||||
"2019-02-05 13:19:22,807 INFO Calculating item similarity\n",
|
||||
"2019-02-05 13:19:22,808 INFO Calculating jaccard\n",
|
||||
"2019-02-05 13:19:22,991 INFO Calculating recommendation scores\n",
|
||||
"2019-02-05 13:19:23,106 INFO Removing seen items\n",
|
||||
"2019-02-05 13:19:23,107 INFO Done training\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Took 0.5829987525939941 seconds for training.\n"
|
||||
"Took 0.5787224769592285 seconds for training.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -345,32 +318,27 @@
|
|||
"\n",
|
||||
"model.fit(train)\n",
|
||||
"\n",
|
||||
"train_time = time.time() - start_time + preprocess_time\n",
|
||||
"train_time = time.time() - start_time\n",
|
||||
"print(\"Took {} seconds for training.\".format(train_time))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Converting to dense matrix...\n",
|
||||
"Removing seen items...\n",
|
||||
"Getting top K...\n",
|
||||
"Select users from the test set\n",
|
||||
"Creating output dataframe...\n",
|
||||
"Formatting output\n"
|
||||
"2019-02-05 13:19:23,125 INFO Getting top K\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Took 0.13302063941955566 seconds for prediction.\n"
|
||||
"Took 0.06923317909240723 seconds for prediction.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -389,7 +357,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
|
@ -430,22 +398,22 @@
|
|||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>423</td>\n",
|
||||
" <td>12.991756</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>183</td>\n",
|
||||
" <td>13.106912</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>89</td>\n",
|
||||
" <td>13.163791</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>423</td>\n",
|
||||
" <td>12.991756</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>144</td>\n",
|
||||
|
@ -458,9 +426,9 @@
|
|||
"text/plain": [
|
||||
" UserId MovieId prediction\n",
|
||||
"0 600 69 12.984131\n",
|
||||
"1 600 423 12.991756\n",
|
||||
"2 600 183 13.106912\n",
|
||||
"3 600 89 13.163791\n",
|
||||
"1 600 183 13.106912\n",
|
||||
"2 600 89 13.163791\n",
|
||||
"3 600 423 12.991756\n",
|
||||
"4 600 144 13.489795"
|
||||
]
|
||||
},
|
||||
|
@ -483,7 +451,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -494,7 +462,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -505,7 +473,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -516,7 +484,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -527,7 +495,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -554,7 +522,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -596,7 +564,7 @@
|
|||
{
|
||||
"data": {
|
||||
"application/papermill.record+json": {
|
||||
"train_time": 0.5829987525939941
|
||||
"train_time": 0.5787224769592285
|
||||
}
|
||||
},
|
||||
"metadata": {},
|
||||
|
@ -605,7 +573,7 @@
|
|||
{
|
||||
"data": {
|
||||
"application/papermill.record+json": {
|
||||
"test_time": 0.13302063941955566
|
||||
"test_time": 0.06923317909240723
|
||||
}
|
||||
},
|
||||
"metadata": {},
|
||||
|
@ -626,9 +594,9 @@
|
|||
"metadata": {
|
||||
"celltoolbar": "Tags",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "Python (reco)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
"name": "reco"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
@ -640,7 +608,7 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.5"
|
||||
"version": "3.6.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
|
@ -1,5 +1,23 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
|
||||
"\n",
|
||||
"<i>Licensed under the MIT License.</i>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
|
||||
"\n",
|
||||
"<i>Licensed under the MIT License.</i>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
@ -463,9 +481,9 @@
|
|||
"metadata": {
|
||||
"celltoolbar": "Tags",
|
||||
"kernelspec": {
|
||||
"display_name": "Python (reco_gpu)",
|
||||
"display_name": "Python (reco_bare)",
|
||||
"language": "python",
|
||||
"name": "reco_gpu"
|
||||
"name": "reco_bare"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
@ -477,7 +495,7 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
"version": "3.6.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
|
@ -936,7 +936,7 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting. This technique is used in [SAR](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_single_node_deep_dive.ipynb). Formula for getting affinity score for each user-item pair is \n",
|
||||
"In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting. This technique is used in [SAR](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_deep_dive.ipynb). Formula for getting affinity score for each user-item pair is \n",
|
||||
"\n",
|
||||
"$$a_{ij}=\\sum_k (w_k \\text{exp}[-\\text{log}_2(\\frac{t_0-t_k}{T})] $$\n",
|
||||
"where $a_{ij}$ is the affinity score, $w_k$ is the interaction weight, $t_0$ is a reference time, $t_k$ is the timestamp for the $k$-th interaction, and $T$ is a hyperparameter that controls the speed of decay.\n",
|
||||
|
@ -1699,7 +1699,7 @@
|
|||
"\n",
|
||||
"1. X. He *et al*, Neural Collaborative Filtering, WWW 2017. \n",
|
||||
"2. Y. Hu *et al*, Collaborative filtering for implicit feedback datasets, ICDM 2008.\n",
|
||||
"3. Smart Adapative Recommendation (SAR), url: https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_single_node_deep_dive.ipynb\n",
|
||||
"3. Smart Adapative Recommendation (SAR), url: https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_deep_dive.ipynb\n",
|
||||
"4. Y. Koren and J. Sill, OrdRec: an ordinal model for predicting personalized item rating distributions, RecSys 2011."
|
||||
]
|
||||
}
|
||||
|
|
|
@ -4,14 +4,14 @@ In this directory, notebooks are provided to give a deep dive into training mode
|
|||
Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)) and Singular Value Decomposition (SVD) using [Surprise](http://surpriselib.com/) python package. The notebooks make use of the utility functions ([reco_utils](../../reco_utils))
|
||||
available in the repo.
|
||||
|
||||
| Notebook | Description |
|
||||
| --- | --- |
|
||||
| [als_deep_dive](als_deep_dive.ipynb) | Deep dive on the ALS algorithm and implementation.
|
||||
| [baseline_deep_dive](baseline_deep_dive.ipynb) | Deep dive on baseline performance estimation.
|
||||
| [ncf_deep_dive](ncf_deep_dive.ipynb) | Deep dive on a NCF algorithm and implementation.
|
||||
| [surprise_svd_deep_dive](surprise_svd_deep_dive.ipynb) | Deep dive on a SVD algorithm and implementation.
|
||||
| [sar_single_node_deep_dive](sar_single_node_deep_dive.ipynb) | Deep dive on the SAR algorithm and implementation.
|
||||
| [vowpal_wabbit_deep_dive](vowpal_wabbit_deep_dive.ipynb) | Deep dive into using Vowpal Wabbit for regression and matrix factorization.
|
||||
| [rbm_deep_dive](rbm_deep_dive.ipynb)| Deep dive on the rbm algorithm and its implementation.
|
||||
| Notebook | Environment | Description |
|
||||
| --- | --- | --- |
|
||||
| [als_deep_dive](als_deep_dive.ipynb) | PySpark | Deep dive on the ALS algorithm and implementation.
|
||||
| [baseline_deep_dive](baseline_deep_dive.ipynb) | --- | Deep dive on baseline performance estimation.
|
||||
| [ncf_deep_dive](ncf_deep_dive.ipynb) | Python CPU, GPU | Deep dive on a NCF algorithm and implementation.
|
||||
| [rbm_deep_dive](rbm_deep_dive.ipynb)| Python CPU, GPU | Deep dive on the rbm algorithm and its implementation.
|
||||
| [sar_deep_dive](sar_deep_dive.ipynb) | Python CPU | Deep dive on the SAR algorithm and implementation.
|
||||
| [surprise_svd_deep_dive](surprise_svd_deep_dive.ipynb) | Python CPU | Deep dive on a SVD algorithm and implementation.
|
||||
| [vowpal_wabbit_deep_dive](vowpal_wabbit_deep_dive.ipynb) | Python CPU | Deep dive into using Vowpal Wabbit for regression and matrix factorization.
|
||||
|
||||
Details on model training are best found inside each notebook.
|
||||
|
|
|
@ -116,7 +116,7 @@
|
|||
"\n",
|
||||
"### 1.2 The MLP model\n",
|
||||
"\n",
|
||||
"NCF adopts two pathways to model users and items: 1) element-wise product of vectors, 2) concatenation of vectors. To learn interactions after concatenating of users and items lantent features, the standard MLP model is applied. In this sense, we can endow the model a large level of flexibility and non-linearity to learn the interactions between $p_{u}$ and $q_{i}$. The details of MLP model are:\n",
|
||||
"NCF adopts two pathways to model users and items: 1) element-wise product of vectors, 2) concatenation of vectors. To learn interactions after concatenating of users and items latent features, the standard MLP model is applied. In this sense, we can endow the model a large level of flexibility and non-linearity to learn the interactions between $p_{u}$ and $q_{i}$. The details of MLP model are:\n",
|
||||
"\n",
|
||||
"For the input layer, there is concatention of user and item vectors:\n",
|
||||
"\n",
|
||||
|
@ -134,7 +134,7 @@
|
|||
"\\hat { r } _ { u , i } = \\sigma \\left( h ^ { T } \\phi \\left( z _ { L - 1 } \\right) \\right)\n",
|
||||
"$$\n",
|
||||
"\n",
|
||||
"where ${ W }_{ l }$, ${ b }_{ l }$, and ${ a }_{ out }$ denote the weight matrix, bias vector, and activation function for the $l$-th layer’s perceptron, respectively. For activation functions of MLP layers, one can freely choose sigmoid, hyperbolic tangent (tanh), and Rectifier (ReLU), among others. Because of implicit data task, the activation function of the output layer is defined as sigmoid $\\sigma(x)=\\frac{1}{1+\\exp{(-x)}}$ to restrict the predicted score to be in (0,1).\n",
|
||||
"where ${ W }_{ l }$, ${ b }_{ l }$, and ${ a }_{ out }$ denote the weight matrix, bias vector, and activation function for the $l$-th layer’s perceptron, respectively. For activation functions of MLP layers, one can freely choose sigmoid, hyperbolic tangent (tanh), and Rectifier (ReLU), among others. Because of implicit data task, the activation function of the output layer is defined as sigmoid $\\sigma(x)=\\frac{1}{1+e^{-x}}$ to restrict the predicted score to be in (0,1).\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"### 1.3 Fusion of GMF and MLP\n",
|
||||
|
@ -159,11 +159,11 @@
|
|||
"\n",
|
||||
"$$P \\left( \\mathcal { R } , \\mathcal { R } ^ { - } | \\mathbf { P } , \\mathbf { Q } , \\Theta \\right) = \\prod _ { ( u , i ) \\in \\mathcal { R } } \\hat { r } _ { u , i } \\prod _ { ( u , j ) \\in \\mathcal { R } ^{ - } } \\left( 1 - \\hat { r } _ { u , j } \\right)$$\n",
|
||||
"\n",
|
||||
"Where $\\mathcal{R}$ denotes the set of observed interactions, and $\\mathcal{ R } ^ { - }$ denotes the set of negative instances. $\\mathbf{P}$ and $\\mathbf{Q}$ denotes the latent factor matrix for users and items, respectively; and $\\Theta$ denotes the model parameters. Taking the negative logarithm of the likelihood, we obatain the objective function to minimize for NCF method, which is known as *binary cross-entropy loss*:\n",
|
||||
"Where $\\mathcal{R}$ denotes the set of observed interactions, and $\\mathcal{ R } ^ { - }$ denotes the set of negative instances. $\\mathbf{P}$ and $\\mathbf{Q}$ denotes the latent factor matrix for users and items, respectively; and $\\Theta$ denotes the model parameters. Taking the negative logarithm of the likelihood, we obatain the objective function to minimize for NCF method, which is known as [binary cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy):\n",
|
||||
"\n",
|
||||
"$$L = - \\sum _ { ( u , i ) \\in \\mathcal { R } \\cup { \\mathcal { R } } ^ { - } } r _ { u , i } \\log \\hat { r } _ { u , i } + \\left( 1 - r _ { u , i } \\right) \\log \\left( 1 - \\hat { r } _ { u , i } \\right)$$\n",
|
||||
"\n",
|
||||
"The optimization can be done by performing Stochastic Gradient Descent (SGD), which has been introduced by the SVD algorithm in surprise svd deep dive notebook. Our SGD method is very similar to the SVD algorithm's."
|
||||
"The optimization can be done by performing Stochastic Gradient Descent (SGD), which is described in the [Surprise SVD deep dive notebook](../02_model/surprise_svd_deep_dive.ipynb). Our SGD method is very similar to the SVD algorithm's."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -1,5 +1,14 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
|
||||
"\n",
|
||||
"<i>Licensed under the MIT License.</i>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
|
@ -82,7 +91,7 @@
|
|||
"\n",
|
||||
"#RBM \n",
|
||||
"from reco_utils.recommender.rbm.rbm import RBM\n",
|
||||
"from reco_utils.dataset.numpy_splitters import numpy_stratified_split\n",
|
||||
"from reco_utils.dataset.python_splitters import numpy_stratified_split\n",
|
||||
"from reco_utils.dataset.sparse import AffinityMatrix\n",
|
||||
"\n",
|
||||
"#Evaluation libraries\n",
|
||||
|
|
|
@ -15,9 +15,9 @@
|
|||
"source": [
|
||||
"# SAR Single Node on MovieLens (Python, CPU)\n",
|
||||
"\n",
|
||||
"In this example, we will walkthrough each step of the Smart Adaptive Recommendations (SAR) algorithm with a Python single-node implementation.\n",
|
||||
"In this example, we will walk through each step of the Smart Adaptive Recommendations (SAR) algorithm using a Python single-node implementation.\n",
|
||||
"\n",
|
||||
"SAR is a fast, scalable, adaptive algorithm for personalized recommendations based on user transaction history and item descriptions. It is powered by understanding the similarity between items, and recommending similar items to ones a user has an existing affinity for."
|
||||
"SAR is a fast, scalable, adaptive algorithm for personalized recommendations based on user transaction history. It is powered by understanding the similarity between items, and recommending similar items to those a user has an existing affinity for."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -26,35 +26,53 @@
|
|||
"source": [
|
||||
"## 1 SAR algorithm\n",
|
||||
"\n",
|
||||
"In the next figure a high-level architecture of SAR is showed.\n",
|
||||
"The following figure presents a high-level architecture of SAR. \n",
|
||||
"\n",
|
||||
"At a very high level, two intermediate matrices are created and used to generate a set of recommendation scores:\n",
|
||||
"\n",
|
||||
"- An item similarity matrix $S$ estimates item-item relationships.\n",
|
||||
"- An affinity matrix $A$ estimates user-item relationships.\n",
|
||||
"\n",
|
||||
"Recommendation scores are then created by computing the matrix multiplication $A\\times S$.\n",
|
||||
"\n",
|
||||
"Optional steps (e.g. \"time decay\" and \"remove seen items\") are described in the details below.\n",
|
||||
"\n",
|
||||
"<img src=\"https://recodatasets.blob.core.windows.net/images/sar_schema.svg?sanitize=true\">\n",
|
||||
"\n",
|
||||
"### 1.1 Compute item co-occurrence and item similarity\n",
|
||||
"\n",
|
||||
"Central to how SAR defines similarity is an item-to-item co-occurrence matrix. Co-occurrence is defined as the number of times two items appear together for a given user. We can represent the co-occurrence of all items as a $m\\times m$ matrix $C$, where $c_{i,j}$ is the number of times item $i$ occurred with item $j$.\n",
|
||||
"SAR defines similarity based on item-to-item co-occurrence data. Co-occurrence is defined as the number of times two items appear together for a given user. We can represent the co-occurrence of all items as a $m\\times m$ matrix $C$, where $c_{i,j}$ is the number of times item $i$ occurred with item $j$, and $m$ is the total number of items.\n",
|
||||
"\n",
|
||||
"The co-occurence matric $C$ has the following properties:\n",
|
||||
"\n",
|
||||
"It is symmetric, so $c_{i,j} = c_{j,i}$\n",
|
||||
"It is nonnegative: $c_{i,j} \\geq 0$\n",
|
||||
"The occurrences are at least as large as the co-occurrences. I.e, the largest element for each row (and column) is on the main diagonal: $\\forall(i,j) C_{i,i},C_{j,j} \\geq C_{i,j}$.\n",
|
||||
"Once we have a co-occurrence matrix, an item similarity matrix $S$ can be obtained by rescaling the co-occurrences according to a given metric. Options for the metric include Jaccard, lift, and counts (meaning no rescaling).\n",
|
||||
"- It is symmetric, so $c_{i,j} = c_{j,i}$\n",
|
||||
"- It is nonnegative: $c_{i,j} \\geq 0$\n",
|
||||
"- The occurrences are at least as large as the co-occurrences. I.e., the largest element for each row (and column) is on the main diagonal: $\\forall(i,j) C_{i,i},C_{j,j} \\geq C_{i,j}$.\n",
|
||||
"\n",
|
||||
"The rescaling formula for Jaccard is $s_{ij}=c_{ij} / (c_{ii}+c_{jj}-c_{ij})$\n",
|
||||
"Once we have a co-occurrence matrix, an item similarity matrix $S$ can be obtained by rescaling the co-occurrences according to a given metric. Options for the metric include `Jaccard`, `lift`, and `counts` (meaning no rescaling).\n",
|
||||
"\n",
|
||||
"and that for lift is $s_{ij}=c_{ij} / (c_{ii} \\times c_{jj})$\n",
|
||||
"\n",
|
||||
"where $c_{ii}$ and $c_{jj}$ are the $i$th and $j$th diagonal elements of $C$. In general, using counts as a similarity metric favours predictability, meaning that the most popular items will be recommended most of the time. Lift by contrast favours discoverability/serendipity: an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. Jaccard is a compromise between the two.\n",
|
||||
"If $c_{ii}$ and $c_{jj}$ are the $i$th and $j$th diagonal elements of $C$, the rescaling options are:\n",
|
||||
"\n",
|
||||
"- `Jaccard`: $s_{ij}=\\frac{c_{ij}}{(c_{ii}+c_{jj}-c_{ij})}$\n",
|
||||
"- `lift`: $s_{ij}=\\frac{c_{ij}}{(c_{ii} \\times c_{jj})}$\n",
|
||||
"- `counts`: $s_{ij}=c_{ij}$\n",
|
||||
"\n",
|
||||
"In general, using `counts` as a similarity metric favours predictability, meaning that the most popular items will be recommended most of the time. `lift` by contrast favours discoverability/serendipity: an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. `Jaccard` is a compromise between the two.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"### 1.2 Compute user affinity scores\n",
|
||||
"\n",
|
||||
"The affinity matrix in SAR captures the strength of the relationship between each individual user and each item. The event types and weights are used in computing this matrix: different event types (such as “rate” vs “view”) should be allowed to have an impact on a user’s affinity for an item. Similarly, the time of a transaction should have an impact; an event that takes place in the distant past can be thought of as being less important in determining the affinity.\n",
|
||||
"The affinity matrix in SAR captures the strength of the relationship between each individual user and the items that user has already interacted with. SAR incorporates two factors that can impact users' affinities: \n",
|
||||
"\n",
|
||||
"Combining these effects gives us an expression for user-item affinity:\n",
|
||||
"- It can consider information about the **type** of user-item interaction through differential weighting of different events (e.g. it may weigh events in which a user rated a particular item more heavily than events in which a user viewed the item).\n",
|
||||
"- It can consider information about **when** a user-item event occurred (e.g. it may discount the value of events that take place in the distant past.\n",
|
||||
"\n",
|
||||
"$$a_{ij}=\\sum_k (w_k \\text{exp}[-\\text{log}_2(\\frac{t_0-t_k}{T})] $$\n",
|
||||
"where the affinity for user $i$ and item $j$ is the sum of all events involving user $i$ and item $j$, and $w_k$ is the weight of event $k$. The presence of the $\\text{log}_{2}$ factor means that the parameter $T$ in the exponential decay term can be treated as a half-life: events this far before the reference date $t_0$ will be given half the weight as those taking place at $t_0$.\n",
|
||||
"Formalizing these factors produces us an expression for user-item affinity:\n",
|
||||
"\n",
|
||||
"$$a_{ij}=\\sum_k w_k e^{[-\\text{log}(2)\\frac{t_0-t_k}{T}]} $$\n",
|
||||
"\n",
|
||||
"where the affinity $a_{ij}$ for user $i$ and item $j$ is the weighted sum of all $k$ events involving user $i$ and item $j$. $w_k$ represents the weight of a particular event, and the exponential term reflects the temporally-discounted event. The $\\text{log}(2)$ scaling factor causes the parameter $T$ to serve as a half-life: events $T$ units before $t_0$ will be given half the weight as those taking place at $t_0$.\n",
|
||||
"\n",
|
||||
"Repeating this computation for all $n$ users and $m$ items results in an $n\\times m$ matrix $A$. Simplifications of the above expression can be obtained by setting all the weights equal to 1 (effectively ignoring event types), or by setting the half-life parameter $T$ to infinity (ignoring transaction times).\n",
|
||||
"\n",
|
||||
|
@ -64,7 +82,7 @@
|
|||
"\n",
|
||||
"### 1.4 Top-k item calculation\n",
|
||||
"\n",
|
||||
"The personalized recommendations for a set of users can then be obtained by multiplying the affinity matrix ($A$) by the similarity matrix ($S$). The result is a recommendation score matrix, with one row per user / item pair; higher scores correspond to more strongly recommended items.\n",
|
||||
"The personalized recommendations for a set of users can then be obtained by multiplying the affinity matrix ($A$) by the similarity matrix ($S$). The result is a recommendation score matrix, where each row corresponds to a user, each column corresponds to an item, and each entry corresponds to a user / item pair. Higher scores correspond to more strongly recommended items.\n",
|
||||
"\n",
|
||||
"It is worth noting that the complexity of recommending operation depends on the data size. SAR algorithm itself has $O(n^3)$ complexity. Therefore the single-node implementation is not supposed to handle large dataset in a scalable manner. Whenever one uses the algorithm, it is recommended to run with sufficiently large memory. "
|
||||
]
|
||||
|
@ -87,16 +105,16 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"System version: 3.6.0 | packaged by conda-forge | (default, Feb 9 2017, 14:36:55) \n",
|
||||
"[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]\n",
|
||||
"Pandas version: 0.23.4\n"
|
||||
"System version: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43) \n",
|
||||
"[GCC 7.3.0]\n",
|
||||
"Pandas version: 0.24.1\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -105,16 +123,18 @@
|
|||
"import sys\n",
|
||||
"sys.path.append(\"../../\")\n",
|
||||
"\n",
|
||||
"import os\n",
|
||||
"import itertools\n",
|
||||
"import logging\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"import numpy as np\n",
|
||||
"import pandas as pd\n",
|
||||
"import papermill as pm\n",
|
||||
"\n",
|
||||
"from reco_utils.recommender.sar.sar_singlenode import SARSingleNode\n",
|
||||
"from reco_utils.dataset import movielens\n",
|
||||
"from reco_utils.dataset.python_splitters import python_random_split\n",
|
||||
"from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k\n",
|
||||
"from reco_utils.recommender.sar.sar_singlenode import SARSingleNode\n",
|
||||
"\n",
|
||||
"print(\"System version: {}\".format(sys.version))\n",
|
||||
"print(\"Pandas version: {}\".format(pd.__version__))"
|
||||
|
@ -122,7 +142,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"parameters"
|
||||
|
@ -153,7 +173,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -232,7 +252,7 @@
|
|||
"4 166 346 1.0 886397596"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
|
@ -260,7 +280,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -269,7 +289,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -297,114 +317,43 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# set log level to INFO\n",
|
||||
"logging.basicConfig(level=logging.DEBUG, \n",
|
||||
" format='%(asctime)s %(levelname)-8s %(message)s')\n",
|
||||
"\n",
|
||||
"model = SARSingleNode(\n",
|
||||
" remove_seen=True, similarity_type=\"jaccard\", \n",
|
||||
" time_decay_coefficient=30, time_now=None, timedecay_formula=True, **header\n",
|
||||
" remove_seen=True, \n",
|
||||
" similarity_type=\"jaccard\", \n",
|
||||
" time_decay_coefficient=30, \n",
|
||||
" time_now=None, \n",
|
||||
" timedecay_formula=True, \n",
|
||||
" **header\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"unique_users = data[\"UserId\"].unique()\n",
|
||||
"unique_items = data[\"MovieId\"].unique()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will hash users and items to smaller continuous space.\n",
|
||||
"This is an ordered set - it's discrete, but contiguous.\n",
|
||||
"This helps keep the matrices we keep in memory as small as possible."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"enumerate_items_1, enumerate_items_2 = itertools.tee(enumerate(unique_items))\n",
|
||||
"enumerate_users_1, enumerate_users_2 = itertools.tee(enumerate(unique_users))\n",
|
||||
"item_map_dict = {x: i for i, x in enumerate_items_1}\n",
|
||||
"user_map_dict = {x: i for i, x in enumerate_users_1}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The reverse of the dictionary above - array index to actual ID\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"index2user = dict(enumerate_users_2)\n",
|
||||
"index2item = dict(enumerate_items_2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We need to index the train and test sets for SAR matrix operations to work"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"model.set_index(unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Collecting user affinity matrix...\n",
|
||||
"Calculating time-decayed affinities...\n",
|
||||
"../../reco_utils/recommender/sar/sar_singlenode.py:219: SettingWithCopyWarning: \n",
|
||||
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
|
||||
"Try using .loc[row_indexer,col_indexer] = value instead\n",
|
||||
"\n",
|
||||
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
|
||||
" df[\"exponential\"] = expo_fun(df[self.col_timestamp].values)\n",
|
||||
"../../reco_utils/recommender/sar/sar_singlenode.py:221: SettingWithCopyWarning: \n",
|
||||
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
|
||||
"Try using .loc[row_indexer,col_indexer] = value instead\n",
|
||||
"\n",
|
||||
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
|
||||
" df[\"rating_exponential\"] = df[self.col_rating] * df[\"exponential\"]\n",
|
||||
"Creating index columns...\n",
|
||||
"../../reco_utils/recommender/sar/sar_singlenode.py:283: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.\n",
|
||||
" self.index = df.as_matrix([self._col_hashed_users, self._col_hashed_items])\n",
|
||||
"Building user affinity sparse matrix...\n",
|
||||
"Calculating item cooccurrence...\n",
|
||||
"Calculating item similarity...\n",
|
||||
"Calculating jaccard...\n",
|
||||
"/anaconda/envs/recommender/lib/python3.6/site-packages/scipy/sparse/base.py:594: RuntimeWarning: invalid value encountered in true_divide\n",
|
||||
" return np.true_divide(self.todense(), other)\n",
|
||||
"Calculating recommendation scores...\n",
|
||||
"done training\n"
|
||||
"2019-02-07 21:12:50,049 INFO Collecting user affinity matrix\n",
|
||||
"2019-02-07 21:12:50,055 INFO Calculating time-decayed affinities\n",
|
||||
"2019-02-07 21:12:50,135 INFO Creating index columns\n",
|
||||
"2019-02-07 21:12:50,164 INFO Building user affinity sparse matrix\n",
|
||||
"2019-02-07 21:12:50,174 INFO Calculating item co-occurrence\n",
|
||||
"2019-02-07 21:12:50,419 INFO Calculating item similarity\n",
|
||||
"2019-02-07 21:12:50,420 INFO Calculating jaccard\n",
|
||||
"2019-02-07 21:12:50,631 INFO Calculating recommendation scores\n",
|
||||
"2019-02-07 21:12:50,738 INFO Removing seen items\n",
|
||||
"2019-02-07 21:12:50,740 INFO Done training\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -414,7 +363,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
|
@ -423,18 +372,7 @@
|
|||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Converting to dense matrix...\n",
|
||||
"../../reco_utils/recommender/sar/sar_singlenode.py:422: SettingWithCopyWarning: \n",
|
||||
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
|
||||
"Try using .loc[row_indexer,col_indexer] = value instead\n",
|
||||
"\n",
|
||||
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
|
||||
" test[self._col_hashed_users] = test[self.col_user].map(self.user_map_dict)\n",
|
||||
"Removing seen items...\n",
|
||||
"Getting top K...\n",
|
||||
"Select users from the test set\n",
|
||||
"Creating output dataframe...\n",
|
||||
"Formatting output\n"
|
||||
"2019-02-07 21:12:50,762 INFO Getting top K\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -444,7 +382,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -462,7 +400,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
|
@ -503,22 +441,22 @@
|
|||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>423</td>\n",
|
||||
" <td>12.991756</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>183</td>\n",
|
||||
" <td>13.106912</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>89</td>\n",
|
||||
" <td>13.163791</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>423</td>\n",
|
||||
" <td>12.991756</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>600</td>\n",
|
||||
" <td>144</td>\n",
|
||||
|
@ -531,9 +469,9 @@
|
|||
"text/plain": [
|
||||
" UserId MovieId prediction\n",
|
||||
"0 600 69 12.984131\n",
|
||||
"1 600 423 12.991756\n",
|
||||
"2 600 183 13.106912\n",
|
||||
"3 600 89 13.163791\n",
|
||||
"1 600 183 13.106912\n",
|
||||
"2 600 89 13.163791\n",
|
||||
"3 600 423 12.991756\n",
|
||||
"4 600 144 13.489795"
|
||||
]
|
||||
},
|
||||
|
@ -558,52 +496,50 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"eval_map = map_at_k(test, top_k, col_user=\"UserId\", col_item=\"MovieId\", \n",
|
||||
" col_rating=\"Rating\", col_prediction=\"prediction\", \n",
|
||||
" relevancy_method=\"top_k\", k=TOP_K)\n",
|
||||
"# all ranking metrics have the same arguments\n",
|
||||
"args = [test, top_k]\n",
|
||||
"kwargs = dict(col_user='UserId', \n",
|
||||
" col_item='MovieId', \n",
|
||||
" col_rating='Rating', \n",
|
||||
" col_prediction='prediction', \n",
|
||||
" relevancy_method='top_k', \n",
|
||||
" k=TOP_K)\n",
|
||||
"\n",
|
||||
"eval_ndcg = ndcg_at_k(test, top_k, col_user=\"UserId\", col_item=\"MovieId\", \n",
|
||||
" col_rating=\"Rating\", col_prediction=\"prediction\", \n",
|
||||
" relevancy_method=\"top_k\", k=TOP_K)\n",
|
||||
"\n",
|
||||
"eval_precision = precision_at_k(test, top_k, col_user=\"UserId\", col_item=\"MovieId\", \n",
|
||||
" col_rating=\"Rating\", col_prediction=\"prediction\", \n",
|
||||
" relevancy_method=\"top_k\", k=TOP_K)\n",
|
||||
"\n",
|
||||
"eval_recall = recall_at_k(test, top_k, col_user=\"UserId\", col_item=\"MovieId\", \n",
|
||||
" col_rating=\"Rating\", col_prediction=\"prediction\", \n",
|
||||
" relevancy_method=\"top_k\", k=TOP_K)"
|
||||
"eval_map = map_at_k(*args, **kwargs)\n",
|
||||
"eval_ndcg = ndcg_at_k(*args, **kwargs)\n",
|
||||
"eval_precision = precision_at_k(*args, **kwargs)\n",
|
||||
"eval_recall = recall_at_k(*args, **kwargs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Model:\tsar_ref\n",
|
||||
"Top K:\t10\n",
|
||||
"MAP:\t0.105815\n",
|
||||
"NDCG:\t0.373197\n",
|
||||
"Precision@K:\t0.326617\n",
|
||||
"Recall@K:\t0.175957\n"
|
||||
"Model:\t\t sar_ref\n",
|
||||
"Top K:\t\t 10\n",
|
||||
"MAP:\t\t 0.105815\n",
|
||||
"NDCG:\t\t 0.373197\n",
|
||||
"Precision@K:\t 0.326617\n",
|
||||
"Recall@K:\t 0.175957\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(\"Model:\\t\" + model.model_str,\n",
|
||||
" \"Top K:\\t%d\" % TOP_K,\n",
|
||||
" \"MAP:\\t%f\" % eval_map,\n",
|
||||
" \"NDCG:\\t%f\" % eval_ndcg,\n",
|
||||
" \"Precision@K:\\t%f\" % eval_precision,\n",
|
||||
" \"Recall@K:\\t%f\" % eval_recall, sep='\\n')"
|
||||
"print(f\"Model:\\t\\t {model.model_str}\",\n",
|
||||
" f\"Top K:\\t\\t {TOP_K}\",\n",
|
||||
" f\"MAP:\\t\\t {eval_map:f}\",\n",
|
||||
" f\"NDCG:\\t\\t {eval_ndcg:f}\",\n",
|
||||
" f\"Precision@K:\\t {eval_precision:f}\",\n",
|
||||
" f\"Recall@K:\\t {eval_recall:f}\", sep='\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -620,11 +556,10 @@
|
|||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Tags",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "Python (reco)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
"name": "reco"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
@ -636,7 +571,7 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.5"
|
||||
"version": "3.6.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
|
@ -131,10 +131,10 @@
|
|||
"\n",
|
||||
"\n",
|
||||
"notebooks = {\n",
|
||||
" 'als': '../00_quick_start/als_pyspark_movielens.ipynb',\n",
|
||||
" 'sar': '../00_quick_start/sar_single_node_movielens.ipynb',\n",
|
||||
" 'als': '../00_quick_start/als_movielens.ipynb',\n",
|
||||
" 'sar': '../00_quick_start/sar_movielens.ipynb',\n",
|
||||
" 'svd': '../02_model/surprise_svd_deep_dive.ipynb',\n",
|
||||
" 'fast': '../00_quick_start/fastai_recommendation.ipynb',\n",
|
||||
" 'fast': '../00_quick_start/fastai_movielens.ipynb',\n",
|
||||
" 'ncf': '../00_quick_start/ncf_movielens.ipynb',\n",
|
||||
" 'rbm': '../00_quick_start/rbm_movielens.ipynb'\n",
|
||||
"}"
|
||||
|
|
|
@ -7,6 +7,7 @@ DEFAULT_ITEM_COL = "itemID"
|
|||
DEFAULT_RATING_COL = "rating"
|
||||
DEFAULT_TIMESTAMP_COL = "timestamp"
|
||||
PREDICTION_COL = "prediction"
|
||||
DEFAULT_PREDICTION_COL = PREDICTION_COL
|
||||
|
||||
# Filtering variables
|
||||
DEFAULT_K = 10
|
||||
|
|
|
@ -1,30 +1,51 @@
|
|||
import numpy as np
|
||||
from scipy.sparse import coo_matrix
|
||||
|
||||
|
||||
def exponential_decay(value, max_val, half_life):
|
||||
"""Compute decay factor for a given value based on an exponential decay
|
||||
Values greater than max_val will be set to 1
|
||||
Args:
|
||||
value (numeric): value to calculate decay factor
|
||||
max_val (numeric): value at which decay factor will be 1
|
||||
half_life (numeric): value at which decay factor will be 0.5
|
||||
Returns:
|
||||
float: decay factor
|
||||
"""
|
||||
|
||||
return np.minimum(1., np.exp(-np.log(2) * (max_val - value) / half_life))
|
||||
|
||||
|
||||
def jaccard(cooccurrence):
|
||||
"""Helper method to calculate the Jaccard similarity of a matrix of cooccurrences
|
||||
"""Helper method to calculate the Jaccard similarity of a matrix of co-occurrences
|
||||
Args:
|
||||
cooccurrence (scipy.sparse.csc_matrix): the symmetric matrix of cooccurrences of items
|
||||
cooccurrence (np.array): the symmetric matrix of co-occurrences of items
|
||||
Returns:
|
||||
scipy.sparse.coo_matrix: The matrix of Jaccard similarities between any two items
|
||||
np.array: The matrix of Jaccard similarities between any two items
|
||||
"""
|
||||
coo = cooccurrence.tocoo()
|
||||
denom = coo.diagonal()[coo.row] + coo.diagonal()[coo.col] - coo.data
|
||||
return coo_matrix((np.divide(coo.data, denom, out=np.zeros_like(coo.data), where=(denom != 0.0)),
|
||||
(coo.row, coo.col)),
|
||||
shape=coo.shape).tocsc()
|
||||
|
||||
diag = cooccurrence.diagonal()
|
||||
diag_rows = np.expand_dims(diag, axis=0)
|
||||
diag_cols = np.expand_dims(diag, axis=1)
|
||||
|
||||
with np.errstate(invalid='ignore', divide='ignore'):
|
||||
result = cooccurrence / (diag_rows + diag_cols - cooccurrence)
|
||||
|
||||
return np.array(result)
|
||||
|
||||
|
||||
def lift(cooccurrence):
|
||||
"""Helper method to calculate the Lift of a matrix of cooccurrences
|
||||
"""Helper method to calculate the Lift of a matrix of co-occurrences
|
||||
Args:
|
||||
cooccurrence (scipy.sparse.csc_matrix): the symmetric matrix of cooccurrences of items
|
||||
cooccurrence (np.array): the symmetric matrix of co-occurrences of items
|
||||
Returns:
|
||||
scipy.sparse.coo_matrix: The matrix of Lifts between any two items
|
||||
np.array: The matrix of Lifts between any two items
|
||||
"""
|
||||
coo = cooccurrence.tocoo()
|
||||
denom = coo.diagonal()[coo.row] * coo.diagonal()[coo.col]
|
||||
return coo_matrix((np.divide(coo.data, denom, out=np.zeros_like(coo.data), where=(denom != 0.0)),
|
||||
(coo.row, coo.col)),
|
||||
shape=coo.shape).tocsc()
|
||||
|
||||
diag = cooccurrence.diagonal()
|
||||
diag_rows = np.expand_dims(diag, axis=0)
|
||||
diag_cols = np.expand_dims(diag, axis=1)
|
||||
|
||||
with np.errstate(invalid='ignore', divide='ignore'):
|
||||
result = cooccurrence / (diag_rows * diag_cols)
|
||||
|
||||
return np.array(result)
|
||||
|
|
|
@ -148,7 +148,7 @@ def load_pandas_df(
|
|||
|
||||
Args:
|
||||
size (str): Size of the data to load. One of ("100k", "1m", "10m", "20m")
|
||||
header (list or tuple): Rating dataset header. If None, ratings are not loaded.
|
||||
header (list or tuple or None): Rating dataset header. If None, ratings are not loaded.
|
||||
local_cache_path (str): Path where to cache the zip file locally
|
||||
title_col (str): Movie title column name. If None, the title column is not loaded.
|
||||
genres_col (str): Genres column name. Genres are '|' separated string.
|
||||
|
|
|
@ -1,92 +0,0 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
"""
|
||||
Collection of numpy based splitters
|
||||
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def numpy_stratified_split(X, ratio=0.75, seed=123):
|
||||
|
||||
"""
|
||||
Split the user/item affinity matrix into train and test set matrices while mantaining
|
||||
local (i.e. per user) ratios.
|
||||
|
||||
Args:
|
||||
X (np.array, int): a sparse matrix
|
||||
ratio (scalar, float): fraction of the entire dataset to constitute the train set
|
||||
seed (scalar, int): random seed
|
||||
|
||||
Returns:
|
||||
Xtr (np.array, int): train set user/item affinity matrix
|
||||
Xtst (np.array, int): test set user/item affinity matrix
|
||||
|
||||
Basic mechanics:
|
||||
Main points :
|
||||
|
||||
1. In a typical recommender problem, different users rate a different number of items,
|
||||
and therefore the user/affinity matrix has a sparse structure with variable number
|
||||
of zeroes (unrated items) per row (user). Cutting a total amount of ratings will
|
||||
result in a non-homogenou distribution between train and test set, i.e. some test
|
||||
users may have many ratings while other very little if none.
|
||||
|
||||
2. In an unsupervised learning problem, no explicit answer is given. For this reason
|
||||
the split needs to be implemented in a different way then in supervised learningself.
|
||||
In the latter, one typically split the dataset by rows (by examples), ending up with
|
||||
the same number of feautures but different number of examples in the train/test setself.
|
||||
This scheme does not work in the unsupervised case, as part of the rated items needs to
|
||||
be used as a test set for fixed number of users.
|
||||
|
||||
Solution:
|
||||
|
||||
1. Instead of cutting a total percentage, for each user we cut a relative ratio of the rated
|
||||
items. For example, if user1 has rated 4 items and user2 10, cutting 25% will correspond to
|
||||
1 and 2.6 ratings in the test set, approximated as 1 and 3 according to the round() function.
|
||||
In this way, the 0.75 ratio is satified both locally and globally, preserving the original
|
||||
distribution of ratings across the train and test set.
|
||||
|
||||
2. It is easy (and fast) to satisfy this requirements by creating the test via element subtraction
|
||||
from the original datatset X. We first create two copies of X; for each user we select a random
|
||||
sample of local size ratio (point 1) and erase the remaining ratings, obtaining in this way the
|
||||
train set matrix Xtst. The train set matrix is obtained in the opposite way.
|
||||
|
||||
|
||||
"""
|
||||
|
||||
np.random.seed(seed) # set the random seed
|
||||
|
||||
test_cut = int((1 - ratio) * 100) # percentage of ratings to go in the test set
|
||||
|
||||
# initialize train and test set matrices
|
||||
Xtr = X.copy()
|
||||
Xtst = X.copy()
|
||||
|
||||
# find the number of rated movies per user
|
||||
rated = np.sum(Xtr != 0, axis=1)
|
||||
|
||||
# for each user, cut down a test_size% for the test set
|
||||
tst = np.around((rated * test_cut) / 100).astype(int)
|
||||
|
||||
Nusers, Nitems = X.shape # total number of users and items
|
||||
|
||||
for u in range(Nusers):
|
||||
# For each user obtain the index of rated movies
|
||||
idx = np.asarray(np.where(Xtr[u] != 0))[0].tolist()
|
||||
|
||||
# extract a random subset of size n from the set of rated movies without repetition
|
||||
idx_tst = np.random.choice(idx, tst[u], replace=False)
|
||||
idx_train = list(set(idx).difference(set(idx_tst)))
|
||||
|
||||
Xtr[
|
||||
u, idx_tst
|
||||
] = 0 # change the selected rated movies to unrated in the train set
|
||||
Xtst[
|
||||
u, idx_train
|
||||
] = 0 # set the movies that appear already in the train set as 0
|
||||
|
||||
del idx, idx_train, idx_tst
|
||||
|
||||
return Xtr, Xtst
|
|
@ -1,6 +1,6 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.model_selection import train_test_split as sk_split
|
||||
|
||||
|
@ -173,3 +173,86 @@ def python_stratified_split(
|
|||
splits[x] = pd.concat([splits[x], group_splits[x]])
|
||||
|
||||
return splits
|
||||
|
||||
|
||||
def numpy_stratified_split(X, ratio=0.75, seed=123):
|
||||
|
||||
"""
|
||||
Split the user/item affinity matrix (sparse matrix) into train and test set matrices while maintaining
|
||||
local (i.e. per user) ratios.
|
||||
|
||||
Args:
|
||||
X (np.array, int): a sparse matrix to be split
|
||||
ratio (scalar, float): fraction of the entire dataset to constitute the train set
|
||||
seed (scalar, int): random seed
|
||||
|
||||
Returns:
|
||||
Xtr (np.array, int): train set user/item affinity matrix
|
||||
Xtst (np.array, int): test set user/item affinity matrix
|
||||
|
||||
Basic mechanics:
|
||||
Main points :
|
||||
|
||||
1. In a typical recommender problem, different users rate a different number of items,
|
||||
and therefore the user/affinity matrix has a sparse structure with variable number
|
||||
of zeroes (unrated items) per row (user). Cutting a total amount of ratings will
|
||||
result in a non-homogenou distribution between train and test set, i.e. some test
|
||||
users may have many ratings while other very little if none.
|
||||
|
||||
2. In an unsupervised learning problem, no explicit answer is given. For this reason
|
||||
the split needs to be implemented in a different way then in supervised learningself.
|
||||
In the latter, one typically split the dataset by rows (by examples), ending up with
|
||||
the same number of feautures but different number of examples in the train/test setself.
|
||||
This scheme does not work in the unsupervised case, as part of the rated items needs to
|
||||
be used as a test set for fixed number of users.
|
||||
|
||||
Solution:
|
||||
|
||||
1. Instead of cutting a total percentage, for each user we cut a relative ratio of the rated
|
||||
items. For example, if user1 has rated 4 items and user2 10, cutting 25% will correspond to
|
||||
1 and 2.6 ratings in the test set, approximated as 1 and 3 according to the round() function.
|
||||
In this way, the 0.75 ratio is satified both locally and globally, preserving the original
|
||||
distribution of ratings across the train and test set.
|
||||
|
||||
2. It is easy (and fast) to satisfy this requirements by creating the test via element subtraction
|
||||
from the original datatset X. We first create two copies of X; for each user we select a random
|
||||
sample of local size ratio (point 1) and erase the remaining ratings, obtaining in this way the
|
||||
train set matrix Xtst. The train set matrix is obtained in the opposite way.
|
||||
|
||||
|
||||
"""
|
||||
|
||||
np.random.seed(seed) # set the random seed
|
||||
|
||||
test_cut = int((1 - ratio) * 100) # percentage of ratings to go in the test set
|
||||
|
||||
# initialize train and test set matrices
|
||||
Xtr = X.copy()
|
||||
Xtst = X.copy()
|
||||
|
||||
# find the number of rated movies per user
|
||||
rated = np.sum(Xtr != 0, axis=1)
|
||||
|
||||
# for each user, cut down a test_size% for the test set
|
||||
tst = np.around((rated * test_cut) / 100).astype(int)
|
||||
|
||||
Nusers, Nitems = X.shape # total number of users and items
|
||||
|
||||
for u in range(Nusers):
|
||||
# For each user obtain the index of rated movies
|
||||
idx = np.asarray(np.where(Xtr[u] != 0))[0].tolist()
|
||||
|
||||
# extract a random subset of size n from the set of rated movies without repetition
|
||||
idx_tst = np.random.choice(idx, tst[u], replace=False)
|
||||
idx_train = list(set(idx).difference(set(idx_tst)))
|
||||
|
||||
Xtr[
|
||||
u, idx_tst
|
||||
] = 0 # change the selected rated movies to unrated in the train set
|
||||
Xtst[
|
||||
u, idx_train
|
||||
] = 0 # set the movies that appear already in the train set as 0
|
||||
|
||||
del idx, idx_train, idx_tst
|
||||
|
||||
return Xtr, Xtst
|
||||
|
|
|
@ -56,7 +56,7 @@ def spark_chrono_split(
|
|||
Args:
|
||||
data (spark.DataFrame): Spark DataFrame to be split.
|
||||
ratio (float or list): Ratio for splitting data. If it is a single float number
|
||||
it splits data into two halfs and the ratio argument indicates the ratio of
|
||||
it splits data into two sets and the ratio argument indicates the ratio of
|
||||
training data set; if it is a list of float numbers, the splitter splits
|
||||
data into several portions corresponding to the split ratios. If a list is
|
||||
provided and the ratios are not summed to 1, they will be normalized.
|
||||
|
@ -93,7 +93,7 @@ def spark_chrono_split(
|
|||
ratio = ratio if multi_split else [ratio, 1 - ratio]
|
||||
ratio_index = np.cumsum(ratio)
|
||||
|
||||
window_spec = Window.partitionBy(split_by_column).orderBy(col(col_timestamp).desc())
|
||||
window_spec = Window.partitionBy(split_by_column).orderBy(col(col_timestamp))
|
||||
|
||||
rating_grouped = (
|
||||
data.groupBy(split_by_column)
|
||||
|
@ -141,6 +141,8 @@ def spark_stratified_split(
|
|||
training data set; if it is a list of float numbers, the splitter splits
|
||||
data into several portions corresponding to the split ratios. If a list is
|
||||
provided and the ratios are not summed to 1, they will be normalized.
|
||||
Earlier indexed splits will have earlier times
|
||||
(e.g the latest time per user or item in split[0] <= the earliest time per user or item in split[1])
|
||||
seed (int): Seed.
|
||||
min_rating (int): minimum number of ratings for user or item.
|
||||
filter_by (str): either "user" or "item", depending on which of the two is to filter
|
||||
|
@ -216,10 +218,12 @@ def spark_timestamp_split(
|
|||
Args:
|
||||
data (spark.DataFrame): Spark DataFrame to be split.
|
||||
ratio (float or list): Ratio for splitting data. If it is a single float number
|
||||
it splits data into two halfs and the ratio argument indicates the ratio of
|
||||
it splits data into two sets and the ratio argument indicates the ratio of
|
||||
training data set; if it is a list of float numbers, the splitter splits
|
||||
data into several portions corresponding to the split ratios. If a list is
|
||||
provided and the ratios are not summed to 1, they will be normalized.
|
||||
Earlier indexed splits will have earlier times
|
||||
(e.g the latest time in split[0] <= the earliest time in split[1])
|
||||
col_user (str): column name of user IDs.
|
||||
col_item (str): column name of item IDs.
|
||||
col_timestamp (str): column name of timestamps. Float number represented in
|
||||
|
@ -233,7 +237,7 @@ def spark_timestamp_split(
|
|||
ratio = ratio if multi_split else [ratio, 1 - ratio]
|
||||
ratio_index = np.cumsum(ratio)
|
||||
|
||||
window_spec = Window.orderBy(col(col_timestamp).desc())
|
||||
window_spec = Window.orderBy(col(col_timestamp))
|
||||
rating = data.withColumn("rank", row_number().over(window_spec))
|
||||
|
||||
data_count = rating.count()
|
||||
|
|
|
@ -26,19 +26,6 @@ log = logging.getLogger(__name__)
|
|||
|
||||
|
||||
class AffinityMatrix:
|
||||
"""
|
||||
|
||||
Args:
|
||||
df (pd.DataFrame): a dataframe containing the data
|
||||
col_user (str): default name for user column
|
||||
col_item (str): default name for item column
|
||||
col_rating (str): default name for rating columns
|
||||
col_time (str): default name for timestamp columns
|
||||
save_model (Bool): if True it saves the item/user maps
|
||||
save_path (str): default path to save item/user maps
|
||||
|
||||
"""
|
||||
|
||||
# initialize class parameters
|
||||
def __init__(
|
||||
self,
|
||||
|
@ -47,11 +34,18 @@ class AffinityMatrix:
|
|||
col_item=DEFAULT_ITEM_COL,
|
||||
col_rating=DEFAULT_RATING_COL,
|
||||
col_pred=PREDICTION_COL,
|
||||
col_time=DEFAULT_TIMESTAMP_COL,
|
||||
save_path=None,
|
||||
debug=False,
|
||||
):
|
||||
"""Generate the user/item affinity matrix from a pandas dataframe and vice versa
|
||||
|
||||
Args:
|
||||
DF (pd.DataFrame): a dataframe containing the data
|
||||
col_user (str): default name for user column
|
||||
col_item (str): default name for item column
|
||||
col_rating (str): default name for rating columns
|
||||
save_path (str): default path to save item/user maps
|
||||
|
||||
"""
|
||||
self.df = DF # dataframe
|
||||
|
||||
# pandas DF parameters
|
||||
|
@ -63,12 +57,10 @@ class AffinityMatrix:
|
|||
# Options to save the model for future use
|
||||
self.save_path = save_path
|
||||
|
||||
def gen_index(self):
|
||||
def _gen_index(self):
|
||||
|
||||
"""
|
||||
Generate the user/item index
|
||||
|
||||
Returns:
|
||||
Generate the user/item index:
|
||||
map_users, map_items: dictionaries mapping the original user/item index to matrix indices
|
||||
map_back_users, map_back_items: dictionaries to map back the matrix elements to the original
|
||||
dataframe indices
|
||||
|
@ -105,13 +97,13 @@ class AffinityMatrix:
|
|||
self.df_.loc[:, "hashedUsers"] = self.df_[self.col_user].map(self.map_users)
|
||||
|
||||
# optionally save the inverse dictionary to work with trained models
|
||||
if self.save_path != None:
|
||||
if self.save_path is not None:
|
||||
|
||||
np.save(self.save_path_ + "/user_dict", self.map_users)
|
||||
np.save(self.save_path_ + "/item_dict", self.map_items)
|
||||
np.save(self.save_path + "/user_dict", self.map_users)
|
||||
np.save(self.save_path + "/item_dict", self.map_items)
|
||||
|
||||
np.save(self.save_path_ + "/user_back_dict", self.map_back_users)
|
||||
np.save(self.save_path_ + "/item_back_dict", self.map_back_items)
|
||||
np.save(self.save_path + "/user_back_dict", self.map_back_users)
|
||||
np.save(self.save_path + "/item_back_dict", self.map_back_items)
|
||||
|
||||
def gen_affinity_matrix(self):
|
||||
|
||||
|
@ -135,7 +127,7 @@ class AffinityMatrix:
|
|||
|
||||
log.info("Generating the user/item affinity matrix...")
|
||||
|
||||
self.gen_index()
|
||||
self._gen_index()
|
||||
|
||||
ratings = self.df_[self.col_rating] # ratings
|
||||
itm_id = self.df_["hashedItems"] # itm_id serving as columns
|
||||
|
|
|
@ -582,10 +582,6 @@ def get_top_k_items(dataframe, col_user=DEFAULT_USER_COL, col_rating=DEFAULT_RAT
|
|||
Return:
|
||||
pd.DataFrame: DataFrame of top k items for each user.
|
||||
"""
|
||||
tmp = dataframe.copy()
|
||||
tmp[col_rating] = tmp[col_rating].astype(float)
|
||||
return (
|
||||
tmp.groupby(col_user, as_index=False)
|
||||
.apply(lambda x: x.nlargest(k, col_rating))
|
||||
.reset_index()
|
||||
)
|
||||
return (dataframe.groupby(col_user, as_index=False)
|
||||
.apply(lambda x: x.nlargest(k, col_rating))
|
||||
.reset_index(drop=True))
|
||||
|
|
|
@ -16,6 +16,6 @@ SIM_COOCCUR = "cooccurrence"
|
|||
SIM_JACCARD = "jaccard"
|
||||
SIM_LIFT = "lift"
|
||||
|
||||
HASHED_ITEMS = "hashedItems"
|
||||
HASHED_USERS = "hashedUsers"
|
||||
INDEXED_ITEMS = "indexedItems"
|
||||
INDEXED_USERS = "indexedUsers"
|
||||
|
||||
|
|
|
@ -5,43 +5,20 @@
|
|||
Reference implementation of SAR in python/numpy/pandas.
|
||||
|
||||
This is not meant to be particularly performant or scalable, just
|
||||
as a simple and readable implementation.
|
||||
a simple and readable implementation.
|
||||
"""
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import logging
|
||||
from scipy import sparse
|
||||
|
||||
from reco_utils.common.python_utils import jaccard, lift
|
||||
from reco_utils.common.python_utils import jaccard, lift, exponential_decay
|
||||
|
||||
from reco_utils.common.constants import (
|
||||
DEFAULT_USER_COL,
|
||||
DEFAULT_ITEM_COL,
|
||||
DEFAULT_RATING_COL,
|
||||
DEFAULT_TIMESTAMP_COL,
|
||||
PREDICTION_COL,
|
||||
)
|
||||
from reco_utils.common import constants
|
||||
from reco_utils.recommender import sar
|
||||
|
||||
from reco_utils.recommender.sar import (
|
||||
SIM_JACCARD,
|
||||
SIM_LIFT,
|
||||
SIM_COOCCUR,
|
||||
HASHED_USERS,
|
||||
HASHED_ITEMS,
|
||||
)
|
||||
from reco_utils.recommender.sar import (
|
||||
TIME_DECAY_COEFFICIENT,
|
||||
TIME_NOW,
|
||||
TIMEDECAY_FORMULA,
|
||||
THRESHOLD,
|
||||
)
|
||||
|
||||
"""
|
||||
enable or set manually with --log=INFO when running example file if you want logging:
|
||||
disabling because logging output contaminates stdout output on Databricsk Spark clusters
|
||||
"""
|
||||
# logging.basicConfig(level=logging.INFO)
|
||||
log = logging.getLogger(__name__)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class SARSingleNode:
|
||||
|
@ -50,125 +27,73 @@ class SARSingleNode:
|
|||
def __init__(
|
||||
self,
|
||||
remove_seen=True,
|
||||
col_user=DEFAULT_USER_COL,
|
||||
col_item=DEFAULT_ITEM_COL,
|
||||
col_rating=DEFAULT_RATING_COL,
|
||||
col_timestamp=DEFAULT_TIMESTAMP_COL,
|
||||
similarity_type=SIM_JACCARD,
|
||||
time_decay_coefficient=TIME_DECAY_COEFFICIENT,
|
||||
time_now=TIME_NOW,
|
||||
timedecay_formula=TIMEDECAY_FORMULA,
|
||||
threshold=THRESHOLD,
|
||||
debug=False,
|
||||
col_user=constants.DEFAULT_USER_COL,
|
||||
col_item=constants.DEFAULT_ITEM_COL,
|
||||
col_rating=constants.DEFAULT_RATING_COL,
|
||||
col_timestamp=constants.DEFAULT_TIMESTAMP_COL,
|
||||
col_prediction=constants.PREDICTION_COL,
|
||||
similarity_type=sar.SIM_JACCARD,
|
||||
time_decay_coefficient=sar.TIME_DECAY_COEFFICIENT,
|
||||
time_now=sar.TIME_NOW,
|
||||
timedecay_formula=sar.TIMEDECAY_FORMULA,
|
||||
threshold=sar.THRESHOLD,
|
||||
):
|
||||
"""Initialize model parameters
|
||||
|
||||
Args:
|
||||
remove_seen (bool): whether to remove items observed in training when making recommendations
|
||||
col_user (str): user column name
|
||||
col_item (str): item column name
|
||||
col_rating (str): rating column name
|
||||
col_timestamp (str): timestamp column name
|
||||
col_prediction (str): prediction column name
|
||||
similarity_type (str): [None, 'jaccard', 'lift'] option for computing item-item similarity
|
||||
time_decay_coefficient (float): number of days till ratings are decayed by 1/2
|
||||
time_now (int): current time for time decay calculation
|
||||
timedecay_formula (bool): flag to apply time decay
|
||||
threshold (int): item-item co-occurrences below this threshold will be removed
|
||||
"""
|
||||
self.col_rating = col_rating
|
||||
self.col_item = col_item
|
||||
self.col_user = col_user
|
||||
# default values for all SAR algos
|
||||
self.col_timestamp = col_timestamp
|
||||
self.col_prediction = col_prediction
|
||||
|
||||
self.remove_seen = remove_seen
|
||||
|
||||
# time of item-item similarity
|
||||
self.similarity_type = similarity_type
|
||||
# denominator in time decay. Zero makes time decay irrelevant
|
||||
self.time_decay_coefficient = time_decay_coefficient
|
||||
# toggle the computation of time decay group by formula
|
||||
self.timedecay_formula = timedecay_formula
|
||||
# current time for time decay calculation
|
||||
# convert to seconds
|
||||
self.time_decay_half_life = time_decay_coefficient * 24 * 60 * 60
|
||||
self.time_decay_flag = timedecay_formula
|
||||
self.time_now = time_now
|
||||
# cooccurrence matrix threshold
|
||||
self.threshold = threshold
|
||||
# debug the code
|
||||
self.debug = debug
|
||||
# log the length of operations
|
||||
self.timer_log = []
|
||||
|
||||
# array of indexes for rows and columns of users and items in training set
|
||||
self.index = None
|
||||
self.model_str = "sar_ref"
|
||||
self.model = self
|
||||
self.user_affinity = None
|
||||
self.item_similarity = None
|
||||
|
||||
# threshold - items below this number get set to zero in coocurrence counts
|
||||
assert self.threshold > 0
|
||||
# threshold - items below this number get set to zero in co-occurrence counts
|
||||
if self.threshold <= 0:
|
||||
raise ValueError('Threshold cannot be < 1')
|
||||
|
||||
# more columns which are used internally
|
||||
self._col_hashed_items = HASHED_ITEMS
|
||||
self._col_hashed_users = HASHED_USERS
|
||||
# Column for mapping user / item ids to internal indices
|
||||
self.col_item_id = sar.INDEXED_ITEMS
|
||||
self.col_user_id = sar.INDEXED_USERS
|
||||
|
||||
# Obtain all the users and items from both training and test data
|
||||
self.unique_users = None
|
||||
self.unique_items = None
|
||||
# store training set index for future use during prediction
|
||||
self.index = None
|
||||
self.n_users = None
|
||||
self.n_items = None
|
||||
|
||||
# user2rowID map for prediction method to look up user affinity vectors
|
||||
self.user_map_dict = None
|
||||
# mapping for item to matrix element
|
||||
self.item_map_dict = None
|
||||
self.user2index = None
|
||||
self.item2index = None
|
||||
|
||||
# the opposite of the above map - map array index to actual string ID
|
||||
self.index2user = None
|
||||
self.index2item = None
|
||||
|
||||
# affinity scores for the recommendation
|
||||
self.scores = None
|
||||
|
||||
def set_index(
|
||||
self,
|
||||
unique_users,
|
||||
unique_items,
|
||||
user_map_dict,
|
||||
item_map_dict,
|
||||
index2user,
|
||||
index2item,
|
||||
):
|
||||
"""MVP2 temporary function to set the index of the sparse dataframe.
|
||||
In future releases this will be carried out into the data object and index will be provided
|
||||
with the data"""
|
||||
|
||||
# original IDs of users and items in a list
|
||||
# later as we modify the algorithm these might not be needed (can use dictionary keys
|
||||
# instead)
|
||||
self.unique_users = unique_users
|
||||
self.unique_items = unique_items
|
||||
|
||||
# mapping of original IDs to actual matrix elements
|
||||
self.user_map_dict = user_map_dict
|
||||
self.item_map_dict = item_map_dict
|
||||
|
||||
# reverse mapping of matrix index to an item
|
||||
# TODO: we can make this into an array as well
|
||||
self.index2user = index2user
|
||||
self.index2item = index2item
|
||||
|
||||
# stateful time function
|
||||
def time(self):
|
||||
"""
|
||||
Time a particular section of the code - call this once to set the state somewhere
|
||||
in the code, then call it again to return the elapsed time since last call.
|
||||
Call again to set the time and so on...
|
||||
|
||||
Returns:
|
||||
None if we're not in debug mode - doesn't do anything
|
||||
False if timer started
|
||||
time in seconds since the last time time function was called
|
||||
"""
|
||||
if self.debug:
|
||||
from time import time
|
||||
|
||||
if self.start_time is None:
|
||||
self.start_time = time()
|
||||
return False
|
||||
else:
|
||||
answer = time() - self.start_time
|
||||
# reset state
|
||||
self.start_time = None
|
||||
return answer
|
||||
else:
|
||||
return None
|
||||
|
||||
def compute_affinity_matrix(self, df, n_users, n_items):
|
||||
""" Affinity matrix
|
||||
The user-affinity matrix can be constructed by treating the users and items as
|
||||
|
@ -176,407 +101,269 @@ class SARSingleNode:
|
|||
the ratings as the event weights. We convert between different sparse-matrix
|
||||
formats to de-duplicate user-item pairs, otherwise they will get added up.
|
||||
Args:
|
||||
df (pd.DataFrame): Hashed df of users and items.
|
||||
df (pd.DataFrame): Indexed df of users and items.
|
||||
n_users (int): Number of users.
|
||||
n_items (int): Number of items.
|
||||
Returns:
|
||||
scipy.csr: Affinity matrix in Compressed Sparse Row (CSR) format.
|
||||
sparse.csr: Affinity matrix in Compressed Sparse Row (CSR) format.
|
||||
"""
|
||||
user_affinity = (
|
||||
sparse.coo_matrix(
|
||||
(
|
||||
df[self.col_rating],
|
||||
(df[self._col_hashed_users], df[self._col_hashed_items]),
|
||||
),
|
||||
shape=(n_users, n_items),
|
||||
)
|
||||
.todok()
|
||||
.tocsr()
|
||||
)
|
||||
return user_affinity
|
||||
|
||||
return sparse.coo_matrix(
|
||||
(df[self.col_rating], (df[self.col_user_id], df[self.col_item_id])),
|
||||
shape=(n_users, n_items),
|
||||
).tocsr()
|
||||
|
||||
def compute_coocurrence_matrix(self, df, n_users, n_items):
|
||||
""" Coocurrence matrix
|
||||
""" Co-occurrence matrix
|
||||
C = U'.transpose() * U'
|
||||
where U' is the user_affinity matrix with 1's as values (instead of ratings).
|
||||
|
||||
Args:
|
||||
df (pd.DataFrame): Hashed df of users and items.
|
||||
df (pd.DataFrame): Indexed df of users and items.
|
||||
n_users (int): Number of users.
|
||||
n_items (int): Number of items.
|
||||
Returns:
|
||||
np.array: Coocurrence matrix
|
||||
np.array: Co-occurrence matrix
|
||||
"""
|
||||
self.time()
|
||||
float_type = df[self.col_rating].dtype
|
||||
|
||||
user_item_hits = (
|
||||
sparse.coo_matrix(
|
||||
(
|
||||
np.array([1.0] * len(df[self._col_hashed_users])).astype(float_type),
|
||||
(df[self._col_hashed_users], df[self._col_hashed_items]),
|
||||
np.repeat(1, df.shape[0]),
|
||||
(df[self.col_user_id], df[self.col_item_id]),
|
||||
),
|
||||
shape=(n_users, n_items)
|
||||
shape=(n_users, n_items),
|
||||
)
|
||||
.todok()
|
||||
.tocsr()
|
||||
.tocsr()
|
||||
.astype(df[self.col_rating].dtype)
|
||||
)
|
||||
|
||||
item_cooccurrence = user_item_hits.transpose().dot(user_item_hits)
|
||||
|
||||
if self.debug:
|
||||
cnt = df.shape[0]
|
||||
elapsed_time = self.time()
|
||||
self.timer_log += [
|
||||
"Item cooccurrence calculation:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
|
||||
% (cnt, elapsed_time, float(cnt) / elapsed_time)
|
||||
]
|
||||
|
||||
self.time()
|
||||
item_cooccurrence = item_cooccurrence.multiply(
|
||||
item_cooccurrence >= self.threshold
|
||||
)
|
||||
if self.debug:
|
||||
elapsed_time = self.time()
|
||||
self.timer_log += [
|
||||
"Applying threshold:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
|
||||
% (cnt, elapsed_time, float(cnt) / elapsed_time)
|
||||
]
|
||||
|
||||
return item_cooccurrence
|
||||
|
||||
def set_index(self, df):
|
||||
"""Generate continuous indices for users and items to reduce memory usage
|
||||
|
||||
Args:
|
||||
df (pd.DataFrame): dataframe with user and item ids
|
||||
"""
|
||||
|
||||
# Generate a map of continuous index values to items
|
||||
self.index2item = dict(enumerate(df[self.col_item].unique()))
|
||||
|
||||
# Invert the mapping from above
|
||||
self.item2index = {v: k for k, v in self.index2item.items()}
|
||||
|
||||
# Create mapping of users to continuous indices
|
||||
self.user2index = {x[1]: x[0] for x in enumerate(df[self.col_user].unique())}
|
||||
|
||||
# set values for the total count of users and items
|
||||
self.n_users = len(self.user2index)
|
||||
self.n_items = len(self.index2item)
|
||||
|
||||
def fit(self, df):
|
||||
"""Main fit method for SAR"""
|
||||
"""Main fit method for SAR
|
||||
|
||||
log.info("Collecting user affinity matrix...")
|
||||
self.time()
|
||||
# use the same floating type for the computations as input
|
||||
float_type = df[self.col_rating].dtype
|
||||
if not np.issubdtype(float_type, np.floating):
|
||||
raise ValueError(
|
||||
"Only floating point data types are accepted for the rating column. Data type was {} "
|
||||
"instead.".format(float_type)
|
||||
)
|
||||
Args:
|
||||
df (pd.DataFrame): User item rating dataframe
|
||||
"""
|
||||
|
||||
if self.timedecay_formula:
|
||||
# WARNING: previously we would take the last value in training dataframe and set it
|
||||
# as a matrix U element
|
||||
# for each user-item pair. Now with time decay, we compute a sum over ratings given
|
||||
# by a user in the case
|
||||
# when T=np.inf, so user gets a cumulative sum of ratings for a particular item and
|
||||
# not the last rating.
|
||||
log.info("Calculating time-decayed affinities...")
|
||||
# Time Decay
|
||||
# do a group by on user item pairs and apply the formula for time decay there
|
||||
# Time T parameter is in days and input time is in seconds
|
||||
# so we do dt/60/(T*24*60)=dt/(T*24*3600)
|
||||
# Generate continuous indices if this hasn't been done
|
||||
if self.index2item is None:
|
||||
self.set_index(df)
|
||||
|
||||
# if time_now is None - get the default behaviour
|
||||
logger.info("Collecting user affinity matrix")
|
||||
if not np.issubdtype(df[self.col_rating].dtype, np.floating):
|
||||
raise TypeError("Rating column data type must be floating point")
|
||||
|
||||
# Copy the DataFrame to avoid modification of the input
|
||||
temp_df = df[[self.col_user, self.col_item, self.col_rating]].copy()
|
||||
|
||||
if self.time_decay_flag:
|
||||
logger.info("Calculating time-decayed affinities")
|
||||
# if time_now is None use the latest time
|
||||
if not self.time_now:
|
||||
self.time_now = df[self.col_timestamp].max()
|
||||
|
||||
# optimization - pre-compute time decay exponential which multiplies the ratings
|
||||
expo_fun = lambda x: np.exp(
|
||||
-np.log(2.0)
|
||||
* (self.time_now - x)
|
||||
/ (self.time_decay_coefficient * 24.0 * 3600)
|
||||
# apply time decay to each rating
|
||||
temp_df[self.col_rating] *= exponential_decay(
|
||||
value=df[self.col_timestamp],
|
||||
max_val=self.time_now,
|
||||
half_life=self.time_decay_half_life,
|
||||
)
|
||||
|
||||
rating_exponential = df[self.col_rating].values * expo_fun(
|
||||
df[self.col_timestamp].values
|
||||
).astype(float_type)
|
||||
# update df with the affinities after the timestamp calculation
|
||||
# copy part of the data frame to avoid modification of the input
|
||||
temp_df = pd.DataFrame(
|
||||
data={
|
||||
self.col_user: df[self.col_user],
|
||||
self.col_item: df[self.col_item],
|
||||
self.col_rating: rating_exponential,
|
||||
}
|
||||
# group time decayed ratings by user-item and take the sum as the user-item affinity
|
||||
temp_df = (
|
||||
temp_df.groupby([self.col_user, self.col_item]).sum().reset_index()
|
||||
)
|
||||
newdf = temp_df.groupby([self.col_user, self.col_item]).sum().reset_index()
|
||||
|
||||
"""
|
||||
# experimental implementation of multiprocessing - in practice for smaller datasets this is not needed
|
||||
# leaving here in case anyone wants to actually try this
|
||||
# to enable, you need:
|
||||
# conda install dill>=0.2.8.1
|
||||
# pip install multiprocess>=0.70.6.1
|
||||
# from multiprocess import Pool, cpu_count
|
||||
#
|
||||
# multiproces uses dill for python3 to serialize lambda functions
|
||||
#
|
||||
# helper function to parallelize the operation on groups
|
||||
def applyParallel(dfGrouped, func):
|
||||
with Pool(cpu_count()*2) as p:
|
||||
ret_list = p.map(func, [group for name, group in dfGrouped])
|
||||
return pd.concat(ret_list)
|
||||
|
||||
from types import MethodType
|
||||
grouped.applyParallel = MethodType(applyParallel, grouped)
|
||||
|
||||
# then replace df.apply with df.applyParallel
|
||||
"""
|
||||
|
||||
"""
|
||||
Original implementatoin of groupby and apply - without optimization
|
||||
rating_series = grouped.apply(lambda x: np.sum(np.array(x[self.col_rating]) * np.exp(
|
||||
-np.log(2.) * (self.time_now - np.array(x[self.col_timestamp])) / (
|
||||
self.time_decay_coefficient * 24. * 3600))))
|
||||
"""
|
||||
|
||||
else:
|
||||
# without time decay we take the last user-provided rating supplied in the dataset as the
|
||||
# final rating for the user-item pair
|
||||
log.info("Deduplicating the user-item counts")
|
||||
newdf = df.drop_duplicates([self.col_user, self.col_item])[
|
||||
[self.col_user, self.col_item, self.col_rating]
|
||||
]
|
||||
# without time decay use the latest user-item rating in the dataset as the affinity score
|
||||
logger.info("De-duplicating the user-item counts")
|
||||
temp_df = temp_df.drop_duplicates(
|
||||
[self.col_user, self.col_item], keep="last"
|
||||
)
|
||||
|
||||
if self.debug:
|
||||
elapsed_time = self.time()
|
||||
cnt = newdf.shape[0]
|
||||
self.timer_log += [
|
||||
"Affinity calculation:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
|
||||
% (cnt, elapsed_time, float(cnt) / elapsed_time)
|
||||
]
|
||||
logger.info("Creating index columns")
|
||||
# Map users and items according to the two dicts. Add the two new columns to temp_df.
|
||||
temp_df.loc[:, self.col_item_id] = temp_df[self.col_item].map(self.item2index)
|
||||
temp_df.loc[:, self.col_user_id] = temp_df[self.col_user].map(self.user2index)
|
||||
|
||||
self.time()
|
||||
log.info("Creating index columns...")
|
||||
# Hash users and items according to the two dicts. Add the two new columns to newdf.
|
||||
newdf.loc[:, self._col_hashed_items] = newdf[self.col_item].map(
|
||||
self.item_map_dict
|
||||
)
|
||||
newdf.loc[:, self._col_hashed_users] = newdf[self.col_user].map(
|
||||
self.user_map_dict
|
||||
)
|
||||
|
||||
# store training set index for future use during prediction
|
||||
# DO NOT USE .values as the warning message suggests
|
||||
self.index = newdf[[self._col_hashed_users, self._col_hashed_items]].values
|
||||
|
||||
n_items = len(self.unique_items)
|
||||
n_users = len(self.unique_users)
|
||||
seen_items = None
|
||||
if self.remove_seen:
|
||||
# retain seen items for removal at prediction time
|
||||
seen_items = temp_df[[self.col_user_id, self.col_item_id]].values
|
||||
|
||||
# Affinity matrix
|
||||
log.info("Building user affinity sparse matrix...")
|
||||
self.user_affinity = self.compute_affinity_matrix(newdf, n_users, n_items)
|
||||
|
||||
if self.debug:
|
||||
elapsed_time = self.time()
|
||||
self.timer_log += [
|
||||
"Indexing and affinity matrix construction:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
|
||||
% (cnt, elapsed_time, float(cnt) / elapsed_time)
|
||||
]
|
||||
|
||||
# Calculate item cooccurrence
|
||||
log.info("Calculating item cooccurrence...")
|
||||
item_cooccurrence = self.compute_coocurrence_matrix(newdf, n_users, n_items)
|
||||
|
||||
log.info("Calculating item similarity...")
|
||||
similarity_type = (
|
||||
SIM_COOCCUR if self.similarity_type is None else self.similarity_type
|
||||
logger.info("Building user affinity sparse matrix")
|
||||
self.user_affinity = self.compute_affinity_matrix(
|
||||
temp_df, self.n_users, self.n_items
|
||||
)
|
||||
|
||||
self.time()
|
||||
if similarity_type == SIM_COOCCUR:
|
||||
# Calculate item co-occurrence
|
||||
logger.info("Calculating item co-occurrence")
|
||||
item_cooccurrence = self.compute_coocurrence_matrix(
|
||||
temp_df, self.n_users, self.n_items
|
||||
)
|
||||
|
||||
# Free up some space
|
||||
del temp_df
|
||||
|
||||
logger.info("Calculating item similarity")
|
||||
if self.similarity_type == sar.SIM_COOCCUR:
|
||||
self.item_similarity = item_cooccurrence
|
||||
elif similarity_type == SIM_JACCARD:
|
||||
log.info("Calculating jaccard ...")
|
||||
elif self.similarity_type == sar.SIM_JACCARD:
|
||||
logger.info("Calculating jaccard")
|
||||
self.item_similarity = jaccard(item_cooccurrence)
|
||||
elif similarity_type == SIM_LIFT:
|
||||
log.info("Calculating lift ...")
|
||||
# Free up some space
|
||||
del item_cooccurrence
|
||||
elif self.similarity_type == sar.SIM_LIFT:
|
||||
logger.info("Calculating lift")
|
||||
self.item_similarity = lift(item_cooccurrence)
|
||||
# Free up some space
|
||||
del item_cooccurrence
|
||||
else:
|
||||
raise ValueError("Unknown similarity type: {0}".format(similarity_type))
|
||||
raise ValueError(
|
||||
"Unknown similarity type: {0}".format(self.similarity_type)
|
||||
)
|
||||
|
||||
if self.debug and (
|
||||
similarity_type == SIM_JACCARD or similarity_type == SIM_LIFT
|
||||
):
|
||||
elapsed_time = self.time()
|
||||
self.timer_log += [
|
||||
"Item similarity calculation:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
|
||||
% (cnt, elapsed_time, float(cnt) / elapsed_time)
|
||||
]
|
||||
|
||||
# Calculate raw scores with a matrix multiplication.
|
||||
log.info("Calculating recommendation scores...")
|
||||
self.time()
|
||||
# Calculate raw scores with a matrix multiplication
|
||||
logger.info("Calculating recommendation scores")
|
||||
self.scores = self.user_affinity.dot(self.item_similarity)
|
||||
|
||||
if self.debug:
|
||||
elapsed_time = self.time()
|
||||
self.timer_log += [
|
||||
"Score calculation:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
|
||||
% (cnt, elapsed_time, float(cnt) / elapsed_time)
|
||||
]
|
||||
# Remove items in the train set so recommended items are always novel
|
||||
if self.remove_seen:
|
||||
logger.info("Removing seen items")
|
||||
self.scores[seen_items[:, 0], seen_items[:, 1]] = -np.inf
|
||||
|
||||
log.info("done training")
|
||||
logger.info("Done training")
|
||||
|
||||
def recommend_k_items(self, test, top_k=10, sort_top_k=False):
|
||||
"""Recommend top K items for all users which are in the test set
|
||||
|
||||
Args:
|
||||
|
||||
test (pd.DataFrame): user to test
|
||||
top_k (int): number of top items to recommend
|
||||
sort_top_k (bool): flag to sort top k results
|
||||
Returns:
|
||||
pd.DataFrame: A DataFrame that contains top k recommendation items for each user.
|
||||
pd.DataFrame: top k recommendation items for each user
|
||||
"""
|
||||
|
||||
# pick users from test set and
|
||||
test_users = test[self.col_user].unique()
|
||||
try:
|
||||
test_users_training_ids = np.array(
|
||||
[self.user_map_dict[user] for user in test_users]
|
||||
)
|
||||
except KeyError():
|
||||
msg = "SAR cannot score test set users which are not in the training set"
|
||||
log.error(msg)
|
||||
raise ValueError(msg)
|
||||
# get user / item indices from test set
|
||||
user_ids = test[self.col_user].drop_duplicates().map(self.user2index).values
|
||||
if any(np.isnan(user_ids)):
|
||||
raise ValueError("SAR cannot score users that are not in the training set")
|
||||
|
||||
# shorthand
|
||||
scores = self.scores
|
||||
# extract only the scores for the test users
|
||||
test_scores = self.scores[user_ids, :]
|
||||
|
||||
# Convert to dense, the following operations are easier.
|
||||
log.info("Converting to dense matrix...")
|
||||
if isinstance(scores, np.matrixlib.defmatrix.matrix):
|
||||
scores_dense = np.array(scores)
|
||||
else:
|
||||
scores_dense = scores.todense()
|
||||
# ensure we're working with a dense matrix
|
||||
if isinstance(test_scores, sparse.spmatrix):
|
||||
test_scores = test_scores.todense()
|
||||
|
||||
# Mask out items in the train set. This only makes sense for some
|
||||
# problems (where a user wouldn't interact with an item more than once).
|
||||
if self.remove_seen:
|
||||
log.info("Removing seen items...")
|
||||
scores_dense[self.index[:, 0], self.index[:, 1]] = 0
|
||||
# get top K items and scores
|
||||
logger.info("Getting top K")
|
||||
# this determines the un-ordered top-k item indices for each user
|
||||
top_items = np.argpartition(test_scores, -top_k, axis=1)[:, -top_k:]
|
||||
top_scores = test_scores[np.arange(test_scores.shape[0])[:, None], top_items]
|
||||
|
||||
# Get top K items and scores.
|
||||
log.info("Getting top K...")
|
||||
top_items = np.argpartition(scores_dense, -top_k, axis=1)[:, -top_k:]
|
||||
top_scores = scores_dense[np.arange(scores_dense.shape[0])[:, None], top_items]
|
||||
if sort_top_k:
|
||||
sort_ind = np.argsort(-top_scores)
|
||||
top_items = top_items[np.arange(top_items.shape[0])[:, None], sort_ind]
|
||||
top_scores = top_scores[np.arange(top_scores.shape[0])[:, None], sort_ind]
|
||||
|
||||
log.info("Select users from the test set")
|
||||
top_items = top_items[test_users_training_ids, :]
|
||||
top_scores = top_scores[test_users_training_ids, :]
|
||||
|
||||
log.info("Creating output dataframe...")
|
||||
|
||||
# Convert to np.array (from view) and flatten
|
||||
top_items = np.reshape(np.array(top_items), -1)
|
||||
top_scores = np.reshape(np.array(top_scores), -1)
|
||||
|
||||
userids = []
|
||||
for u in test_users:
|
||||
userids.extend([u] * top_k)
|
||||
|
||||
results = pd.DataFrame.from_dict(
|
||||
df = pd.DataFrame(
|
||||
{
|
||||
self.col_user: userids,
|
||||
self.col_item: top_items,
|
||||
self.col_rating: top_scores,
|
||||
self.col_user: np.repeat(
|
||||
test[self.col_user].drop_duplicates().values, top_k
|
||||
),
|
||||
self.col_item: [
|
||||
self.index2item[item] for item in np.array(top_items).flatten()
|
||||
],
|
||||
self.col_prediction: np.array(top_scores).flatten(),
|
||||
}
|
||||
)
|
||||
|
||||
# remap user and item indices to IDs
|
||||
results[self.col_item] = results[self.col_item].map(self.index2item)
|
||||
|
||||
# do final sort
|
||||
if sort_top_k:
|
||||
results = (
|
||||
results.sort_values(
|
||||
by=[self.col_user, self.col_rating], ascending=False
|
||||
)
|
||||
.groupby(self.col_user)
|
||||
.apply(lambda x: x)
|
||||
)
|
||||
|
||||
# format the dataframe in the end to conform to Suprise return type
|
||||
log.info("Formatting output")
|
||||
|
||||
# modify test to make it compatible with
|
||||
|
||||
return (
|
||||
results[[self.col_user, self.col_item, self.col_rating]]
|
||||
.rename(columns={self.col_rating: PREDICTION_COL})
|
||||
.astype(
|
||||
{
|
||||
self.col_user: _user_item_return_type(),
|
||||
self.col_item: _user_item_return_type(),
|
||||
PREDICTION_COL: self.scores.dtype,
|
||||
}
|
||||
)
|
||||
# ensure datatypes are correct
|
||||
df = df.astype(
|
||||
dtype={
|
||||
self.col_user: str,
|
||||
self.col_item: str,
|
||||
self.col_prediction: self.scores.dtype,
|
||||
}
|
||||
)
|
||||
|
||||
# drop seen items
|
||||
return df.replace(-np.inf, np.nan).dropna()
|
||||
|
||||
def predict(self, test):
|
||||
"""Output SAR scores for only the users-items pairs which are in the test set
|
||||
|
||||
Args:
|
||||
test (pd.DataFrame): DataFrame that contains ground-truth of user-item ratings.
|
||||
|
||||
test (pd.DataFrame): DataFrame that contains users and items to test
|
||||
Return:
|
||||
pd.DataFrame: DataFrame contains the prediction results.
|
||||
pd.DataFrame: DataFrame contains the prediction results
|
||||
"""
|
||||
# pick users from test set and
|
||||
test_users = test[self.col_user].unique()
|
||||
try:
|
||||
training_ids = np.array([self.user_map_dict[user] for user in test_users])
|
||||
assert training_ids is not None
|
||||
except KeyError():
|
||||
msg = "SAR cannot score test set users which are not in the training set"
|
||||
log.error(msg)
|
||||
raise ValueError(msg)
|
||||
|
||||
# shorthand
|
||||
scores = self.scores
|
||||
# get user / item indices from test set
|
||||
user_ids = test[self.col_user].map(self.user2index).values
|
||||
if any(np.isnan(user_ids)):
|
||||
raise ValueError("SAR cannot score users that are not in the training set")
|
||||
|
||||
# Convert to dense, the following operations are easier.
|
||||
log.info("Converting to dense array ...")
|
||||
scores_dense = scores.toarray()
|
||||
# extract only the scores for the test users
|
||||
test_scores = self.scores[user_ids, :]
|
||||
|
||||
# take the intersection between train test items and items we actually need
|
||||
test_col_hashed_users = test[self.col_user].map(self.user_map_dict)
|
||||
test_col_hashed_items = test[self.col_item].map(self.item_map_dict)
|
||||
# convert and flatten scores into an array
|
||||
if isinstance(test_scores, sparse.spmatrix):
|
||||
test_scores = test_scores.todense()
|
||||
|
||||
test_index = pd.concat(
|
||||
[test_col_hashed_users, test_col_hashed_items], axis=1
|
||||
).values
|
||||
aset = set([tuple(x) for x in self.index])
|
||||
bset = set([tuple(x) for x in test_index])
|
||||
item_ids = test[self.col_item].map(self.item2index).values
|
||||
nans = np.isnan(item_ids)
|
||||
if any(nans):
|
||||
# predict 0 for items not seen during training
|
||||
test_scores = np.append(test_scores, np.zeros((self.n_users, 1)), axis=1)
|
||||
item_ids[nans] = self.n_items
|
||||
item_ids = item_ids.astype("int64")
|
||||
|
||||
common_index = np.array([x for x in aset & bset])
|
||||
|
||||
# Mask out items in the train set. This only makes sense for some
|
||||
# problems (where a user wouldn't interact with an item more than once).
|
||||
if self.remove_seen and len(aset & bset) > 0:
|
||||
log.info("Removing seen items...")
|
||||
scores_dense[common_index[:, 0], common_index[:, 1]] = 0
|
||||
|
||||
final_scores = scores_dense[test_index[:, 0], test_index[:, 1]]
|
||||
|
||||
results = pd.DataFrame.from_dict(
|
||||
df = pd.DataFrame(
|
||||
{
|
||||
self.col_user: test_index[:, 0],
|
||||
self.col_item: test_index[:, 1],
|
||||
self.col_rating: final_scores
|
||||
self.col_user: test[self.col_user].values,
|
||||
self.col_item: test[self.col_item].values,
|
||||
self.col_prediction: test_scores[
|
||||
np.arange(test_scores.shape[0]), item_ids
|
||||
],
|
||||
}
|
||||
)
|
||||
|
||||
# remap user and item indices to IDs
|
||||
results[self.col_user] = results[self.col_user].map(self.index2user)
|
||||
results[self.col_item] = results[self.col_item].map(self.index2item)
|
||||
|
||||
# format the dataframe in the end to conform to Suprise return type
|
||||
log.info("Formatting output")
|
||||
|
||||
# modify test to make it compatible with
|
||||
return (
|
||||
results[[self.col_user, self.col_item, self.col_rating]]
|
||||
.rename(columns={self.col_rating: PREDICTION_COL})
|
||||
.astype(
|
||||
{
|
||||
self.col_user: _user_item_return_type(),
|
||||
self.col_item: _user_item_return_type(),
|
||||
PREDICTION_COL: self.scores.dtype,
|
||||
}
|
||||
)
|
||||
# ensure datatypes are correct
|
||||
df = df.astype(
|
||||
dtype={
|
||||
self.col_user: str,
|
||||
self.col_item: str,
|
||||
self.col_prediction: self.scores.dtype,
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def _user_item_return_type():
|
||||
return str
|
||||
return df
|
||||
|
|
|
@ -0,0 +1,139 @@
|
|||
#!/usr/bin/python
|
||||
|
||||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
# This script creates yaml files to build conda environments
|
||||
# For generating a conda file for running only python code:
|
||||
# $ python generate_conda_file.py
|
||||
# For generating a conda file for running python gpu:
|
||||
# $ python generate_conda_file.py --gpu
|
||||
# For generating a conda file for running pyspark:
|
||||
# $ python generate_conda_file.py --pyspark
|
||||
# For generating a conda file for running python gpu and pyspark:
|
||||
# $ python generate_conda_file.py --gpu --pyspark
|
||||
# For generating a conda file for running python gpu and pyspark with a particular version:
|
||||
# $ python generate_conda_file.py --gpu --pyspark-version 2.4.0
|
||||
|
||||
import argparse
|
||||
|
||||
|
||||
CHANNELS = [
|
||||
'conda-forge',
|
||||
'pytorch',
|
||||
'fastai',
|
||||
'defaults',
|
||||
]
|
||||
|
||||
CONDA_BASE = {
|
||||
'dask': 'dask>=0.17.1',
|
||||
'fastai': 'fastai>=1.0.40',
|
||||
'fastparquet': 'fastparquet>=0.1.6',
|
||||
'gitpython': 'gitpython>=2.1.8',
|
||||
'ipykernel': 'ipykernel>=4.6.1',
|
||||
'jupyter': 'jupyter>=1.0.0',
|
||||
'matplotlib': 'matplotlib>=2.2.2',
|
||||
'numpy': 'numpy>=1.13.3',
|
||||
'pandas': 'pandas>=0.23.4',
|
||||
'pymongo': 'pymongo>=3.6.1',
|
||||
'python': 'python==3.6.8',
|
||||
'pytest': 'pytest>=3.6.4',
|
||||
'seaborn': 'seaborn>=0.8.1',
|
||||
'scikit-learn': 'scikit-learn==0.19.1',
|
||||
'scipy': 'scipy>=1.0.0',
|
||||
'scikit-surprise': 'scikit-surprise>=1.0.6',
|
||||
'tensorflow': 'tensorflow==1.12.0',
|
||||
}
|
||||
|
||||
CONDA_PYSPARK = {
|
||||
'pyarrow': 'pyarrow>=0.8.0',
|
||||
'pyspark': 'pyspark==2.3.1',
|
||||
}
|
||||
|
||||
CONDA_GPU = {
|
||||
'numba': 'numba>=0.38.1',
|
||||
'tensorflow': 'tensorflow-gpu==1.12.0',
|
||||
}
|
||||
|
||||
PIP_BASE = {
|
||||
'azureml-sdk[notebooks,contrib]': 'azureml-sdk[notebooks,contrib]>=1.0.8',
|
||||
'azure-storage': 'azure-storage>=0.36.0',
|
||||
'black': 'black>=18.6b4',
|
||||
'dataclasses': 'dataclasses>=0.6',
|
||||
'hyperopt': 'hyperopt==0.1.1',
|
||||
'idna': 'idna==2.7',
|
||||
'memory-profiler': 'memory-profiler>=0.54.0',
|
||||
'nvidia-ml-py3': 'nvidia-ml-py3>=7.352.0',
|
||||
'papermill': 'papermill>=0.15.0',
|
||||
'pydocumentdb': 'pydocumentdb>=2.3.3',
|
||||
}
|
||||
|
||||
PIP_PYSPARK = {}
|
||||
PIP_GPU = {}
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description='This script generates a conda file for different environments. Plain python is the default, flags can be used to add packages needed to support pyspark and gpu functionality')
|
||||
parser.add_argument('--name', help='specify name of conda environment')
|
||||
parser.add_argument('--gpu', action="store_true", help='include packages for gpu support')
|
||||
parser.add_argument('--pyspark', action="store_true", help='include packages for pyspark support')
|
||||
parser.add_argument('--pyspark-version', help='provide specific version of pyspark to use')
|
||||
args = parser.parse_args()
|
||||
|
||||
# check pyspark version
|
||||
if args.pyspark_version is not None:
|
||||
args.pyspark = True
|
||||
pyspark_version_info = args.pyspark_version.split('.')
|
||||
if len(pyspark_version_info) != 3 or any([not x.isdigit() for x in pyspark_version_info]):
|
||||
raise TypeError('Pyspark version input must be valid numeric format (e.g. --pyspark-version=2.3.1)')
|
||||
|
||||
# set name for environment and output yaml file
|
||||
conda_env = 'reco_base'
|
||||
if args.gpu and args.pyspark:
|
||||
conda_env = 'reco_full'
|
||||
elif args.gpu:
|
||||
conda_env = 'reco_gpu'
|
||||
elif args.pyspark:
|
||||
conda_env = 'reco_pyspark'
|
||||
|
||||
# overwrite environment name with user input
|
||||
if args.name is not None:
|
||||
conda_env = args.name
|
||||
|
||||
# update conda and pip packages based on flags provided
|
||||
conda_packages = CONDA_BASE
|
||||
pip_packages = PIP_BASE
|
||||
if args.pyspark:
|
||||
conda_packages.update(CONDA_PYSPARK)
|
||||
conda_packages['pyspark'] = 'pyspark=={}'.format(args.pyspark_version)
|
||||
pip_packages.update(PIP_PYSPARK)
|
||||
if args.gpu:
|
||||
conda_packages.update(CONDA_GPU)
|
||||
pip_packages.update(PIP_GPU)
|
||||
|
||||
# write out yaml file
|
||||
conda_file = '{}.yaml'.format(conda_env)
|
||||
with open(conda_file, 'w') as f:
|
||||
f.write('name: {}\n'.format(conda_env))
|
||||
f.write('channels:\n')
|
||||
for channel in CHANNELS:
|
||||
f.write('- {}\n'.format(channel))
|
||||
f.write('dependencies:\n')
|
||||
for conda_package in conda_packages.values():
|
||||
f.write('- {}\n'.format(conda_package))
|
||||
f.write('- pip:\n')
|
||||
for pip_package in pip_packages.values():
|
||||
f.write(' - {}\n'.format(pip_package))
|
||||
|
||||
print("""Generated conda file: {conda_file}
|
||||
|
||||
To create the conda environment:
|
||||
$ conda env create -f {conda_file}
|
||||
|
||||
To update the conda environment:
|
||||
$ conda env update -f {conda_file}
|
||||
|
||||
To register the conda environment in Jupyter:
|
||||
$ conda activate {conda_env}
|
||||
$ python -m ipykernel install --user --name {conda_env} --display-name "Python ({conda_env})"
|
||||
""".format(conda_env=conda_env, conda_file=conda_file))
|
|
@ -133,10 +133,7 @@ def demo_usage_data(header, sar_settings):
|
|||
@pytest.fixture(scope="module")
|
||||
def demo_usage_data_spark(spark, demo_usage_data, header):
|
||||
data_local = demo_usage_data[[x[1] for x in header.items()]]
|
||||
# TODO: install pyArrow in DS VM
|
||||
# spark.conf.set("spark.sql.execution.arrow.enabled", "true")
|
||||
data = spark.createDataFrame(data_local)
|
||||
return data
|
||||
return spark.createDataFrame(data_local)
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
|
@ -147,14 +144,20 @@ def notebooks():
|
|||
paths = {
|
||||
"template": os.path.join(folder_notebooks, "template.ipynb"),
|
||||
"sar_single_node": os.path.join(
|
||||
folder_notebooks, "00_quick_start", "sar_single_node_movielens.ipynb"
|
||||
folder_notebooks, "00_quick_start", "sar_movielens.ipynb"
|
||||
),
|
||||
"ncf": os.path.join(folder_notebooks, "00_quick_start", "ncf_movielens.ipynb"),
|
||||
"als_pyspark": os.path.join(
|
||||
folder_notebooks, "00_quick_start", "als_pyspark_movielens.ipynb"
|
||||
folder_notebooks, "00_quick_start", "als_movielens.ipynb"
|
||||
),
|
||||
"fastai": os.path.join(
|
||||
folder_notebooks, "00_quick_start", "fastai_recommendation.ipynb"
|
||||
folder_notebooks, "00_quick_start", "fastai_movielens.ipynb"
|
||||
),
|
||||
"xdeepfm_quickstart": os.path.join(
|
||||
folder_notebooks, "00_quick_start", "xdeepfm_synthetic.ipynb"
|
||||
),
|
||||
"dkn_quickstart": os.path.join(
|
||||
folder_notebooks, "00_quick_start", "dkn_synthetic.ipynb"
|
||||
),
|
||||
"data_split": os.path.join(
|
||||
folder_notebooks, "01_prepare_data", "data_split.ipynb"
|
||||
|
@ -171,14 +174,13 @@ def notebooks():
|
|||
"ncf_deep_dive": os.path.join(
|
||||
folder_notebooks, "02_model", "ncf_deep_dive.ipynb"
|
||||
),
|
||||
"sar_deep_dive": os.path.join(
|
||||
folder_notebooks, "02_model", "sar_deep_dive.ipynb"
|
||||
),
|
||||
"vowpal_wabbit_deep_dive": os.path.join(
|
||||
folder_notebooks, "02_model", "vowpal_wabbit_deep_dive.ipynb"
|
||||
),
|
||||
"evaluation": os.path.join(folder_notebooks, "03_evaluate", "evaluation.ipynb"),
|
||||
"fastai": os.path.join(
|
||||
folder_notebooks, "00_quick_start", "fastai_recommendation.ipynb"
|
||||
),
|
||||
"xdeepfm_quickstart": os.path.join(
|
||||
folder_notebooks, "00_quick_start", "xdeepfm.ipynb"
|
||||
),
|
||||
"dkn_quickstart": os.path.join(folder_notebooks, "00_quick_start", "dkn.ipynb"),
|
||||
}
|
||||
return paths
|
||||
|
||||
|
|
|
@ -23,13 +23,14 @@ except ImportError:
|
|||
@pytest.mark.parametrize(
|
||||
"size, num_samples, num_movies, title_example, genres_example",
|
||||
[
|
||||
("100k", 100000, 1682, "Toy Story (1995)", "Animation|Children's|Comedy"),
|
||||
("1m", 1000209, 3883, "Toy Story (1995)", "Animation|Children's|Comedy"),
|
||||
("10m", 10000054, 10681, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
|
||||
("20m", 20000263, 27278, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
|
||||
],
|
||||
)
|
||||
def test_load_pandas_df(size, num_samples, num_movies, title_example, genres_example):
|
||||
"""Test MovieLens dataset load into pd.DataFrame
|
||||
"""
|
||||
"""Test MovieLens dataset load into pd.DataFrame"""
|
||||
|
||||
df = movielens.load_pandas_df(size=size)
|
||||
assert len(df) == num_samples
|
||||
assert len(df.columns) == 4
|
||||
|
@ -70,8 +71,9 @@ def test_load_pandas_df(size, num_samples, num_movies, title_example, genres_exa
|
|||
@pytest.mark.parametrize(
|
||||
"size, num_samples, num_movies, title_example, genres_example",
|
||||
[
|
||||
("100k", 100000, 1682, "Toy Story (1995)", "Animation|Children's|Comedy"),
|
||||
("1m", 1000209, 3883, "Toy Story (1995)", "Animation|Children's|Comedy"),
|
||||
("10m", 10000054, 10681, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
|
||||
("20m", 20000263, 27278, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
|
||||
],
|
||||
)
|
||||
def test_load_spark_df(size, num_samples, num_movies, title_example, genres_example):
|
|
@ -82,3 +82,30 @@ def test_surprise_svd_integration(notebooks, size, expected_values):
|
|||
for key, value in expected_values.items():
|
||||
assert results[key] == pytest.approx(value, rel=TOL)
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
@pytest.mark.parametrize(
|
||||
"size, expected_values",
|
||||
[
|
||||
("1m", dict(rmse=0.9555,
|
||||
mae=0.68493,
|
||||
rsquared=0.26547,
|
||||
exp_var=0.26615,
|
||||
map=0.50635,
|
||||
ndcg=0.99966,
|
||||
precision=0.92684,
|
||||
recall=0.50635)),
|
||||
],
|
||||
)
|
||||
def test_vw_deep_dive_integration(notebooks, size, expected_values):
|
||||
notebook_path = notebooks["vowpal_wabbit_deep_dive"]
|
||||
pm.execute_notebook(
|
||||
notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
kernel_name=KERNEL_NAME,
|
||||
parameters=dict(MOVIELENS_DATA_SIZE=size, TOP_K=10),
|
||||
)
|
||||
results = pm.read_notebook(OUTPUT_NOTEBOOK).dataframe.set_index("name")["value"]
|
||||
|
||||
for key, value in expected_values.items():
|
||||
assert results[key] == pytest.approx(value, rel=TOL)
|
||||
|
|
|
@ -4,12 +4,11 @@
|
|||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from reco_utils.dataset.numpy_splitters import numpy_stratified_split
|
||||
from reco_utils.dataset.python_splitters import numpy_stratified_split
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def test_specs():
|
||||
|
||||
return {
|
||||
"users": 30,
|
||||
"items": 53,
|
||||
|
@ -22,20 +21,18 @@ def test_specs():
|
|||
|
||||
@pytest.fixture(scope="module")
|
||||
def affinity_matrix(test_specs):
|
||||
|
||||
"""
|
||||
Generate a random user/item affinity matrix. By increasing the likehood of 0 elements we simulate
|
||||
a typical recommeding situation where the input matrix is highly sparse.
|
||||
"""Generate a random user/item affinity matrix. By increasing the likehood of 0 elements we simulate
|
||||
a typical recommending situation where the input matrix is highly sparse.
|
||||
|
||||
Args:
|
||||
users (int): number of users (rows).
|
||||
items (int): number of items (columns).
|
||||
ratings (int): rating scale, e.g. 5 meaning rates are from 1 to 5.
|
||||
spars: probablity of obtaining zero. This roughly correponds to the sparseness.
|
||||
spars: probability of obtaining zero. This roughly corresponds to the sparseness.
|
||||
of the generated matrix. If spars = 0 then the affinity matrix is dense.
|
||||
|
||||
Returns:
|
||||
X (np array, int): sparse user/affinity matrix
|
||||
np.array: sparse user/affinity matrix of integers.
|
||||
|
||||
"""
|
||||
|
||||
|
|
|
@ -6,10 +6,6 @@ import urllib.request
|
|||
import csv
|
||||
import codecs
|
||||
|
||||
import logging
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _csv_reader_url(url, delimiter=",", encoding="utf-8"):
|
||||
ftpstream = urllib.request.urlopen(url)
|
||||
|
|
|
@ -24,8 +24,7 @@ except ImportError:
|
|||
@pytest.mark.parametrize(
|
||||
"size, num_samples, num_movies, title_example, genres_example",
|
||||
[
|
||||
("10m", 10000054, 10681, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
|
||||
("20m", 20000263, 27278, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
|
||||
("100k", 100000, 1682, "Toy Story (1995)", "Animation|Children's|Comedy"),
|
||||
],
|
||||
)
|
||||
def test_load_pandas_df(size, num_samples, num_movies, title_example, genres_example):
|
||||
|
@ -72,8 +71,7 @@ def test_load_pandas_df(size, num_samples, num_movies, title_example, genres_exa
|
|||
@pytest.mark.parametrize(
|
||||
"size, num_samples, num_movies, title_example, genres_example",
|
||||
[
|
||||
("10m", 10000054, 10681, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
|
||||
("20m", 20000263, 27278, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
|
||||
("100k", 100000, 1682, "Toy Story (1995)", "Animation|Children's|Comedy"),
|
||||
],
|
||||
)
|
||||
def test_load_spark_df(size, num_samples, num_movies, title_example, genres_example):
|
||||
|
|
|
@ -8,6 +8,7 @@ from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME
|
|||
|
||||
TOL = 0.05
|
||||
|
||||
|
||||
@pytest.mark.smoke
|
||||
def test_sar_single_node_smoke(notebooks):
|
||||
notebook_path = notebooks["sar_single_node"]
|
||||
|
@ -68,4 +69,24 @@ def test_surprise_svd_smoke(notebooks):
|
|||
assert results["ndcg"] == pytest.approx(0.1, TOL)
|
||||
assert results["precision"] == pytest.approx(0.095, TOL)
|
||||
assert results["recall"] == pytest.approx(0.032, TOL)
|
||||
|
||||
|
||||
|
||||
def test_vw_deep_dive_smoke(notebooks):
|
||||
notebook_path = notebooks["vowpal_wabbit_deep_dive"]
|
||||
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
|
||||
pm.execute_notebook(
|
||||
notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
kernel_name=KERNEL_NAME,
|
||||
parameters=dict(MOVIELENS_DATA_SIZE="100k"),
|
||||
)
|
||||
results = pm.read_notebook(OUTPUT_NOTEBOOK).dataframe.set_index("name")["value"]
|
||||
|
||||
assert results["rmse"] == pytest.approx(0.99575, TOL)
|
||||
assert results["mae"] == pytest.approx(0.72024, TOL)
|
||||
assert results["rsquared"] == pytest.approx(0.22961, TOL)
|
||||
assert results["exp_var"] == pytest.approx(0.22967, TOL)
|
||||
assert results["map"] == pytest.approx(0.25684, TOL)
|
||||
assert results["ndcg"] == pytest.approx(0.65339, TOL)
|
||||
assert results["precision"] == pytest.approx(0.514738, TOL)
|
||||
assert results["recall"] == pytest.approx(0.25684, TOL)
|
||||
|
|
|
@ -4,23 +4,15 @@
|
|||
import os
|
||||
import sys
|
||||
import pytest
|
||||
|
||||
# TODO: better solution??
|
||||
root = os.path.abspath(
|
||||
os.path.join(os.path.dirname(__file__), os.path.pardir, os.path.pardir)
|
||||
)
|
||||
sys.path.append(root)
|
||||
from reco_utils.dataset.url_utils import maybe_download
|
||||
|
||||
|
||||
def test_maybe_download():
|
||||
# TODO: change this file to the repo license when it is public
|
||||
file_url = "https://raw.githubusercontent.com/Microsoft/vscode/master/LICENSE.txt"
|
||||
file_url = "https://raw.githubusercontent.com/Microsoft/Recommenders/master/LICENSE"
|
||||
filepath = "license.txt"
|
||||
assert not os.path.exists(filepath)
|
||||
filepath = maybe_download(file_url, "license.txt", expected_bytes=1110)
|
||||
filepath = maybe_download(file_url, "license.txt", expected_bytes=1162)
|
||||
assert os.path.exists(filepath)
|
||||
# TODO: download again and test that the file is already there, grab the log??
|
||||
os.remove(filepath)
|
||||
with pytest.raises(IOError):
|
||||
filepath = maybe_download(file_url, "license.txt", expected_bytes=0)
|
||||
|
|
|
@ -1,28 +1,32 @@
|
|||
|
||||
import pytest
|
||||
import os
|
||||
from reco_utils.recommender.deeprec.deeprec_utils import *
|
||||
from reco_utils.recommender.deeprec.models.xDeepFM import *
|
||||
from reco_utils.recommender.deeprec.models.dkn import *
|
||||
from reco_utils.recommender.deeprec.IO.iterator import *
|
||||
from reco_utils.recommender.deeprec.IO.dkn_iterator import *
|
||||
from reco_utils.recommender.deeprec.deeprec_utils import prepare_hparams, download_deeprec_resources
|
||||
from reco_utils.recommender.deeprec.models.xDeepFM import XDeepFMModel
|
||||
from reco_utils.recommender.deeprec.models.dkn import DKN
|
||||
from reco_utils.recommender.deeprec.IO.iterator import FFMTextIterator
|
||||
from reco_utils.recommender.deeprec.IO.dkn_iterator import DKNTextIterator
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def resource_path():
|
||||
return os.path.dirname(os.path.realpath(__file__))
|
||||
|
||||
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.deeprec
|
||||
def test_xdeepfm_component_definition(resource_path):
|
||||
data_path = os.path.join(resource_path, '../resources/deeprec/xdeepfm')
|
||||
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
|
||||
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "xdeepfm")
|
||||
yaml_file = os.path.join(data_path, "xDeepFM.yaml")
|
||||
|
||||
if not os.path.exists(yaml_file):
|
||||
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')
|
||||
download_deeprec_resources(
|
||||
"https://recodatasets.blob.core.windows.net/deeprec/",
|
||||
data_path,
|
||||
"xdeepfmresources.zip",
|
||||
)
|
||||
|
||||
hparams = prepare_hparams(yaml_file)
|
||||
input_creator = FFMTextIterator
|
||||
model = XDeepFMModel(hparams, input_creator)
|
||||
model = XDeepFMModel(hparams, FFMTextIterator)
|
||||
|
||||
assert model.logit is not None
|
||||
assert model.update is not None
|
||||
|
@ -32,19 +36,27 @@ def test_xdeepfm_component_definition(resource_path):
|
|||
@pytest.mark.gpu
|
||||
@pytest.mark.deeprec
|
||||
def test_dkn_component_definition(resource_path):
|
||||
data_path = os.path.join(resource_path, '../resources/deeprec/dkn')
|
||||
yaml_file = os.path.join(data_path, r'dkn.yaml')
|
||||
wordEmb_file = os.path.join(data_path, r'word_embeddings_100.npy')
|
||||
entityEmb_file = os.path.join(data_path, r'TransE_entity2vec_100.npy')
|
||||
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "dkn")
|
||||
yaml_file = os.path.join(data_path, "dkn.yaml")
|
||||
wordEmb_file = os.path.join(data_path, "word_embeddings_100.npy")
|
||||
entityEmb_file = os.path.join(data_path, "TransE_entity2vec_100.npy")
|
||||
|
||||
if not os.path.exists(yaml_file):
|
||||
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'dknresources.zip')
|
||||
download_deeprec_resources(
|
||||
"https://recodatasets.blob.core.windows.net/deeprec/",
|
||||
data_path,
|
||||
"dknresources.zip",
|
||||
)
|
||||
|
||||
hparams = prepare_hparams(yaml_file, wordEmb_file=wordEmb_file,
|
||||
entityEmb_file=entityEmb_file, epochs=5, learning_rate=0.0001)
|
||||
hparams = prepare_hparams(
|
||||
yaml_file,
|
||||
wordEmb_file=wordEmb_file,
|
||||
entityEmb_file=entityEmb_file,
|
||||
epochs=5,
|
||||
learning_rate=0.0001,
|
||||
)
|
||||
assert hparams is not None
|
||||
input_creator = DKNTextIterator
|
||||
model = DKN(hparams, input_creator)
|
||||
model = DKN(hparams, DKNTextIterator)
|
||||
|
||||
assert model.logit is not None
|
||||
assert model.update is not None
|
||||
|
|
|
@ -1,54 +1,68 @@
|
|||
|
||||
import pytest
|
||||
import os
|
||||
from reco_utils.recommender.deeprec.deeprec_utils import *
|
||||
from reco_utils.recommender.deeprec.IO.iterator import *
|
||||
from reco_utils.recommender.deeprec.IO.dkn_iterator import *
|
||||
import tensorflow as tf
|
||||
from reco_utils.recommender.deeprec.deeprec_utils import (
|
||||
prepare_hparams,
|
||||
download_deeprec_resources,
|
||||
load_yaml_file
|
||||
)
|
||||
from reco_utils.recommender.deeprec.IO.iterator import FFMTextIterator
|
||||
from reco_utils.recommender.deeprec.IO.dkn_iterator import DKNTextIterator
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def resource_path():
|
||||
return os.path.dirname(os.path.realpath(__file__))
|
||||
|
||||
|
||||
@pytest.mark.parametrize("must_exist_attributes", [
|
||||
"FEATURE_COUNT", "data_format", "dim"
|
||||
])
|
||||
@pytest.mark.parametrize(
|
||||
"must_exist_attributes", ["FEATURE_COUNT", "data_format", "dim"]
|
||||
)
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.deeprec
|
||||
def test_prepare_hparams(must_exist_attributes,resource_path):
|
||||
data_path = os.path.join(resource_path, '../resources/deeprec/xdeepfm')
|
||||
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
|
||||
|
||||
def test_prepare_hparams(must_exist_attributes, resource_path):
|
||||
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "xdeepfm")
|
||||
yaml_file = os.path.join(data_path, "xDeepFM.yaml")
|
||||
if not os.path.exists(yaml_file):
|
||||
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')
|
||||
|
||||
download_deeprec_resources(
|
||||
"https://recodatasets.blob.core.windows.net/deeprec/",
|
||||
data_path,
|
||||
"xdeepfmresources.zip",
|
||||
)
|
||||
hparams = prepare_hparams(yaml_file)
|
||||
assert hasattr(hparams, must_exist_attributes)
|
||||
|
||||
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.deeprec
|
||||
def test_load_yaml_file(resource_path):
|
||||
data_path = os.path.join(resource_path, '../resources/deeprec/xdeepfm')
|
||||
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
|
||||
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "xdeepfm")
|
||||
yaml_file = os.path.join(data_path, "xDeepFM.yaml")
|
||||
|
||||
if not os.path.exists(yaml_file):
|
||||
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path,
|
||||
'xdeepfmresources.zip')
|
||||
download_deeprec_resources(
|
||||
"https://recodatasets.blob.core.windows.net/deeprec/",
|
||||
data_path,
|
||||
"xdeepfmresources.zip",
|
||||
)
|
||||
|
||||
config = load_yaml_file(yaml_file)
|
||||
assert config is not None
|
||||
|
||||
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.deeprec
|
||||
def test_FFM_iterator(resource_path):
|
||||
data_path = os.path.join(resource_path, '../resources/deeprec/xdeepfm')
|
||||
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
|
||||
data_file = os.path.join(data_path, r'sample_FFM_data.txt')
|
||||
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "xdeepfm")
|
||||
yaml_file = os.path.join(data_path, "xDeepFM.yaml")
|
||||
data_file = os.path.join(data_path, "sample_FFM_data.txt")
|
||||
|
||||
if not os.path.exists(yaml_file):
|
||||
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path,
|
||||
'xdeepfmresources.zip')
|
||||
download_deeprec_resources(
|
||||
"https://recodatasets.blob.core.windows.net/deeprec/",
|
||||
data_path,
|
||||
"xdeepfmresources.zip",
|
||||
)
|
||||
|
||||
hparams = prepare_hparams(yaml_file)
|
||||
iterator = FFMTextIterator(hparams, tf.Graph())
|
||||
|
@ -56,17 +70,21 @@ def test_FFM_iterator(resource_path):
|
|||
for res in iterator.load_data_from_file(data_file):
|
||||
assert isinstance(res, dict)
|
||||
|
||||
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.deeprec
|
||||
def test_DKN_iterator(resource_path):
|
||||
data_path = os.path.join(resource_path, '../resources/deeprec/dkn')
|
||||
data_file = os.path.join(data_path, r'final_test_with_entity.txt')
|
||||
yaml_file = os.path.join(data_path, r'dkn.yaml')
|
||||
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "dkn")
|
||||
data_file = os.path.join(data_path, "final_test_with_entity.txt")
|
||||
yaml_file = os.path.join(data_path, "dkn.yaml")
|
||||
if not os.path.exists(yaml_file):
|
||||
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path,
|
||||
'dknresources.zip')
|
||||
download_deeprec_resources(
|
||||
"https://recodatasets.blob.core.windows.net/deeprec/",
|
||||
data_path,
|
||||
"dknresources.zip",
|
||||
)
|
||||
|
||||
hparams = prepare_hparams(yaml_file, wordEmb_file='', entityEmb_file='')
|
||||
hparams = prepare_hparams(yaml_file, wordEmb_file="", entityEmb_file="")
|
||||
iterator = DKNTextIterator(hparams, tf.Graph())
|
||||
assert iterator is not None
|
||||
for res in iterator.load_data_from_file(data_file):
|
||||
|
|
|
@ -10,20 +10,17 @@ from reco_utils.common.constants import (
|
|||
DEFAULT_RATING_COL,
|
||||
DEFAULT_TIMESTAMP_COL,
|
||||
)
|
||||
|
||||
from reco_utils.recommender.ncf.dataset import Dataset
|
||||
|
||||
from tests.ncf_common import python_dataset_ncf, test_specs_ncf
|
||||
|
||||
|
||||
N_NEG = 5
|
||||
N_NEG_TEST = 10
|
||||
BATCH_SIZE = 32
|
||||
|
||||
def test_data_preprocessing(python_dataset_ncf):
|
||||
# test dataset._data_preprocessing and dataset._reindex
|
||||
|
||||
def test_data_preprocessing(python_dataset_ncf):
|
||||
train, test = python_dataset_ncf
|
||||
|
||||
data = Dataset(train=train, test=test, n_neg=N_NEG, n_neg_test=N_NEG_TEST)
|
||||
|
||||
# shape
|
||||
|
@ -43,11 +40,9 @@ def test_data_preprocessing(python_dataset_ncf):
|
|||
assert data_row[1][DEFAULT_ITEM_COL] == data.item2id[row[1][DEFAULT_ITEM_COL]]
|
||||
assert row[1][DEFAULT_ITEM_COL] == data.id2item[data_row[1][DEFAULT_ITEM_COL]]
|
||||
|
||||
def test_train_loader(python_dataset_ncf):
|
||||
# test dataset.train_loader()
|
||||
|
||||
def test_train_loader(python_dataset_ncf):
|
||||
train, test = python_dataset_ncf
|
||||
|
||||
data = Dataset(train=train, test=test, n_neg=N_NEG, n_neg_test=N_NEG_TEST)
|
||||
|
||||
# collect positvie user-item dict
|
||||
|
@ -62,7 +57,6 @@ def test_train_loader(python_dataset_ncf):
|
|||
assert len(user) == BATCH_SIZE
|
||||
assert len(item) == BATCH_SIZE
|
||||
assert len(labels) == BATCH_SIZE
|
||||
|
||||
assert max(labels) == min(labels)
|
||||
|
||||
# right labels
|
||||
|
@ -73,12 +67,8 @@ def test_train_loader(python_dataset_ncf):
|
|||
assert i not in positive_pool[u]
|
||||
|
||||
data.negative_sampling()
|
||||
|
||||
label_list = []
|
||||
|
||||
batches = []
|
||||
|
||||
|
||||
for idx, batch in enumerate(data.train_loader(batch_size=1)):
|
||||
user, item, labels = batch
|
||||
assert len(user) == 1
|
||||
|
@ -99,23 +89,18 @@ def test_train_loader(python_dataset_ncf):
|
|||
|
||||
|
||||
def test_test_loader(python_dataset_ncf):
|
||||
# test for dataset.test_loader()
|
||||
|
||||
train, test = python_dataset_ncf
|
||||
|
||||
data = Dataset(train=train, test=test, n_neg=N_NEG, n_neg_test=N_NEG_TEST)
|
||||
|
||||
# positive user-item dict, noting that the pool is train+test
|
||||
positive_pool = {}
|
||||
df = train.append(test)
|
||||
for u in df[DEFAULT_USER_COL].unique():
|
||||
|
||||
for u in df[DEFAULT_USER_COL].unique():
|
||||
positive_pool[u] = set(df[df[DEFAULT_USER_COL] == u][DEFAULT_ITEM_COL])
|
||||
|
||||
for batch in data.test_loader():
|
||||
user, item, labels = batch
|
||||
# shape
|
||||
|
||||
assert len(user) == N_NEG_TEST + 1
|
||||
assert len(item) == N_NEG_TEST + 1
|
||||
assert len(labels) == N_NEG_TEST + 1
|
||||
|
|
|
@ -9,19 +9,19 @@ import os
|
|||
import shutil
|
||||
from reco_utils.recommender.ncf.ncf_singlenode import NCF
|
||||
from reco_utils.recommender.ncf.dataset import Dataset
|
||||
|
||||
N_NEG = 5
|
||||
N_NEG_TEST = 10
|
||||
|
||||
from reco_utils.common.constants import (
|
||||
DEFAULT_USER_COL,
|
||||
DEFAULT_ITEM_COL,
|
||||
DEFAULT_RATING_COL,
|
||||
DEFAULT_TIMESTAMP_COL,
|
||||
)
|
||||
|
||||
from tests.ncf_common import python_dataset_ncf, test_specs_ncf
|
||||
|
||||
|
||||
N_NEG = 5
|
||||
N_NEG_TEST = 10
|
||||
|
||||
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.parametrize(
|
||||
"model_type, n_users, n_items", [("NeuMF", 1, 1), ("GMF", 10, 10), ("MLP", 4, 8)]
|
||||
|
@ -45,6 +45,7 @@ def test_init(model_type, n_users, n_items):
|
|||
|
||||
# TODO: more parameters
|
||||
|
||||
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.parametrize(
|
||||
"model_type, n_users, n_items", [("NeuMF", 5, 5), ("GMF", 5, 5), ("MLP", 5, 5)]
|
||||
|
|
|
@ -9,8 +9,6 @@ from reco_utils.common.notebook_utils import is_jupyter, is_databricks
|
|||
|
||||
@pytest.mark.notebooks
|
||||
def test_is_jupyter():
|
||||
"""Test if the module is running on Jupyter
|
||||
"""
|
||||
# Test on the terminal
|
||||
assert is_jupyter() is False
|
||||
assert is_databricks() is False
|
||||
|
|
|
@ -28,6 +28,12 @@ def test_sar_single_node_runs(notebooks):
|
|||
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
|
||||
|
||||
|
||||
@pytest.mark.notebooks
|
||||
def test_sar_deep_dive_runs(notebooks):
|
||||
notebook_path = notebooks["sar_deep_dive"]
|
||||
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
|
||||
|
||||
|
||||
@pytest.mark.notebooks
|
||||
def test_baseline_deep_dive_runs(notebooks):
|
||||
notebook_path = notebooks["baseline_deep_dive"]
|
||||
|
@ -38,3 +44,9 @@ def test_baseline_deep_dive_runs(notebooks):
|
|||
def test_surprise_deep_dive_runs(notebooks):
|
||||
notebook_path = notebooks["surprise_svd_deep_dive"]
|
||||
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
|
||||
|
||||
|
||||
@pytest.mark.notebooks
|
||||
def test_vw_deep_dive_runs(notebooks):
|
||||
notebook_path = notebooks["vowpal_wabbit_deep_dive"]
|
||||
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
|
||||
|
|
|
@ -1,159 +0,0 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from reco_utils.dataset.numpy_splitters import numpy_stratified_split
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def test_specs():
|
||||
|
||||
return {
|
||||
"number_of_items": 50,
|
||||
"number_of_users": 20,
|
||||
"seed": 123,
|
||||
"ratio": 0.6,
|
||||
"tolerance": 0.01,
|
||||
"fluctuation": 0.02,
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def python_int_dataset(test_specs):
|
||||
|
||||
"""Generate a test user/item affinity Matrix"""
|
||||
|
||||
# fix the the random seed
|
||||
np.random.seed(test_specs["seed"])
|
||||
|
||||
# generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
|
||||
X = np.random.randint(
|
||||
low=0,
|
||||
high=6,
|
||||
size=(test_specs["number_of_users"], test_specs["number_of_items"]),
|
||||
)
|
||||
|
||||
return X
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def python_float_dataset(test_specs):
|
||||
|
||||
"""Generate a test user/item affinity Matrix"""
|
||||
|
||||
# fix the the random seed
|
||||
np.random.seed(test_specs["seed"])
|
||||
|
||||
# generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
|
||||
X = (
|
||||
np.random.random(
|
||||
size=(test_specs["number_of_users"], test_specs["number_of_items"])
|
||||
)
|
||||
* 5
|
||||
)
|
||||
|
||||
return X
|
||||
|
||||
|
||||
def test_int_numpy_stratified_splitter(test_specs, python_int_dataset):
|
||||
"""
|
||||
Test the random stratified splitter.
|
||||
"""
|
||||
|
||||
# generate a syntetic dataset
|
||||
X = python_int_dataset
|
||||
|
||||
# the splitter returns (in order): train and test user/affinity matrices, train and test datafarmes and user/items to matrix maps
|
||||
Xtr, Xtst = numpy_stratified_split(
|
||||
X, ratio=test_specs["ratio"], seed=test_specs["seed"]
|
||||
)
|
||||
|
||||
# Tests
|
||||
# check that the generated matrices have the correct dimensions
|
||||
assert (Xtr.shape[0] == X.shape[0]) & (Xtr.shape[1] == X.shape[1])
|
||||
|
||||
assert (Xtst.shape[0] == X.shape[0]) & (Xtst.shape[1] == X.shape[1])
|
||||
|
||||
X_rated = np.sum(X != 0, axis=1) # number of total rated items per user
|
||||
Xtr_rated = np.sum(Xtr != 0, axis=1) # number of rated items in the train set
|
||||
Xtst_rated = np.sum(Xtst != 0, axis=1) # number of rated items in the test set
|
||||
|
||||
# global split: check that the all dataset is split in the correct ratio
|
||||
assert Xtr_rated.sum() / (X_rated.sum()) == pytest.approx(
|
||||
test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
assert Xtst_rated.sum() / (X_rated.sum()) == pytest.approx(
|
||||
1 - test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
# This implementation of the stratified splitter performs a random split at the single user level. Here we check
|
||||
# that also this more stringent condition is verified. Note that user to user fluctuations in the split ratio
|
||||
# are stronger than for the entire dataset due to the random nature of the per user splitting.
|
||||
# For this reason we allow a slightly bigger tollerance, as specified in the test_specs()
|
||||
|
||||
assert (
|
||||
(Xtr_rated / X_rated <= test_specs["ratio"] + test_specs["fluctuation"]).all()
|
||||
& (Xtr_rated / X_rated >= test_specs["ratio"] - test_specs["fluctuation"]).all()
|
||||
)
|
||||
|
||||
assert (
|
||||
(
|
||||
Xtst_rated / X_rated
|
||||
<= (1 - test_specs["ratio"]) + test_specs["fluctuation"]
|
||||
).all()
|
||||
& (
|
||||
Xtst_rated / X_rated
|
||||
>= (1 - test_specs["ratio"]) - test_specs["fluctuation"]
|
||||
).all()
|
||||
)
|
||||
|
||||
|
||||
def test_float_numpy_stratified_splitter(test_specs, python_float_dataset):
|
||||
"""
|
||||
Test the random stratified splitter.
|
||||
"""
|
||||
|
||||
# generate a syntetic dataset
|
||||
X = python_float_dataset
|
||||
|
||||
# the splitter returns (in order): train and test user/affinity matrices, train and test datafarmes and user/items to matrix maps
|
||||
Xtr, Xtst = numpy_stratified_split(
|
||||
X, ratio=test_specs["ratio"], seed=test_specs["seed"]
|
||||
)
|
||||
|
||||
# Tests
|
||||
# check that the generated matrices have the correct dimensions
|
||||
assert (Xtr.shape[0] == X.shape[0]) & (Xtr.shape[1] == X.shape[1])
|
||||
|
||||
assert (Xtst.shape[0] == X.shape[0]) & (Xtst.shape[1] == X.shape[1])
|
||||
|
||||
X_rated = np.sum(X != 0, axis=1) # number of total rated items per user
|
||||
Xtr_rated = np.sum(Xtr != 0, axis=1) # number of rated items in the train set
|
||||
Xtst_rated = np.sum(Xtst != 0, axis=1) # number of rated items in the test set
|
||||
|
||||
# global split: check that the all dataset is split in the correct ratio
|
||||
assert Xtr_rated.sum() / (X_rated.sum()) == pytest.approx(
|
||||
test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
assert Xtst_rated.sum() / (X_rated.sum()) == pytest.approx(
|
||||
1 - test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
# This implementation of the stratified splitter performs a random split at the single user level. Here we check
|
||||
# that also this more stringent condition is verified. Note that user to user fluctuations in the split ratio
|
||||
# are stronger than for the entire dataset due to the random nature of the per user splitting.
|
||||
# For this reason we allow a slightly bigger tollerance, as specified in the test_specs()
|
||||
|
||||
assert Xtr_rated / X_rated == pytest.approx(
|
||||
test_specs["ratio"], rel=test_specs["fluctuation"]
|
||||
)
|
||||
|
||||
assert Xtst_rated / X_rated == pytest.approx(
|
||||
(1 - test_specs["ratio"]), rel=test_specs["fluctuation"]
|
||||
)
|
|
@ -18,7 +18,7 @@ TOL = 0.0001
|
|||
|
||||
|
||||
@pytest.fixture
|
||||
def target_metrics():
|
||||
def target_metrics(scope="module"):
|
||||
return {
|
||||
"rmse": pytest.approx(7.254309, TOL),
|
||||
"mae": pytest.approx(6.375, TOL),
|
||||
|
@ -92,7 +92,6 @@ def python_data():
|
|||
],
|
||||
}
|
||||
)
|
||||
|
||||
return rating_true, rating_pred, rating_nohit
|
||||
|
||||
|
||||
|
@ -120,7 +119,6 @@ def test_python_rsquared(python_data, target_metrics):
|
|||
assert rsquared(
|
||||
rating_true=rating_true, rating_pred=rating_true, col_prediction="rating"
|
||||
) == pytest.approx(1.0, TOL)
|
||||
|
||||
assert rsquared(rating_true, rating_pred) == target_metrics["rsquared"]
|
||||
|
||||
|
||||
|
@ -130,7 +128,6 @@ def test_python_exp_var(python_data, target_metrics):
|
|||
assert exp_var(
|
||||
rating_true=rating_true, rating_pred=rating_true, col_prediction="rating"
|
||||
) == pytest.approx(1.0, TOL)
|
||||
|
||||
assert exp_var(rating_true, rating_pred) == target_metrics["exp_var"]
|
||||
|
||||
|
||||
|
|
|
@ -10,10 +10,12 @@ from reco_utils.dataset.split_utils import (
|
|||
min_rating_filter_pandas,
|
||||
split_pandas_data_with_ratios,
|
||||
)
|
||||
|
||||
from reco_utils.dataset.python_splitters import (
|
||||
python_chrono_split,
|
||||
python_random_split,
|
||||
python_stratified_split,
|
||||
numpy_stratified_split,
|
||||
)
|
||||
|
||||
from reco_utils.common.constants import (
|
||||
|
@ -34,13 +36,14 @@ def test_specs():
|
|||
"ratios": [0.2, 0.3, 0.5],
|
||||
"split_numbers": [2, 3, 5],
|
||||
"tolerance": 0.01,
|
||||
"number_of_items": 50,
|
||||
"number_of_users": 20,
|
||||
"fluctuation": 0.02,
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def python_dataset(test_specs):
|
||||
"""Get Python labels"""
|
||||
|
||||
def random_date_generator(start_date, range_in_days):
|
||||
"""Helper function to generate random timestamps.
|
||||
|
||||
|
@ -59,13 +62,13 @@ def python_dataset(test_specs):
|
|||
|
||||
rating = pd.DataFrame(
|
||||
{
|
||||
DEFAULT_USER_COL: np.random.random_integers(
|
||||
DEFAULT_USER_COL: np.random.randint(
|
||||
1, 5, test_specs["number_of_rows"]
|
||||
),
|
||||
DEFAULT_ITEM_COL: np.random.random_integers(
|
||||
DEFAULT_ITEM_COL: np.random.randint(
|
||||
1, 15, test_specs["number_of_rows"]
|
||||
),
|
||||
DEFAULT_RATING_COL: np.random.random_integers(
|
||||
DEFAULT_RATING_COL: np.random.randint(
|
||||
1, 5, test_specs["number_of_rows"]
|
||||
),
|
||||
DEFAULT_TIMESTAMP_COL: random_date_generator(
|
||||
|
@ -73,32 +76,22 @@ def python_dataset(test_specs):
|
|||
),
|
||||
}
|
||||
)
|
||||
|
||||
return rating
|
||||
|
||||
|
||||
def test_split_pandas_data(pandas_dummy_timestamp):
|
||||
"""Test split pandas data
|
||||
"""
|
||||
df_rating = pandas_dummy_timestamp
|
||||
|
||||
splits = split_pandas_data_with_ratios(df_rating, ratios=[0.5, 0.5])
|
||||
|
||||
splits = split_pandas_data_with_ratios(pandas_dummy_timestamp, ratios=[0.5, 0.5])
|
||||
assert len(splits[0]) == 5
|
||||
assert len(splits[1]) == 5
|
||||
|
||||
splits = split_pandas_data_with_ratios(df_rating, ratios=[0.12, 0.36, 0.52])
|
||||
|
||||
assert len(splits[0]) == round(df_rating.shape[0] * 0.12)
|
||||
assert len(splits[1]) == round(df_rating.shape[0] * 0.36)
|
||||
assert len(splits[2]) == round(df_rating.shape[0] * 0.52)
|
||||
splits = split_pandas_data_with_ratios(pandas_dummy_timestamp, ratios=[0.12, 0.36, 0.52])
|
||||
shape = pandas_dummy_timestamp.shape[0]
|
||||
assert len(splits[0]) == round(shape * 0.12)
|
||||
assert len(splits[1]) == round(shape * 0.36)
|
||||
assert len(splits[2]) == round(shape * 0.52)
|
||||
|
||||
|
||||
def test_min_rating_filter(python_dataset):
|
||||
"""Test min rating filter
|
||||
"""
|
||||
df_rating = python_dataset
|
||||
|
||||
def count_filtered_rows(data, filter_by="user"):
|
||||
split_by_column = DEFAULT_USER_COL if filter_by == "user" else DEFAULT_ITEM_COL
|
||||
data_grouped = data.groupby(split_by_column)
|
||||
|
@ -110,9 +103,8 @@ def test_min_rating_filter(python_dataset):
|
|||
|
||||
return row_counts
|
||||
|
||||
df_user = min_rating_filter_pandas(df_rating, min_rating=5, filter_by="user")
|
||||
df_item = min_rating_filter_pandas(df_rating, min_rating=5, filter_by="item")
|
||||
|
||||
df_user = min_rating_filter_pandas(python_dataset, min_rating=5, filter_by="user")
|
||||
df_item = min_rating_filter_pandas(python_dataset, min_rating=5, filter_by="item")
|
||||
user_rating_counts = count_filtered_rows(df_user, filter_by="user")
|
||||
item_rating_counts = count_filtered_rows(df_item, filter_by="item")
|
||||
|
||||
|
@ -128,10 +120,8 @@ def test_random_splitter(test_specs, python_dataset):
|
|||
the testing data. A approximate match with certain level of tolerance is therefore used
|
||||
instead for tests.
|
||||
"""
|
||||
df_rating = python_dataset
|
||||
|
||||
splits = python_random_split(
|
||||
df_rating, ratio=test_specs["ratio"], seed=test_specs["seed"]
|
||||
python_dataset, ratio=test_specs["ratio"], seed=test_specs["seed"]
|
||||
)
|
||||
assert len(splits[0]) / test_specs["number_of_rows"] == pytest.approx(
|
||||
test_specs["ratio"], test_specs["tolerance"]
|
||||
|
@ -141,7 +131,7 @@ def test_random_splitter(test_specs, python_dataset):
|
|||
)
|
||||
|
||||
splits = python_random_split(
|
||||
df_rating, ratio=test_specs["ratios"], seed=test_specs["seed"]
|
||||
python_dataset, ratio=test_specs["ratios"], seed=test_specs["seed"]
|
||||
)
|
||||
|
||||
assert len(splits) == 3
|
||||
|
@ -156,7 +146,7 @@ def test_random_splitter(test_specs, python_dataset):
|
|||
)
|
||||
|
||||
splits = python_random_split(
|
||||
df_rating, ratio=test_specs["split_numbers"], seed=test_specs["seed"]
|
||||
python_dataset, ratio=test_specs["split_numbers"], seed=test_specs["seed"]
|
||||
)
|
||||
|
||||
assert len(splits) == 3
|
||||
|
@ -172,12 +162,8 @@ def test_random_splitter(test_specs, python_dataset):
|
|||
|
||||
|
||||
def test_chrono_splitter(test_specs, python_dataset):
|
||||
"""Test chronological splitter for Spark dataframes.
|
||||
"""
|
||||
df_rating = python_dataset
|
||||
|
||||
splits = python_chrono_split(
|
||||
df_rating, ratio=test_specs["ratio"], min_rating=10, filter_by="user"
|
||||
python_dataset, ratio=test_specs["ratio"], min_rating=10, filter_by="user"
|
||||
)
|
||||
|
||||
assert len(splits[0]) / test_specs["number_of_rows"] == pytest.approx(
|
||||
|
@ -187,27 +173,21 @@ def test_chrono_splitter(test_specs, python_dataset):
|
|||
1 - test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
# Test all time stamps in test are later than that in train for all users.
|
||||
# This is for single-split case.
|
||||
all_later = []
|
||||
for user in test_specs["user_ids"]:
|
||||
df_train = splits[0][splits[0][DEFAULT_USER_COL] == user]
|
||||
df_test = splits[1][splits[1][DEFAULT_USER_COL] == user]
|
||||
|
||||
p = product(df_train[DEFAULT_TIMESTAMP_COL], df_test[DEFAULT_TIMESTAMP_COL])
|
||||
user_later = [a <= b for (a, b) in p]
|
||||
|
||||
all_later.append(user_later)
|
||||
assert all(all_later)
|
||||
|
||||
# Test if both contains the same user list. This is because chrono split is stratified.
|
||||
users_train = splits[0][DEFAULT_USER_COL].unique()
|
||||
users_test = splits[1][DEFAULT_USER_COL].unique()
|
||||
|
||||
assert set(users_train) == set(users_test)
|
||||
|
||||
# Test all time stamps in test are later than that in train for all users.
|
||||
# This is for single-split case.
|
||||
max_train_times = splits[0][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).max()
|
||||
min_test_times = splits[1][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).min()
|
||||
check_times = max_train_times.join(min_test_times, lsuffix='_0', rsuffix='_1')
|
||||
assert all((check_times[DEFAULT_TIMESTAMP_COL + '_0'] < check_times[DEFAULT_TIMESTAMP_COL + '_1']).values)
|
||||
|
||||
# Test multi-split case
|
||||
splits = python_chrono_split(
|
||||
df_rating, ratio=test_specs["ratios"], min_rating=10, filter_by="user"
|
||||
python_dataset, ratio=test_specs["ratios"], min_rating=10, filter_by="user"
|
||||
)
|
||||
|
||||
assert len(splits) == 3
|
||||
|
@ -221,30 +201,28 @@ def test_chrono_splitter(test_specs, python_dataset):
|
|||
test_specs["ratios"][2], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
# Test if all splits contain the same user list. This is because chrono split is stratified.
|
||||
users_train = splits[0][DEFAULT_USER_COL].unique()
|
||||
users_test = splits[1][DEFAULT_USER_COL].unique()
|
||||
users_val = splits[2][DEFAULT_USER_COL].unique()
|
||||
assert set(users_train) == set(users_test)
|
||||
assert set(users_train) == set(users_val)
|
||||
|
||||
# Test if timestamps are correctly split. This is for multi-split case.
|
||||
all_later = []
|
||||
for user in test_specs["user_ids"]:
|
||||
df_train = splits[0][splits[0][DEFAULT_USER_COL] == user]
|
||||
df_valid = splits[1][splits[1][DEFAULT_USER_COL] == user]
|
||||
df_test = splits[2][splits[2][DEFAULT_USER_COL] == user]
|
||||
max_train_times = splits[0][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).max()
|
||||
min_test_times = splits[1][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).min()
|
||||
check_times = max_train_times.join(min_test_times, lsuffix='_0', rsuffix='_1')
|
||||
assert all((check_times[DEFAULT_TIMESTAMP_COL + '_0'] < check_times[DEFAULT_TIMESTAMP_COL + '_1']).values)
|
||||
|
||||
p1 = product(df_train[DEFAULT_TIMESTAMP_COL], df_valid[DEFAULT_TIMESTAMP_COL])
|
||||
p2 = product(df_valid[DEFAULT_TIMESTAMP_COL], df_test[DEFAULT_TIMESTAMP_COL])
|
||||
user_later_1 = [a <= b for (a, b) in p1]
|
||||
user_later_2 = [a <= b for (a, b) in p2]
|
||||
|
||||
all_later.append(user_later_1)
|
||||
all_later.append(user_later_2)
|
||||
assert all(all_later)
|
||||
max_test_times = splits[1][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).max()
|
||||
min_val_times = splits[2][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).min()
|
||||
check_times = max_test_times.join(min_val_times, lsuffix='_1', rsuffix='_2')
|
||||
assert all((check_times[DEFAULT_TIMESTAMP_COL + '_1'] < check_times[DEFAULT_TIMESTAMP_COL + '_2']).values)
|
||||
|
||||
|
||||
def test_stratified_splitter(test_specs, python_dataset):
|
||||
"""Test stratified splitter.
|
||||
"""
|
||||
df_rating = python_dataset
|
||||
|
||||
splits = python_stratified_split(
|
||||
df_rating, ratio=test_specs["ratio"], min_rating=10, filter_by="user"
|
||||
python_dataset, ratio=test_specs["ratio"], min_rating=10, filter_by="user"
|
||||
)
|
||||
|
||||
assert len(splits[0]) / test_specs["number_of_rows"] == pytest.approx(
|
||||
|
@ -261,7 +239,7 @@ def test_stratified_splitter(test_specs, python_dataset):
|
|||
assert set(users_train) == set(users_test)
|
||||
|
||||
splits = python_stratified_split(
|
||||
df_rating, ratio=test_specs["ratios"], min_rating=10, filter_by="user"
|
||||
python_dataset, ratio=test_specs["ratios"], min_rating=10, filter_by="user"
|
||||
)
|
||||
|
||||
assert len(splits) == 3
|
||||
|
@ -275,3 +253,117 @@ def test_stratified_splitter(test_specs, python_dataset):
|
|||
test_specs["ratios"][2], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def python_int_dataset(test_specs):
|
||||
# fix the the random seed
|
||||
np.random.seed(test_specs["seed"])
|
||||
|
||||
# generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
|
||||
return np.random.randint(
|
||||
low=0,
|
||||
high=6,
|
||||
size=(test_specs["number_of_users"], test_specs["number_of_items"]),
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def python_float_dataset(test_specs):
|
||||
# fix the the random seed
|
||||
np.random.seed(test_specs["seed"])
|
||||
|
||||
# generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
|
||||
return np.random.random(
|
||||
size=(test_specs["number_of_users"], test_specs["number_of_items"])
|
||||
) * 5
|
||||
|
||||
|
||||
def test_int_numpy_stratified_splitter(test_specs, python_int_dataset):
|
||||
# generate a syntetic dataset
|
||||
X = python_int_dataset
|
||||
|
||||
# the splitter returns (in order): train and test user/affinity matrices, train and test datafarmes and user/items to matrix maps
|
||||
Xtr, Xtst = numpy_stratified_split(
|
||||
X, ratio=test_specs["ratio"], seed=test_specs["seed"]
|
||||
)
|
||||
|
||||
# check that the generated matrices have the correct dimensions
|
||||
assert (Xtr.shape[0] == X.shape[0]) & (Xtr.shape[1] == X.shape[1])
|
||||
assert (Xtst.shape[0] == X.shape[0]) & (Xtst.shape[1] == X.shape[1])
|
||||
|
||||
X_rated = np.sum(X != 0, axis=1) # number of total rated items per user
|
||||
Xtr_rated = np.sum(Xtr != 0, axis=1) # number of rated items in the train set
|
||||
Xtst_rated = np.sum(Xtst != 0, axis=1) # number of rated items in the test set
|
||||
|
||||
# global split: check that the all dataset is split in the correct ratio
|
||||
assert Xtr_rated.sum() / (X_rated.sum()) == pytest.approx(
|
||||
test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
assert Xtst_rated.sum() / (X_rated.sum()) == pytest.approx(
|
||||
1 - test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
# This implementation of the stratified splitter performs a random split at the single user level. Here we check
|
||||
# that also this more stringent condition is verified. Note that user to user fluctuations in the split ratio
|
||||
# are stronger than for the entire dataset due to the random nature of the per user splitting.
|
||||
# For this reason we allow a slightly bigger tollerance, as specified in the test_specs()
|
||||
|
||||
assert (
|
||||
(Xtr_rated / X_rated <= test_specs["ratio"] + test_specs["fluctuation"]).all()
|
||||
& (Xtr_rated / X_rated >= test_specs["ratio"] - test_specs["fluctuation"]).all()
|
||||
)
|
||||
|
||||
assert (
|
||||
(
|
||||
Xtst_rated / X_rated
|
||||
<= (1 - test_specs["ratio"]) + test_specs["fluctuation"]
|
||||
).all()
|
||||
& (
|
||||
Xtst_rated / X_rated
|
||||
>= (1 - test_specs["ratio"]) - test_specs["fluctuation"]
|
||||
).all()
|
||||
)
|
||||
|
||||
|
||||
def test_float_numpy_stratified_splitter(test_specs, python_float_dataset):
|
||||
# generate a syntetic dataset
|
||||
X = python_float_dataset
|
||||
|
||||
# the splitter returns (in order): train and test user/affinity matrices, train and test datafarmes and user/items to matrix maps
|
||||
Xtr, Xtst = numpy_stratified_split(
|
||||
X, ratio=test_specs["ratio"], seed=test_specs["seed"]
|
||||
)
|
||||
|
||||
# Tests
|
||||
# check that the generated matrices have the correct dimensions
|
||||
assert (Xtr.shape[0] == X.shape[0]) & (Xtr.shape[1] == X.shape[1])
|
||||
|
||||
assert (Xtst.shape[0] == X.shape[0]) & (Xtst.shape[1] == X.shape[1])
|
||||
|
||||
X_rated = np.sum(X != 0, axis=1) # number of total rated items per user
|
||||
Xtr_rated = np.sum(Xtr != 0, axis=1) # number of rated items in the train set
|
||||
Xtst_rated = np.sum(Xtst != 0, axis=1) # number of rated items in the test set
|
||||
|
||||
# global split: check that the all dataset is split in the correct ratio
|
||||
assert Xtr_rated.sum() / (X_rated.sum()) == pytest.approx(
|
||||
test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
assert Xtst_rated.sum() / (X_rated.sum()) == pytest.approx(
|
||||
1 - test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
# This implementation of the stratified splitter performs a random split at the single user level. Here we check
|
||||
# that also this more stringent condition is verified. Note that user to user fluctuations in the split ratio
|
||||
# are stronger than for the entire dataset due to the random nature of the per user splitting.
|
||||
# For this reason we allow a slightly bigger tollerance, as specified in the test_specs()
|
||||
|
||||
assert Xtr_rated / X_rated == pytest.approx(
|
||||
test_specs["ratio"], rel=test_specs["fluctuation"]
|
||||
)
|
||||
|
||||
assert Xtst_rated / X_rated == pytest.approx(
|
||||
(1 - test_specs["ratio"]), rel=test_specs["fluctuation"]
|
||||
)
|
||||
|
||||
|
|
|
@ -7,8 +7,8 @@ Test common python utils
|
|||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from scipy.sparse import csc, csc_matrix
|
||||
from reco_utils.common.python_utils import (
|
||||
exponential_decay,
|
||||
jaccard,
|
||||
lift
|
||||
)
|
||||
|
@ -17,21 +17,21 @@ TOL = 0.0001
|
|||
|
||||
|
||||
@pytest.fixture
|
||||
def target_matrices():
|
||||
J1 = np.mat('1.0, 0.0, 0.5; '
|
||||
'0.0, 1.0, 0.33333; '
|
||||
'0.5, 0.33333, 1.0')
|
||||
J2 = np.mat('1.0, 0.0, 0.0, 0.2; '
|
||||
'0.0, 1.0, 0.0, 0.0; '
|
||||
'0.0, 0.0, 1.0, 0.5; '
|
||||
'0.2, 0.0, 0.5, 1.0')
|
||||
L1 = np.mat('1.0, 0.0, 0.5; '
|
||||
'0.0, 0.5, 0.25; '
|
||||
'0.5, 0.25, 0.5')
|
||||
L2 = np.mat('0.5, 0.0, 0.0, 0.125; '
|
||||
'0.0, 0.33333, 0.0, 0.0; '
|
||||
'0.0, 0.0, 0.5, 0.25; '
|
||||
'0.125, 0.0, 0.25, 0.25')
|
||||
def target_matrices(scope="module"):
|
||||
J1 = np.array([[1.0, 0.0, 0.5],
|
||||
[0.0, 1.0, 0.33333],
|
||||
[0.5, 0.33333, 1.0]])
|
||||
J2 = np.array([[1.0, 0.0, 0.0, 0.2],
|
||||
[0.0, 1.0, 0.0, 0.0],
|
||||
[0.0, 0.0, 1.0, 0.5],
|
||||
[0.2, 0.0, 0.5, 1.0]])
|
||||
L1 = np.array([[1.0, 0.0, 0.5],
|
||||
[0.0, 0.5, 0.25],
|
||||
[0.5, 0.25, 0.5]])
|
||||
L2 = np.array([[0.5, 0.0, 0.0, 0.125],
|
||||
[0.0, 0.33333, 0.0, 0.0],
|
||||
[0.0, 0.0, 0.5, 0.25],
|
||||
[0.125, 0.0, 0.25, 0.25]])
|
||||
return {
|
||||
"jaccard1": pytest.approx(J1, TOL),
|
||||
"jaccard2": pytest.approx(J2, TOL),
|
||||
|
@ -42,36 +42,40 @@ def target_matrices():
|
|||
|
||||
@pytest.fixture(scope="module")
|
||||
def python_data():
|
||||
D1 = np.mat('1.0, 0.0, 1.0; '
|
||||
'0.0, 2.0, 1.0; '
|
||||
'1.0, 1.0, 2.0')
|
||||
cooccurrence1 = csc_matrix(D1)
|
||||
D2 = np.mat('2.0, 0.0, 0.0, 1.0; '
|
||||
'0.0, 3.0, 0.0, 0.0; '
|
||||
'0.0, 0.0, 2.0, 2.0; '
|
||||
'1.0, 0.0, 2.0, 4.0')
|
||||
cooccurrence2 = csc_matrix(D2)
|
||||
|
||||
cooccurrence1 = np.array([[1.0, 0.0, 1.0],
|
||||
[0.0, 2.0, 1.0],
|
||||
[1.0, 1.0, 2.0]])
|
||||
cooccurrence2 = np.array([[2.0, 0.0, 0.0, 1.0],
|
||||
[0.0, 3.0, 0.0, 0.0],
|
||||
[0.0, 0.0, 2.0, 2.0],
|
||||
[1.0, 0.0, 2.0, 4.0]])
|
||||
return cooccurrence1, cooccurrence2
|
||||
|
||||
|
||||
def test_python_jaccard(python_data, target_matrices):
|
||||
cooccurrence1, cooccurrence2 = python_data
|
||||
J1 = jaccard(cooccurrence1)
|
||||
assert type(J1) == csc.csc_matrix
|
||||
assert J1.todense() == target_matrices["jaccard1"]
|
||||
assert type(J1) == np.ndarray
|
||||
assert J1 == target_matrices["jaccard1"]
|
||||
|
||||
J2 = jaccard(cooccurrence2)
|
||||
assert type(J2) == csc.csc_matrix
|
||||
assert J2.todense() == target_matrices["jaccard2"]
|
||||
assert type(J2) == np.ndarray
|
||||
assert J2 == target_matrices["jaccard2"]
|
||||
|
||||
|
||||
def test_python_lift(python_data, target_matrices):
|
||||
cooccurrence1, cooccurrence2 = python_data
|
||||
L1 = lift(cooccurrence1)
|
||||
assert type(L1) == csc.csc_matrix
|
||||
assert L1.todense() == target_matrices["lift1"]
|
||||
assert type(L1) == np.ndarray
|
||||
assert L1 == target_matrices["lift1"]
|
||||
|
||||
L2 = lift(cooccurrence2)
|
||||
assert type(L2) == csc.csc_matrix
|
||||
assert L2.todense() == target_matrices["lift2"]
|
||||
assert type(L2) == np.ndarray
|
||||
assert L2 == target_matrices["lift2"]
|
||||
|
||||
|
||||
def test_exponential_decay():
|
||||
values = np.array([1, 2, 3, 4, 5, 6])
|
||||
expected = np.array([0.25, 0.35355339, 0.5, 0.70710678, 1., 1.])
|
||||
actual = exponential_decay(value=values, max_val=5, half_life=2)
|
||||
assert np.allclose(actual, expected, atol=TOL)
|
||||
|
|
|
@ -3,14 +3,12 @@
|
|||
|
||||
import pytest
|
||||
import numpy as np
|
||||
|
||||
from reco_utils.recommender.rbm.rbm import RBM
|
||||
from tests.rbm_common import test_specs, affinity_matrix
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def init_rbm():
|
||||
|
||||
return {
|
||||
"n_hidden": 100,
|
||||
"epochs": 10,
|
||||
|
@ -25,9 +23,6 @@ def init_rbm():
|
|||
|
||||
@pytest.mark.gpu
|
||||
def test_class_init(init_rbm):
|
||||
"""
|
||||
Test the init of the model class
|
||||
"""
|
||||
model = RBM(
|
||||
hidden_units=init_rbm["n_hidden"],
|
||||
training_epoch=init_rbm["epochs"],
|
||||
|
@ -59,9 +54,6 @@ def test_class_init(init_rbm):
|
|||
|
||||
@pytest.mark.gpu
|
||||
def test_train_param_init(init_rbm, affinity_matrix):
|
||||
"""
|
||||
Test the dimension of the learning parameters
|
||||
"""
|
||||
# obtain the train/test set matrices
|
||||
Xtr, Xtst = affinity_matrix
|
||||
|
||||
|
@ -86,9 +78,6 @@ def test_train_param_init(init_rbm, affinity_matrix):
|
|||
|
||||
@pytest.mark.gpu
|
||||
def test_sampling_funct(init_rbm, affinity_matrix):
|
||||
"""
|
||||
Test the sampling functions
|
||||
"""
|
||||
# obtain the train/test set matrices
|
||||
Xtr, Xtst = affinity_matrix
|
||||
|
||||
|
|
|
@ -22,40 +22,6 @@ def _rearrange_to_test(array, row_ids, col_ids, row_map, col_map):
|
|||
return array
|
||||
|
||||
|
||||
def _apply_sar_hash_index(model, train, test, header, pandas_new=False):
|
||||
# TODO: review this function
|
||||
# index all users and items which SAR will compute scores for
|
||||
# bugfix to get around different pandas vesions in build servers
|
||||
if test is not None:
|
||||
if pandas_new:
|
||||
df_all = pd.concat([train, test], sort=False)
|
||||
else:
|
||||
df_all = pd.concat([train, test])
|
||||
else:
|
||||
df_all = train
|
||||
|
||||
# hash SAR
|
||||
# Obtain all the users and items from both training and test data
|
||||
unique_users = df_all[header["col_user"]].unique()
|
||||
unique_items = df_all[header["col_item"]].unique()
|
||||
|
||||
# Hash users and items to smaller continuous space.
|
||||
# Actually, this is an ordered set - it's discrete, but .
|
||||
# This helps keep the matrices we keep in memory as small as possible.
|
||||
enumerate_items_1, enumerate_items_2 = itertools.tee(enumerate(unique_items))
|
||||
enumerate_users_1, enumerate_users_2 = itertools.tee(enumerate(unique_users))
|
||||
item_map_dict = {x: i for i, x in enumerate_items_1}
|
||||
user_map_dict = {x: i for i, x in enumerate_users_1}
|
||||
|
||||
# the reverse of the dictionary above - array index to actual ID
|
||||
index2user = dict(enumerate_users_2)
|
||||
index2item = dict(enumerate_items_2)
|
||||
|
||||
model.set_index(
|
||||
unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item
|
||||
)
|
||||
|
||||
|
||||
def test_init(header):
|
||||
model = SARSingleNode(
|
||||
remove_seen=True, similarity_type="jaccard", **header
|
||||
|
@ -78,8 +44,6 @@ def test_fit(similarity_type, timedecay_formula, train_test_dummy_timestamp, hea
|
|||
**header
|
||||
)
|
||||
trainset, testset = train_test_dummy_timestamp
|
||||
_apply_sar_hash_index(model, trainset, testset, header)
|
||||
|
||||
model.fit(trainset)
|
||||
|
||||
|
||||
|
@ -96,9 +60,6 @@ def test_predict(
|
|||
**header
|
||||
)
|
||||
trainset, testset = train_test_dummy_timestamp
|
||||
|
||||
_apply_sar_hash_index(model, trainset, testset, header)
|
||||
|
||||
model.fit(trainset)
|
||||
preds = model.predict(testset)
|
||||
|
||||
|
@ -109,11 +70,6 @@ def test_predict(
|
|||
assert preds[PREDICTION_COL].dtype == trainset[header["col_rating"]].dtype
|
||||
|
||||
|
||||
"""
|
||||
Main SAR tests are below - load test files which are used for both Scala SAR and Python reference implementations
|
||||
"""
|
||||
|
||||
# Tests 1-6
|
||||
@pytest.mark.parametrize(
|
||||
"threshold,similarity_type,file",
|
||||
[
|
||||
|
@ -139,8 +95,6 @@ def test_sar_item_similarity(
|
|||
**header
|
||||
)
|
||||
|
||||
_apply_sar_hash_index(model, demo_usage_data, None, header)
|
||||
|
||||
model.fit(demo_usage_data)
|
||||
|
||||
true_item_similarity, row_ids, col_ids = read_matrix(
|
||||
|
@ -152,8 +106,8 @@ def test_sar_item_similarity(
|
|||
model.item_similarity.todense(),
|
||||
row_ids,
|
||||
col_ids,
|
||||
model.item_map_dict,
|
||||
model.item_map_dict,
|
||||
model.item2index,
|
||||
model.item2index,
|
||||
)
|
||||
assert np.array_equal(
|
||||
true_item_similarity.astype(test_item_similarity.dtype),
|
||||
|
@ -161,11 +115,11 @@ def test_sar_item_similarity(
|
|||
)
|
||||
else:
|
||||
test_item_similarity = _rearrange_to_test(
|
||||
model.item_similarity.toarray(),
|
||||
model.item_similarity,
|
||||
row_ids,
|
||||
col_ids,
|
||||
model.item_map_dict,
|
||||
model.item_map_dict,
|
||||
model.item2index,
|
||||
model.item2index,
|
||||
)
|
||||
assert np.allclose(
|
||||
true_item_similarity.astype(test_item_similarity.dtype),
|
||||
|
@ -174,7 +128,6 @@ def test_sar_item_similarity(
|
|||
)
|
||||
|
||||
|
||||
# Test 7
|
||||
def test_user_affinity(demo_usage_data, sar_settings, header):
|
||||
time_now = demo_usage_data[header["col_timestamp"]].max()
|
||||
model = SARSingleNode(
|
||||
|
@ -185,15 +138,14 @@ def test_user_affinity(demo_usage_data, sar_settings, header):
|
|||
time_now=time_now,
|
||||
**header
|
||||
)
|
||||
_apply_sar_hash_index(model, demo_usage_data, None, header)
|
||||
model.fit(demo_usage_data)
|
||||
|
||||
true_user_affinity, items = load_affinity(sar_settings["FILE_DIR"] + "user_aff.csv")
|
||||
user_index = model.user_map_dict[sar_settings["TEST_USER_ID"]]
|
||||
user_index = model.user2index[sar_settings["TEST_USER_ID"]]
|
||||
test_user_affinity = np.reshape(
|
||||
np.array(
|
||||
_rearrange_to_test(
|
||||
model.user_affinity, None, items, None, model.item_map_dict
|
||||
model.user_affinity, None, items, None, model.item2index
|
||||
)[user_index,].todense()
|
||||
),
|
||||
-1,
|
||||
|
@ -205,7 +157,6 @@ def test_user_affinity(demo_usage_data, sar_settings, header):
|
|||
)
|
||||
|
||||
|
||||
# Tests 8-10
|
||||
@pytest.mark.parametrize(
|
||||
"threshold,similarity_type,file",
|
||||
[(3, "cooccurrence", "count"), (3, "jaccard", "jac"), (3, "lift", "lift")],
|
||||
|
@ -223,7 +174,6 @@ def test_userpred(
|
|||
threshold=threshold,
|
||||
**header
|
||||
)
|
||||
_apply_sar_hash_index(model, demo_usage_data, None, header)
|
||||
model.fit(demo_usage_data)
|
||||
|
||||
true_items, true_scores = load_userpred(
|
||||
|
|
|
@ -19,6 +19,7 @@ try:
|
|||
except ImportError:
|
||||
pass # skip this import if we are in pure python environment
|
||||
|
||||
|
||||
TOL = 0.0001
|
||||
|
||||
|
||||
|
@ -93,7 +94,6 @@ def test_init_spark(spark):
|
|||
def test_init_spark_rating_eval(spark_data):
|
||||
df_true, df_pred = spark_data
|
||||
evaluator = SparkRatingEvaluation(df_true, df_pred)
|
||||
|
||||
assert evaluator is not None
|
||||
|
||||
|
||||
|
@ -230,7 +230,6 @@ def test_spark_map(spark_data, target_metrics):
|
|||
@pytest.mark.spark
|
||||
def test_spark_python_match(python_data, spark):
|
||||
# Test on the original data with k = 10.
|
||||
|
||||
df_true, df_pred = python_data
|
||||
|
||||
dfs_true = spark.createDataFrame(df_true)
|
||||
|
@ -247,11 +246,9 @@ def test_spark_python_match(python_data, spark):
|
|||
== pytest.approx(eval_spark1.ndcg_at_k(), TOL),
|
||||
map_at_k(df_true, df_pred, k=10) == pytest.approx(eval_spark1.map_at_k(), TOL),
|
||||
]
|
||||
|
||||
assert all(match1)
|
||||
|
||||
# Test on the original data with k = 3.
|
||||
|
||||
dfs_true = spark.createDataFrame(df_true)
|
||||
dfs_pred = spark.createDataFrame(df_pred)
|
||||
|
||||
|
@ -265,7 +262,6 @@ def test_spark_python_match(python_data, spark):
|
|||
ndcg_at_k(df_true, df_pred, k=3) == pytest.approx(eval_spark2.ndcg_at_k(), TOL),
|
||||
map_at_k(df_true, df_pred, k=3) == pytest.approx(eval_spark2.map_at_k(), TOL),
|
||||
]
|
||||
|
||||
assert all(match2)
|
||||
|
||||
# Remove the first row from the original data.
|
||||
|
@ -285,11 +281,9 @@ def test_spark_python_match(python_data, spark):
|
|||
== pytest.approx(eval_spark3.ndcg_at_k(), TOL),
|
||||
map_at_k(df_true, df_pred, k=10) == pytest.approx(eval_spark3.map_at_k(), TOL),
|
||||
]
|
||||
|
||||
assert all(match3)
|
||||
|
||||
# Test with one user
|
||||
|
||||
df_pred = df_pred.loc[df_pred["userID"] == 3]
|
||||
df_true = df_true.loc[df_true["userID"] == 3]
|
||||
|
||||
|
@ -307,5 +301,4 @@ def test_spark_python_match(python_data, spark):
|
|||
== pytest.approx(eval_spark4.ndcg_at_k(), TOL),
|
||||
map_at_k(df_true, df_pred, k=10) == pytest.approx(eval_spark4.map_at_k(), TOL),
|
||||
]
|
||||
|
||||
assert all(match4)
|
||||
|
|
|
@ -3,7 +3,6 @@
|
|||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from itertools import product
|
||||
import pytest
|
||||
from reco_utils.dataset.split_utils import min_rating_filter_spark
|
||||
from reco_utils.common.constants import (
|
||||
|
@ -14,6 +13,8 @@ from reco_utils.common.constants import (
|
|||
)
|
||||
|
||||
try:
|
||||
from pyspark.sql import functions as F
|
||||
from pyspark.sql.functions import col
|
||||
from reco_utils.common.spark_utils import start_or_get_spark
|
||||
from reco_utils.dataset.spark_splitters import (
|
||||
spark_chrono_split,
|
||||
|
@ -44,9 +45,7 @@ def python_data(test_specs):
|
|||
|
||||
def random_date_generator(start_date, range_in_days):
|
||||
"""Helper function to generate random timestamps.
|
||||
|
||||
Reference: https://stackoverflow.com/questions/41006182/generate-random-dates-within-a
|
||||
-range-in-numpy
|
||||
Reference: https://stackoverflow.com/questions/41006182/generate-random-dates-within-a-range-in-numpy
|
||||
"""
|
||||
days_to_add = np.arange(0, range_in_days)
|
||||
random_dates = []
|
||||
|
@ -58,13 +57,13 @@ def python_data(test_specs):
|
|||
|
||||
rating = pd.DataFrame(
|
||||
{
|
||||
DEFAULT_USER_COL: np.random.random_integers(
|
||||
DEFAULT_USER_COL: np.random.randint(
|
||||
1, 5, test_specs["number_of_rows"]
|
||||
),
|
||||
DEFAULT_ITEM_COL: np.random.random_integers(
|
||||
DEFAULT_ITEM_COL: np.random.randint(
|
||||
1, 15, test_specs["number_of_rows"]
|
||||
),
|
||||
DEFAULT_RATING_COL: np.random.random_integers(
|
||||
DEFAULT_RATING_COL: np.random.randint(
|
||||
1, 5, test_specs["number_of_rows"]
|
||||
),
|
||||
DEFAULT_TIMESTAMP_COL: random_date_generator(
|
||||
|
@ -78,22 +77,14 @@ def python_data(test_specs):
|
|||
|
||||
@pytest.fixture(scope="module")
|
||||
def spark_dataset(python_data):
|
||||
"""Get Python labels"""
|
||||
rating = python_data
|
||||
spark = start_or_get_spark("SplitterTesting")
|
||||
df_rating = spark.createDataFrame(rating)
|
||||
|
||||
return df_rating
|
||||
return spark.createDataFrame(python_data)
|
||||
|
||||
|
||||
@pytest.mark.spark
|
||||
def test_min_rating_filter(spark_dataset):
|
||||
"""Test min rating filter
|
||||
"""
|
||||
dfs_rating = spark_dataset
|
||||
|
||||
dfs_user = min_rating_filter_spark(dfs_rating, min_rating=5, filter_by="user")
|
||||
dfs_item = min_rating_filter_spark(dfs_rating, min_rating=5, filter_by="item")
|
||||
dfs_user = min_rating_filter_spark(spark_dataset, min_rating=5, filter_by="user")
|
||||
dfs_item = min_rating_filter_spark(spark_dataset, min_rating=5, filter_by="item")
|
||||
|
||||
user_rating_counts = [
|
||||
x["count"] >= 5 for x in dfs_user.groupBy(DEFAULT_USER_COL).count().collect()
|
||||
|
@ -115,10 +106,8 @@ def test_random_splitter(test_specs, spark_dataset):
|
|||
the testing data. A approximate match with certain level of tolerance is therefore used
|
||||
instead for tests.
|
||||
"""
|
||||
df_rating = spark_dataset
|
||||
|
||||
splits = spark_random_split(
|
||||
df_rating, ratio=test_specs["ratio"], seed=test_specs["seed"]
|
||||
spark_dataset, ratio=test_specs["ratio"], seed=test_specs["seed"]
|
||||
)
|
||||
|
||||
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
|
||||
|
@ -129,7 +118,7 @@ def test_random_splitter(test_specs, spark_dataset):
|
|||
)
|
||||
|
||||
splits = spark_random_split(
|
||||
df_rating, ratio=test_specs["ratios"], seed=test_specs["seed"]
|
||||
spark_dataset, ratio=test_specs["ratios"], seed=test_specs["seed"]
|
||||
)
|
||||
|
||||
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
|
||||
|
@ -145,11 +134,8 @@ def test_random_splitter(test_specs, spark_dataset):
|
|||
|
||||
@pytest.mark.spark
|
||||
def test_chrono_splitter(test_specs, spark_dataset):
|
||||
"""Test chronological splitter for Spark dataframes"""
|
||||
dfs_rating = spark_dataset
|
||||
|
||||
splits = spark_chrono_split(
|
||||
dfs_rating, ratio=test_specs["ratio"], filter_by="user", min_rating=10
|
||||
spark_dataset, ratio=test_specs["ratio"], filter_by="user", min_rating=10
|
||||
)
|
||||
|
||||
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
|
||||
|
@ -169,18 +155,9 @@ def test_chrono_splitter(test_specs, spark_dataset):
|
|||
|
||||
assert set(users_train) == set(users_test)
|
||||
|
||||
# Test all time stamps in test are later than that in train for all users.
|
||||
all_later = []
|
||||
for user in test_specs["user_ids"]:
|
||||
dfs_train = splits[0][splits[0][DEFAULT_USER_COL] == user]
|
||||
dfs_test = splits[1][splits[1][DEFAULT_USER_COL] == user]
|
||||
assert _if_later(splits[0], splits[1])
|
||||
|
||||
user_later = _if_later(dfs_train, dfs_test, col_timestamp=DEFAULT_TIMESTAMP_COL)
|
||||
|
||||
all_later.append(user_later)
|
||||
assert all(all_later)
|
||||
|
||||
splits = spark_chrono_split(dfs_rating, ratio=test_specs["ratios"])
|
||||
splits = spark_chrono_split(spark_dataset, ratio=test_specs["ratios"])
|
||||
|
||||
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
|
||||
test_specs["ratios"][0], test_specs["tolerance"]
|
||||
|
@ -192,28 +169,14 @@ def test_chrono_splitter(test_specs, spark_dataset):
|
|||
test_specs["ratios"][2], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
# Test if timestamps are correctly split. This is for multi-split case.
|
||||
all_later = []
|
||||
for user in test_specs["user_ids"]:
|
||||
dfs_train = splits[0][splits[0][DEFAULT_USER_COL] == user]
|
||||
dfs_valid = splits[1][splits[1][DEFAULT_USER_COL] == user]
|
||||
dfs_test = splits[2][splits[2][DEFAULT_USER_COL] == user]
|
||||
|
||||
user_later_1 = _if_later(dfs_train, dfs_valid, col_timestamp=DEFAULT_TIMESTAMP_COL)
|
||||
user_later_2 = _if_later(dfs_valid, dfs_test, col_timestamp=DEFAULT_TIMESTAMP_COL)
|
||||
|
||||
all_later.append(user_later_1)
|
||||
all_later.append(user_later_2)
|
||||
assert all(all_later)
|
||||
assert _if_later(splits[0], splits[1])
|
||||
assert _if_later(splits[1], splits[2])
|
||||
|
||||
|
||||
@pytest.mark.spark
|
||||
def test_stratified_splitter(test_specs, spark_dataset):
|
||||
"""Test stratified splitter for Spark dataframes"""
|
||||
dfs_rating = spark_dataset
|
||||
|
||||
splits = spark_stratified_split(
|
||||
dfs_rating, ratio=test_specs["ratio"], filter_by="user", min_rating=10
|
||||
spark_dataset, ratio=test_specs["ratio"], filter_by="user", min_rating=10
|
||||
)
|
||||
|
||||
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
|
||||
|
@ -233,7 +196,7 @@ def test_stratified_splitter(test_specs, spark_dataset):
|
|||
|
||||
assert set(users_train) == set(users_test)
|
||||
|
||||
splits = spark_stratified_split(dfs_rating, ratio=test_specs["ratios"])
|
||||
splits = spark_stratified_split(spark_dataset, ratio=test_specs["ratios"])
|
||||
|
||||
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
|
||||
test_specs["ratios"][0], test_specs["tolerance"]
|
||||
|
@ -248,11 +211,7 @@ def test_stratified_splitter(test_specs, spark_dataset):
|
|||
|
||||
@pytest.mark.spark
|
||||
def test_timestamp_splitter(test_specs, spark_dataset):
|
||||
"""Test timestamp splitter for Spark dataframes"""
|
||||
from pyspark.sql.functions import col
|
||||
|
||||
dfs_rating = spark_dataset
|
||||
dfs_rating = dfs_rating.withColumn(DEFAULT_TIMESTAMP_COL, col(DEFAULT_TIMESTAMP_COL).cast("float"))
|
||||
dfs_rating = spark_dataset.withColumn(DEFAULT_TIMESTAMP_COL, col(DEFAULT_TIMESTAMP_COL).cast("float"))
|
||||
|
||||
splits = spark_timestamp_split(
|
||||
dfs_rating, ratio=test_specs["ratio"], col_timestamp=DEFAULT_TIMESTAMP_COL
|
||||
|
@ -265,8 +224,12 @@ def test_timestamp_splitter(test_specs, spark_dataset):
|
|||
1 - test_specs["ratio"], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
max_split0 = splits[0].agg(F.max(DEFAULT_TIMESTAMP_COL)).first()[0]
|
||||
min_split1 = splits[1].agg(F.min(DEFAULT_TIMESTAMP_COL)).first()[0]
|
||||
assert(max_split0 <= min_split1)
|
||||
|
||||
# Test multi split
|
||||
splits = spark_stratified_split(dfs_rating, ratio=test_specs["ratios"])
|
||||
splits = spark_timestamp_split(dfs_rating, ratio=test_specs["ratios"])
|
||||
|
||||
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
|
||||
test_specs["ratios"][0], test_specs["tolerance"]
|
||||
|
@ -278,37 +241,34 @@ def test_timestamp_splitter(test_specs, spark_dataset):
|
|||
test_specs["ratios"][2], test_specs["tolerance"]
|
||||
)
|
||||
|
||||
dfs_train = splits[0]
|
||||
dfs_valid = splits[1]
|
||||
dfs_test = splits[2]
|
||||
max_split0 = splits[0].agg(F.max(DEFAULT_TIMESTAMP_COL)).first()[0]
|
||||
min_split1 = splits[1].agg(F.min(DEFAULT_TIMESTAMP_COL)).first()[0]
|
||||
assert(max_split0 <= min_split1)
|
||||
|
||||
# if valid is later than train.
|
||||
all_later_1 = _if_later(dfs_train, dfs_valid, col_timestamp=DEFAULT_TIMESTAMP_COL)
|
||||
assert all_later_1
|
||||
|
||||
# if test is later than valid.
|
||||
all_later_2 = _if_later(dfs_valid, dfs_test, col_timestamp=DEFAULT_TIMESTAMP_COL)
|
||||
assert all_later_2
|
||||
max_split1 = splits[1].agg(F.max(DEFAULT_TIMESTAMP_COL)).first()[0]
|
||||
min_split2 = splits[2].agg(F.min(DEFAULT_TIMESTAMP_COL)).first()[0]
|
||||
assert(max_split1 <= min_split2)
|
||||
|
||||
|
||||
def _if_later(data1, data2, col_timestamp=DEFAULT_TIMESTAMP_COL):
|
||||
'''Helper function to test if records in data1 are later than that in data2.
|
||||
def _if_later(data1, data2):
|
||||
"""Helper function to test if records in data1 are earlier than that in data2.
|
||||
Returns:
|
||||
bool: True or False indicating if data1 is earlier than data2.
|
||||
"""
|
||||
x = (data1.select(DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL)
|
||||
.groupBy(DEFAULT_USER_COL)
|
||||
.agg(F.max(DEFAULT_TIMESTAMP_COL).cast('long').alias('max'))
|
||||
.collect())
|
||||
max_times = {row[DEFAULT_USER_COL]: row['max'] for row in x}
|
||||
|
||||
Return:
|
||||
True or False indicating if data1 is later than data2.
|
||||
'''
|
||||
p = product(
|
||||
[
|
||||
x[col_timestamp]
|
||||
for x in data1.select(col_timestamp).collect()
|
||||
],
|
||||
[
|
||||
x[col_timestamp]
|
||||
for x in data2.select(col_timestamp).collect()
|
||||
],
|
||||
)
|
||||
y = (data2.select(DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL)
|
||||
.groupBy(DEFAULT_USER_COL)
|
||||
.agg(F.min(DEFAULT_TIMESTAMP_COL).cast('long').alias('min'))
|
||||
.collect())
|
||||
min_times = {row[DEFAULT_USER_COL]: row['min'] for row in y}
|
||||
|
||||
if_late = [a <= b for (a, b) in p]
|
||||
|
||||
return if_late
|
||||
result = True
|
||||
for user, max_time in max_times.items():
|
||||
result = result and min_times[user] >= max_time
|
||||
|
||||
return result
|
||||
|
|
|
@ -4,10 +4,8 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
import pytest
|
||||
from sklearn.utils import shuffle
|
||||
|
||||
from reco_utils.dataset.sparse import AffinityMatrix
|
||||
|
||||
from reco_utils.common.constants import (
|
||||
DEFAULT_USER_COL,
|
||||
DEFAULT_ITEM_COL,
|
||||
|
@ -18,7 +16,6 @@ from reco_utils.common.constants import (
|
|||
|
||||
@pytest.fixture(scope="module")
|
||||
def test_specs():
|
||||
|
||||
return {"number_of_items": 50, "number_of_users": 20, "seed": 123}
|
||||
|
||||
|
||||
|
@ -40,7 +37,6 @@ def python_dataset(test_specs):
|
|||
random_dates = []
|
||||
|
||||
for i in range(range_in_days):
|
||||
|
||||
random_date = np.datetime64(start_date) + np.random.choice(days_to_add)
|
||||
random_dates.append(random_date)
|
||||
|
||||
|
@ -87,10 +83,6 @@ def python_dataset(test_specs):
|
|||
|
||||
|
||||
def test_df_to_sparse(test_specs, python_dataset):
|
||||
|
||||
# generate a syntetic dataset
|
||||
df_rating = python_dataset
|
||||
|
||||
# initialize the splitter
|
||||
header = {
|
||||
"col_user": DEFAULT_USER_COL,
|
||||
|
@ -99,25 +91,18 @@ def test_df_to_sparse(test_specs, python_dataset):
|
|||
}
|
||||
|
||||
# instantiate the affinity matrix
|
||||
am = AffinityMatrix(DF=df_rating, **header)
|
||||
am = AffinityMatrix(DF=python_dataset, **header)
|
||||
|
||||
# obtain the sparse matrix representation of the input dataframe
|
||||
X = am.gen_affinity_matrix()
|
||||
|
||||
# Tests
|
||||
# check that the generated matrix has the correct dimensions
|
||||
assert (X.shape[0] == df_rating.userID.unique().shape[0]) & (
|
||||
X.shape[1] == df_rating.itemID.unique().shape[0]
|
||||
assert (X.shape[0] == python_dataset.userID.unique().shape[0]) & (
|
||||
X.shape[1] == python_dataset.itemID.unique().shape[0]
|
||||
)
|
||||
|
||||
|
||||
# Test inverse mapping: from sparse matrix to dataframe
|
||||
|
||||
|
||||
def test_sparse_to_df(test_specs, python_dataset):
|
||||
|
||||
df_rating = python_dataset
|
||||
|
||||
# initialize the splitter
|
||||
header = {
|
||||
"col_user": DEFAULT_USER_COL,
|
||||
|
@ -126,7 +111,7 @@ def test_sparse_to_df(test_specs, python_dataset):
|
|||
}
|
||||
|
||||
# instantiate the the affinity matrix
|
||||
am = AffinityMatrix(DF=df_rating, **header)
|
||||
am = AffinityMatrix(DF=python_dataset, **header)
|
||||
|
||||
# generate the sparse matrix representation
|
||||
X = am.gen_affinity_matrix()
|
||||
|
@ -137,15 +122,15 @@ def test_sparse_to_df(test_specs, python_dataset):
|
|||
# tests: check that the two dataframes have the same elements in the same positions.
|
||||
assert (
|
||||
DF.userID.values.all()
|
||||
== df_rating.sort_values(by=["userID"]).userID.values.all()
|
||||
== python_dataset.sort_values(by=["userID"]).userID.values.all()
|
||||
)
|
||||
|
||||
assert (
|
||||
DF.itemID.values.all()
|
||||
== df_rating.sort_values(by=["userID"]).itemID.values.all()
|
||||
== python_dataset.sort_values(by=["userID"]).itemID.values.all()
|
||||
)
|
||||
|
||||
assert (
|
||||
DF.rating.values.all()
|
||||
== df_rating.sort_values(by=["userID"]).rating.values.all()
|
||||
== python_dataset.sort_values(by=["userID"]).rating.values.all()
|
||||
)
|
||||
|
|
|
@ -3,6 +3,7 @@ import pytest
|
|||
|
||||
from reco_utils.evaluation.parameter_sweep import generate_param_grid
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def parameter_dictionary():
|
||||
params = {
|
||||
|
@ -13,6 +14,7 @@ def parameter_dictionary():
|
|||
|
||||
return params
|
||||
|
||||
|
||||
def test_param_sweep(parameter_dictionary):
|
||||
params_grid = generate_param_grid(parameter_dictionary)
|
||||
|
||||
|
|
Загрузка…
Ссылка в новой задаче