updating readme adding py script

This commit is contained in:
Scott Graham 2019-02-12 08:36:37 -05:00
Родитель 5a7df4d948 dbc6ab5bcb
Коммит 622aa2f4c2
53 изменённых файлов: 1714 добавлений и 1908 удалений

2
.gitignore поставляемый
Просмотреть файл

@ -129,4 +129,4 @@ ml-1m/
ml-20m/
*.jar
*.item
*.pkl
*.pkl

Просмотреть файл

@ -4,7 +4,7 @@ This repository provides examples and best practices for building recommendation
- [Prepare Data](notebooks/01_prepare_data/README.md): Preparing and loading data for each recommender algorithm
- [Model](notebooks/02_model/README.md): Building models using various recommender algorithms such as Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)), Singular Value Decomposition ([SVD](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)), etc.
- [Evaluate](notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics
- [Model Select and Optimize](notebooks/04_model_select_and_optimize): Tuning and optimizing hyperparameteres for recommender models
- [Model Select and Optimize](notebooks/04_model_select_and_optimize): Tuning and optimizing hyperparameters for recommender models
- [Operationalize](notebooks/05_operationalize/README.md): Operationalizing models in a production environment on Azure
Several utilities are provided in [reco_utils](reco_utils) to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting train/test data. Implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
@ -22,7 +22,7 @@ To setup on your local machine:
```
cd Recommenders
python scripts/generate_conda_file.py
conda env create -f conda_base.yaml
conda env create -f reco_base.yaml
```
4. Activate the conda environment and register it with Jupyter:
```
@ -57,20 +57,22 @@ The Quick-Start and Modeling notebooks showcase how to utilize the following alg
**Algorithms**
The table below lists recommender algorithms available in the repository at the moment.
| Algorithm | Environment | Type | Description |
| --- | --- | --- | --- |
| **`Classic Recommenders`** |
| [Surprise/Singular Value Decomposition (SVD)](notebooks/00_quick_start/sar_single_node_movielens.ipynb) | Python | Collaborative Filtering | General purpose algorithm for smaller datasets |
| [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_pyspark_movielens.ipynb) | Spark | Collaborative | General purpose algorithm for larger datasets, optimized with Spark |
| **`Microsoft Recommenders`** |
| [Smart Adaptive Recommendations (SAR)](notebooks/00_quick_start/sar_single_node_movielens.ipynb) | Python / Spark | Collaborative Filtering | Generalized algorithm utilizing item similarities and can easily adapt to new users |
| [Vowpal Wabbit Family (VW)](notebooks/02_model/vowpal_wabbit_deep_dive.ipynb) | Python / Online | Collaborative, Content Based | Fast online learning algorithms, great for scenarios where user features / context are constantly changing, like real-time bidding |
| [eXtreme Deep Factorization Machine (xDeepFM)](notebooks/00_quick_start/xdeepfm.ipynb) | Python / GPU | Hybrid | Deep learning model combining implicit and explicit features |
| [Deep Knowledge-Aware Network (DKN)](notebooks/00_quick_start/dkn.ipynb) | Python / GPU | Content Based | Deep learning model incorporating a knowledge graph and article embeddings to provide powerful news or article recommendations |
| **`Deep Learning`** |
| **Classic Recommenders** |
| [Surprise/Singular Value Decomposition (SVD)](notebooks/00_quick_start/sar_movielens.ipynb) | Python | Collaborative Filtering | General purpose algorithm for smaller datasets |
| [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_movielens.ipynb) | Spark | Collaborative Filtering | General purpose algorithm for larger datasets, optimized with Spark |
| **Microsoft Recommenders** |
| [Smart Adaptive Recommendations (SAR)](notebooks/00_quick_start/sar_movielens.ipynb) | Python / Spark | Collaborative Filtering | Generalized algorithm utilizing item similarities and can easily adapt to new users |
| [Vowpal Wabbit Family (VW)](notebooks/02_model/vowpal_wabbit_deep_dive.ipynb) | Python / Online | Collaborative, Content-based Filtering | Fast online learning algorithms, great for scenarios where user features / context are constantly changing, like real-time bidding |
| [eXtreme Deep Factorization Machine (xDeepFM)](notebooks/00_quick_start/xdeepfm_synthetic.ipynb) | Python / GPU | Hybrid | Deep learning model combining implicit and explicit features |
| [Deep Knowledge-Aware Network (DKN)](notebooks/00_quick_start/dkn_synthetic.ipynb) | Python / GPU | Content-based Filtering | Deep learning model incorporating a knowledge graph and article embeddings to provide powerful news or article recommendations |
| **Deep Learning Recommenders** |
| [Neural Collaborative Filtering (NCF)](notebooks/00_quick_start/ncf_movielens.ipynb) | Python / GPU | Collaborative Filtering | General algorithm built using a multi-layer perceptron |
| [Restricted Boltzmann Machines (RBM)](notebooks/00_quick_start/rbm_movielens.ipynb) | Python / GPU | Collaborative Filtering | Generative neural network algorithm built to learn the underlying probability distribution for user/item affinity |
| [FastAI Embedding Dot Bias (FAST)](notebooks/00_quick_start/fastai_recommendation.ipynb) | Python / GPU | Collaborative Filtering | General purpose algorithm embedding dot biases for users and items |
| [FastAI Embedding Dot Bias (FAST)](notebooks/00_quick_start/fastai_movielens.ipynb) | Python / GPU | Collaborative Filtering | General purpose algorithm embedding dot biases for users and items |
In addition, we also provide a [comparison notebook](notebooks/03_evaluate/comparison.ipynb) to illustrate how different algorithms could be evaluated and compared. In this notebook, data (MovieLens 1M) is randomly split into train/test sets at a 75/25 ratio. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature [here](http://mymedialite.net/examples/datasets.html). For ranking metrics we use k = 10 (top 10 results). We run the comparison on a Standard NC6s_v2 [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (6 vCPUs, 112 GB memory and 1 K80 GPU). Spark ALS is run in local standalone mode.
@ -78,9 +80,9 @@ In addition, we also provide a [comparison notebook](notebooks/03_evaluate/compa
| Algo | MAP | nDCG@k | Precision@k | Recall@k | RMSE | MAE | R<sup>2</sup> | Explained Variance |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| [ALS](notebooks/00_quick_start/als_pyspark_movielens.ipynb) | 0.002020 | 0.024313 | 0.030677 | 0.009649 | 0.860502 | 0.680608 | 0.406014 | 0.411603 |
| [ALS](notebooks/00_quick_start/als_movielens.ipynb) | 0.002020 | 0.024313 | 0.030677 | 0.009649 | 0.860502 | 0.680608 | 0.406014 | 0.411603 |
| [SVD](notebooks/02_model/surprise_svd_deep_dive.ipynb) | 0.010915 | 0.102398 | 0.092996 | 0.025362 | 0.888991 | 0.696781 | 0.364178 | 0.364178 |
| [FastAI](notebooks/00_quick_start/fastai_recommendation.ipynb) | 0.023022 |0.168714 |0.154761 |0.050153 |0.887224 |0.705609 |0.371552 |0.374281 |
| [FastAI](notebooks/00_quick_start/fastai_movielens.ipynb) | 0.023022 |0.168714 |0.154761 |0.050153 |0.887224 |0.705609 |0.371552 |0.374281 |

Просмотреть файл

@ -60,7 +60,7 @@ Assuming the repo is cloned as `Recommenders` in the local system, to install th
cd Recommenders
python scripts/generate_conda_file.py
conda env create -f conda_bare.yaml
conda env create -f reco_base.yaml
</details>
@ -71,7 +71,7 @@ Assuming that you have a GPU machine, to install the Python GPU environment, whi
cd Recommenders
python scripts/generate_conda_file.py --gpu
conda env create -f conda_gpu.yaml
conda env create -f reco_gpu.yaml
</details>
@ -82,11 +82,11 @@ To install the PySpark environment, which by default installs the CPU environmen
cd Recommenders
python scripts/generate_conda_file.py --pyspark
conda env create -f conda_pyspark.yaml
conda env create -f reo_pyspark.yaml
Additionally, if you want to test a particular version of spark, you may pass the --pyspark-version argument:
python /scripts/generate_conda_file.py --pyspark-version 2.4.0
python scripts/generate_conda_file.py --pyspark-version 2.4.0
**NOTE** - for a PySpark environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.
@ -113,8 +113,8 @@ unset PYSPARK_DRIVER_PYTHON
To install all three environments:
cd Recommenders
python /scripts/generate_conda_file.py --gpu --pyspark
conda env create -f conda_full.yaml
python scripts/generate_conda_file.py --gpu --pyspark
conda env create -f reco_full.yaml
</details>

Просмотреть файл

@ -5,17 +5,17 @@ In this directory, notebooks are provided to demonstrate the use of different al
data preparation, model building, and model evaluation by using the utility functions ([reco_utils](../../reco_utils))
available in the repo.
| Notebook | Description |
| --- | --- |
| [als_pyspark_movielens](als_pyspark_movielens.ipynb) | Utilizing ALS algorithm to predict movie ratings in a PySpark environment.
| [fastai_recommendation](fastai_recommendation.ipynb) | Utilizing FastAI recommender to predict movie ratings in a Python+GPU (PyTorch) environment.
| [ncf_movielens](ncf_movielens.ipynb) | Utilizing Neural Collaborative Filtering (NCF) [1] to predict movie ratings in a Python+GPU (TensorFlow) environment.
| [sar_python_cpu_movielens](sar_single_node_movielens.ipynb) | Utilizing Smart Adaptive Recommendations (SAR) algorithm to predict movie ratings in a Python+CPU environment.
| [dkn](dkn.ipynb) | Utilizing the Deep Knowledge-Aware Network (DKN) [2] algorithm for news recommendations using information from a knowledge graph, in a Python+GPU (TensorFlow) environment.
| [xdeepfm](xdeepfm.ipynb) | Utilizing the eXtreme Deep Factorization Machine (xDeepFM) [3] to learn both low and high order feature interactions for predicting CTR, in a Python+GPU (TensorFlow) environment.
| [rbm](rbm_movielens.ipynb)| Utilizing the Restricted Boltzmann Machine (rbm) [4] to predict movie ratings in a Python+GPU (TensorFlow) environment.<br>
| Notebook | Dataset | Environment | Description |
| --- | --- | --- | --- |
| [als](als_movielens.ipynb) | MovieLens | PySpark | Utilizing ALS algorithm to predict movie ratings in a PySpark environment.
| [dkn](dkn_synthetic.ipynb) | Synthetic Data | Python CPU, GPU | Utilizing the Deep Knowledge-Aware Network (DKN) [2] algorithm for news recommendations using information from a knowledge graph, in a Python+GPU (TensorFlow) environment.
| [fastai](fastai_movielens.ipynb) | MovieLens | Python CPU, GPU | Utilizing FastAI recommender to predict movie ratings in a Python+GPU (PyTorch) environment.
| [ncf](ncf_movielens.ipynb) | MovieLens | Python CPU, GPU | Utilizing Neural Collaborative Filtering (NCF) [1] to predict movie ratings in a Python+GPU (TensorFlow) environment.
| [rbm](rbm_movielens.ipynb)| MovieLens | Python CPU, GPU | Utilizing the Restricted Boltzmann Machine (rbm) [4] to predict movie ratings in a Python+GPU (TensorFlow) environment.<br>
| [sar](sar_movielens.ipynb) | MovieLens | Python CPU | Utilizing Smart Adaptive Recommendations (SAR) algorithm to predict movie ratings in a Python+CPU environment.
| [xdeepfm](xdeepfm_synthetic.ipynb) | Synthetic Data | Python CPU, GPU | Utilizing the eXtreme Deep Factorization Machine (xDeepFM) [3] to learn both low and high order feature interactions for predicting CTR, in a Python+GPU (TensorFlow) environment.
[1] _Neural Collaborative Filtering_, Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu and Tat-Seng Chua. WWW 2017.<br>
[2] _DKN: Deep Knowledge-Aware Network for News Recommendation_, Hongwei Wang, Fuzheng Zhang, Xing Xie and Minyi Guo. WWW 2018.<br>
[3] _xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems_, Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie and Guangzhong Sun. KDD 2018.<br>
[4] _Restricted Boltzmann Machines for Collaborative Filtering_, Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton. ICML 2007.
[4] _Restricted Boltzmann Machines for Collaborative Filtering_, Ruslan Salakhutdinov, Andriy Mnih and Geoffrey Hinton. ICML 2007.

Просмотреть файл

@ -1,11 +1,20 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DKN : Deep Knowledge-Aware Network for News Recommendation\n",
"DKN\\[1\\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX\\[2\\] method for knowledge graph representaion learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. \n",
"DKN \\[1\\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \\[2\\] method for knowledge graph representaion learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. \n",
"\n",
"## Properties of DKN:\n",
"- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. \n",
@ -241,9 +250,9 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python (reco)",
"display_name": "Python (reco_bare)",
"language": "python",
"name": "reco"
"name": "reco_bare"
},
"language_info": {
"codemirror_mode": {
@ -255,7 +264,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
"version": "3.6.8"
}
},
"nbformat": 4,

Просмотреть файл

@ -76,7 +76,7 @@
"metadata": {},
"outputs": [],
"source": [
"USER,ITEM,RATING,TIMESTAMP,PREDICTION,TITLE = 'UserId','MovieId','Rating','Timestamp','Prediction','Title'"
"USER, ITEM, RATING, TIMESTAMP, PREDICTION, TITLE = 'UserId', 'MovieId', 'Rating', 'Timestamp', 'Prediction', 'Title'"
]
},
{
@ -141,7 +141,7 @@
"metadata": {},
"outputs": [],
"source": [
"# fix random seeds to make sure out runs are reproducible\n",
"# fix random seeds to make sure our runs are reproducible\n",
"np.random.seed(101)\n",
"torch.manual_seed(101)\n",
"torch.cuda.manual_seed_all(101)"
@ -582,7 +582,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The above numbers are lower than SAR, but expected, since the model is explicitly trying to generalize the users and items to the latent factors. Next look at how well the model predicts how the user would rate the movie. Need to score `test_df`, but this time don't ask for top_k. "
"The above numbers are lower than [SAR](../sar_single_node_movielens.ipynb), but expected, since the model is explicitly trying to generalize the users and items to the latent factors. Next look at how well the model predicts how the user would rate the movie. Need to score `test_df`, but this time don't ask for top_k. "
]
},
{

Просмотреть файл

@ -15,7 +15,7 @@
"source": [
"# Neural Collaborative Filtering on Movielens dataset.\n",
"\n",
"Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalize the matrix factorization problem with multi-layer perceptron. \n",
"Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalizes the matrix factorization problem with multi-layer perceptron. \n",
"\n",
"This notebook provides an example of how to utilize and evaluate NCF implementation in the `reco_utils`. We use a smaller dataset in this example to run NCF efficiently with GPU acceleration on a [Data Science Virtual Machine](https://azure.microsoft.com/en-gb/services/virtual-machines/data-science-virtual-machines/)."
]
@ -143,7 +143,7 @@
"source": [
"### 3. Train the NCF model on the training data, and get the top-k recommendations for our testing data\n",
"\n",
"NCF is for implicity feedback typed recommender, and it generates prospensity of items to be recommended to users in the scale of 0 to 1. A recommended item list can then be generated based on the scores. NOTE this quickstart notebook is using a smaller number of epoch size to reduce time for training. As a consequence, the model performance will be slighlty deteriorated. "
"NCF accepts implicit feedback and generates prospensity of items to be recommended to users in the scale of 0 to 1. A recommended item list can then be generated based on the scores. Note that this quickstart notebook is using a smaller number of epochs to reduce time for training. As a consequence, the model performance will be slighlty deteriorated. "
]
},
{

Просмотреть файл

@ -1,5 +1,23 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {
@ -66,7 +84,7 @@
"import papermill as pm\n",
"\n",
"from reco_utils.recommender.rbm.rbm import RBM\n",
"from reco_utils.dataset.numpy_splitters import numpy_stratified_split\n",
"from reco_utils.dataset.python_splitters import numpy_stratified_split\n",
"from reco_utils.dataset.sparse import AffinityMatrix\n",
"\n",
"\n",

Просмотреть файл

@ -41,16 +41,16 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"System version: 3.6.0 | packaged by conda-forge | (default, Feb 9 2017, 14:36:55) \n",
"[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]\n",
"Pandas version: 0.23.4\n"
"System version: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43) \n",
"[GCC 7.3.0]\n",
"Pandas version: 0.24.1\n"
]
}
],
@ -58,17 +58,20 @@
"# set the environment path to find Recommenders\n",
"import sys\n",
"sys.path.append(\"../../\")\n",
"import time\n",
"import os\n",
"\n",
"import itertools\n",
"import logging\n",
"import os\n",
"import time\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import papermill as pm\n",
"\n",
"from reco_utils.recommender.sar.sar_singlenode import SARSingleNode\n",
"from reco_utils.dataset import movielens\n",
"from reco_utils.dataset.python_splitters import python_random_split\n",
"from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k\n",
"from reco_utils.recommender.sar.sar_singlenode import SARSingleNode\n",
"\n",
"print(\"System version: {}\".format(sys.version))\n",
"print(\"Pandas version: {}\".format(pd.__version__))"
@ -90,7 +93,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 2,
"metadata": {
"tags": [
"parameters"
@ -114,7 +117,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 3,
"metadata": {},
"outputs": [
{
@ -193,7 +196,7 @@
"4 166 346 1.0 886397596"
]
},
"execution_count": 13,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
@ -221,7 +224,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
@ -246,7 +249,7 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
@ -257,46 +260,15 @@
" \"col_timestamp\": \"Timestamp\",\n",
"}\n",
"\n",
"logging.basicConfig(level=logging.DEBUG, \n",
" format='%(asctime)s %(levelname)-8s %(message)s')\n",
"\n",
"model = SARSingleNode(\n",
" remove_seen=True, similarity_type=\"jaccard\", \n",
" time_decay_coefficient=30, time_now=None, timedecay_formula=True, **header\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will hash users and items to smaller continuous space.\n",
"This is an ordered set - it's discrete, but contiguous.\n",
"This helps keep the matrices we keep in memory as small as possible."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"start_time = time.time()\n",
"\n",
"unique_users = data[\"UserId\"].unique()\n",
"unique_items = data[\"MovieId\"].unique()\n",
"enumerate_items_1, enumerate_items_2 = itertools.tee(enumerate(unique_items))\n",
"enumerate_users_1, enumerate_users_2 = itertools.tee(enumerate(unique_users))\n",
"\n",
"item_map_dict = {x: i for i, x in enumerate_items_1}\n",
"user_map_dict = {x: i for i, x in enumerate_users_1}\n",
"# The reverse of the dictionary above - array index to actual ID\n",
"index2user = dict(enumerate_users_2)\n",
"index2item = dict(enumerate_items_2)\n",
"\n",
"# We need to index the train and test sets for SAR matrix operations to work\n",
"model.set_index(unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item)\n",
"\n",
"preprocess_time = time.time() - start_time"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -314,29 +286,30 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Collecting user affinity matrix...\n",
"Calculating time-decayed affinities...\n",
"Creating index columns...\n",
"Building user affinity sparse matrix...\n",
"Calculating item cooccurrence...\n",
"Calculating item similarity...\n",
"Calculating jaccard...\n",
"Calculating recommendation scores...\n",
"done training\n"
"2019-02-05 13:19:22,533 INFO Collecting user affinity matrix\n",
"2019-02-05 13:19:22,538 INFO Calculating time-decayed affinities\n",
"2019-02-05 13:19:22,589 INFO Creating index columns\n",
"2019-02-05 13:19:22,607 INFO Building user affinity sparse matrix\n",
"2019-02-05 13:19:22,615 INFO Calculating item co-occurrence\n",
"2019-02-05 13:19:22,807 INFO Calculating item similarity\n",
"2019-02-05 13:19:22,808 INFO Calculating jaccard\n",
"2019-02-05 13:19:22,991 INFO Calculating recommendation scores\n",
"2019-02-05 13:19:23,106 INFO Removing seen items\n",
"2019-02-05 13:19:23,107 INFO Done training\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Took 0.5829987525939941 seconds for training.\n"
"Took 0.5787224769592285 seconds for training.\n"
]
}
],
@ -345,32 +318,27 @@
"\n",
"model.fit(train)\n",
"\n",
"train_time = time.time() - start_time + preprocess_time\n",
"train_time = time.time() - start_time\n",
"print(\"Took {} seconds for training.\".format(train_time))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Converting to dense matrix...\n",
"Removing seen items...\n",
"Getting top K...\n",
"Select users from the test set\n",
"Creating output dataframe...\n",
"Formatting output\n"
"2019-02-05 13:19:23,125 INFO Getting top K\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Took 0.13302063941955566 seconds for prediction.\n"
"Took 0.06923317909240723 seconds for prediction.\n"
]
}
],
@ -389,7 +357,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 8,
"metadata": {
"scrolled": true
},
@ -430,22 +398,22 @@
" <tr>\n",
" <th>1</th>\n",
" <td>600</td>\n",
" <td>423</td>\n",
" <td>12.991756</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>600</td>\n",
" <td>183</td>\n",
" <td>13.106912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>2</th>\n",
" <td>600</td>\n",
" <td>89</td>\n",
" <td>13.163791</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>600</td>\n",
" <td>423</td>\n",
" <td>12.991756</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>600</td>\n",
" <td>144</td>\n",
@ -458,9 +426,9 @@
"text/plain": [
" UserId MovieId prediction\n",
"0 600 69 12.984131\n",
"1 600 423 12.991756\n",
"2 600 183 13.106912\n",
"3 600 89 13.163791\n",
"1 600 183 13.106912\n",
"2 600 89 13.163791\n",
"3 600 423 12.991756\n",
"4 600 144 13.489795"
]
},
@ -483,7 +451,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
@ -494,7 +462,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
@ -505,7 +473,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
@ -516,7 +484,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
@ -527,7 +495,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 13,
"metadata": {},
"outputs": [
{
@ -554,7 +522,7 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 14,
"metadata": {},
"outputs": [
{
@ -596,7 +564,7 @@
{
"data": {
"application/papermill.record+json": {
"train_time": 0.5829987525939941
"train_time": 0.5787224769592285
}
},
"metadata": {},
@ -605,7 +573,7 @@
{
"data": {
"application/papermill.record+json": {
"test_time": 0.13302063941955566
"test_time": 0.06923317909240723
}
},
"metadata": {},
@ -626,9 +594,9 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python (reco)",
"language": "python",
"name": "python3"
"name": "reco"
},
"language_info": {
"codemirror_mode": {
@ -640,7 +608,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.7"
}
},
"nbformat": 4,

Просмотреть файл

@ -1,5 +1,23 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -463,9 +481,9 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python (reco_gpu)",
"display_name": "Python (reco_bare)",
"language": "python",
"name": "reco_gpu"
"name": "reco_bare"
},
"language_info": {
"codemirror_mode": {
@ -477,7 +495,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.6.8"
}
},
"nbformat": 4,

Просмотреть файл

@ -936,7 +936,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting. This technique is used in [SAR](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_single_node_deep_dive.ipynb). Formula for getting affinity score for each user-item pair is \n",
"In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting. This technique is used in [SAR](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_deep_dive.ipynb). Formula for getting affinity score for each user-item pair is \n",
"\n",
"$$a_{ij}=\\sum_k (w_k \\text{exp}[-\\text{log}_2(\\frac{t_0-t_k}{T})] $$\n",
"where $a_{ij}$ is the affinity score, $w_k$ is the interaction weight, $t_0$ is a reference time, $t_k$ is the timestamp for the $k$-th interaction, and $T$ is a hyperparameter that controls the speed of decay.\n",
@ -1699,7 +1699,7 @@
"\n",
"1. X. He *et al*, Neural Collaborative Filtering, WWW 2017. \n",
"2. Y. Hu *et al*, Collaborative filtering for implicit feedback datasets, ICDM 2008.\n",
"3. Smart Adapative Recommendation (SAR), url: https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_single_node_deep_dive.ipynb\n",
"3. Smart Adapative Recommendation (SAR), url: https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/sar_deep_dive.ipynb\n",
"4. Y. Koren and J. Sill, OrdRec: an ordinal model for predicting personalized item rating distributions, RecSys 2011."
]
}

Просмотреть файл

@ -4,14 +4,14 @@ In this directory, notebooks are provided to give a deep dive into training mode
Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)) and Singular Value Decomposition (SVD) using [Surprise](http://surpriselib.com/) python package. The notebooks make use of the utility functions ([reco_utils](../../reco_utils))
available in the repo.
| Notebook | Description |
| --- | --- |
| [als_deep_dive](als_deep_dive.ipynb) | Deep dive on the ALS algorithm and implementation.
| [baseline_deep_dive](baseline_deep_dive.ipynb) | Deep dive on baseline performance estimation.
| [ncf_deep_dive](ncf_deep_dive.ipynb) | Deep dive on a NCF algorithm and implementation.
| [surprise_svd_deep_dive](surprise_svd_deep_dive.ipynb) | Deep dive on a SVD algorithm and implementation.
| [sar_single_node_deep_dive](sar_single_node_deep_dive.ipynb) | Deep dive on the SAR algorithm and implementation.
| [vowpal_wabbit_deep_dive](vowpal_wabbit_deep_dive.ipynb) | Deep dive into using Vowpal Wabbit for regression and matrix factorization.
| [rbm_deep_dive](rbm_deep_dive.ipynb)| Deep dive on the rbm algorithm and its implementation.
| Notebook | Environment | Description |
| --- | --- | --- |
| [als_deep_dive](als_deep_dive.ipynb) | PySpark | Deep dive on the ALS algorithm and implementation.
| [baseline_deep_dive](baseline_deep_dive.ipynb) | --- | Deep dive on baseline performance estimation.
| [ncf_deep_dive](ncf_deep_dive.ipynb) | Python CPU, GPU | Deep dive on a NCF algorithm and implementation.
| [rbm_deep_dive](rbm_deep_dive.ipynb)| Python CPU, GPU | Deep dive on the rbm algorithm and its implementation.
| [sar_deep_dive](sar_deep_dive.ipynb) | Python CPU | Deep dive on the SAR algorithm and implementation.
| [surprise_svd_deep_dive](surprise_svd_deep_dive.ipynb) | Python CPU | Deep dive on a SVD algorithm and implementation.
| [vowpal_wabbit_deep_dive](vowpal_wabbit_deep_dive.ipynb) | Python CPU | Deep dive into using Vowpal Wabbit for regression and matrix factorization.
Details on model training are best found inside each notebook.

Просмотреть файл

@ -116,7 +116,7 @@
"\n",
"### 1.2 The MLP model\n",
"\n",
"NCF adopts two pathways to model users and items: 1) element-wise product of vectors, 2) concatenation of vectors. To learn interactions after concatenating of users and items lantent features, the standard MLP model is applied. In this sense, we can endow the model a large level of flexibility and non-linearity to learn the interactions between $p_{u}$ and $q_{i}$. The details of MLP model are:\n",
"NCF adopts two pathways to model users and items: 1) element-wise product of vectors, 2) concatenation of vectors. To learn interactions after concatenating of users and items latent features, the standard MLP model is applied. In this sense, we can endow the model a large level of flexibility and non-linearity to learn the interactions between $p_{u}$ and $q_{i}$. The details of MLP model are:\n",
"\n",
"For the input layer, there is concatention of user and item vectors:\n",
"\n",
@ -134,7 +134,7 @@
"\\hat { r } _ { u , i } = \\sigma \\left( h ^ { T } \\phi \\left( z _ { L - 1 } \\right) \\right)\n",
"$$\n",
"\n",
"where ${ W }_{ l }$, ${ b }_{ l }$, and ${ a }_{ out }$ denote the weight matrix, bias vector, and activation function for the $l$-th layers perceptron, respectively. For activation functions of MLP layers, one can freely choose sigmoid, hyperbolic tangent (tanh), and Rectifier (ReLU), among others. Because of implicit data task, the activation function of the output layer is defined as sigmoid $\\sigma(x)=\\frac{1}{1+\\exp{(-x)}}$ to restrict the predicted score to be in (0,1).\n",
"where ${ W }_{ l }$, ${ b }_{ l }$, and ${ a }_{ out }$ denote the weight matrix, bias vector, and activation function for the $l$-th layers perceptron, respectively. For activation functions of MLP layers, one can freely choose sigmoid, hyperbolic tangent (tanh), and Rectifier (ReLU), among others. Because of implicit data task, the activation function of the output layer is defined as sigmoid $\\sigma(x)=\\frac{1}{1+e^{-x}}$ to restrict the predicted score to be in (0,1).\n",
"\n",
"\n",
"### 1.3 Fusion of GMF and MLP\n",
@ -159,11 +159,11 @@
"\n",
"$$P \\left( \\mathcal { R } , \\mathcal { R } ^ { - } | \\mathbf { P } , \\mathbf { Q } , \\Theta \\right) = \\prod _ { ( u , i ) \\in \\mathcal { R } } \\hat { r } _ { u , i } \\prod _ { ( u , j ) \\in \\mathcal { R } ^{ - } } \\left( 1 - \\hat { r } _ { u , j } \\right)$$\n",
"\n",
"Where $\\mathcal{R}$ denotes the set of observed interactions, and $\\mathcal{ R } ^ { - }$ denotes the set of negative instances. $\\mathbf{P}$ and $\\mathbf{Q}$ denotes the latent factor matrix for users and items, respectively; and $\\Theta$ denotes the model parameters. Taking the negative logarithm of the likelihood, we obatain the objective function to minimize for NCF method, which is known as *binary cross-entropy loss*:\n",
"Where $\\mathcal{R}$ denotes the set of observed interactions, and $\\mathcal{ R } ^ { - }$ denotes the set of negative instances. $\\mathbf{P}$ and $\\mathbf{Q}$ denotes the latent factor matrix for users and items, respectively; and $\\Theta$ denotes the model parameters. Taking the negative logarithm of the likelihood, we obatain the objective function to minimize for NCF method, which is known as [binary cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy):\n",
"\n",
"$$L = - \\sum _ { ( u , i ) \\in \\mathcal { R } \\cup { \\mathcal { R } } ^ { - } } r _ { u , i } \\log \\hat { r } _ { u , i } + \\left( 1 - r _ { u , i } \\right) \\log \\left( 1 - \\hat { r } _ { u , i } \\right)$$\n",
"\n",
"The optimization can be done by performing Stochastic Gradient Descent (SGD), which has been introduced by the SVD algorithm in surprise svd deep dive notebook. Our SGD method is very similar to the SVD algorithm's."
"The optimization can be done by performing Stochastic Gradient Descent (SGD), which is described in the [Surprise SVD deep dive notebook](../02_model/surprise_svd_deep_dive.ipynb). Our SGD method is very similar to the SVD algorithm's."
]
},
{

Просмотреть файл

@ -1,5 +1,14 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {
@ -82,7 +91,7 @@
"\n",
"#RBM \n",
"from reco_utils.recommender.rbm.rbm import RBM\n",
"from reco_utils.dataset.numpy_splitters import numpy_stratified_split\n",
"from reco_utils.dataset.python_splitters import numpy_stratified_split\n",
"from reco_utils.dataset.sparse import AffinityMatrix\n",
"\n",
"#Evaluation libraries\n",

Просмотреть файл

@ -15,9 +15,9 @@
"source": [
"# SAR Single Node on MovieLens (Python, CPU)\n",
"\n",
"In this example, we will walkthrough each step of the Smart Adaptive Recommendations (SAR) algorithm with a Python single-node implementation.\n",
"In this example, we will walk through each step of the Smart Adaptive Recommendations (SAR) algorithm using a Python single-node implementation.\n",
"\n",
"SAR is a fast, scalable, adaptive algorithm for personalized recommendations based on user transaction history and item descriptions. It is powered by understanding the similarity between items, and recommending similar items to ones a user has an existing affinity for."
"SAR is a fast, scalable, adaptive algorithm for personalized recommendations based on user transaction history. It is powered by understanding the similarity between items, and recommending similar items to those a user has an existing affinity for."
]
},
{
@ -26,35 +26,53 @@
"source": [
"## 1 SAR algorithm\n",
"\n",
"In the next figure a high-level architecture of SAR is showed.\n",
"The following figure presents a high-level architecture of SAR. \n",
"\n",
"At a very high level, two intermediate matrices are created and used to generate a set of recommendation scores:\n",
"\n",
"- An item similarity matrix $S$ estimates item-item relationships.\n",
"- An affinity matrix $A$ estimates user-item relationships.\n",
"\n",
"Recommendation scores are then created by computing the matrix multiplication $A\\times S$.\n",
"\n",
"Optional steps (e.g. \"time decay\" and \"remove seen items\") are described in the details below.\n",
"\n",
"<img src=\"https://recodatasets.blob.core.windows.net/images/sar_schema.svg?sanitize=true\">\n",
"\n",
"### 1.1 Compute item co-occurrence and item similarity\n",
"\n",
"Central to how SAR defines similarity is an item-to-item co-occurrence matrix. Co-occurrence is defined as the number of times two items appear together for a given user. We can represent the co-occurrence of all items as a $m\\times m$ matrix $C$, where $c_{i,j}$ is the number of times item $i$ occurred with item $j$.\n",
"SAR defines similarity based on item-to-item co-occurrence data. Co-occurrence is defined as the number of times two items appear together for a given user. We can represent the co-occurrence of all items as a $m\\times m$ matrix $C$, where $c_{i,j}$ is the number of times item $i$ occurred with item $j$, and $m$ is the total number of items.\n",
"\n",
"The co-occurence matric $C$ has the following properties:\n",
"\n",
"It is symmetric, so $c_{i,j} = c_{j,i}$\n",
"It is nonnegative: $c_{i,j} \\geq 0$\n",
"The occurrences are at least as large as the co-occurrences. I.e, the largest element for each row (and column) is on the main diagonal: $\\forall(i,j)C_{i,i},C_{j,j} \\geq C_{i,j}$.\n",
"Once we have a co-occurrence matrix, an item similarity matrix $S$ can be obtained by rescaling the co-occurrences according to a given metric. Options for the metric include Jaccard, lift, and counts (meaning no rescaling).\n",
"- It is symmetric, so $c_{i,j} = c_{j,i}$\n",
"- It is nonnegative: $c_{i,j} \\geq 0$\n",
"- The occurrences are at least as large as the co-occurrences. I.e., the largest element for each row (and column) is on the main diagonal: $\\forall(i,j) C_{i,i},C_{j,j} \\geq C_{i,j}$.\n",
"\n",
"The rescaling formula for Jaccard is $s_{ij}=c_{ij} / (c_{ii}+c_{jj}-c_{ij})$\n",
"Once we have a co-occurrence matrix, an item similarity matrix $S$ can be obtained by rescaling the co-occurrences according to a given metric. Options for the metric include `Jaccard`, `lift`, and `counts` (meaning no rescaling).\n",
"\n",
"and that for lift is $s_{ij}=c_{ij} / (c_{ii} \\times c_{jj})$\n",
"\n",
"where $c_{ii}$ and $c_{jj}$ are the $i$th and $j$th diagonal elements of $C$. In general, using counts as a similarity metric favours predictability, meaning that the most popular items will be recommended most of the time. Lift by contrast favours discoverability/serendipity: an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. Jaccard is a compromise between the two.\n",
"If $c_{ii}$ and $c_{jj}$ are the $i$th and $j$th diagonal elements of $C$, the rescaling options are:\n",
"\n",
"- `Jaccard`: $s_{ij}=\\frac{c_{ij}}{(c_{ii}+c_{jj}-c_{ij})}$\n",
"- `lift`: $s_{ij}=\\frac{c_{ij}}{(c_{ii} \\times c_{jj})}$\n",
"- `counts`: $s_{ij}=c_{ij}$\n",
"\n",
"In general, using `counts` as a similarity metric favours predictability, meaning that the most popular items will be recommended most of the time. `lift` by contrast favours discoverability/serendipity: an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. `Jaccard` is a compromise between the two.\n",
"\n",
"\n",
"### 1.2 Compute user affinity scores\n",
"\n",
"The affinity matrix in SAR captures the strength of the relationship between each individual user and each item. The event types and weights are used in computing this matrix: different event types (such as “rate” vs “view”) should be allowed to have an impact on a users affinity for an item. Similarly, the time of a transaction should have an impact; an event that takes place in the distant past can be thought of as being less important in determining the affinity.\n",
"The affinity matrix in SAR captures the strength of the relationship between each individual user and the items that user has already interacted with. SAR incorporates two factors that can impact users' affinities: \n",
"\n",
"Combining these effects gives us an expression for user-item affinity:\n",
"- It can consider information about the **type** of user-item interaction through differential weighting of different events (e.g. it may weigh events in which a user rated a particular item more heavily than events in which a user viewed the item).\n",
"- It can consider information about **when** a user-item event occurred (e.g. it may discount the value of events that take place in the distant past.\n",
"\n",
"$$a_{ij}=\\sum_k (w_k \\text{exp}[-\\text{log}_2(\\frac{t_0-t_k}{T})] $$\n",
"where the affinity for user $i$ and item $j$ is the sum of all events involving user $i$ and item $j$, and $w_k$ is the weight of event $k$. The presence of the $\\text{log}_{2}$ factor means that the parameter $T$ in the exponential decay term can be treated as a half-life: events this far before the reference date $t_0$ will be given half the weight as those taking place at $t_0$.\n",
"Formalizing these factors produces us an expression for user-item affinity:\n",
"\n",
"$$a_{ij}=\\sum_k w_k e^{[-\\text{log}(2)\\frac{t_0-t_k}{T}]} $$\n",
"\n",
"where the affinity $a_{ij}$ for user $i$ and item $j$ is the weighted sum of all $k$ events involving user $i$ and item $j$. $w_k$ represents the weight of a particular event, and the exponential term reflects the temporally-discounted event. The $\\text{log}(2)$ scaling factor causes the parameter $T$ to serve as a half-life: events $T$ units before $t_0$ will be given half the weight as those taking place at $t_0$.\n",
"\n",
"Repeating this computation for all $n$ users and $m$ items results in an $n\\times m$ matrix $A$. Simplifications of the above expression can be obtained by setting all the weights equal to 1 (effectively ignoring event types), or by setting the half-life parameter $T$ to infinity (ignoring transaction times).\n",
"\n",
@ -64,7 +82,7 @@
"\n",
"### 1.4 Top-k item calculation\n",
"\n",
"The personalized recommendations for a set of users can then be obtained by multiplying the affinity matrix ($A$) by the similarity matrix ($S$). The result is a recommendation score matrix, with one row per user / item pair; higher scores correspond to more strongly recommended items.\n",
"The personalized recommendations for a set of users can then be obtained by multiplying the affinity matrix ($A$) by the similarity matrix ($S$). The result is a recommendation score matrix, where each row corresponds to a user, each column corresponds to an item, and each entry corresponds to a user / item pair. Higher scores correspond to more strongly recommended items.\n",
"\n",
"It is worth noting that the complexity of recommending operation depends on the data size. SAR algorithm itself has $O(n^3)$ complexity. Therefore the single-node implementation is not supposed to handle large dataset in a scalable manner. Whenever one uses the algorithm, it is recommended to run with sufficiently large memory. "
]
@ -87,16 +105,16 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"System version: 3.6.0 | packaged by conda-forge | (default, Feb 9 2017, 14:36:55) \n",
"[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]\n",
"Pandas version: 0.23.4\n"
"System version: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43) \n",
"[GCC 7.3.0]\n",
"Pandas version: 0.24.1\n"
]
}
],
@ -105,16 +123,18 @@
"import sys\n",
"sys.path.append(\"../../\")\n",
"\n",
"import os\n",
"import itertools\n",
"import logging\n",
"import os\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import papermill as pm\n",
"\n",
"from reco_utils.recommender.sar.sar_singlenode import SARSingleNode\n",
"from reco_utils.dataset import movielens\n",
"from reco_utils.dataset.python_splitters import python_random_split\n",
"from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k\n",
"from reco_utils.recommender.sar.sar_singlenode import SARSingleNode\n",
"\n",
"print(\"System version: {}\".format(sys.version))\n",
"print(\"Pandas version: {}\".format(pd.__version__))"
@ -122,7 +142,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 2,
"metadata": {
"tags": [
"parameters"
@ -153,7 +173,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 3,
"metadata": {},
"outputs": [
{
@ -232,7 +252,7 @@
"4 166 346 1.0 886397596"
]
},
"execution_count": 10,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
@ -260,7 +280,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
@ -269,7 +289,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
@ -297,114 +317,43 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# set log level to INFO\n",
"logging.basicConfig(level=logging.DEBUG, \n",
" format='%(asctime)s %(levelname)-8s %(message)s')\n",
"\n",
"model = SARSingleNode(\n",
" remove_seen=True, similarity_type=\"jaccard\", \n",
" time_decay_coefficient=30, time_now=None, timedecay_formula=True, **header\n",
" remove_seen=True, \n",
" similarity_type=\"jaccard\", \n",
" time_decay_coefficient=30, \n",
" time_now=None, \n",
" timedecay_formula=True, \n",
" **header\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"unique_users = data[\"UserId\"].unique()\n",
"unique_items = data[\"MovieId\"].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will hash users and items to smaller continuous space.\n",
"This is an ordered set - it's discrete, but contiguous.\n",
"This helps keep the matrices we keep in memory as small as possible."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"enumerate_items_1, enumerate_items_2 = itertools.tee(enumerate(unique_items))\n",
"enumerate_users_1, enumerate_users_2 = itertools.tee(enumerate(unique_users))\n",
"item_map_dict = {x: i for i, x in enumerate_items_1}\n",
"user_map_dict = {x: i for i, x in enumerate_users_1}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The reverse of the dictionary above - array index to actual ID\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"index2user = dict(enumerate_users_2)\n",
"index2item = dict(enumerate_items_2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to index the train and test sets for SAR matrix operations to work"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"model.set_index(unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Collecting user affinity matrix...\n",
"Calculating time-decayed affinities...\n",
"../../reco_utils/recommender/sar/sar_singlenode.py:219: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" df[\"exponential\"] = expo_fun(df[self.col_timestamp].values)\n",
"../../reco_utils/recommender/sar/sar_singlenode.py:221: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" df[\"rating_exponential\"] = df[self.col_rating] * df[\"exponential\"]\n",
"Creating index columns...\n",
"../../reco_utils/recommender/sar/sar_singlenode.py:283: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.\n",
" self.index = df.as_matrix([self._col_hashed_users, self._col_hashed_items])\n",
"Building user affinity sparse matrix...\n",
"Calculating item cooccurrence...\n",
"Calculating item similarity...\n",
"Calculating jaccard...\n",
"/anaconda/envs/recommender/lib/python3.6/site-packages/scipy/sparse/base.py:594: RuntimeWarning: invalid value encountered in true_divide\n",
" return np.true_divide(self.todense(), other)\n",
"Calculating recommendation scores...\n",
"done training\n"
"2019-02-07 21:12:50,049 INFO Collecting user affinity matrix\n",
"2019-02-07 21:12:50,055 INFO Calculating time-decayed affinities\n",
"2019-02-07 21:12:50,135 INFO Creating index columns\n",
"2019-02-07 21:12:50,164 INFO Building user affinity sparse matrix\n",
"2019-02-07 21:12:50,174 INFO Calculating item co-occurrence\n",
"2019-02-07 21:12:50,419 INFO Calculating item similarity\n",
"2019-02-07 21:12:50,420 INFO Calculating jaccard\n",
"2019-02-07 21:12:50,631 INFO Calculating recommendation scores\n",
"2019-02-07 21:12:50,738 INFO Removing seen items\n",
"2019-02-07 21:12:50,740 INFO Done training\n"
]
}
],
@ -414,7 +363,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 8,
"metadata": {
"scrolled": true
},
@ -423,18 +372,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"Converting to dense matrix...\n",
"../../reco_utils/recommender/sar/sar_singlenode.py:422: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" test[self._col_hashed_users] = test[self.col_user].map(self.user_map_dict)\n",
"Removing seen items...\n",
"Getting top K...\n",
"Select users from the test set\n",
"Creating output dataframe...\n",
"Formatting output\n"
"2019-02-07 21:12:50,762 INFO Getting top K\n"
]
}
],
@ -444,7 +382,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
@ -462,7 +400,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 10,
"metadata": {
"scrolled": true
},
@ -503,22 +441,22 @@
" <tr>\n",
" <th>1</th>\n",
" <td>600</td>\n",
" <td>423</td>\n",
" <td>12.991756</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>600</td>\n",
" <td>183</td>\n",
" <td>13.106912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>2</th>\n",
" <td>600</td>\n",
" <td>89</td>\n",
" <td>13.163791</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>600</td>\n",
" <td>423</td>\n",
" <td>12.991756</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>600</td>\n",
" <td>144</td>\n",
@ -531,9 +469,9 @@
"text/plain": [
" UserId MovieId prediction\n",
"0 600 69 12.984131\n",
"1 600 423 12.991756\n",
"2 600 183 13.106912\n",
"3 600 89 13.163791\n",
"1 600 183 13.106912\n",
"2 600 89 13.163791\n",
"3 600 423 12.991756\n",
"4 600 144 13.489795"
]
},
@ -558,52 +496,50 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"eval_map = map_at_k(test, top_k, col_user=\"UserId\", col_item=\"MovieId\", \n",
" col_rating=\"Rating\", col_prediction=\"prediction\", \n",
" relevancy_method=\"top_k\", k=TOP_K)\n",
"# all ranking metrics have the same arguments\n",
"args = [test, top_k]\n",
"kwargs = dict(col_user='UserId', \n",
" col_item='MovieId', \n",
" col_rating='Rating', \n",
" col_prediction='prediction', \n",
" relevancy_method='top_k', \n",
" k=TOP_K)\n",
"\n",
"eval_ndcg = ndcg_at_k(test, top_k, col_user=\"UserId\", col_item=\"MovieId\", \n",
" col_rating=\"Rating\", col_prediction=\"prediction\", \n",
" relevancy_method=\"top_k\", k=TOP_K)\n",
"\n",
"eval_precision = precision_at_k(test, top_k, col_user=\"UserId\", col_item=\"MovieId\", \n",
" col_rating=\"Rating\", col_prediction=\"prediction\", \n",
" relevancy_method=\"top_k\", k=TOP_K)\n",
"\n",
"eval_recall = recall_at_k(test, top_k, col_user=\"UserId\", col_item=\"MovieId\", \n",
" col_rating=\"Rating\", col_prediction=\"prediction\", \n",
" relevancy_method=\"top_k\", k=TOP_K)"
"eval_map = map_at_k(*args, **kwargs)\n",
"eval_ndcg = ndcg_at_k(*args, **kwargs)\n",
"eval_precision = precision_at_k(*args, **kwargs)\n",
"eval_recall = recall_at_k(*args, **kwargs)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model:\tsar_ref\n",
"Top K:\t10\n",
"MAP:\t0.105815\n",
"NDCG:\t0.373197\n",
"Precision@K:\t0.326617\n",
"Recall@K:\t0.175957\n"
"Model:\t\t sar_ref\n",
"Top K:\t\t 10\n",
"MAP:\t\t 0.105815\n",
"NDCG:\t\t 0.373197\n",
"Precision@K:\t 0.326617\n",
"Recall@K:\t 0.175957\n"
]
}
],
"source": [
"print(\"Model:\\t\" + model.model_str,\n",
" \"Top K:\\t%d\" % TOP_K,\n",
" \"MAP:\\t%f\" % eval_map,\n",
" \"NDCG:\\t%f\" % eval_ndcg,\n",
" \"Precision@K:\\t%f\" % eval_precision,\n",
" \"Recall@K:\\t%f\" % eval_recall, sep='\\n')"
"print(f\"Model:\\t\\t {model.model_str}\",\n",
" f\"Top K:\\t\\t {TOP_K}\",\n",
" f\"MAP:\\t\\t {eval_map:f}\",\n",
" f\"NDCG:\\t\\t {eval_ndcg:f}\",\n",
" f\"Precision@K:\\t {eval_precision:f}\",\n",
" f\"Recall@K:\\t {eval_recall:f}\", sep='\\n')"
]
},
{
@ -620,11 +556,10 @@
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python (reco)",
"language": "python",
"name": "python3"
"name": "reco"
},
"language_info": {
"codemirror_mode": {
@ -636,7 +571,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.7"
}
},
"nbformat": 4,

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -131,10 +131,10 @@
"\n",
"\n",
"notebooks = {\n",
" 'als': '../00_quick_start/als_pyspark_movielens.ipynb',\n",
" 'sar': '../00_quick_start/sar_single_node_movielens.ipynb',\n",
" 'als': '../00_quick_start/als_movielens.ipynb',\n",
" 'sar': '../00_quick_start/sar_movielens.ipynb',\n",
" 'svd': '../02_model/surprise_svd_deep_dive.ipynb',\n",
" 'fast': '../00_quick_start/fastai_recommendation.ipynb',\n",
" 'fast': '../00_quick_start/fastai_movielens.ipynb',\n",
" 'ncf': '../00_quick_start/ncf_movielens.ipynb',\n",
" 'rbm': '../00_quick_start/rbm_movielens.ipynb'\n",
"}"

Просмотреть файл

@ -7,6 +7,7 @@ DEFAULT_ITEM_COL = "itemID"
DEFAULT_RATING_COL = "rating"
DEFAULT_TIMESTAMP_COL = "timestamp"
PREDICTION_COL = "prediction"
DEFAULT_PREDICTION_COL = PREDICTION_COL
# Filtering variables
DEFAULT_K = 10

Просмотреть файл

@ -1,30 +1,51 @@
import numpy as np
from scipy.sparse import coo_matrix
def exponential_decay(value, max_val, half_life):
"""Compute decay factor for a given value based on an exponential decay
Values greater than max_val will be set to 1
Args:
value (numeric): value to calculate decay factor
max_val (numeric): value at which decay factor will be 1
half_life (numeric): value at which decay factor will be 0.5
Returns:
float: decay factor
"""
return np.minimum(1., np.exp(-np.log(2) * (max_val - value) / half_life))
def jaccard(cooccurrence):
"""Helper method to calculate the Jaccard similarity of a matrix of cooccurrences
"""Helper method to calculate the Jaccard similarity of a matrix of co-occurrences
Args:
cooccurrence (scipy.sparse.csc_matrix): the symmetric matrix of cooccurrences of items
cooccurrence (np.array): the symmetric matrix of co-occurrences of items
Returns:
scipy.sparse.coo_matrix: The matrix of Jaccard similarities between any two items
np.array: The matrix of Jaccard similarities between any two items
"""
coo = cooccurrence.tocoo()
denom = coo.diagonal()[coo.row] + coo.diagonal()[coo.col] - coo.data
return coo_matrix((np.divide(coo.data, denom, out=np.zeros_like(coo.data), where=(denom != 0.0)),
(coo.row, coo.col)),
shape=coo.shape).tocsc()
diag = cooccurrence.diagonal()
diag_rows = np.expand_dims(diag, axis=0)
diag_cols = np.expand_dims(diag, axis=1)
with np.errstate(invalid='ignore', divide='ignore'):
result = cooccurrence / (diag_rows + diag_cols - cooccurrence)
return np.array(result)
def lift(cooccurrence):
"""Helper method to calculate the Lift of a matrix of cooccurrences
"""Helper method to calculate the Lift of a matrix of co-occurrences
Args:
cooccurrence (scipy.sparse.csc_matrix): the symmetric matrix of cooccurrences of items
cooccurrence (np.array): the symmetric matrix of co-occurrences of items
Returns:
scipy.sparse.coo_matrix: The matrix of Lifts between any two items
np.array: The matrix of Lifts between any two items
"""
coo = cooccurrence.tocoo()
denom = coo.diagonal()[coo.row] * coo.diagonal()[coo.col]
return coo_matrix((np.divide(coo.data, denom, out=np.zeros_like(coo.data), where=(denom != 0.0)),
(coo.row, coo.col)),
shape=coo.shape).tocsc()
diag = cooccurrence.diagonal()
diag_rows = np.expand_dims(diag, axis=0)
diag_cols = np.expand_dims(diag, axis=1)
with np.errstate(invalid='ignore', divide='ignore'):
result = cooccurrence / (diag_rows * diag_cols)
return np.array(result)

Просмотреть файл

@ -148,7 +148,7 @@ def load_pandas_df(
Args:
size (str): Size of the data to load. One of ("100k", "1m", "10m", "20m")
header (list or tuple): Rating dataset header. If None, ratings are not loaded.
header (list or tuple or None): Rating dataset header. If None, ratings are not loaded.
local_cache_path (str): Path where to cache the zip file locally
title_col (str): Movie title column name. If None, the title column is not loaded.
genres_col (str): Genres column name. Genres are '|' separated string.

Просмотреть файл

@ -1,92 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Collection of numpy based splitters
"""
import numpy as np
def numpy_stratified_split(X, ratio=0.75, seed=123):
"""
Split the user/item affinity matrix into train and test set matrices while mantaining
local (i.e. per user) ratios.
Args:
X (np.array, int): a sparse matrix
ratio (scalar, float): fraction of the entire dataset to constitute the train set
seed (scalar, int): random seed
Returns:
Xtr (np.array, int): train set user/item affinity matrix
Xtst (np.array, int): test set user/item affinity matrix
Basic mechanics:
Main points :
1. In a typical recommender problem, different users rate a different number of items,
and therefore the user/affinity matrix has a sparse structure with variable number
of zeroes (unrated items) per row (user). Cutting a total amount of ratings will
result in a non-homogenou distribution between train and test set, i.e. some test
users may have many ratings while other very little if none.
2. In an unsupervised learning problem, no explicit answer is given. For this reason
the split needs to be implemented in a different way then in supervised learningself.
In the latter, one typically split the dataset by rows (by examples), ending up with
the same number of feautures but different number of examples in the train/test setself.
This scheme does not work in the unsupervised case, as part of the rated items needs to
be used as a test set for fixed number of users.
Solution:
1. Instead of cutting a total percentage, for each user we cut a relative ratio of the rated
items. For example, if user1 has rated 4 items and user2 10, cutting 25% will correspond to
1 and 2.6 ratings in the test set, approximated as 1 and 3 according to the round() function.
In this way, the 0.75 ratio is satified both locally and globally, preserving the original
distribution of ratings across the train and test set.
2. It is easy (and fast) to satisfy this requirements by creating the test via element subtraction
from the original datatset X. We first create two copies of X; for each user we select a random
sample of local size ratio (point 1) and erase the remaining ratings, obtaining in this way the
train set matrix Xtst. The train set matrix is obtained in the opposite way.
"""
np.random.seed(seed) # set the random seed
test_cut = int((1 - ratio) * 100) # percentage of ratings to go in the test set
# initialize train and test set matrices
Xtr = X.copy()
Xtst = X.copy()
# find the number of rated movies per user
rated = np.sum(Xtr != 0, axis=1)
# for each user, cut down a test_size% for the test set
tst = np.around((rated * test_cut) / 100).astype(int)
Nusers, Nitems = X.shape # total number of users and items
for u in range(Nusers):
# For each user obtain the index of rated movies
idx = np.asarray(np.where(Xtr[u] != 0))[0].tolist()
# extract a random subset of size n from the set of rated movies without repetition
idx_tst = np.random.choice(idx, tst[u], replace=False)
idx_train = list(set(idx).difference(set(idx_tst)))
Xtr[
u, idx_tst
] = 0 # change the selected rated movies to unrated in the train set
Xtst[
u, idx_train
] = 0 # set the movies that appear already in the train set as 0
del idx, idx_train, idx_tst
return Xtr, Xtst

Просмотреть файл

@ -1,6 +1,6 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split as sk_split
@ -173,3 +173,86 @@ def python_stratified_split(
splits[x] = pd.concat([splits[x], group_splits[x]])
return splits
def numpy_stratified_split(X, ratio=0.75, seed=123):
"""
Split the user/item affinity matrix (sparse matrix) into train and test set matrices while maintaining
local (i.e. per user) ratios.
Args:
X (np.array, int): a sparse matrix to be split
ratio (scalar, float): fraction of the entire dataset to constitute the train set
seed (scalar, int): random seed
Returns:
Xtr (np.array, int): train set user/item affinity matrix
Xtst (np.array, int): test set user/item affinity matrix
Basic mechanics:
Main points :
1. In a typical recommender problem, different users rate a different number of items,
and therefore the user/affinity matrix has a sparse structure with variable number
of zeroes (unrated items) per row (user). Cutting a total amount of ratings will
result in a non-homogenou distribution between train and test set, i.e. some test
users may have many ratings while other very little if none.
2. In an unsupervised learning problem, no explicit answer is given. For this reason
the split needs to be implemented in a different way then in supervised learningself.
In the latter, one typically split the dataset by rows (by examples), ending up with
the same number of feautures but different number of examples in the train/test setself.
This scheme does not work in the unsupervised case, as part of the rated items needs to
be used as a test set for fixed number of users.
Solution:
1. Instead of cutting a total percentage, for each user we cut a relative ratio of the rated
items. For example, if user1 has rated 4 items and user2 10, cutting 25% will correspond to
1 and 2.6 ratings in the test set, approximated as 1 and 3 according to the round() function.
In this way, the 0.75 ratio is satified both locally and globally, preserving the original
distribution of ratings across the train and test set.
2. It is easy (and fast) to satisfy this requirements by creating the test via element subtraction
from the original datatset X. We first create two copies of X; for each user we select a random
sample of local size ratio (point 1) and erase the remaining ratings, obtaining in this way the
train set matrix Xtst. The train set matrix is obtained in the opposite way.
"""
np.random.seed(seed) # set the random seed
test_cut = int((1 - ratio) * 100) # percentage of ratings to go in the test set
# initialize train and test set matrices
Xtr = X.copy()
Xtst = X.copy()
# find the number of rated movies per user
rated = np.sum(Xtr != 0, axis=1)
# for each user, cut down a test_size% for the test set
tst = np.around((rated * test_cut) / 100).astype(int)
Nusers, Nitems = X.shape # total number of users and items
for u in range(Nusers):
# For each user obtain the index of rated movies
idx = np.asarray(np.where(Xtr[u] != 0))[0].tolist()
# extract a random subset of size n from the set of rated movies without repetition
idx_tst = np.random.choice(idx, tst[u], replace=False)
idx_train = list(set(idx).difference(set(idx_tst)))
Xtr[
u, idx_tst
] = 0 # change the selected rated movies to unrated in the train set
Xtst[
u, idx_train
] = 0 # set the movies that appear already in the train set as 0
del idx, idx_train, idx_tst
return Xtr, Xtst

Просмотреть файл

@ -56,7 +56,7 @@ def spark_chrono_split(
Args:
data (spark.DataFrame): Spark DataFrame to be split.
ratio (float or list): Ratio for splitting data. If it is a single float number
it splits data into two halfs and the ratio argument indicates the ratio of
it splits data into two sets and the ratio argument indicates the ratio of
training data set; if it is a list of float numbers, the splitter splits
data into several portions corresponding to the split ratios. If a list is
provided and the ratios are not summed to 1, they will be normalized.
@ -93,7 +93,7 @@ def spark_chrono_split(
ratio = ratio if multi_split else [ratio, 1 - ratio]
ratio_index = np.cumsum(ratio)
window_spec = Window.partitionBy(split_by_column).orderBy(col(col_timestamp).desc())
window_spec = Window.partitionBy(split_by_column).orderBy(col(col_timestamp))
rating_grouped = (
data.groupBy(split_by_column)
@ -141,6 +141,8 @@ def spark_stratified_split(
training data set; if it is a list of float numbers, the splitter splits
data into several portions corresponding to the split ratios. If a list is
provided and the ratios are not summed to 1, they will be normalized.
Earlier indexed splits will have earlier times
(e.g the latest time per user or item in split[0] <= the earliest time per user or item in split[1])
seed (int): Seed.
min_rating (int): minimum number of ratings for user or item.
filter_by (str): either "user" or "item", depending on which of the two is to filter
@ -216,10 +218,12 @@ def spark_timestamp_split(
Args:
data (spark.DataFrame): Spark DataFrame to be split.
ratio (float or list): Ratio for splitting data. If it is a single float number
it splits data into two halfs and the ratio argument indicates the ratio of
it splits data into two sets and the ratio argument indicates the ratio of
training data set; if it is a list of float numbers, the splitter splits
data into several portions corresponding to the split ratios. If a list is
provided and the ratios are not summed to 1, they will be normalized.
Earlier indexed splits will have earlier times
(e.g the latest time in split[0] <= the earliest time in split[1])
col_user (str): column name of user IDs.
col_item (str): column name of item IDs.
col_timestamp (str): column name of timestamps. Float number represented in
@ -233,7 +237,7 @@ def spark_timestamp_split(
ratio = ratio if multi_split else [ratio, 1 - ratio]
ratio_index = np.cumsum(ratio)
window_spec = Window.orderBy(col(col_timestamp).desc())
window_spec = Window.orderBy(col(col_timestamp))
rating = data.withColumn("rank", row_number().over(window_spec))
data_count = rating.count()

Просмотреть файл

@ -26,19 +26,6 @@ log = logging.getLogger(__name__)
class AffinityMatrix:
"""
Args:
df (pd.DataFrame): a dataframe containing the data
col_user (str): default name for user column
col_item (str): default name for item column
col_rating (str): default name for rating columns
col_time (str): default name for timestamp columns
save_model (Bool): if True it saves the item/user maps
save_path (str): default path to save item/user maps
"""
# initialize class parameters
def __init__(
self,
@ -47,11 +34,18 @@ class AffinityMatrix:
col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL,
col_pred=PREDICTION_COL,
col_time=DEFAULT_TIMESTAMP_COL,
save_path=None,
debug=False,
):
"""Generate the user/item affinity matrix from a pandas dataframe and vice versa
Args:
DF (pd.DataFrame): a dataframe containing the data
col_user (str): default name for user column
col_item (str): default name for item column
col_rating (str): default name for rating columns
save_path (str): default path to save item/user maps
"""
self.df = DF # dataframe
# pandas DF parameters
@ -63,12 +57,10 @@ class AffinityMatrix:
# Options to save the model for future use
self.save_path = save_path
def gen_index(self):
def _gen_index(self):
"""
Generate the user/item index
Returns:
Generate the user/item index:
map_users, map_items: dictionaries mapping the original user/item index to matrix indices
map_back_users, map_back_items: dictionaries to map back the matrix elements to the original
dataframe indices
@ -105,13 +97,13 @@ class AffinityMatrix:
self.df_.loc[:, "hashedUsers"] = self.df_[self.col_user].map(self.map_users)
# optionally save the inverse dictionary to work with trained models
if self.save_path != None:
if self.save_path is not None:
np.save(self.save_path_ + "/user_dict", self.map_users)
np.save(self.save_path_ + "/item_dict", self.map_items)
np.save(self.save_path + "/user_dict", self.map_users)
np.save(self.save_path + "/item_dict", self.map_items)
np.save(self.save_path_ + "/user_back_dict", self.map_back_users)
np.save(self.save_path_ + "/item_back_dict", self.map_back_items)
np.save(self.save_path + "/user_back_dict", self.map_back_users)
np.save(self.save_path + "/item_back_dict", self.map_back_items)
def gen_affinity_matrix(self):
@ -135,7 +127,7 @@ class AffinityMatrix:
log.info("Generating the user/item affinity matrix...")
self.gen_index()
self._gen_index()
ratings = self.df_[self.col_rating] # ratings
itm_id = self.df_["hashedItems"] # itm_id serving as columns

Просмотреть файл

@ -582,10 +582,6 @@ def get_top_k_items(dataframe, col_user=DEFAULT_USER_COL, col_rating=DEFAULT_RAT
Return:
pd.DataFrame: DataFrame of top k items for each user.
"""
tmp = dataframe.copy()
tmp[col_rating] = tmp[col_rating].astype(float)
return (
tmp.groupby(col_user, as_index=False)
.apply(lambda x: x.nlargest(k, col_rating))
.reset_index()
)
return (dataframe.groupby(col_user, as_index=False)
.apply(lambda x: x.nlargest(k, col_rating))
.reset_index(drop=True))

Просмотреть файл

@ -16,6 +16,6 @@ SIM_COOCCUR = "cooccurrence"
SIM_JACCARD = "jaccard"
SIM_LIFT = "lift"
HASHED_ITEMS = "hashedItems"
HASHED_USERS = "hashedUsers"
INDEXED_ITEMS = "indexedItems"
INDEXED_USERS = "indexedUsers"

Просмотреть файл

@ -5,43 +5,20 @@
Reference implementation of SAR in python/numpy/pandas.
This is not meant to be particularly performant or scalable, just
as a simple and readable implementation.
a simple and readable implementation.
"""
import numpy as np
import pandas as pd
import logging
from scipy import sparse
from reco_utils.common.python_utils import jaccard, lift
from reco_utils.common.python_utils import jaccard, lift, exponential_decay
from reco_utils.common.constants import (
DEFAULT_USER_COL,
DEFAULT_ITEM_COL,
DEFAULT_RATING_COL,
DEFAULT_TIMESTAMP_COL,
PREDICTION_COL,
)
from reco_utils.common import constants
from reco_utils.recommender import sar
from reco_utils.recommender.sar import (
SIM_JACCARD,
SIM_LIFT,
SIM_COOCCUR,
HASHED_USERS,
HASHED_ITEMS,
)
from reco_utils.recommender.sar import (
TIME_DECAY_COEFFICIENT,
TIME_NOW,
TIMEDECAY_FORMULA,
THRESHOLD,
)
"""
enable or set manually with --log=INFO when running example file if you want logging:
disabling because logging output contaminates stdout output on Databricsk Spark clusters
"""
# logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
class SARSingleNode:
@ -50,125 +27,73 @@ class SARSingleNode:
def __init__(
self,
remove_seen=True,
col_user=DEFAULT_USER_COL,
col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL,
col_timestamp=DEFAULT_TIMESTAMP_COL,
similarity_type=SIM_JACCARD,
time_decay_coefficient=TIME_DECAY_COEFFICIENT,
time_now=TIME_NOW,
timedecay_formula=TIMEDECAY_FORMULA,
threshold=THRESHOLD,
debug=False,
col_user=constants.DEFAULT_USER_COL,
col_item=constants.DEFAULT_ITEM_COL,
col_rating=constants.DEFAULT_RATING_COL,
col_timestamp=constants.DEFAULT_TIMESTAMP_COL,
col_prediction=constants.PREDICTION_COL,
similarity_type=sar.SIM_JACCARD,
time_decay_coefficient=sar.TIME_DECAY_COEFFICIENT,
time_now=sar.TIME_NOW,
timedecay_formula=sar.TIMEDECAY_FORMULA,
threshold=sar.THRESHOLD,
):
"""Initialize model parameters
Args:
remove_seen (bool): whether to remove items observed in training when making recommendations
col_user (str): user column name
col_item (str): item column name
col_rating (str): rating column name
col_timestamp (str): timestamp column name
col_prediction (str): prediction column name
similarity_type (str): [None, 'jaccard', 'lift'] option for computing item-item similarity
time_decay_coefficient (float): number of days till ratings are decayed by 1/2
time_now (int): current time for time decay calculation
timedecay_formula (bool): flag to apply time decay
threshold (int): item-item co-occurrences below this threshold will be removed
"""
self.col_rating = col_rating
self.col_item = col_item
self.col_user = col_user
# default values for all SAR algos
self.col_timestamp = col_timestamp
self.col_prediction = col_prediction
self.remove_seen = remove_seen
# time of item-item similarity
self.similarity_type = similarity_type
# denominator in time decay. Zero makes time decay irrelevant
self.time_decay_coefficient = time_decay_coefficient
# toggle the computation of time decay group by formula
self.timedecay_formula = timedecay_formula
# current time for time decay calculation
# convert to seconds
self.time_decay_half_life = time_decay_coefficient * 24 * 60 * 60
self.time_decay_flag = timedecay_formula
self.time_now = time_now
# cooccurrence matrix threshold
self.threshold = threshold
# debug the code
self.debug = debug
# log the length of operations
self.timer_log = []
# array of indexes for rows and columns of users and items in training set
self.index = None
self.model_str = "sar_ref"
self.model = self
self.user_affinity = None
self.item_similarity = None
# threshold - items below this number get set to zero in coocurrence counts
assert self.threshold > 0
# threshold - items below this number get set to zero in co-occurrence counts
if self.threshold <= 0:
raise ValueError('Threshold cannot be < 1')
# more columns which are used internally
self._col_hashed_items = HASHED_ITEMS
self._col_hashed_users = HASHED_USERS
# Column for mapping user / item ids to internal indices
self.col_item_id = sar.INDEXED_ITEMS
self.col_user_id = sar.INDEXED_USERS
# Obtain all the users and items from both training and test data
self.unique_users = None
self.unique_items = None
# store training set index for future use during prediction
self.index = None
self.n_users = None
self.n_items = None
# user2rowID map for prediction method to look up user affinity vectors
self.user_map_dict = None
# mapping for item to matrix element
self.item_map_dict = None
self.user2index = None
self.item2index = None
# the opposite of the above map - map array index to actual string ID
self.index2user = None
self.index2item = None
# affinity scores for the recommendation
self.scores = None
def set_index(
self,
unique_users,
unique_items,
user_map_dict,
item_map_dict,
index2user,
index2item,
):
"""MVP2 temporary function to set the index of the sparse dataframe.
In future releases this will be carried out into the data object and index will be provided
with the data"""
# original IDs of users and items in a list
# later as we modify the algorithm these might not be needed (can use dictionary keys
# instead)
self.unique_users = unique_users
self.unique_items = unique_items
# mapping of original IDs to actual matrix elements
self.user_map_dict = user_map_dict
self.item_map_dict = item_map_dict
# reverse mapping of matrix index to an item
# TODO: we can make this into an array as well
self.index2user = index2user
self.index2item = index2item
# stateful time function
def time(self):
"""
Time a particular section of the code - call this once to set the state somewhere
in the code, then call it again to return the elapsed time since last call.
Call again to set the time and so on...
Returns:
None if we're not in debug mode - doesn't do anything
False if timer started
time in seconds since the last time time function was called
"""
if self.debug:
from time import time
if self.start_time is None:
self.start_time = time()
return False
else:
answer = time() - self.start_time
# reset state
self.start_time = None
return answer
else:
return None
def compute_affinity_matrix(self, df, n_users, n_items):
""" Affinity matrix
The user-affinity matrix can be constructed by treating the users and items as
@ -176,407 +101,269 @@ class SARSingleNode:
the ratings as the event weights. We convert between different sparse-matrix
formats to de-duplicate user-item pairs, otherwise they will get added up.
Args:
df (pd.DataFrame): Hashed df of users and items.
df (pd.DataFrame): Indexed df of users and items.
n_users (int): Number of users.
n_items (int): Number of items.
Returns:
scipy.csr: Affinity matrix in Compressed Sparse Row (CSR) format.
sparse.csr: Affinity matrix in Compressed Sparse Row (CSR) format.
"""
user_affinity = (
sparse.coo_matrix(
(
df[self.col_rating],
(df[self._col_hashed_users], df[self._col_hashed_items]),
),
shape=(n_users, n_items),
)
.todok()
.tocsr()
)
return user_affinity
return sparse.coo_matrix(
(df[self.col_rating], (df[self.col_user_id], df[self.col_item_id])),
shape=(n_users, n_items),
).tocsr()
def compute_coocurrence_matrix(self, df, n_users, n_items):
""" Coocurrence matrix
""" Co-occurrence matrix
C = U'.transpose() * U'
where U' is the user_affinity matrix with 1's as values (instead of ratings).
Args:
df (pd.DataFrame): Hashed df of users and items.
df (pd.DataFrame): Indexed df of users and items.
n_users (int): Number of users.
n_items (int): Number of items.
Returns:
np.array: Coocurrence matrix
np.array: Co-occurrence matrix
"""
self.time()
float_type = df[self.col_rating].dtype
user_item_hits = (
sparse.coo_matrix(
(
np.array([1.0] * len(df[self._col_hashed_users])).astype(float_type),
(df[self._col_hashed_users], df[self._col_hashed_items]),
np.repeat(1, df.shape[0]),
(df[self.col_user_id], df[self.col_item_id]),
),
shape=(n_users, n_items)
shape=(n_users, n_items),
)
.todok()
.tocsr()
.tocsr()
.astype(df[self.col_rating].dtype)
)
item_cooccurrence = user_item_hits.transpose().dot(user_item_hits)
if self.debug:
cnt = df.shape[0]
elapsed_time = self.time()
self.timer_log += [
"Item cooccurrence calculation:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
% (cnt, elapsed_time, float(cnt) / elapsed_time)
]
self.time()
item_cooccurrence = item_cooccurrence.multiply(
item_cooccurrence >= self.threshold
)
if self.debug:
elapsed_time = self.time()
self.timer_log += [
"Applying threshold:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
% (cnt, elapsed_time, float(cnt) / elapsed_time)
]
return item_cooccurrence
def set_index(self, df):
"""Generate continuous indices for users and items to reduce memory usage
Args:
df (pd.DataFrame): dataframe with user and item ids
"""
# Generate a map of continuous index values to items
self.index2item = dict(enumerate(df[self.col_item].unique()))
# Invert the mapping from above
self.item2index = {v: k for k, v in self.index2item.items()}
# Create mapping of users to continuous indices
self.user2index = {x[1]: x[0] for x in enumerate(df[self.col_user].unique())}
# set values for the total count of users and items
self.n_users = len(self.user2index)
self.n_items = len(self.index2item)
def fit(self, df):
"""Main fit method for SAR"""
"""Main fit method for SAR
log.info("Collecting user affinity matrix...")
self.time()
# use the same floating type for the computations as input
float_type = df[self.col_rating].dtype
if not np.issubdtype(float_type, np.floating):
raise ValueError(
"Only floating point data types are accepted for the rating column. Data type was {} "
"instead.".format(float_type)
)
Args:
df (pd.DataFrame): User item rating dataframe
"""
if self.timedecay_formula:
# WARNING: previously we would take the last value in training dataframe and set it
# as a matrix U element
# for each user-item pair. Now with time decay, we compute a sum over ratings given
# by a user in the case
# when T=np.inf, so user gets a cumulative sum of ratings for a particular item and
# not the last rating.
log.info("Calculating time-decayed affinities...")
# Time Decay
# do a group by on user item pairs and apply the formula for time decay there
# Time T parameter is in days and input time is in seconds
# so we do dt/60/(T*24*60)=dt/(T*24*3600)
# Generate continuous indices if this hasn't been done
if self.index2item is None:
self.set_index(df)
# if time_now is None - get the default behaviour
logger.info("Collecting user affinity matrix")
if not np.issubdtype(df[self.col_rating].dtype, np.floating):
raise TypeError("Rating column data type must be floating point")
# Copy the DataFrame to avoid modification of the input
temp_df = df[[self.col_user, self.col_item, self.col_rating]].copy()
if self.time_decay_flag:
logger.info("Calculating time-decayed affinities")
# if time_now is None use the latest time
if not self.time_now:
self.time_now = df[self.col_timestamp].max()
# optimization - pre-compute time decay exponential which multiplies the ratings
expo_fun = lambda x: np.exp(
-np.log(2.0)
* (self.time_now - x)
/ (self.time_decay_coefficient * 24.0 * 3600)
# apply time decay to each rating
temp_df[self.col_rating] *= exponential_decay(
value=df[self.col_timestamp],
max_val=self.time_now,
half_life=self.time_decay_half_life,
)
rating_exponential = df[self.col_rating].values * expo_fun(
df[self.col_timestamp].values
).astype(float_type)
# update df with the affinities after the timestamp calculation
# copy part of the data frame to avoid modification of the input
temp_df = pd.DataFrame(
data={
self.col_user: df[self.col_user],
self.col_item: df[self.col_item],
self.col_rating: rating_exponential,
}
# group time decayed ratings by user-item and take the sum as the user-item affinity
temp_df = (
temp_df.groupby([self.col_user, self.col_item]).sum().reset_index()
)
newdf = temp_df.groupby([self.col_user, self.col_item]).sum().reset_index()
"""
# experimental implementation of multiprocessing - in practice for smaller datasets this is not needed
# leaving here in case anyone wants to actually try this
# to enable, you need:
# conda install dill>=0.2.8.1
# pip install multiprocess>=0.70.6.1
# from multiprocess import Pool, cpu_count
#
# multiproces uses dill for python3 to serialize lambda functions
#
# helper function to parallelize the operation on groups
def applyParallel(dfGrouped, func):
with Pool(cpu_count()*2) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pd.concat(ret_list)
from types import MethodType
grouped.applyParallel = MethodType(applyParallel, grouped)
# then replace df.apply with df.applyParallel
"""
"""
Original implementatoin of groupby and apply - without optimization
rating_series = grouped.apply(lambda x: np.sum(np.array(x[self.col_rating]) * np.exp(
-np.log(2.) * (self.time_now - np.array(x[self.col_timestamp])) / (
self.time_decay_coefficient * 24. * 3600))))
"""
else:
# without time decay we take the last user-provided rating supplied in the dataset as the
# final rating for the user-item pair
log.info("Deduplicating the user-item counts")
newdf = df.drop_duplicates([self.col_user, self.col_item])[
[self.col_user, self.col_item, self.col_rating]
]
# without time decay use the latest user-item rating in the dataset as the affinity score
logger.info("De-duplicating the user-item counts")
temp_df = temp_df.drop_duplicates(
[self.col_user, self.col_item], keep="last"
)
if self.debug:
elapsed_time = self.time()
cnt = newdf.shape[0]
self.timer_log += [
"Affinity calculation:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
% (cnt, elapsed_time, float(cnt) / elapsed_time)
]
logger.info("Creating index columns")
# Map users and items according to the two dicts. Add the two new columns to temp_df.
temp_df.loc[:, self.col_item_id] = temp_df[self.col_item].map(self.item2index)
temp_df.loc[:, self.col_user_id] = temp_df[self.col_user].map(self.user2index)
self.time()
log.info("Creating index columns...")
# Hash users and items according to the two dicts. Add the two new columns to newdf.
newdf.loc[:, self._col_hashed_items] = newdf[self.col_item].map(
self.item_map_dict
)
newdf.loc[:, self._col_hashed_users] = newdf[self.col_user].map(
self.user_map_dict
)
# store training set index for future use during prediction
# DO NOT USE .values as the warning message suggests
self.index = newdf[[self._col_hashed_users, self._col_hashed_items]].values
n_items = len(self.unique_items)
n_users = len(self.unique_users)
seen_items = None
if self.remove_seen:
# retain seen items for removal at prediction time
seen_items = temp_df[[self.col_user_id, self.col_item_id]].values
# Affinity matrix
log.info("Building user affinity sparse matrix...")
self.user_affinity = self.compute_affinity_matrix(newdf, n_users, n_items)
if self.debug:
elapsed_time = self.time()
self.timer_log += [
"Indexing and affinity matrix construction:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
% (cnt, elapsed_time, float(cnt) / elapsed_time)
]
# Calculate item cooccurrence
log.info("Calculating item cooccurrence...")
item_cooccurrence = self.compute_coocurrence_matrix(newdf, n_users, n_items)
log.info("Calculating item similarity...")
similarity_type = (
SIM_COOCCUR if self.similarity_type is None else self.similarity_type
logger.info("Building user affinity sparse matrix")
self.user_affinity = self.compute_affinity_matrix(
temp_df, self.n_users, self.n_items
)
self.time()
if similarity_type == SIM_COOCCUR:
# Calculate item co-occurrence
logger.info("Calculating item co-occurrence")
item_cooccurrence = self.compute_coocurrence_matrix(
temp_df, self.n_users, self.n_items
)
# Free up some space
del temp_df
logger.info("Calculating item similarity")
if self.similarity_type == sar.SIM_COOCCUR:
self.item_similarity = item_cooccurrence
elif similarity_type == SIM_JACCARD:
log.info("Calculating jaccard ...")
elif self.similarity_type == sar.SIM_JACCARD:
logger.info("Calculating jaccard")
self.item_similarity = jaccard(item_cooccurrence)
elif similarity_type == SIM_LIFT:
log.info("Calculating lift ...")
# Free up some space
del item_cooccurrence
elif self.similarity_type == sar.SIM_LIFT:
logger.info("Calculating lift")
self.item_similarity = lift(item_cooccurrence)
# Free up some space
del item_cooccurrence
else:
raise ValueError("Unknown similarity type: {0}".format(similarity_type))
raise ValueError(
"Unknown similarity type: {0}".format(self.similarity_type)
)
if self.debug and (
similarity_type == SIM_JACCARD or similarity_type == SIM_LIFT
):
elapsed_time = self.time()
self.timer_log += [
"Item similarity calculation:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
% (cnt, elapsed_time, float(cnt) / elapsed_time)
]
# Calculate raw scores with a matrix multiplication.
log.info("Calculating recommendation scores...")
self.time()
# Calculate raw scores with a matrix multiplication
logger.info("Calculating recommendation scores")
self.scores = self.user_affinity.dot(self.item_similarity)
if self.debug:
elapsed_time = self.time()
self.timer_log += [
"Score calculation:\t%d\trows in\t%s\tseconds -\t%f\trows per second."
% (cnt, elapsed_time, float(cnt) / elapsed_time)
]
# Remove items in the train set so recommended items are always novel
if self.remove_seen:
logger.info("Removing seen items")
self.scores[seen_items[:, 0], seen_items[:, 1]] = -np.inf
log.info("done training")
logger.info("Done training")
def recommend_k_items(self, test, top_k=10, sort_top_k=False):
"""Recommend top K items for all users which are in the test set
Args:
test (pd.DataFrame): user to test
top_k (int): number of top items to recommend
sort_top_k (bool): flag to sort top k results
Returns:
pd.DataFrame: A DataFrame that contains top k recommendation items for each user.
pd.DataFrame: top k recommendation items for each user
"""
# pick users from test set and
test_users = test[self.col_user].unique()
try:
test_users_training_ids = np.array(
[self.user_map_dict[user] for user in test_users]
)
except KeyError():
msg = "SAR cannot score test set users which are not in the training set"
log.error(msg)
raise ValueError(msg)
# get user / item indices from test set
user_ids = test[self.col_user].drop_duplicates().map(self.user2index).values
if any(np.isnan(user_ids)):
raise ValueError("SAR cannot score users that are not in the training set")
# shorthand
scores = self.scores
# extract only the scores for the test users
test_scores = self.scores[user_ids, :]
# Convert to dense, the following operations are easier.
log.info("Converting to dense matrix...")
if isinstance(scores, np.matrixlib.defmatrix.matrix):
scores_dense = np.array(scores)
else:
scores_dense = scores.todense()
# ensure we're working with a dense matrix
if isinstance(test_scores, sparse.spmatrix):
test_scores = test_scores.todense()
# Mask out items in the train set. This only makes sense for some
# problems (where a user wouldn't interact with an item more than once).
if self.remove_seen:
log.info("Removing seen items...")
scores_dense[self.index[:, 0], self.index[:, 1]] = 0
# get top K items and scores
logger.info("Getting top K")
# this determines the un-ordered top-k item indices for each user
top_items = np.argpartition(test_scores, -top_k, axis=1)[:, -top_k:]
top_scores = test_scores[np.arange(test_scores.shape[0])[:, None], top_items]
# Get top K items and scores.
log.info("Getting top K...")
top_items = np.argpartition(scores_dense, -top_k, axis=1)[:, -top_k:]
top_scores = scores_dense[np.arange(scores_dense.shape[0])[:, None], top_items]
if sort_top_k:
sort_ind = np.argsort(-top_scores)
top_items = top_items[np.arange(top_items.shape[0])[:, None], sort_ind]
top_scores = top_scores[np.arange(top_scores.shape[0])[:, None], sort_ind]
log.info("Select users from the test set")
top_items = top_items[test_users_training_ids, :]
top_scores = top_scores[test_users_training_ids, :]
log.info("Creating output dataframe...")
# Convert to np.array (from view) and flatten
top_items = np.reshape(np.array(top_items), -1)
top_scores = np.reshape(np.array(top_scores), -1)
userids = []
for u in test_users:
userids.extend([u] * top_k)
results = pd.DataFrame.from_dict(
df = pd.DataFrame(
{
self.col_user: userids,
self.col_item: top_items,
self.col_rating: top_scores,
self.col_user: np.repeat(
test[self.col_user].drop_duplicates().values, top_k
),
self.col_item: [
self.index2item[item] for item in np.array(top_items).flatten()
],
self.col_prediction: np.array(top_scores).flatten(),
}
)
# remap user and item indices to IDs
results[self.col_item] = results[self.col_item].map(self.index2item)
# do final sort
if sort_top_k:
results = (
results.sort_values(
by=[self.col_user, self.col_rating], ascending=False
)
.groupby(self.col_user)
.apply(lambda x: x)
)
# format the dataframe in the end to conform to Suprise return type
log.info("Formatting output")
# modify test to make it compatible with
return (
results[[self.col_user, self.col_item, self.col_rating]]
.rename(columns={self.col_rating: PREDICTION_COL})
.astype(
{
self.col_user: _user_item_return_type(),
self.col_item: _user_item_return_type(),
PREDICTION_COL: self.scores.dtype,
}
)
# ensure datatypes are correct
df = df.astype(
dtype={
self.col_user: str,
self.col_item: str,
self.col_prediction: self.scores.dtype,
}
)
# drop seen items
return df.replace(-np.inf, np.nan).dropna()
def predict(self, test):
"""Output SAR scores for only the users-items pairs which are in the test set
Args:
test (pd.DataFrame): DataFrame that contains ground-truth of user-item ratings.
test (pd.DataFrame): DataFrame that contains users and items to test
Return:
pd.DataFrame: DataFrame contains the prediction results.
pd.DataFrame: DataFrame contains the prediction results
"""
# pick users from test set and
test_users = test[self.col_user].unique()
try:
training_ids = np.array([self.user_map_dict[user] for user in test_users])
assert training_ids is not None
except KeyError():
msg = "SAR cannot score test set users which are not in the training set"
log.error(msg)
raise ValueError(msg)
# shorthand
scores = self.scores
# get user / item indices from test set
user_ids = test[self.col_user].map(self.user2index).values
if any(np.isnan(user_ids)):
raise ValueError("SAR cannot score users that are not in the training set")
# Convert to dense, the following operations are easier.
log.info("Converting to dense array ...")
scores_dense = scores.toarray()
# extract only the scores for the test users
test_scores = self.scores[user_ids, :]
# take the intersection between train test items and items we actually need
test_col_hashed_users = test[self.col_user].map(self.user_map_dict)
test_col_hashed_items = test[self.col_item].map(self.item_map_dict)
# convert and flatten scores into an array
if isinstance(test_scores, sparse.spmatrix):
test_scores = test_scores.todense()
test_index = pd.concat(
[test_col_hashed_users, test_col_hashed_items], axis=1
).values
aset = set([tuple(x) for x in self.index])
bset = set([tuple(x) for x in test_index])
item_ids = test[self.col_item].map(self.item2index).values
nans = np.isnan(item_ids)
if any(nans):
# predict 0 for items not seen during training
test_scores = np.append(test_scores, np.zeros((self.n_users, 1)), axis=1)
item_ids[nans] = self.n_items
item_ids = item_ids.astype("int64")
common_index = np.array([x for x in aset & bset])
# Mask out items in the train set. This only makes sense for some
# problems (where a user wouldn't interact with an item more than once).
if self.remove_seen and len(aset & bset) > 0:
log.info("Removing seen items...")
scores_dense[common_index[:, 0], common_index[:, 1]] = 0
final_scores = scores_dense[test_index[:, 0], test_index[:, 1]]
results = pd.DataFrame.from_dict(
df = pd.DataFrame(
{
self.col_user: test_index[:, 0],
self.col_item: test_index[:, 1],
self.col_rating: final_scores
self.col_user: test[self.col_user].values,
self.col_item: test[self.col_item].values,
self.col_prediction: test_scores[
np.arange(test_scores.shape[0]), item_ids
],
}
)
# remap user and item indices to IDs
results[self.col_user] = results[self.col_user].map(self.index2user)
results[self.col_item] = results[self.col_item].map(self.index2item)
# format the dataframe in the end to conform to Suprise return type
log.info("Formatting output")
# modify test to make it compatible with
return (
results[[self.col_user, self.col_item, self.col_rating]]
.rename(columns={self.col_rating: PREDICTION_COL})
.astype(
{
self.col_user: _user_item_return_type(),
self.col_item: _user_item_return_type(),
PREDICTION_COL: self.scores.dtype,
}
)
# ensure datatypes are correct
df = df.astype(
dtype={
self.col_user: str,
self.col_item: str,
self.col_prediction: self.scores.dtype,
}
)
def _user_item_return_type():
return str
return df

Просмотреть файл

@ -0,0 +1,139 @@
#!/usr/bin/python
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# This script creates yaml files to build conda environments
# For generating a conda file for running only python code:
# $ python generate_conda_file.py
# For generating a conda file for running python gpu:
# $ python generate_conda_file.py --gpu
# For generating a conda file for running pyspark:
# $ python generate_conda_file.py --pyspark
# For generating a conda file for running python gpu and pyspark:
# $ python generate_conda_file.py --gpu --pyspark
# For generating a conda file for running python gpu and pyspark with a particular version:
# $ python generate_conda_file.py --gpu --pyspark-version 2.4.0
import argparse
CHANNELS = [
'conda-forge',
'pytorch',
'fastai',
'defaults',
]
CONDA_BASE = {
'dask': 'dask>=0.17.1',
'fastai': 'fastai>=1.0.40',
'fastparquet': 'fastparquet>=0.1.6',
'gitpython': 'gitpython>=2.1.8',
'ipykernel': 'ipykernel>=4.6.1',
'jupyter': 'jupyter>=1.0.0',
'matplotlib': 'matplotlib>=2.2.2',
'numpy': 'numpy>=1.13.3',
'pandas': 'pandas>=0.23.4',
'pymongo': 'pymongo>=3.6.1',
'python': 'python==3.6.8',
'pytest': 'pytest>=3.6.4',
'seaborn': 'seaborn>=0.8.1',
'scikit-learn': 'scikit-learn==0.19.1',
'scipy': 'scipy>=1.0.0',
'scikit-surprise': 'scikit-surprise>=1.0.6',
'tensorflow': 'tensorflow==1.12.0',
}
CONDA_PYSPARK = {
'pyarrow': 'pyarrow>=0.8.0',
'pyspark': 'pyspark==2.3.1',
}
CONDA_GPU = {
'numba': 'numba>=0.38.1',
'tensorflow': 'tensorflow-gpu==1.12.0',
}
PIP_BASE = {
'azureml-sdk[notebooks,contrib]': 'azureml-sdk[notebooks,contrib]>=1.0.8',
'azure-storage': 'azure-storage>=0.36.0',
'black': 'black>=18.6b4',
'dataclasses': 'dataclasses>=0.6',
'hyperopt': 'hyperopt==0.1.1',
'idna': 'idna==2.7',
'memory-profiler': 'memory-profiler>=0.54.0',
'nvidia-ml-py3': 'nvidia-ml-py3>=7.352.0',
'papermill': 'papermill>=0.15.0',
'pydocumentdb': 'pydocumentdb>=2.3.3',
}
PIP_PYSPARK = {}
PIP_GPU = {}
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='This script generates a conda file for different environments. Plain python is the default, flags can be used to add packages needed to support pyspark and gpu functionality')
parser.add_argument('--name', help='specify name of conda environment')
parser.add_argument('--gpu', action="store_true", help='include packages for gpu support')
parser.add_argument('--pyspark', action="store_true", help='include packages for pyspark support')
parser.add_argument('--pyspark-version', help='provide specific version of pyspark to use')
args = parser.parse_args()
# check pyspark version
if args.pyspark_version is not None:
args.pyspark = True
pyspark_version_info = args.pyspark_version.split('.')
if len(pyspark_version_info) != 3 or any([not x.isdigit() for x in pyspark_version_info]):
raise TypeError('Pyspark version input must be valid numeric format (e.g. --pyspark-version=2.3.1)')
# set name for environment and output yaml file
conda_env = 'reco_base'
if args.gpu and args.pyspark:
conda_env = 'reco_full'
elif args.gpu:
conda_env = 'reco_gpu'
elif args.pyspark:
conda_env = 'reco_pyspark'
# overwrite environment name with user input
if args.name is not None:
conda_env = args.name
# update conda and pip packages based on flags provided
conda_packages = CONDA_BASE
pip_packages = PIP_BASE
if args.pyspark:
conda_packages.update(CONDA_PYSPARK)
conda_packages['pyspark'] = 'pyspark=={}'.format(args.pyspark_version)
pip_packages.update(PIP_PYSPARK)
if args.gpu:
conda_packages.update(CONDA_GPU)
pip_packages.update(PIP_GPU)
# write out yaml file
conda_file = '{}.yaml'.format(conda_env)
with open(conda_file, 'w') as f:
f.write('name: {}\n'.format(conda_env))
f.write('channels:\n')
for channel in CHANNELS:
f.write('- {}\n'.format(channel))
f.write('dependencies:\n')
for conda_package in conda_packages.values():
f.write('- {}\n'.format(conda_package))
f.write('- pip:\n')
for pip_package in pip_packages.values():
f.write(' - {}\n'.format(pip_package))
print("""Generated conda file: {conda_file}
To create the conda environment:
$ conda env create -f {conda_file}
To update the conda environment:
$ conda env update -f {conda_file}
To register the conda environment in Jupyter:
$ conda activate {conda_env}
$ python -m ipykernel install --user --name {conda_env} --display-name "Python ({conda_env})"
""".format(conda_env=conda_env, conda_file=conda_file))

Просмотреть файл

@ -133,10 +133,7 @@ def demo_usage_data(header, sar_settings):
@pytest.fixture(scope="module")
def demo_usage_data_spark(spark, demo_usage_data, header):
data_local = demo_usage_data[[x[1] for x in header.items()]]
# TODO: install pyArrow in DS VM
# spark.conf.set("spark.sql.execution.arrow.enabled", "true")
data = spark.createDataFrame(data_local)
return data
return spark.createDataFrame(data_local)
@pytest.fixture(scope="module")
@ -147,14 +144,20 @@ def notebooks():
paths = {
"template": os.path.join(folder_notebooks, "template.ipynb"),
"sar_single_node": os.path.join(
folder_notebooks, "00_quick_start", "sar_single_node_movielens.ipynb"
folder_notebooks, "00_quick_start", "sar_movielens.ipynb"
),
"ncf": os.path.join(folder_notebooks, "00_quick_start", "ncf_movielens.ipynb"),
"als_pyspark": os.path.join(
folder_notebooks, "00_quick_start", "als_pyspark_movielens.ipynb"
folder_notebooks, "00_quick_start", "als_movielens.ipynb"
),
"fastai": os.path.join(
folder_notebooks, "00_quick_start", "fastai_recommendation.ipynb"
folder_notebooks, "00_quick_start", "fastai_movielens.ipynb"
),
"xdeepfm_quickstart": os.path.join(
folder_notebooks, "00_quick_start", "xdeepfm_synthetic.ipynb"
),
"dkn_quickstart": os.path.join(
folder_notebooks, "00_quick_start", "dkn_synthetic.ipynb"
),
"data_split": os.path.join(
folder_notebooks, "01_prepare_data", "data_split.ipynb"
@ -171,14 +174,13 @@ def notebooks():
"ncf_deep_dive": os.path.join(
folder_notebooks, "02_model", "ncf_deep_dive.ipynb"
),
"sar_deep_dive": os.path.join(
folder_notebooks, "02_model", "sar_deep_dive.ipynb"
),
"vowpal_wabbit_deep_dive": os.path.join(
folder_notebooks, "02_model", "vowpal_wabbit_deep_dive.ipynb"
),
"evaluation": os.path.join(folder_notebooks, "03_evaluate", "evaluation.ipynb"),
"fastai": os.path.join(
folder_notebooks, "00_quick_start", "fastai_recommendation.ipynb"
),
"xdeepfm_quickstart": os.path.join(
folder_notebooks, "00_quick_start", "xdeepfm.ipynb"
),
"dkn_quickstart": os.path.join(folder_notebooks, "00_quick_start", "dkn.ipynb"),
}
return paths

Просмотреть файл

@ -23,13 +23,14 @@ except ImportError:
@pytest.mark.parametrize(
"size, num_samples, num_movies, title_example, genres_example",
[
("100k", 100000, 1682, "Toy Story (1995)", "Animation|Children's|Comedy"),
("1m", 1000209, 3883, "Toy Story (1995)", "Animation|Children's|Comedy"),
("10m", 10000054, 10681, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
("20m", 20000263, 27278, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
],
)
def test_load_pandas_df(size, num_samples, num_movies, title_example, genres_example):
"""Test MovieLens dataset load into pd.DataFrame
"""
"""Test MovieLens dataset load into pd.DataFrame"""
df = movielens.load_pandas_df(size=size)
assert len(df) == num_samples
assert len(df.columns) == 4
@ -70,8 +71,9 @@ def test_load_pandas_df(size, num_samples, num_movies, title_example, genres_exa
@pytest.mark.parametrize(
"size, num_samples, num_movies, title_example, genres_example",
[
("100k", 100000, 1682, "Toy Story (1995)", "Animation|Children's|Comedy"),
("1m", 1000209, 3883, "Toy Story (1995)", "Animation|Children's|Comedy"),
("10m", 10000054, 10681, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
("20m", 20000263, 27278, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
],
)
def test_load_spark_df(size, num_samples, num_movies, title_example, genres_example):

Просмотреть файл

@ -82,3 +82,30 @@ def test_surprise_svd_integration(notebooks, size, expected_values):
for key, value in expected_values.items():
assert results[key] == pytest.approx(value, rel=TOL)
@pytest.mark.integration
@pytest.mark.parametrize(
"size, expected_values",
[
("1m", dict(rmse=0.9555,
mae=0.68493,
rsquared=0.26547,
exp_var=0.26615,
map=0.50635,
ndcg=0.99966,
precision=0.92684,
recall=0.50635)),
],
)
def test_vw_deep_dive_integration(notebooks, size, expected_values):
notebook_path = notebooks["vowpal_wabbit_deep_dive"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(MOVIELENS_DATA_SIZE=size, TOP_K=10),
)
results = pm.read_notebook(OUTPUT_NOTEBOOK).dataframe.set_index("name")["value"]
for key, value in expected_values.items():
assert results[key] == pytest.approx(value, rel=TOL)

Просмотреть файл

@ -4,12 +4,11 @@
import numpy as np
import pytest
from reco_utils.dataset.numpy_splitters import numpy_stratified_split
from reco_utils.dataset.python_splitters import numpy_stratified_split
@pytest.fixture(scope="module")
def test_specs():
return {
"users": 30,
"items": 53,
@ -22,20 +21,18 @@ def test_specs():
@pytest.fixture(scope="module")
def affinity_matrix(test_specs):
"""
Generate a random user/item affinity matrix. By increasing the likehood of 0 elements we simulate
a typical recommeding situation where the input matrix is highly sparse.
"""Generate a random user/item affinity matrix. By increasing the likehood of 0 elements we simulate
a typical recommending situation where the input matrix is highly sparse.
Args:
users (int): number of users (rows).
items (int): number of items (columns).
ratings (int): rating scale, e.g. 5 meaning rates are from 1 to 5.
spars: probablity of obtaining zero. This roughly correponds to the sparseness.
spars: probability of obtaining zero. This roughly corresponds to the sparseness.
of the generated matrix. If spars = 0 then the affinity matrix is dense.
Returns:
X (np array, int): sparse user/affinity matrix
np.array: sparse user/affinity matrix of integers.
"""

Просмотреть файл

@ -6,10 +6,6 @@ import urllib.request
import csv
import codecs
import logging
log = logging.getLogger(__name__)
def _csv_reader_url(url, delimiter=",", encoding="utf-8"):
ftpstream = urllib.request.urlopen(url)

Просмотреть файл

@ -24,8 +24,7 @@ except ImportError:
@pytest.mark.parametrize(
"size, num_samples, num_movies, title_example, genres_example",
[
("10m", 10000054, 10681, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
("20m", 20000263, 27278, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
("100k", 100000, 1682, "Toy Story (1995)", "Animation|Children's|Comedy"),
],
)
def test_load_pandas_df(size, num_samples, num_movies, title_example, genres_example):
@ -72,8 +71,7 @@ def test_load_pandas_df(size, num_samples, num_movies, title_example, genres_exa
@pytest.mark.parametrize(
"size, num_samples, num_movies, title_example, genres_example",
[
("10m", 10000054, 10681, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
("20m", 20000263, 27278, "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"),
("100k", 100000, 1682, "Toy Story (1995)", "Animation|Children's|Comedy"),
],
)
def test_load_spark_df(size, num_samples, num_movies, title_example, genres_example):

Просмотреть файл

@ -8,6 +8,7 @@ from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME
TOL = 0.05
@pytest.mark.smoke
def test_sar_single_node_smoke(notebooks):
notebook_path = notebooks["sar_single_node"]
@ -68,4 +69,24 @@ def test_surprise_svd_smoke(notebooks):
assert results["ndcg"] == pytest.approx(0.1, TOL)
assert results["precision"] == pytest.approx(0.095, TOL)
assert results["recall"] == pytest.approx(0.032, TOL)
def test_vw_deep_dive_smoke(notebooks):
notebook_path = notebooks["vowpal_wabbit_deep_dive"]
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(MOVIELENS_DATA_SIZE="100k"),
)
results = pm.read_notebook(OUTPUT_NOTEBOOK).dataframe.set_index("name")["value"]
assert results["rmse"] == pytest.approx(0.99575, TOL)
assert results["mae"] == pytest.approx(0.72024, TOL)
assert results["rsquared"] == pytest.approx(0.22961, TOL)
assert results["exp_var"] == pytest.approx(0.22967, TOL)
assert results["map"] == pytest.approx(0.25684, TOL)
assert results["ndcg"] == pytest.approx(0.65339, TOL)
assert results["precision"] == pytest.approx(0.514738, TOL)
assert results["recall"] == pytest.approx(0.25684, TOL)

Просмотреть файл

@ -4,23 +4,15 @@
import os
import sys
import pytest
# TODO: better solution??
root = os.path.abspath(
os.path.join(os.path.dirname(__file__), os.path.pardir, os.path.pardir)
)
sys.path.append(root)
from reco_utils.dataset.url_utils import maybe_download
def test_maybe_download():
# TODO: change this file to the repo license when it is public
file_url = "https://raw.githubusercontent.com/Microsoft/vscode/master/LICENSE.txt"
file_url = "https://raw.githubusercontent.com/Microsoft/Recommenders/master/LICENSE"
filepath = "license.txt"
assert not os.path.exists(filepath)
filepath = maybe_download(file_url, "license.txt", expected_bytes=1110)
filepath = maybe_download(file_url, "license.txt", expected_bytes=1162)
assert os.path.exists(filepath)
# TODO: download again and test that the file is already there, grab the log??
os.remove(filepath)
with pytest.raises(IOError):
filepath = maybe_download(file_url, "license.txt", expected_bytes=0)

Просмотреть файл

@ -1,28 +1,32 @@
import pytest
import os
from reco_utils.recommender.deeprec.deeprec_utils import *
from reco_utils.recommender.deeprec.models.xDeepFM import *
from reco_utils.recommender.deeprec.models.dkn import *
from reco_utils.recommender.deeprec.IO.iterator import *
from reco_utils.recommender.deeprec.IO.dkn_iterator import *
from reco_utils.recommender.deeprec.deeprec_utils import prepare_hparams, download_deeprec_resources
from reco_utils.recommender.deeprec.models.xDeepFM import XDeepFMModel
from reco_utils.recommender.deeprec.models.dkn import DKN
from reco_utils.recommender.deeprec.IO.iterator import FFMTextIterator
from reco_utils.recommender.deeprec.IO.dkn_iterator import DKNTextIterator
@pytest.fixture
def resource_path():
return os.path.dirname(os.path.realpath(__file__))
@pytest.mark.gpu
@pytest.mark.deeprec
def test_xdeepfm_component_definition(resource_path):
data_path = os.path.join(resource_path, '../resources/deeprec/xdeepfm')
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "xdeepfm")
yaml_file = os.path.join(data_path, "xDeepFM.yaml")
if not os.path.exists(yaml_file):
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')
download_deeprec_resources(
"https://recodatasets.blob.core.windows.net/deeprec/",
data_path,
"xdeepfmresources.zip",
)
hparams = prepare_hparams(yaml_file)
input_creator = FFMTextIterator
model = XDeepFMModel(hparams, input_creator)
model = XDeepFMModel(hparams, FFMTextIterator)
assert model.logit is not None
assert model.update is not None
@ -32,19 +36,27 @@ def test_xdeepfm_component_definition(resource_path):
@pytest.mark.gpu
@pytest.mark.deeprec
def test_dkn_component_definition(resource_path):
data_path = os.path.join(resource_path, '../resources/deeprec/dkn')
yaml_file = os.path.join(data_path, r'dkn.yaml')
wordEmb_file = os.path.join(data_path, r'word_embeddings_100.npy')
entityEmb_file = os.path.join(data_path, r'TransE_entity2vec_100.npy')
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "dkn")
yaml_file = os.path.join(data_path, "dkn.yaml")
wordEmb_file = os.path.join(data_path, "word_embeddings_100.npy")
entityEmb_file = os.path.join(data_path, "TransE_entity2vec_100.npy")
if not os.path.exists(yaml_file):
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'dknresources.zip')
download_deeprec_resources(
"https://recodatasets.blob.core.windows.net/deeprec/",
data_path,
"dknresources.zip",
)
hparams = prepare_hparams(yaml_file, wordEmb_file=wordEmb_file,
entityEmb_file=entityEmb_file, epochs=5, learning_rate=0.0001)
hparams = prepare_hparams(
yaml_file,
wordEmb_file=wordEmb_file,
entityEmb_file=entityEmb_file,
epochs=5,
learning_rate=0.0001,
)
assert hparams is not None
input_creator = DKNTextIterator
model = DKN(hparams, input_creator)
model = DKN(hparams, DKNTextIterator)
assert model.logit is not None
assert model.update is not None

Просмотреть файл

@ -1,54 +1,68 @@
import pytest
import os
from reco_utils.recommender.deeprec.deeprec_utils import *
from reco_utils.recommender.deeprec.IO.iterator import *
from reco_utils.recommender.deeprec.IO.dkn_iterator import *
import tensorflow as tf
from reco_utils.recommender.deeprec.deeprec_utils import (
prepare_hparams,
download_deeprec_resources,
load_yaml_file
)
from reco_utils.recommender.deeprec.IO.iterator import FFMTextIterator
from reco_utils.recommender.deeprec.IO.dkn_iterator import DKNTextIterator
@pytest.fixture
def resource_path():
return os.path.dirname(os.path.realpath(__file__))
@pytest.mark.parametrize("must_exist_attributes", [
"FEATURE_COUNT", "data_format", "dim"
])
@pytest.mark.parametrize(
"must_exist_attributes", ["FEATURE_COUNT", "data_format", "dim"]
)
@pytest.mark.gpu
@pytest.mark.deeprec
def test_prepare_hparams(must_exist_attributes,resource_path):
data_path = os.path.join(resource_path, '../resources/deeprec/xdeepfm')
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
def test_prepare_hparams(must_exist_attributes, resource_path):
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "xdeepfm")
yaml_file = os.path.join(data_path, "xDeepFM.yaml")
if not os.path.exists(yaml_file):
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')
download_deeprec_resources(
"https://recodatasets.blob.core.windows.net/deeprec/",
data_path,
"xdeepfmresources.zip",
)
hparams = prepare_hparams(yaml_file)
assert hasattr(hparams, must_exist_attributes)
@pytest.mark.gpu
@pytest.mark.deeprec
def test_load_yaml_file(resource_path):
data_path = os.path.join(resource_path, '../resources/deeprec/xdeepfm')
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "xdeepfm")
yaml_file = os.path.join(data_path, "xDeepFM.yaml")
if not os.path.exists(yaml_file):
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path,
'xdeepfmresources.zip')
download_deeprec_resources(
"https://recodatasets.blob.core.windows.net/deeprec/",
data_path,
"xdeepfmresources.zip",
)
config = load_yaml_file(yaml_file)
assert config is not None
@pytest.mark.gpu
@pytest.mark.deeprec
def test_FFM_iterator(resource_path):
data_path = os.path.join(resource_path, '../resources/deeprec/xdeepfm')
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
data_file = os.path.join(data_path, r'sample_FFM_data.txt')
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "xdeepfm")
yaml_file = os.path.join(data_path, "xDeepFM.yaml")
data_file = os.path.join(data_path, "sample_FFM_data.txt")
if not os.path.exists(yaml_file):
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path,
'xdeepfmresources.zip')
download_deeprec_resources(
"https://recodatasets.blob.core.windows.net/deeprec/",
data_path,
"xdeepfmresources.zip",
)
hparams = prepare_hparams(yaml_file)
iterator = FFMTextIterator(hparams, tf.Graph())
@ -56,17 +70,21 @@ def test_FFM_iterator(resource_path):
for res in iterator.load_data_from_file(data_file):
assert isinstance(res, dict)
@pytest.mark.gpu
@pytest.mark.deeprec
def test_DKN_iterator(resource_path):
data_path = os.path.join(resource_path, '../resources/deeprec/dkn')
data_file = os.path.join(data_path, r'final_test_with_entity.txt')
yaml_file = os.path.join(data_path, r'dkn.yaml')
data_path = os.path.join(resource_path, "..", "resources", "deeprec", "dkn")
data_file = os.path.join(data_path, "final_test_with_entity.txt")
yaml_file = os.path.join(data_path, "dkn.yaml")
if not os.path.exists(yaml_file):
download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path,
'dknresources.zip')
download_deeprec_resources(
"https://recodatasets.blob.core.windows.net/deeprec/",
data_path,
"dknresources.zip",
)
hparams = prepare_hparams(yaml_file, wordEmb_file='', entityEmb_file='')
hparams = prepare_hparams(yaml_file, wordEmb_file="", entityEmb_file="")
iterator = DKNTextIterator(hparams, tf.Graph())
assert iterator is not None
for res in iterator.load_data_from_file(data_file):

Просмотреть файл

@ -10,20 +10,17 @@ from reco_utils.common.constants import (
DEFAULT_RATING_COL,
DEFAULT_TIMESTAMP_COL,
)
from reco_utils.recommender.ncf.dataset import Dataset
from tests.ncf_common import python_dataset_ncf, test_specs_ncf
N_NEG = 5
N_NEG_TEST = 10
BATCH_SIZE = 32
def test_data_preprocessing(python_dataset_ncf):
# test dataset._data_preprocessing and dataset._reindex
def test_data_preprocessing(python_dataset_ncf):
train, test = python_dataset_ncf
data = Dataset(train=train, test=test, n_neg=N_NEG, n_neg_test=N_NEG_TEST)
# shape
@ -43,11 +40,9 @@ def test_data_preprocessing(python_dataset_ncf):
assert data_row[1][DEFAULT_ITEM_COL] == data.item2id[row[1][DEFAULT_ITEM_COL]]
assert row[1][DEFAULT_ITEM_COL] == data.id2item[data_row[1][DEFAULT_ITEM_COL]]
def test_train_loader(python_dataset_ncf):
# test dataset.train_loader()
def test_train_loader(python_dataset_ncf):
train, test = python_dataset_ncf
data = Dataset(train=train, test=test, n_neg=N_NEG, n_neg_test=N_NEG_TEST)
# collect positvie user-item dict
@ -62,7 +57,6 @@ def test_train_loader(python_dataset_ncf):
assert len(user) == BATCH_SIZE
assert len(item) == BATCH_SIZE
assert len(labels) == BATCH_SIZE
assert max(labels) == min(labels)
# right labels
@ -73,12 +67,8 @@ def test_train_loader(python_dataset_ncf):
assert i not in positive_pool[u]
data.negative_sampling()
label_list = []
batches = []
for idx, batch in enumerate(data.train_loader(batch_size=1)):
user, item, labels = batch
assert len(user) == 1
@ -99,23 +89,18 @@ def test_train_loader(python_dataset_ncf):
def test_test_loader(python_dataset_ncf):
# test for dataset.test_loader()
train, test = python_dataset_ncf
data = Dataset(train=train, test=test, n_neg=N_NEG, n_neg_test=N_NEG_TEST)
# positive user-item dict, noting that the pool is train+test
positive_pool = {}
df = train.append(test)
for u in df[DEFAULT_USER_COL].unique():
for u in df[DEFAULT_USER_COL].unique():
positive_pool[u] = set(df[df[DEFAULT_USER_COL] == u][DEFAULT_ITEM_COL])
for batch in data.test_loader():
user, item, labels = batch
# shape
assert len(user) == N_NEG_TEST + 1
assert len(item) == N_NEG_TEST + 1
assert len(labels) == N_NEG_TEST + 1

Просмотреть файл

@ -9,19 +9,19 @@ import os
import shutil
from reco_utils.recommender.ncf.ncf_singlenode import NCF
from reco_utils.recommender.ncf.dataset import Dataset
N_NEG = 5
N_NEG_TEST = 10
from reco_utils.common.constants import (
DEFAULT_USER_COL,
DEFAULT_ITEM_COL,
DEFAULT_RATING_COL,
DEFAULT_TIMESTAMP_COL,
)
from tests.ncf_common import python_dataset_ncf, test_specs_ncf
N_NEG = 5
N_NEG_TEST = 10
@pytest.mark.gpu
@pytest.mark.parametrize(
"model_type, n_users, n_items", [("NeuMF", 1, 1), ("GMF", 10, 10), ("MLP", 4, 8)]
@ -45,6 +45,7 @@ def test_init(model_type, n_users, n_items):
# TODO: more parameters
@pytest.mark.gpu
@pytest.mark.parametrize(
"model_type, n_users, n_items", [("NeuMF", 5, 5), ("GMF", 5, 5), ("MLP", 5, 5)]

Просмотреть файл

@ -9,8 +9,6 @@ from reco_utils.common.notebook_utils import is_jupyter, is_databricks
@pytest.mark.notebooks
def test_is_jupyter():
"""Test if the module is running on Jupyter
"""
# Test on the terminal
assert is_jupyter() is False
assert is_databricks() is False

Просмотреть файл

@ -28,6 +28,12 @@ def test_sar_single_node_runs(notebooks):
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
@pytest.mark.notebooks
def test_sar_deep_dive_runs(notebooks):
notebook_path = notebooks["sar_deep_dive"]
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
@pytest.mark.notebooks
def test_baseline_deep_dive_runs(notebooks):
notebook_path = notebooks["baseline_deep_dive"]
@ -38,3 +44,9 @@ def test_baseline_deep_dive_runs(notebooks):
def test_surprise_deep_dive_runs(notebooks):
notebook_path = notebooks["surprise_svd_deep_dive"]
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
@pytest.mark.notebooks
def test_vw_deep_dive_runs(notebooks):
notebook_path = notebooks["vowpal_wabbit_deep_dive"]
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)

Просмотреть файл

@ -1,159 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import pandas as pd
import numpy as np
import pytest
from reco_utils.dataset.numpy_splitters import numpy_stratified_split
@pytest.fixture(scope="module")
def test_specs():
return {
"number_of_items": 50,
"number_of_users": 20,
"seed": 123,
"ratio": 0.6,
"tolerance": 0.01,
"fluctuation": 0.02,
}
@pytest.fixture(scope="module")
def python_int_dataset(test_specs):
"""Generate a test user/item affinity Matrix"""
# fix the the random seed
np.random.seed(test_specs["seed"])
# generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
X = np.random.randint(
low=0,
high=6,
size=(test_specs["number_of_users"], test_specs["number_of_items"]),
)
return X
@pytest.fixture(scope="module")
def python_float_dataset(test_specs):
"""Generate a test user/item affinity Matrix"""
# fix the the random seed
np.random.seed(test_specs["seed"])
# generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
X = (
np.random.random(
size=(test_specs["number_of_users"], test_specs["number_of_items"])
)
* 5
)
return X
def test_int_numpy_stratified_splitter(test_specs, python_int_dataset):
"""
Test the random stratified splitter.
"""
# generate a syntetic dataset
X = python_int_dataset
# the splitter returns (in order): train and test user/affinity matrices, train and test datafarmes and user/items to matrix maps
Xtr, Xtst = numpy_stratified_split(
X, ratio=test_specs["ratio"], seed=test_specs["seed"]
)
# Tests
# check that the generated matrices have the correct dimensions
assert (Xtr.shape[0] == X.shape[0]) & (Xtr.shape[1] == X.shape[1])
assert (Xtst.shape[0] == X.shape[0]) & (Xtst.shape[1] == X.shape[1])
X_rated = np.sum(X != 0, axis=1) # number of total rated items per user
Xtr_rated = np.sum(Xtr != 0, axis=1) # number of rated items in the train set
Xtst_rated = np.sum(Xtst != 0, axis=1) # number of rated items in the test set
# global split: check that the all dataset is split in the correct ratio
assert Xtr_rated.sum() / (X_rated.sum()) == pytest.approx(
test_specs["ratio"], test_specs["tolerance"]
)
assert Xtst_rated.sum() / (X_rated.sum()) == pytest.approx(
1 - test_specs["ratio"], test_specs["tolerance"]
)
# This implementation of the stratified splitter performs a random split at the single user level. Here we check
# that also this more stringent condition is verified. Note that user to user fluctuations in the split ratio
# are stronger than for the entire dataset due to the random nature of the per user splitting.
# For this reason we allow a slightly bigger tollerance, as specified in the test_specs()
assert (
(Xtr_rated / X_rated <= test_specs["ratio"] + test_specs["fluctuation"]).all()
& (Xtr_rated / X_rated >= test_specs["ratio"] - test_specs["fluctuation"]).all()
)
assert (
(
Xtst_rated / X_rated
<= (1 - test_specs["ratio"]) + test_specs["fluctuation"]
).all()
& (
Xtst_rated / X_rated
>= (1 - test_specs["ratio"]) - test_specs["fluctuation"]
).all()
)
def test_float_numpy_stratified_splitter(test_specs, python_float_dataset):
"""
Test the random stratified splitter.
"""
# generate a syntetic dataset
X = python_float_dataset
# the splitter returns (in order): train and test user/affinity matrices, train and test datafarmes and user/items to matrix maps
Xtr, Xtst = numpy_stratified_split(
X, ratio=test_specs["ratio"], seed=test_specs["seed"]
)
# Tests
# check that the generated matrices have the correct dimensions
assert (Xtr.shape[0] == X.shape[0]) & (Xtr.shape[1] == X.shape[1])
assert (Xtst.shape[0] == X.shape[0]) & (Xtst.shape[1] == X.shape[1])
X_rated = np.sum(X != 0, axis=1) # number of total rated items per user
Xtr_rated = np.sum(Xtr != 0, axis=1) # number of rated items in the train set
Xtst_rated = np.sum(Xtst != 0, axis=1) # number of rated items in the test set
# global split: check that the all dataset is split in the correct ratio
assert Xtr_rated.sum() / (X_rated.sum()) == pytest.approx(
test_specs["ratio"], test_specs["tolerance"]
)
assert Xtst_rated.sum() / (X_rated.sum()) == pytest.approx(
1 - test_specs["ratio"], test_specs["tolerance"]
)
# This implementation of the stratified splitter performs a random split at the single user level. Here we check
# that also this more stringent condition is verified. Note that user to user fluctuations in the split ratio
# are stronger than for the entire dataset due to the random nature of the per user splitting.
# For this reason we allow a slightly bigger tollerance, as specified in the test_specs()
assert Xtr_rated / X_rated == pytest.approx(
test_specs["ratio"], rel=test_specs["fluctuation"]
)
assert Xtst_rated / X_rated == pytest.approx(
(1 - test_specs["ratio"]), rel=test_specs["fluctuation"]
)

Просмотреть файл

@ -18,7 +18,7 @@ TOL = 0.0001
@pytest.fixture
def target_metrics():
def target_metrics(scope="module"):
return {
"rmse": pytest.approx(7.254309, TOL),
"mae": pytest.approx(6.375, TOL),
@ -92,7 +92,6 @@ def python_data():
],
}
)
return rating_true, rating_pred, rating_nohit
@ -120,7 +119,6 @@ def test_python_rsquared(python_data, target_metrics):
assert rsquared(
rating_true=rating_true, rating_pred=rating_true, col_prediction="rating"
) == pytest.approx(1.0, TOL)
assert rsquared(rating_true, rating_pred) == target_metrics["rsquared"]
@ -130,7 +128,6 @@ def test_python_exp_var(python_data, target_metrics):
assert exp_var(
rating_true=rating_true, rating_pred=rating_true, col_prediction="rating"
) == pytest.approx(1.0, TOL)
assert exp_var(rating_true, rating_pred) == target_metrics["exp_var"]

Просмотреть файл

@ -10,10 +10,12 @@ from reco_utils.dataset.split_utils import (
min_rating_filter_pandas,
split_pandas_data_with_ratios,
)
from reco_utils.dataset.python_splitters import (
python_chrono_split,
python_random_split,
python_stratified_split,
numpy_stratified_split,
)
from reco_utils.common.constants import (
@ -34,13 +36,14 @@ def test_specs():
"ratios": [0.2, 0.3, 0.5],
"split_numbers": [2, 3, 5],
"tolerance": 0.01,
"number_of_items": 50,
"number_of_users": 20,
"fluctuation": 0.02,
}
@pytest.fixture(scope="module")
def python_dataset(test_specs):
"""Get Python labels"""
def random_date_generator(start_date, range_in_days):
"""Helper function to generate random timestamps.
@ -59,13 +62,13 @@ def python_dataset(test_specs):
rating = pd.DataFrame(
{
DEFAULT_USER_COL: np.random.random_integers(
DEFAULT_USER_COL: np.random.randint(
1, 5, test_specs["number_of_rows"]
),
DEFAULT_ITEM_COL: np.random.random_integers(
DEFAULT_ITEM_COL: np.random.randint(
1, 15, test_specs["number_of_rows"]
),
DEFAULT_RATING_COL: np.random.random_integers(
DEFAULT_RATING_COL: np.random.randint(
1, 5, test_specs["number_of_rows"]
),
DEFAULT_TIMESTAMP_COL: random_date_generator(
@ -73,32 +76,22 @@ def python_dataset(test_specs):
),
}
)
return rating
def test_split_pandas_data(pandas_dummy_timestamp):
"""Test split pandas data
"""
df_rating = pandas_dummy_timestamp
splits = split_pandas_data_with_ratios(df_rating, ratios=[0.5, 0.5])
splits = split_pandas_data_with_ratios(pandas_dummy_timestamp, ratios=[0.5, 0.5])
assert len(splits[0]) == 5
assert len(splits[1]) == 5
splits = split_pandas_data_with_ratios(df_rating, ratios=[0.12, 0.36, 0.52])
assert len(splits[0]) == round(df_rating.shape[0] * 0.12)
assert len(splits[1]) == round(df_rating.shape[0] * 0.36)
assert len(splits[2]) == round(df_rating.shape[0] * 0.52)
splits = split_pandas_data_with_ratios(pandas_dummy_timestamp, ratios=[0.12, 0.36, 0.52])
shape = pandas_dummy_timestamp.shape[0]
assert len(splits[0]) == round(shape * 0.12)
assert len(splits[1]) == round(shape * 0.36)
assert len(splits[2]) == round(shape * 0.52)
def test_min_rating_filter(python_dataset):
"""Test min rating filter
"""
df_rating = python_dataset
def count_filtered_rows(data, filter_by="user"):
split_by_column = DEFAULT_USER_COL if filter_by == "user" else DEFAULT_ITEM_COL
data_grouped = data.groupby(split_by_column)
@ -110,9 +103,8 @@ def test_min_rating_filter(python_dataset):
return row_counts
df_user = min_rating_filter_pandas(df_rating, min_rating=5, filter_by="user")
df_item = min_rating_filter_pandas(df_rating, min_rating=5, filter_by="item")
df_user = min_rating_filter_pandas(python_dataset, min_rating=5, filter_by="user")
df_item = min_rating_filter_pandas(python_dataset, min_rating=5, filter_by="item")
user_rating_counts = count_filtered_rows(df_user, filter_by="user")
item_rating_counts = count_filtered_rows(df_item, filter_by="item")
@ -128,10 +120,8 @@ def test_random_splitter(test_specs, python_dataset):
the testing data. A approximate match with certain level of tolerance is therefore used
instead for tests.
"""
df_rating = python_dataset
splits = python_random_split(
df_rating, ratio=test_specs["ratio"], seed=test_specs["seed"]
python_dataset, ratio=test_specs["ratio"], seed=test_specs["seed"]
)
assert len(splits[0]) / test_specs["number_of_rows"] == pytest.approx(
test_specs["ratio"], test_specs["tolerance"]
@ -141,7 +131,7 @@ def test_random_splitter(test_specs, python_dataset):
)
splits = python_random_split(
df_rating, ratio=test_specs["ratios"], seed=test_specs["seed"]
python_dataset, ratio=test_specs["ratios"], seed=test_specs["seed"]
)
assert len(splits) == 3
@ -156,7 +146,7 @@ def test_random_splitter(test_specs, python_dataset):
)
splits = python_random_split(
df_rating, ratio=test_specs["split_numbers"], seed=test_specs["seed"]
python_dataset, ratio=test_specs["split_numbers"], seed=test_specs["seed"]
)
assert len(splits) == 3
@ -172,12 +162,8 @@ def test_random_splitter(test_specs, python_dataset):
def test_chrono_splitter(test_specs, python_dataset):
"""Test chronological splitter for Spark dataframes.
"""
df_rating = python_dataset
splits = python_chrono_split(
df_rating, ratio=test_specs["ratio"], min_rating=10, filter_by="user"
python_dataset, ratio=test_specs["ratio"], min_rating=10, filter_by="user"
)
assert len(splits[0]) / test_specs["number_of_rows"] == pytest.approx(
@ -187,27 +173,21 @@ def test_chrono_splitter(test_specs, python_dataset):
1 - test_specs["ratio"], test_specs["tolerance"]
)
# Test all time stamps in test are later than that in train for all users.
# This is for single-split case.
all_later = []
for user in test_specs["user_ids"]:
df_train = splits[0][splits[0][DEFAULT_USER_COL] == user]
df_test = splits[1][splits[1][DEFAULT_USER_COL] == user]
p = product(df_train[DEFAULT_TIMESTAMP_COL], df_test[DEFAULT_TIMESTAMP_COL])
user_later = [a <= b for (a, b) in p]
all_later.append(user_later)
assert all(all_later)
# Test if both contains the same user list. This is because chrono split is stratified.
users_train = splits[0][DEFAULT_USER_COL].unique()
users_test = splits[1][DEFAULT_USER_COL].unique()
assert set(users_train) == set(users_test)
# Test all time stamps in test are later than that in train for all users.
# This is for single-split case.
max_train_times = splits[0][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).max()
min_test_times = splits[1][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).min()
check_times = max_train_times.join(min_test_times, lsuffix='_0', rsuffix='_1')
assert all((check_times[DEFAULT_TIMESTAMP_COL + '_0'] < check_times[DEFAULT_TIMESTAMP_COL + '_1']).values)
# Test multi-split case
splits = python_chrono_split(
df_rating, ratio=test_specs["ratios"], min_rating=10, filter_by="user"
python_dataset, ratio=test_specs["ratios"], min_rating=10, filter_by="user"
)
assert len(splits) == 3
@ -221,30 +201,28 @@ def test_chrono_splitter(test_specs, python_dataset):
test_specs["ratios"][2], test_specs["tolerance"]
)
# Test if all splits contain the same user list. This is because chrono split is stratified.
users_train = splits[0][DEFAULT_USER_COL].unique()
users_test = splits[1][DEFAULT_USER_COL].unique()
users_val = splits[2][DEFAULT_USER_COL].unique()
assert set(users_train) == set(users_test)
assert set(users_train) == set(users_val)
# Test if timestamps are correctly split. This is for multi-split case.
all_later = []
for user in test_specs["user_ids"]:
df_train = splits[0][splits[0][DEFAULT_USER_COL] == user]
df_valid = splits[1][splits[1][DEFAULT_USER_COL] == user]
df_test = splits[2][splits[2][DEFAULT_USER_COL] == user]
max_train_times = splits[0][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).max()
min_test_times = splits[1][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).min()
check_times = max_train_times.join(min_test_times, lsuffix='_0', rsuffix='_1')
assert all((check_times[DEFAULT_TIMESTAMP_COL + '_0'] < check_times[DEFAULT_TIMESTAMP_COL + '_1']).values)
p1 = product(df_train[DEFAULT_TIMESTAMP_COL], df_valid[DEFAULT_TIMESTAMP_COL])
p2 = product(df_valid[DEFAULT_TIMESTAMP_COL], df_test[DEFAULT_TIMESTAMP_COL])
user_later_1 = [a <= b for (a, b) in p1]
user_later_2 = [a <= b for (a, b) in p2]
all_later.append(user_later_1)
all_later.append(user_later_2)
assert all(all_later)
max_test_times = splits[1][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).max()
min_val_times = splits[2][[DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL]].groupby(DEFAULT_USER_COL).min()
check_times = max_test_times.join(min_val_times, lsuffix='_1', rsuffix='_2')
assert all((check_times[DEFAULT_TIMESTAMP_COL + '_1'] < check_times[DEFAULT_TIMESTAMP_COL + '_2']).values)
def test_stratified_splitter(test_specs, python_dataset):
"""Test stratified splitter.
"""
df_rating = python_dataset
splits = python_stratified_split(
df_rating, ratio=test_specs["ratio"], min_rating=10, filter_by="user"
python_dataset, ratio=test_specs["ratio"], min_rating=10, filter_by="user"
)
assert len(splits[0]) / test_specs["number_of_rows"] == pytest.approx(
@ -261,7 +239,7 @@ def test_stratified_splitter(test_specs, python_dataset):
assert set(users_train) == set(users_test)
splits = python_stratified_split(
df_rating, ratio=test_specs["ratios"], min_rating=10, filter_by="user"
python_dataset, ratio=test_specs["ratios"], min_rating=10, filter_by="user"
)
assert len(splits) == 3
@ -275,3 +253,117 @@ def test_stratified_splitter(test_specs, python_dataset):
test_specs["ratios"][2], test_specs["tolerance"]
)
@pytest.fixture(scope="module")
def python_int_dataset(test_specs):
# fix the the random seed
np.random.seed(test_specs["seed"])
# generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
return np.random.randint(
low=0,
high=6,
size=(test_specs["number_of_users"], test_specs["number_of_items"]),
)
@pytest.fixture(scope="module")
def python_float_dataset(test_specs):
# fix the the random seed
np.random.seed(test_specs["seed"])
# generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
return np.random.random(
size=(test_specs["number_of_users"], test_specs["number_of_items"])
) * 5
def test_int_numpy_stratified_splitter(test_specs, python_int_dataset):
# generate a syntetic dataset
X = python_int_dataset
# the splitter returns (in order): train and test user/affinity matrices, train and test datafarmes and user/items to matrix maps
Xtr, Xtst = numpy_stratified_split(
X, ratio=test_specs["ratio"], seed=test_specs["seed"]
)
# check that the generated matrices have the correct dimensions
assert (Xtr.shape[0] == X.shape[0]) & (Xtr.shape[1] == X.shape[1])
assert (Xtst.shape[0] == X.shape[0]) & (Xtst.shape[1] == X.shape[1])
X_rated = np.sum(X != 0, axis=1) # number of total rated items per user
Xtr_rated = np.sum(Xtr != 0, axis=1) # number of rated items in the train set
Xtst_rated = np.sum(Xtst != 0, axis=1) # number of rated items in the test set
# global split: check that the all dataset is split in the correct ratio
assert Xtr_rated.sum() / (X_rated.sum()) == pytest.approx(
test_specs["ratio"], test_specs["tolerance"]
)
assert Xtst_rated.sum() / (X_rated.sum()) == pytest.approx(
1 - test_specs["ratio"], test_specs["tolerance"]
)
# This implementation of the stratified splitter performs a random split at the single user level. Here we check
# that also this more stringent condition is verified. Note that user to user fluctuations in the split ratio
# are stronger than for the entire dataset due to the random nature of the per user splitting.
# For this reason we allow a slightly bigger tollerance, as specified in the test_specs()
assert (
(Xtr_rated / X_rated <= test_specs["ratio"] + test_specs["fluctuation"]).all()
& (Xtr_rated / X_rated >= test_specs["ratio"] - test_specs["fluctuation"]).all()
)
assert (
(
Xtst_rated / X_rated
<= (1 - test_specs["ratio"]) + test_specs["fluctuation"]
).all()
& (
Xtst_rated / X_rated
>= (1 - test_specs["ratio"]) - test_specs["fluctuation"]
).all()
)
def test_float_numpy_stratified_splitter(test_specs, python_float_dataset):
# generate a syntetic dataset
X = python_float_dataset
# the splitter returns (in order): train and test user/affinity matrices, train and test datafarmes and user/items to matrix maps
Xtr, Xtst = numpy_stratified_split(
X, ratio=test_specs["ratio"], seed=test_specs["seed"]
)
# Tests
# check that the generated matrices have the correct dimensions
assert (Xtr.shape[0] == X.shape[0]) & (Xtr.shape[1] == X.shape[1])
assert (Xtst.shape[0] == X.shape[0]) & (Xtst.shape[1] == X.shape[1])
X_rated = np.sum(X != 0, axis=1) # number of total rated items per user
Xtr_rated = np.sum(Xtr != 0, axis=1) # number of rated items in the train set
Xtst_rated = np.sum(Xtst != 0, axis=1) # number of rated items in the test set
# global split: check that the all dataset is split in the correct ratio
assert Xtr_rated.sum() / (X_rated.sum()) == pytest.approx(
test_specs["ratio"], test_specs["tolerance"]
)
assert Xtst_rated.sum() / (X_rated.sum()) == pytest.approx(
1 - test_specs["ratio"], test_specs["tolerance"]
)
# This implementation of the stratified splitter performs a random split at the single user level. Here we check
# that also this more stringent condition is verified. Note that user to user fluctuations in the split ratio
# are stronger than for the entire dataset due to the random nature of the per user splitting.
# For this reason we allow a slightly bigger tollerance, as specified in the test_specs()
assert Xtr_rated / X_rated == pytest.approx(
test_specs["ratio"], rel=test_specs["fluctuation"]
)
assert Xtst_rated / X_rated == pytest.approx(
(1 - test_specs["ratio"]), rel=test_specs["fluctuation"]
)

Просмотреть файл

@ -7,8 +7,8 @@ Test common python utils
import numpy as np
import pytest
from scipy.sparse import csc, csc_matrix
from reco_utils.common.python_utils import (
exponential_decay,
jaccard,
lift
)
@ -17,21 +17,21 @@ TOL = 0.0001
@pytest.fixture
def target_matrices():
J1 = np.mat('1.0, 0.0, 0.5; '
'0.0, 1.0, 0.33333; '
'0.5, 0.33333, 1.0')
J2 = np.mat('1.0, 0.0, 0.0, 0.2; '
'0.0, 1.0, 0.0, 0.0; '
'0.0, 0.0, 1.0, 0.5; '
'0.2, 0.0, 0.5, 1.0')
L1 = np.mat('1.0, 0.0, 0.5; '
'0.0, 0.5, 0.25; '
'0.5, 0.25, 0.5')
L2 = np.mat('0.5, 0.0, 0.0, 0.125; '
'0.0, 0.33333, 0.0, 0.0; '
'0.0, 0.0, 0.5, 0.25; '
'0.125, 0.0, 0.25, 0.25')
def target_matrices(scope="module"):
J1 = np.array([[1.0, 0.0, 0.5],
[0.0, 1.0, 0.33333],
[0.5, 0.33333, 1.0]])
J2 = np.array([[1.0, 0.0, 0.0, 0.2],
[0.0, 1.0, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.5],
[0.2, 0.0, 0.5, 1.0]])
L1 = np.array([[1.0, 0.0, 0.5],
[0.0, 0.5, 0.25],
[0.5, 0.25, 0.5]])
L2 = np.array([[0.5, 0.0, 0.0, 0.125],
[0.0, 0.33333, 0.0, 0.0],
[0.0, 0.0, 0.5, 0.25],
[0.125, 0.0, 0.25, 0.25]])
return {
"jaccard1": pytest.approx(J1, TOL),
"jaccard2": pytest.approx(J2, TOL),
@ -42,36 +42,40 @@ def target_matrices():
@pytest.fixture(scope="module")
def python_data():
D1 = np.mat('1.0, 0.0, 1.0; '
'0.0, 2.0, 1.0; '
'1.0, 1.0, 2.0')
cooccurrence1 = csc_matrix(D1)
D2 = np.mat('2.0, 0.0, 0.0, 1.0; '
'0.0, 3.0, 0.0, 0.0; '
'0.0, 0.0, 2.0, 2.0; '
'1.0, 0.0, 2.0, 4.0')
cooccurrence2 = csc_matrix(D2)
cooccurrence1 = np.array([[1.0, 0.0, 1.0],
[0.0, 2.0, 1.0],
[1.0, 1.0, 2.0]])
cooccurrence2 = np.array([[2.0, 0.0, 0.0, 1.0],
[0.0, 3.0, 0.0, 0.0],
[0.0, 0.0, 2.0, 2.0],
[1.0, 0.0, 2.0, 4.0]])
return cooccurrence1, cooccurrence2
def test_python_jaccard(python_data, target_matrices):
cooccurrence1, cooccurrence2 = python_data
J1 = jaccard(cooccurrence1)
assert type(J1) == csc.csc_matrix
assert J1.todense() == target_matrices["jaccard1"]
assert type(J1) == np.ndarray
assert J1 == target_matrices["jaccard1"]
J2 = jaccard(cooccurrence2)
assert type(J2) == csc.csc_matrix
assert J2.todense() == target_matrices["jaccard2"]
assert type(J2) == np.ndarray
assert J2 == target_matrices["jaccard2"]
def test_python_lift(python_data, target_matrices):
cooccurrence1, cooccurrence2 = python_data
L1 = lift(cooccurrence1)
assert type(L1) == csc.csc_matrix
assert L1.todense() == target_matrices["lift1"]
assert type(L1) == np.ndarray
assert L1 == target_matrices["lift1"]
L2 = lift(cooccurrence2)
assert type(L2) == csc.csc_matrix
assert L2.todense() == target_matrices["lift2"]
assert type(L2) == np.ndarray
assert L2 == target_matrices["lift2"]
def test_exponential_decay():
values = np.array([1, 2, 3, 4, 5, 6])
expected = np.array([0.25, 0.35355339, 0.5, 0.70710678, 1., 1.])
actual = exponential_decay(value=values, max_val=5, half_life=2)
assert np.allclose(actual, expected, atol=TOL)

Просмотреть файл

@ -3,14 +3,12 @@
import pytest
import numpy as np
from reco_utils.recommender.rbm.rbm import RBM
from tests.rbm_common import test_specs, affinity_matrix
@pytest.fixture(scope="module")
def init_rbm():
return {
"n_hidden": 100,
"epochs": 10,
@ -25,9 +23,6 @@ def init_rbm():
@pytest.mark.gpu
def test_class_init(init_rbm):
"""
Test the init of the model class
"""
model = RBM(
hidden_units=init_rbm["n_hidden"],
training_epoch=init_rbm["epochs"],
@ -59,9 +54,6 @@ def test_class_init(init_rbm):
@pytest.mark.gpu
def test_train_param_init(init_rbm, affinity_matrix):
"""
Test the dimension of the learning parameters
"""
# obtain the train/test set matrices
Xtr, Xtst = affinity_matrix
@ -86,9 +78,6 @@ def test_train_param_init(init_rbm, affinity_matrix):
@pytest.mark.gpu
def test_sampling_funct(init_rbm, affinity_matrix):
"""
Test the sampling functions
"""
# obtain the train/test set matrices
Xtr, Xtst = affinity_matrix

Просмотреть файл

@ -22,40 +22,6 @@ def _rearrange_to_test(array, row_ids, col_ids, row_map, col_map):
return array
def _apply_sar_hash_index(model, train, test, header, pandas_new=False):
# TODO: review this function
# index all users and items which SAR will compute scores for
# bugfix to get around different pandas vesions in build servers
if test is not None:
if pandas_new:
df_all = pd.concat([train, test], sort=False)
else:
df_all = pd.concat([train, test])
else:
df_all = train
# hash SAR
# Obtain all the users and items from both training and test data
unique_users = df_all[header["col_user"]].unique()
unique_items = df_all[header["col_item"]].unique()
# Hash users and items to smaller continuous space.
# Actually, this is an ordered set - it's discrete, but .
# This helps keep the matrices we keep in memory as small as possible.
enumerate_items_1, enumerate_items_2 = itertools.tee(enumerate(unique_items))
enumerate_users_1, enumerate_users_2 = itertools.tee(enumerate(unique_users))
item_map_dict = {x: i for i, x in enumerate_items_1}
user_map_dict = {x: i for i, x in enumerate_users_1}
# the reverse of the dictionary above - array index to actual ID
index2user = dict(enumerate_users_2)
index2item = dict(enumerate_items_2)
model.set_index(
unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item
)
def test_init(header):
model = SARSingleNode(
remove_seen=True, similarity_type="jaccard", **header
@ -78,8 +44,6 @@ def test_fit(similarity_type, timedecay_formula, train_test_dummy_timestamp, hea
**header
)
trainset, testset = train_test_dummy_timestamp
_apply_sar_hash_index(model, trainset, testset, header)
model.fit(trainset)
@ -96,9 +60,6 @@ def test_predict(
**header
)
trainset, testset = train_test_dummy_timestamp
_apply_sar_hash_index(model, trainset, testset, header)
model.fit(trainset)
preds = model.predict(testset)
@ -109,11 +70,6 @@ def test_predict(
assert preds[PREDICTION_COL].dtype == trainset[header["col_rating"]].dtype
"""
Main SAR tests are below - load test files which are used for both Scala SAR and Python reference implementations
"""
# Tests 1-6
@pytest.mark.parametrize(
"threshold,similarity_type,file",
[
@ -139,8 +95,6 @@ def test_sar_item_similarity(
**header
)
_apply_sar_hash_index(model, demo_usage_data, None, header)
model.fit(demo_usage_data)
true_item_similarity, row_ids, col_ids = read_matrix(
@ -152,8 +106,8 @@ def test_sar_item_similarity(
model.item_similarity.todense(),
row_ids,
col_ids,
model.item_map_dict,
model.item_map_dict,
model.item2index,
model.item2index,
)
assert np.array_equal(
true_item_similarity.astype(test_item_similarity.dtype),
@ -161,11 +115,11 @@ def test_sar_item_similarity(
)
else:
test_item_similarity = _rearrange_to_test(
model.item_similarity.toarray(),
model.item_similarity,
row_ids,
col_ids,
model.item_map_dict,
model.item_map_dict,
model.item2index,
model.item2index,
)
assert np.allclose(
true_item_similarity.astype(test_item_similarity.dtype),
@ -174,7 +128,6 @@ def test_sar_item_similarity(
)
# Test 7
def test_user_affinity(demo_usage_data, sar_settings, header):
time_now = demo_usage_data[header["col_timestamp"]].max()
model = SARSingleNode(
@ -185,15 +138,14 @@ def test_user_affinity(demo_usage_data, sar_settings, header):
time_now=time_now,
**header
)
_apply_sar_hash_index(model, demo_usage_data, None, header)
model.fit(demo_usage_data)
true_user_affinity, items = load_affinity(sar_settings["FILE_DIR"] + "user_aff.csv")
user_index = model.user_map_dict[sar_settings["TEST_USER_ID"]]
user_index = model.user2index[sar_settings["TEST_USER_ID"]]
test_user_affinity = np.reshape(
np.array(
_rearrange_to_test(
model.user_affinity, None, items, None, model.item_map_dict
model.user_affinity, None, items, None, model.item2index
)[user_index,].todense()
),
-1,
@ -205,7 +157,6 @@ def test_user_affinity(demo_usage_data, sar_settings, header):
)
# Tests 8-10
@pytest.mark.parametrize(
"threshold,similarity_type,file",
[(3, "cooccurrence", "count"), (3, "jaccard", "jac"), (3, "lift", "lift")],
@ -223,7 +174,6 @@ def test_userpred(
threshold=threshold,
**header
)
_apply_sar_hash_index(model, demo_usage_data, None, header)
model.fit(demo_usage_data)
true_items, true_scores = load_userpred(

Просмотреть файл

@ -19,6 +19,7 @@ try:
except ImportError:
pass # skip this import if we are in pure python environment
TOL = 0.0001
@ -93,7 +94,6 @@ def test_init_spark(spark):
def test_init_spark_rating_eval(spark_data):
df_true, df_pred = spark_data
evaluator = SparkRatingEvaluation(df_true, df_pred)
assert evaluator is not None
@ -230,7 +230,6 @@ def test_spark_map(spark_data, target_metrics):
@pytest.mark.spark
def test_spark_python_match(python_data, spark):
# Test on the original data with k = 10.
df_true, df_pred = python_data
dfs_true = spark.createDataFrame(df_true)
@ -247,11 +246,9 @@ def test_spark_python_match(python_data, spark):
== pytest.approx(eval_spark1.ndcg_at_k(), TOL),
map_at_k(df_true, df_pred, k=10) == pytest.approx(eval_spark1.map_at_k(), TOL),
]
assert all(match1)
# Test on the original data with k = 3.
dfs_true = spark.createDataFrame(df_true)
dfs_pred = spark.createDataFrame(df_pred)
@ -265,7 +262,6 @@ def test_spark_python_match(python_data, spark):
ndcg_at_k(df_true, df_pred, k=3) == pytest.approx(eval_spark2.ndcg_at_k(), TOL),
map_at_k(df_true, df_pred, k=3) == pytest.approx(eval_spark2.map_at_k(), TOL),
]
assert all(match2)
# Remove the first row from the original data.
@ -285,11 +281,9 @@ def test_spark_python_match(python_data, spark):
== pytest.approx(eval_spark3.ndcg_at_k(), TOL),
map_at_k(df_true, df_pred, k=10) == pytest.approx(eval_spark3.map_at_k(), TOL),
]
assert all(match3)
# Test with one user
df_pred = df_pred.loc[df_pred["userID"] == 3]
df_true = df_true.loc[df_true["userID"] == 3]
@ -307,5 +301,4 @@ def test_spark_python_match(python_data, spark):
== pytest.approx(eval_spark4.ndcg_at_k(), TOL),
map_at_k(df_true, df_pred, k=10) == pytest.approx(eval_spark4.map_at_k(), TOL),
]
assert all(match4)

Просмотреть файл

@ -3,7 +3,6 @@
import pandas as pd
import numpy as np
from itertools import product
import pytest
from reco_utils.dataset.split_utils import min_rating_filter_spark
from reco_utils.common.constants import (
@ -14,6 +13,8 @@ from reco_utils.common.constants import (
)
try:
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from reco_utils.common.spark_utils import start_or_get_spark
from reco_utils.dataset.spark_splitters import (
spark_chrono_split,
@ -44,9 +45,7 @@ def python_data(test_specs):
def random_date_generator(start_date, range_in_days):
"""Helper function to generate random timestamps.
Reference: https://stackoverflow.com/questions/41006182/generate-random-dates-within-a
-range-in-numpy
Reference: https://stackoverflow.com/questions/41006182/generate-random-dates-within-a-range-in-numpy
"""
days_to_add = np.arange(0, range_in_days)
random_dates = []
@ -58,13 +57,13 @@ def python_data(test_specs):
rating = pd.DataFrame(
{
DEFAULT_USER_COL: np.random.random_integers(
DEFAULT_USER_COL: np.random.randint(
1, 5, test_specs["number_of_rows"]
),
DEFAULT_ITEM_COL: np.random.random_integers(
DEFAULT_ITEM_COL: np.random.randint(
1, 15, test_specs["number_of_rows"]
),
DEFAULT_RATING_COL: np.random.random_integers(
DEFAULT_RATING_COL: np.random.randint(
1, 5, test_specs["number_of_rows"]
),
DEFAULT_TIMESTAMP_COL: random_date_generator(
@ -78,22 +77,14 @@ def python_data(test_specs):
@pytest.fixture(scope="module")
def spark_dataset(python_data):
"""Get Python labels"""
rating = python_data
spark = start_or_get_spark("SplitterTesting")
df_rating = spark.createDataFrame(rating)
return df_rating
return spark.createDataFrame(python_data)
@pytest.mark.spark
def test_min_rating_filter(spark_dataset):
"""Test min rating filter
"""
dfs_rating = spark_dataset
dfs_user = min_rating_filter_spark(dfs_rating, min_rating=5, filter_by="user")
dfs_item = min_rating_filter_spark(dfs_rating, min_rating=5, filter_by="item")
dfs_user = min_rating_filter_spark(spark_dataset, min_rating=5, filter_by="user")
dfs_item = min_rating_filter_spark(spark_dataset, min_rating=5, filter_by="item")
user_rating_counts = [
x["count"] >= 5 for x in dfs_user.groupBy(DEFAULT_USER_COL).count().collect()
@ -115,10 +106,8 @@ def test_random_splitter(test_specs, spark_dataset):
the testing data. A approximate match with certain level of tolerance is therefore used
instead for tests.
"""
df_rating = spark_dataset
splits = spark_random_split(
df_rating, ratio=test_specs["ratio"], seed=test_specs["seed"]
spark_dataset, ratio=test_specs["ratio"], seed=test_specs["seed"]
)
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
@ -129,7 +118,7 @@ def test_random_splitter(test_specs, spark_dataset):
)
splits = spark_random_split(
df_rating, ratio=test_specs["ratios"], seed=test_specs["seed"]
spark_dataset, ratio=test_specs["ratios"], seed=test_specs["seed"]
)
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
@ -145,11 +134,8 @@ def test_random_splitter(test_specs, spark_dataset):
@pytest.mark.spark
def test_chrono_splitter(test_specs, spark_dataset):
"""Test chronological splitter for Spark dataframes"""
dfs_rating = spark_dataset
splits = spark_chrono_split(
dfs_rating, ratio=test_specs["ratio"], filter_by="user", min_rating=10
spark_dataset, ratio=test_specs["ratio"], filter_by="user", min_rating=10
)
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
@ -169,18 +155,9 @@ def test_chrono_splitter(test_specs, spark_dataset):
assert set(users_train) == set(users_test)
# Test all time stamps in test are later than that in train for all users.
all_later = []
for user in test_specs["user_ids"]:
dfs_train = splits[0][splits[0][DEFAULT_USER_COL] == user]
dfs_test = splits[1][splits[1][DEFAULT_USER_COL] == user]
assert _if_later(splits[0], splits[1])
user_later = _if_later(dfs_train, dfs_test, col_timestamp=DEFAULT_TIMESTAMP_COL)
all_later.append(user_later)
assert all(all_later)
splits = spark_chrono_split(dfs_rating, ratio=test_specs["ratios"])
splits = spark_chrono_split(spark_dataset, ratio=test_specs["ratios"])
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
test_specs["ratios"][0], test_specs["tolerance"]
@ -192,28 +169,14 @@ def test_chrono_splitter(test_specs, spark_dataset):
test_specs["ratios"][2], test_specs["tolerance"]
)
# Test if timestamps are correctly split. This is for multi-split case.
all_later = []
for user in test_specs["user_ids"]:
dfs_train = splits[0][splits[0][DEFAULT_USER_COL] == user]
dfs_valid = splits[1][splits[1][DEFAULT_USER_COL] == user]
dfs_test = splits[2][splits[2][DEFAULT_USER_COL] == user]
user_later_1 = _if_later(dfs_train, dfs_valid, col_timestamp=DEFAULT_TIMESTAMP_COL)
user_later_2 = _if_later(dfs_valid, dfs_test, col_timestamp=DEFAULT_TIMESTAMP_COL)
all_later.append(user_later_1)
all_later.append(user_later_2)
assert all(all_later)
assert _if_later(splits[0], splits[1])
assert _if_later(splits[1], splits[2])
@pytest.mark.spark
def test_stratified_splitter(test_specs, spark_dataset):
"""Test stratified splitter for Spark dataframes"""
dfs_rating = spark_dataset
splits = spark_stratified_split(
dfs_rating, ratio=test_specs["ratio"], filter_by="user", min_rating=10
spark_dataset, ratio=test_specs["ratio"], filter_by="user", min_rating=10
)
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
@ -233,7 +196,7 @@ def test_stratified_splitter(test_specs, spark_dataset):
assert set(users_train) == set(users_test)
splits = spark_stratified_split(dfs_rating, ratio=test_specs["ratios"])
splits = spark_stratified_split(spark_dataset, ratio=test_specs["ratios"])
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
test_specs["ratios"][0], test_specs["tolerance"]
@ -248,11 +211,7 @@ def test_stratified_splitter(test_specs, spark_dataset):
@pytest.mark.spark
def test_timestamp_splitter(test_specs, spark_dataset):
"""Test timestamp splitter for Spark dataframes"""
from pyspark.sql.functions import col
dfs_rating = spark_dataset
dfs_rating = dfs_rating.withColumn(DEFAULT_TIMESTAMP_COL, col(DEFAULT_TIMESTAMP_COL).cast("float"))
dfs_rating = spark_dataset.withColumn(DEFAULT_TIMESTAMP_COL, col(DEFAULT_TIMESTAMP_COL).cast("float"))
splits = spark_timestamp_split(
dfs_rating, ratio=test_specs["ratio"], col_timestamp=DEFAULT_TIMESTAMP_COL
@ -265,8 +224,12 @@ def test_timestamp_splitter(test_specs, spark_dataset):
1 - test_specs["ratio"], test_specs["tolerance"]
)
max_split0 = splits[0].agg(F.max(DEFAULT_TIMESTAMP_COL)).first()[0]
min_split1 = splits[1].agg(F.min(DEFAULT_TIMESTAMP_COL)).first()[0]
assert(max_split0 <= min_split1)
# Test multi split
splits = spark_stratified_split(dfs_rating, ratio=test_specs["ratios"])
splits = spark_timestamp_split(dfs_rating, ratio=test_specs["ratios"])
assert splits[0].count() / test_specs["number_of_rows"] == pytest.approx(
test_specs["ratios"][0], test_specs["tolerance"]
@ -278,37 +241,34 @@ def test_timestamp_splitter(test_specs, spark_dataset):
test_specs["ratios"][2], test_specs["tolerance"]
)
dfs_train = splits[0]
dfs_valid = splits[1]
dfs_test = splits[2]
max_split0 = splits[0].agg(F.max(DEFAULT_TIMESTAMP_COL)).first()[0]
min_split1 = splits[1].agg(F.min(DEFAULT_TIMESTAMP_COL)).first()[0]
assert(max_split0 <= min_split1)
# if valid is later than train.
all_later_1 = _if_later(dfs_train, dfs_valid, col_timestamp=DEFAULT_TIMESTAMP_COL)
assert all_later_1
# if test is later than valid.
all_later_2 = _if_later(dfs_valid, dfs_test, col_timestamp=DEFAULT_TIMESTAMP_COL)
assert all_later_2
max_split1 = splits[1].agg(F.max(DEFAULT_TIMESTAMP_COL)).first()[0]
min_split2 = splits[2].agg(F.min(DEFAULT_TIMESTAMP_COL)).first()[0]
assert(max_split1 <= min_split2)
def _if_later(data1, data2, col_timestamp=DEFAULT_TIMESTAMP_COL):
'''Helper function to test if records in data1 are later than that in data2.
def _if_later(data1, data2):
"""Helper function to test if records in data1 are earlier than that in data2.
Returns:
bool: True or False indicating if data1 is earlier than data2.
"""
x = (data1.select(DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL)
.groupBy(DEFAULT_USER_COL)
.agg(F.max(DEFAULT_TIMESTAMP_COL).cast('long').alias('max'))
.collect())
max_times = {row[DEFAULT_USER_COL]: row['max'] for row in x}
Return:
True or False indicating if data1 is later than data2.
'''
p = product(
[
x[col_timestamp]
for x in data1.select(col_timestamp).collect()
],
[
x[col_timestamp]
for x in data2.select(col_timestamp).collect()
],
)
y = (data2.select(DEFAULT_USER_COL, DEFAULT_TIMESTAMP_COL)
.groupBy(DEFAULT_USER_COL)
.agg(F.min(DEFAULT_TIMESTAMP_COL).cast('long').alias('min'))
.collect())
min_times = {row[DEFAULT_USER_COL]: row['min'] for row in y}
if_late = [a <= b for (a, b) in p]
return if_late
result = True
for user, max_time in max_times.items():
result = result and min_times[user] >= max_time
return result

Просмотреть файл

@ -4,10 +4,8 @@
import pandas as pd
import numpy as np
import pytest
from sklearn.utils import shuffle
from reco_utils.dataset.sparse import AffinityMatrix
from reco_utils.common.constants import (
DEFAULT_USER_COL,
DEFAULT_ITEM_COL,
@ -18,7 +16,6 @@ from reco_utils.common.constants import (
@pytest.fixture(scope="module")
def test_specs():
return {"number_of_items": 50, "number_of_users": 20, "seed": 123}
@ -40,7 +37,6 @@ def python_dataset(test_specs):
random_dates = []
for i in range(range_in_days):
random_date = np.datetime64(start_date) + np.random.choice(days_to_add)
random_dates.append(random_date)
@ -87,10 +83,6 @@ def python_dataset(test_specs):
def test_df_to_sparse(test_specs, python_dataset):
# generate a syntetic dataset
df_rating = python_dataset
# initialize the splitter
header = {
"col_user": DEFAULT_USER_COL,
@ -99,25 +91,18 @@ def test_df_to_sparse(test_specs, python_dataset):
}
# instantiate the affinity matrix
am = AffinityMatrix(DF=df_rating, **header)
am = AffinityMatrix(DF=python_dataset, **header)
# obtain the sparse matrix representation of the input dataframe
X = am.gen_affinity_matrix()
# Tests
# check that the generated matrix has the correct dimensions
assert (X.shape[0] == df_rating.userID.unique().shape[0]) & (
X.shape[1] == df_rating.itemID.unique().shape[0]
assert (X.shape[0] == python_dataset.userID.unique().shape[0]) & (
X.shape[1] == python_dataset.itemID.unique().shape[0]
)
# Test inverse mapping: from sparse matrix to dataframe
def test_sparse_to_df(test_specs, python_dataset):
df_rating = python_dataset
# initialize the splitter
header = {
"col_user": DEFAULT_USER_COL,
@ -126,7 +111,7 @@ def test_sparse_to_df(test_specs, python_dataset):
}
# instantiate the the affinity matrix
am = AffinityMatrix(DF=df_rating, **header)
am = AffinityMatrix(DF=python_dataset, **header)
# generate the sparse matrix representation
X = am.gen_affinity_matrix()
@ -137,15 +122,15 @@ def test_sparse_to_df(test_specs, python_dataset):
# tests: check that the two dataframes have the same elements in the same positions.
assert (
DF.userID.values.all()
== df_rating.sort_values(by=["userID"]).userID.values.all()
== python_dataset.sort_values(by=["userID"]).userID.values.all()
)
assert (
DF.itemID.values.all()
== df_rating.sort_values(by=["userID"]).itemID.values.all()
== python_dataset.sort_values(by=["userID"]).itemID.values.all()
)
assert (
DF.rating.values.all()
== df_rating.sort_values(by=["userID"]).rating.values.all()
== python_dataset.sort_values(by=["userID"]).rating.values.all()
)

Просмотреть файл

@ -3,6 +3,7 @@ import pytest
from reco_utils.evaluation.parameter_sweep import generate_param_grid
@pytest.fixture(scope="module")
def parameter_dictionary():
params = {
@ -13,6 +14,7 @@ def parameter_dictionary():
return params
def test_param_sweep(parameter_dictionary):
params_grid = generate_param_grid(parameter_dictionary)