This commit is contained in:
Scott Graham 2019-01-02 09:30:24 -05:00
Родитель 086ee4c002
Коммит d03148092c
1 изменённых файлов: 9 добавлений и 6 удалений

Просмотреть файл

@ -4,13 +4,17 @@ This module (reco_utils) contains functions to simplify common tasks used when d
## Sub-Modules
### Common
### [Common](./common)
This submodule contains high-level utilities for defining constants used in most algorithms as well as helper functions for managing aspects of different frameworks: gpu, spark, jupyter notebook.
### Dataset
### [Dataset](./dataset)
Dataset includes helper functions for interacting with Azure Cosmos databases, pulling different sizes of the Movielens dataset and formatting them appropriately as well as utilities for splitting data for training / testing.
#### Data Loading
The movielens module will allow you to load a dataframe in pandas or spark formats from the Movielens dataset, with sizes of 100k, 1M, 10M, or 20M to test algorithms and evaluate performance benchmarks.
```python
df = movielens.load_pandas_df(size="100k")
```
#### Splitting Techniques:
Currently three methods are available for splitting datasets. All of them support splitting by user or item and filtering out minimal samples (for instance users that have not rated enough item, or items that have not been rated by enough users).
@ -18,9 +22,9 @@ Currently three methods are available for splitting datasets. All of them suppor
- Chronological: this uses provided timestamps to order the data and selects a cut-off time that will split the desired ratio of data to train before that time and test after that time
- Stratified: this is similar to random sampling, but the splits are stratified, for example if the datasets are split by user, the splitting approach will attempt to maintain the same set of items used in both training and test splits. The converse is true if splitting by item.
### Evaluation
### [Evaluation](./evaluation)
The evaluation submodule includes functionality for performing hyperparameter sweeps as well as calculating common recommender metrics directly in python or in a Spark environment using pyspark.
The evaluation submodule includes functionality for performing hyperparameter sweeps and functionality for calculating common recommender ratings directly in python or in a Spark environment using pyspark.
Currently available metrics include:
- Root Mean Squared Error
- Mean Absolute Error
@ -31,7 +35,6 @@ Currently available metrics include:
- Normalized Discounted Cumulative Gain at K
- Mean Average Precision at K
### Recommender
### [Recommender](./recommender)
The recommender submodule contains implementations of various algorithms that can be used in addition to external packages to evaluate and develop new recommender system approaches.
Currently the Simple Adaptive Recommender (SAR) algorithm is implemented in python for running on a single node.