recommenders/docs/datasets.rst

146 строки
5.0 KiB
ReStructuredText

.. _dataset:
Dataset module
##############
Recommendation datasets and related utilities
Recommendation datasets
***********************
Amazon Reviews
==============
`Amazon Reviews dataset <https://snap.stanford.edu/data/web-Amazon.html>`_ consists of reviews from Amazon.
The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user
information, ratings, and a plaintext review.
:Citation:
J. McAuley and J. Leskovec, "Hidden factors and hidden topics: understanding rating dimensions with review text",
RecSys, 2013.
.. automodule:: recommenders.datasets.amazon_reviews
:members:
CORD-19
=======
`COVID-19 Open Research Dataset (CORD-19) <https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/>`_ is a full-text
and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized
for machine readability and made available for use by the global research community.
In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups
to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of
over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the
coronavirus family of viruses for use by the global research community.
This dataset is intended to mobilize researchers to apply recent advances in natural language processing
to generate new insights in support of the fight against this infectious disease.
:Citation:
Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D.,
Funk, K., Kinney, R., Liu, Z., Merrill, W. and Mooney, P. "Cord-19: The COVID-19 Open Research Dataset.", 2020.
.. automodule:: recommenders.datasets.covid_utils
:members:
Criteo
======
`Criteo dataset <https://www.kaggle.com/c/criteo-display-ad-challenge/overview>`_, released by Criteo Labs, is an online advertising dataset that contains feature values and click feedback
for millions of display Ads. Every Ad has has 40 attributes, the first attribute is the label where a value 1 represents
that the Ad has been clicked on and a 0 represents it wasn't clicked on. The rest consist of 13 integer columns and
26 categorical columns.
.. automodule:: recommenders.datasets.criteo
:members:
MIND
====
`MIcrosoft News Dataset (MIND) <https://msnews.github.io/>`_, is a large-scale dataset for news recommendation research. It was collected from
anonymized behavior logs of Microsoft News website.
MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users.
Every news article contains rich textual content including title, abstract, body, category and entities.
Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before
this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID.
:Citation:
Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu
and Ming Zhou, "MIND: A Large-scale Dataset for News Recommendation", ACL, 2020.
.. automodule:: recommenders.datasets.mind
:members:
MovieLens
=========
The `MovieLens datasets <https://grouplens.org/datasets/movielens/>`_, first released in 1998,
describe people's expressed preferences
for movies. These preferences take the form of `<user, item, rating, timestamp>` tuples,
each the result of a person expressing a preference (a 0-5 star rating) for a movie
at a particular time.
It comes with several sizes:
* MovieLens 100k: 100,000 ratings from 1000 users on 1700 movies.
* MovieLens 1M: 1 million ratings from 6000 users on 4000 movies.
* MovieLens 10M: 10 million ratings from 72000 users on 10000 movies.
* MovieLens 20M: 20 million ratings from 138000 users on 27000 movies
:Citation:
F. M. Harper and J. A. Konstan. "The MovieLens Datasets: History and Context".
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19,
DOI=http://dx.doi.org/10.1145/2827872, 2015.
.. automodule:: recommenders.datasets.movielens
:members:
Download utilities
******************
.. automodule:: recommenders.datasets.download_utils
:members:
Pandas dataframe utilities
***************************
.. automodule:: recommenders.datasets.pandas_df_utils
:members:
Splitter utilities
******************
Python splitters
================
.. automodule:: recommenders.datasets.python_splitters
:members:
PySpark splitters
=================
.. automodule:: recommenders.datasets.spark_splitters
:members:
Other splitters utilities
=========================
.. automodule:: recommenders.datasets.split_utils
:members:
Sparse utilities
****************
.. automodule:: recommenders.datasets.sparse
:members:
Knowledge graph utilities
*************************
.. automodule:: recommenders.datasets.wikidata
:members: