146 строки
5.0 KiB
ReStructuredText
146 строки
5.0 KiB
ReStructuredText
.. _dataset:
|
|
|
|
Dataset module
|
|
##############
|
|
|
|
Recommendation datasets and related utilities
|
|
|
|
Recommendation datasets
|
|
***********************
|
|
|
|
Amazon Reviews
|
|
==============
|
|
|
|
`Amazon Reviews dataset <https://snap.stanford.edu/data/web-Amazon.html>`_ consists of reviews from Amazon.
|
|
The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user
|
|
information, ratings, and a plaintext review.
|
|
|
|
:Citation:
|
|
|
|
J. McAuley and J. Leskovec, "Hidden factors and hidden topics: understanding rating dimensions with review text",
|
|
RecSys, 2013.
|
|
|
|
.. automodule:: recommenders.datasets.amazon_reviews
|
|
:members:
|
|
|
|
CORD-19
|
|
=======
|
|
|
|
`COVID-19 Open Research Dataset (CORD-19) <https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/>`_ is a full-text
|
|
and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized
|
|
for machine readability and made available for use by the global research community.
|
|
|
|
In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups
|
|
to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of
|
|
over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the
|
|
coronavirus family of viruses for use by the global research community.
|
|
|
|
This dataset is intended to mobilize researchers to apply recent advances in natural language processing
|
|
to generate new insights in support of the fight against this infectious disease.
|
|
|
|
:Citation:
|
|
|
|
Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D.,
|
|
Funk, K., Kinney, R., Liu, Z., Merrill, W. and Mooney, P. "Cord-19: The COVID-19 Open Research Dataset.", 2020.
|
|
|
|
|
|
.. automodule:: recommenders.datasets.covid_utils
|
|
:members:
|
|
|
|
Criteo
|
|
======
|
|
|
|
`Criteo dataset <https://www.kaggle.com/c/criteo-display-ad-challenge/overview>`_, released by Criteo Labs, is an online advertising dataset that contains feature values and click feedback
|
|
for millions of display Ads. Every Ad has has 40 attributes, the first attribute is the label where a value 1 represents
|
|
that the Ad has been clicked on and a 0 represents it wasn't clicked on. The rest consist of 13 integer columns and
|
|
26 categorical columns.
|
|
|
|
.. automodule:: recommenders.datasets.criteo
|
|
:members:
|
|
|
|
MIND
|
|
====
|
|
|
|
`MIcrosoft News Dataset (MIND) <https://msnews.github.io/>`_, is a large-scale dataset for news recommendation research. It was collected from
|
|
anonymized behavior logs of Microsoft News website.
|
|
|
|
MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users.
|
|
Every news article contains rich textual content including title, abstract, body, category and entities.
|
|
Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before
|
|
this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID.
|
|
|
|
:Citation:
|
|
|
|
Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu
|
|
and Ming Zhou, "MIND: A Large-scale Dataset for News Recommendation", ACL, 2020.
|
|
|
|
|
|
.. automodule:: recommenders.datasets.mind
|
|
:members:
|
|
|
|
MovieLens
|
|
=========
|
|
|
|
The `MovieLens datasets <https://grouplens.org/datasets/movielens/>`_, first released in 1998,
|
|
describe people's expressed preferences
|
|
for movies. These preferences take the form of `<user, item, rating, timestamp>` tuples,
|
|
each the result of a person expressing a preference (a 0-5 star rating) for a movie
|
|
at a particular time.
|
|
|
|
It comes with several sizes:
|
|
|
|
* MovieLens 100k: 100,000 ratings from 1000 users on 1700 movies.
|
|
* MovieLens 1M: 1 million ratings from 6000 users on 4000 movies.
|
|
* MovieLens 10M: 10 million ratings from 72000 users on 10000 movies.
|
|
* MovieLens 20M: 20 million ratings from 138000 users on 27000 movies
|
|
|
|
:Citation:
|
|
|
|
F. M. Harper and J. A. Konstan. "The MovieLens Datasets: History and Context".
|
|
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19,
|
|
DOI=http://dx.doi.org/10.1145/2827872, 2015.
|
|
|
|
.. automodule:: recommenders.datasets.movielens
|
|
:members:
|
|
|
|
Download utilities
|
|
******************
|
|
.. automodule:: recommenders.datasets.download_utils
|
|
:members:
|
|
|
|
|
|
Pandas dataframe utilities
|
|
***************************
|
|
.. automodule:: recommenders.datasets.pandas_df_utils
|
|
:members:
|
|
|
|
|
|
Splitter utilities
|
|
******************
|
|
Python splitters
|
|
================
|
|
.. automodule:: recommenders.datasets.python_splitters
|
|
:members:
|
|
|
|
PySpark splitters
|
|
=================
|
|
.. automodule:: recommenders.datasets.spark_splitters
|
|
:members:
|
|
|
|
Other splitters utilities
|
|
=========================
|
|
.. automodule:: recommenders.datasets.split_utils
|
|
:members:
|
|
|
|
|
|
Sparse utilities
|
|
****************
|
|
.. automodule:: recommenders.datasets.sparse
|
|
:members:
|
|
|
|
|
|
Knowledge graph utilities
|
|
*************************
|
|
.. automodule:: recommenders.datasets.wikidata
|
|
:members:
|