Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Перейти к файлу

dvir e10ddb034e readme		2021-05-30 10:55:28 +03:00
data	SDR	2021-05-30 10:43:09 +03:00
instructions	readme	2021-05-30 10:55:28 +03:00
models	SDR	2021-05-30 10:43:09 +03:00
utils	SDR	2021-05-30 10:43:09 +03:00
.gitignore	SDR	2021-05-30 10:43:09 +03:00
README.md	readme	2021-05-30 10:55:28 +03:00
train_doc_sim.py	SDR	2021-05-30 10:43:09 +03:00

README.md

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

This repo is the implementation for SDR.

Tested environment

Python 3.7
PyTorch 1.7
CUDA 11.0

Lower CUDA and PyTorch versions should work as well.

Installation
Datasets
Train with our datasets
Hierarchical Inference
Cite

License, Security, support and code of conduct specifications are under the Instructions directory.

Installation

Run

bash instructions/installation.sh

Datasets

The published datasets are:

Video games
- 21,935 articles
- Expert annotated test set. 90 articles with 12 ground-truth recommendations.
- Examples:
  - Grand Theft Auto - Mafia
  - Burnout Paradise - Forza Horizon 3
Wines
- 1635 articles
- Crafted by a human sommelier, 92 articles with ~10 ground-truth recommendations.
- Examples:
  - Pinot Meunier - Chardonnay
  - Dom Pérignon - Moët & Chandon

For more details and direct download see Wines and Video Games.

Training

The training process downloads the datasets automatically.

python train_doc_sim.py --dataset_name video_games

The code is based on PyTorch-Lightning, all PL hyperparameters are supported. (limit_train/val/test_batches, check_val_every_n_epoch etc.)

Tensorboard support

All metrics are being logged automatically and stored in

SDR/output/document_similarity/SDR/arch_SDR/dataset_name_<dataset>/<time_of_run>

Run tesnroboard --logdir=<path> to see the the logs.

Inference

The hierarchical inference described in the paper is implemented as a stand-alone service and can be used with any backbone algorithm (models/reco/hierarchical_reco.py).

python train_doc_sim.py --dataset_name <name> --resume_from_checkpoint <checkpoint> --test_only

Results

Citing & Authors

If you find this repository or the annotated datasets helpful, feel free to cite our publication -

SDR: Self-Supervised Document-to-Document Similarity Ranking viaContextualized Language Models and Hierarchical Inference

 Link will be uploaded soon!

Contact person: Dvir Ginzburg