Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference
Перейти к файлу
microsoft-github-policy-service[bot] c913e591e8
Auto merge mandatory file pr
This pr is auto merged as it contains a mandatory file and is opened for more than 10 days.
2022-11-28 19:08:28 +00:00
data SDR 2021-05-30 10:43:09 +03:00
instructions readme 2021-05-30 10:55:28 +03:00
models Tidy 2021-06-01 09:25:34 +03:00
utils Tidy 2021-06-01 09:25:34 +03:00
.gitignore SDR 2021-05-30 10:43:09 +03:00
README.md Update README.md 2021-06-23 11:15:36 +03:00
SECURITY.md Microsoft mandatory file 2022-05-18 22:04:05 +00:00
__init__.py Tidy 2021-06-01 09:25:34 +03:00
sdr_main.py Rename train_doc_sim.py to sdr_main.py 2021-06-23 11:14:24 +03:00

README.md

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

This repo is the implementation for SDR.

 

Tested environment

  • Python 3.7
  • PyTorch 1.7
  • CUDA 11.0

Lower CUDA and PyTorch versions should work as well.

 

Contents

License, Security, support and code of conduct specifications are under the Instructions directory.  

Installation

Run

bash instructions/installation.sh 

 

Datasets

The published datasets are:

  • Video games
    • 21,935 articles
    • Expert annotated test set. 90 articles with 12 ground-truth recommendations.
    • Examples:
      • Grand Theft Auto - Mafia
      • Burnout Paradise - Forza Horizon 3
  • Wines
    • 1635 articles
    • Crafted by a human sommelier, 92 articles with ~10 ground-truth recommendations.
    • Examples:
      • Pinot Meunier - Chardonnay
      • Dom Pérignon - Moët & Chandon

For more details and direct download see Wines and Video Games.

 

Training

The training process downloads the datasets automatically.

python sdr_main.py --dataset_name video_games

The code is based on PyTorch-Lightning, all PL hyperparameters are supported. (limit_train/val/test_batches, check_val_every_n_epoch etc.)

Tensorboard support

All metrics are being logged automatically and stored in

SDR/output/document_similarity/SDR/arch_SDR/dataset_name_<dataset>/<time_of_run>

Run tesnroboard --logdir=<path> to see the the logs.

 

Inference

The hierarchical inference described in the paper is implemented as a stand-alone service and can be used with any backbone algorithm (models/reco/hierarchical_reco.py).

 

python sdr_main.py --dataset_name <name> --resume_from_checkpoint <checkpoint> --test_only

Results

Citing & Authors

If you find this repository or the annotated datasets helpful, feel free to cite our publication -

SDR: Self-Supervised Document-to-Document Similarity Ranking viaContextualized Language Models and Hierarchical Inference

 @misc{ginzburg2021selfsupervised,
     title={Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference}, 
     author={Dvir Ginzburg and Itzik Malkiel and Oren Barkan and Avi Caciularu and Noam Koenigstein},
     year={2021},
     eprint={2106.01186},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

Contact: Dvir Ginzburg, Itzik Malkiel.