ProphetNet/JGR/data
shenwzh3 99348f5a98 upload jgr 2023-05-23 15:21:21 +08:00
..
cnndm upload jgr 2023-05-23 15:21:21 +08:00
samsum upload jgr 2023-05-23 15:21:21 +08:00
README.md upload jgr 2023-05-23 15:21:21 +08:00
prepro.py upload jgr 2023-05-23 15:21:21 +08:00

README.md

Data acquisition for JGR

1. CNN/Dailymail

We use the non-anonymized version of CNN/Dailymail. First down load the url_lists of train/dev/test set, and down load the unzip the stories directories from here for both CNN and Daily Mail. Then put the uzipped directories to /data/cnndm/raw_data.

Then run the following instructions:

cd cnndm
python generate_data.py

This instrcutions will finally generate train/dev/test_data.json, which contain the training/dev/test samples of CNN/Dailymail.

2. SAMSum

First download and unzip the data files of SAMSum from here, then put them to /data/samsam/raw_data

run:

cd samsum
python generate_data.py

This instrcutions will finally generate train/dev/test_data.json, which contain the training/dev/test samples of SAMSum.

3. Squadqg & Personachat

We use the preprocessed version of squadqg and personachat from GLGE. You should first download the training/dev set of squadqg/personachat from here and test set from here. The put the org_data directory to the corresponding folder. Then run:

python prepro.py --dataset_name squadqg # squadqg
python prepro.py --dataset_name personachat # personachat

This instrcutions will finally generate train/dev/test_data.json, which contain the training/dev/test samples of Squadqg/Personachat.