99348f5a98 | ||
---|---|---|
.. | ||
cnndm | ||
samsum | ||
README.md | ||
prepro.py |
README.md
Data acquisition for JGR
1. CNN/Dailymail
We use the non-anonymized version of CNN/Dailymail. First down load the url_lists of train/dev/test set, and down load the unzip the stories directories from here for both CNN and Daily Mail. Then put the uzipped directories to /data/cnndm/raw_data
.
Then run the following instructions:
cd cnndm
python generate_data.py
This instrcutions will finally generate train/dev/test_data.json
, which contain the training/dev/test samples of CNN/Dailymail.
2. SAMSum
First download and unzip the data files of SAMSum from here, then put them to /data/samsam/raw_data
run:
cd samsum
python generate_data.py
This instrcutions will finally generate train/dev/test_data.json
, which contain the training/dev/test samples of SAMSum.
3. Squadqg & Personachat
We use the preprocessed version of squadqg and personachat from GLGE. You should first download the training/dev set of squadqg/personachat from here and test set from here. The put the org_data
directory to the corresponding folder. Then run:
python prepro.py --dataset_name squadqg # squadqg
python prepro.py --dataset_name personachat # personachat
This instrcutions will finally generate train/dev/test_data.json
, which contain the training/dev/test samples of Squadqg/Personachat.