Unicoder/understanding
Eureka6174 149d8a46f9
Merge pull request #1 from microsoft/dependabot/pip/understanding/examples/distillation/psutil-5.6.6
Bump psutil from 5.6.3 to 5.6.6 in /understanding/examples/distillation
2020-10-09 11:25:13 +08:00
..
.circleci add understanding 2020-06-02 09:32:28 +00:00
docker add understanding 2020-06-02 09:32:28 +00:00
docs add understanding 2020-06-02 09:32:28 +00:00
examples Bump psutil from 5.6.3 to 5.6.6 in /understanding/examples/distillation 2020-06-02 09:44:16 +00:00
model_cards add understanding 2020-06-02 09:32:28 +00:00
notebooks add understanding 2020-06-02 09:32:28 +00:00
src/transformers add understanding 2020-06-02 09:32:28 +00:00
templates add understanding 2020-06-02 09:32:28 +00:00
tests add understanding 2020-06-02 09:32:28 +00:00
unicoder add understanding 2020-06-02 09:32:28 +00:00
utils add understanding 2020-06-02 09:32:28 +00:00
.coveragerc add understanding 2020-06-02 09:32:28 +00:00
.gitignore add understanding 2020-06-02 09:32:28 +00:00
LICENSE add understanding 2020-06-02 09:32:28 +00:00
MANIFEST.in add understanding 2020-06-02 09:32:28 +00:00
Makefile add understanding 2020-06-02 09:32:28 +00:00
NOTICE.txt add understanding 2020-06-02 09:32:28 +00:00
README.md Merge branch 'master' of https://github.com/microsoft/Unicoder 2020-06-02 17:53:42 +08:00
deploy_multi_version_doc.sh add understanding 2020-06-02 09:32:28 +00:00
hubconf.py add understanding 2020-06-02 09:32:28 +00:00
setup.cfg add understanding 2020-06-02 09:32:28 +00:00
setup.py add understanding 2020-06-02 09:32:28 +00:00
transformers-cli add understanding 2020-06-02 09:32:28 +00:00
valohai.yaml add understanding 2020-06-02 09:32:28 +00:00

README.md

Unicoder

This repo provides the code for reproducing the experiment in XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation.

Install and Dependency

This repo is based on Transformers. It's tested on Python 3.6+, PyTorch 1.4.0.

Installing this repo will replaced the original Transformers you installed. So we recommend you install this repo in a separate virtual environment or conda environment.

You could install this repo with pip by:

git clone https://github.com/microsoft/Unicoder
cd Unicoder/understanding
pip install . 

You could speed up all experiments by installing apex and setting --fp16.

Pre-trained model

The pre-trained model used in paper is at here.

Fine-tuning experiments

XGLUE dataset

You can download XGLUE dataset from XGLUE homepage.

Fine-tuning

We used a single V100 with 32GB memory to run all the experiments. If you are using GPU with 16GB memory or less, you could decrease the per_gpu_train_batch_size and increase gradient_accumulation_steps.

In our experiments, we used FP16 for speed up. It's known that using FP32 or K80/M60 GPU may lead to slightly different performance.

You could run the xglue.sh to reproduce all the experiment.

xglue.sh has three parameters.

DATA_DIR: downloaded from XGLUE github
MODEL_DIR: downloaded from section "pre-trained model"
OUTPUT_DIR: any folder

XNLI:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/XNLI \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/XNLI \
--task_name xnli \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

QADSM:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,fr \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/QADSM \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/QADSM \
--task_name ads \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

QAM:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,fr \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/QAM \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/QAM \
--task_name qam \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

PAWSX:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,es,fr \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/PAWSX \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/PAWSX \
--task_name pawsx \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

NC:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,es,fr,ru \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/NC \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/NC \
--task_name news \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

WPR:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,es,fr,it,pt,zh \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/WPR \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/WPR \
--task_name rel \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

MLQA:

CUDA_VISIBLE_DEVICES=0 python examples/run_xmrc.py --model_type xlmr  \
--model_name_or_path $MODEL_DIR \
--do_train \
--do_eval \
--do_lower_case \
--language en,es,de,ar,hi,vi,zh \
--train_language en \
--data_dir $DATA_DIR/MLQA \
--per_gpu_train_batch_size 12 \
--per_gpu_eval_batch_size 128 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--save_steps 0 \
--logging_each_epoch \
--max_seq_length 384 \
--doc_stride 128  \
--output_dir $OUTPUT_DIR/MLQA \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \

NER:

python examples/ner/run_ner.py --data_dir $DATA_DIR/NER \
--model_type xlmroberta \
--labels $DATA_DIR/NER/labels.txt \
--model_name_or_path $MODEL_DIR \
--gpu_id 0 \
--output_dir $OUTPUT_DIR/NER \
--language de,en,es,nl \
--train_language en \
--max_seq_length 256 \
--num_train_epochs 20 \
--per_gpu_train_batch_size 32 \
--save_steps 1500 \
--seed 42 \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache \
--logging_each_epoch \
--evaluate_during_training \
--learning_rate 5e-6 \
--task_name ner

POS:

python examples/ner/run_ner.py --data_dir $DATA_DIR/POS \
--model_type xlmroberta \
--labels $DATA_DIR/POS/labels \
--model_name_or_path $MODEL_DIR \
--gpu_id 0 \
--output_dir $OUTPUT_DIR/POS \
--language en,ar,bg,de,el,es,fr,hi,it,nl,pl,pt,ru,th,tr,ur,vi,zh \
--train_language en \
--max_seq_length 128 \
--num_train_epochs 20 \
--per_gpu_train_batch_size 32 \
--save_steps 1500 \
--seed 42 \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache \
--logging_each_epoch \
--evaluate_during_training \
--learning_rate 2e-5 \
--task_name pos

The languages in fine-tuning and testing doesn't need to be same. You could only fine-tune on English and test on other languages. Based on our experiments, translate English training data to other languages and fine-tune on it could improve the performance.

Notes and Acknowledgments

This code base is built on top of Transformers.

Revised Files

examples/run_xglue.py

examples/run_xglue_ft.py

examples/run_xmrc.py

examples/ner/run_ner.py

src/transformers/data/processors/xglue.py

src/transformers/data/metrics/__init__.py

Added Files

unicoder/xglue.sh

How to cite

If you extend or use this work please cite our paper.

@inproceedings{huang2019unicoder,
  title={Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks},
  author={Huang, Haoyang and Liang, Yaobo and Duan, Nan and Gong, Ming and Shou, Linjun and Jiang, Daxin and Zhou, Ming},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={2485--2494},
  year={2019}
}

@article{Liang2020XGLUEAN,
  title={XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation},
  author={Yaobo Liang and Nan Duan and Yeyun Gong and Ning Wu and Fenfei Guo and Weizhen Qi and Ming Gong and Linjun Shou and Daxin Jiang and Guihong Cao and Xiaodong Fan and Ruofei Zhang and Rahul Agrawal and Edward Cui and Sining Wei and Taroon Bharti and Ying Qiao and Jiun-Hung Chen and Winnie Wu and Shuguang Liu and Fan Yang and Daniel Campos and Rangan Majumder and Ming Zhou},
  journal={arXiv},
  year={2020},
  volume={abs/2004.01401}
}