История

Eureka6174 149d8a46f9 Merge pull request #1 from microsoft/dependabot/pip/understanding/examples/distillation/psutil-5.6.6 Bump psutil from 5.6.3 to 5.6.6 in /understanding/examples/distillation		2020-10-09 11:25:13 +08:00
..
.circleci	add understanding	2020-06-02 09:32:28 +00:00
docker	add understanding	2020-06-02 09:32:28 +00:00
docs	add understanding	2020-06-02 09:32:28 +00:00
examples	Bump psutil from 5.6.3 to 5.6.6 in /understanding/examples/distillation	2020-06-02 09:44:16 +00:00
model_cards	add understanding	2020-06-02 09:32:28 +00:00
notebooks	add understanding	2020-06-02 09:32:28 +00:00
src/transformers	add understanding	2020-06-02 09:32:28 +00:00
templates	add understanding	2020-06-02 09:32:28 +00:00
tests	add understanding	2020-06-02 09:32:28 +00:00
unicoder	add understanding	2020-06-02 09:32:28 +00:00
utils	add understanding	2020-06-02 09:32:28 +00:00
.coveragerc	add understanding	2020-06-02 09:32:28 +00:00
.gitignore	add understanding	2020-06-02 09:32:28 +00:00
LICENSE	add understanding	2020-06-02 09:32:28 +00:00
MANIFEST.in	add understanding	2020-06-02 09:32:28 +00:00
Makefile	add understanding	2020-06-02 09:32:28 +00:00
NOTICE.txt	add understanding	2020-06-02 09:32:28 +00:00
README.md	Merge branch 'master' of https://github.com/microsoft/Unicoder	2020-06-02 17:53:42 +08:00
deploy_multi_version_doc.sh	add understanding	2020-06-02 09:32:28 +00:00
hubconf.py	add understanding	2020-06-02 09:32:28 +00:00
setup.cfg	add understanding	2020-06-02 09:32:28 +00:00
setup.py	add understanding	2020-06-02 09:32:28 +00:00
transformers-cli	add understanding	2020-06-02 09:32:28 +00:00
valohai.yaml	add understanding	2020-06-02 09:32:28 +00:00

README.md

Unicoder

This repo provides the code for reproducing the experiment in XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation.

Install and Dependency

This repo is based on Transformers. It's tested on Python 3.6+, PyTorch 1.4.0.

Installing this repo will replaced the original Transformers you installed. So we recommend you install this repo in a separate virtual environment or conda environment.

You could install this repo with pip by:

git clone https://github.com/microsoft/Unicoder
cd Unicoder/understanding
pip install .

You could speed up all experiments by installing apex and setting --fp16.

Pre-trained model

The pre-trained model used in paper is at here.

Fine-tuning experiments

XGLUE dataset

You can download XGLUE dataset from XGLUE homepage.

Fine-tuning

We used a single V100 with 32GB memory to run all the experiments. If you are using GPU with 16GB memory or less, you could decrease the per_gpu_train_batch_size and increase gradient_accumulation_steps.

In our experiments, we used FP16 for speed up. It's known that using FP32 or K80/M60 GPU may lead to slightly different performance.

You could run the xglue.sh to reproduce all the experiment.

xglue.sh has three parameters.

DATA_DIR: downloaded from XGLUE github
MODEL_DIR: downloaded from section "pre-trained model"
OUTPUT_DIR: any folder

XNLI:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/XNLI \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/XNLI \
--task_name xnli \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

QADSM:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,fr \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/QADSM \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/QADSM \
--task_name ads \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

QAM:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,fr \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/QAM \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/QAM \
--task_name qam \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

PAWSX:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,es,fr \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/PAWSX \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/PAWSX \
--task_name pawsx \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

NC:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,es,fr,ru \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/NC \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/NC \
--task_name news \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

WPR:

python examples/run_xglue.py --model_type xlmr \
--model_name_or_path $MODEL_DIR \
--language de,en,es,fr,it,pt,zh \
--train_language en \
--do_train \
--do_eval \
--do_predict \
--data_dir $DATA_DIR/WPR \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--max_seq_length 256 \
--output_dir $OUTPUT_DIR/WPR \
--task_name rel \
--save_steps -1 \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \
--logging_steps -1 \
--logging_steps_in_sample -1 \
--logging_each_epoch \
--gpu_id 0

MLQA:

CUDA_VISIBLE_DEVICES=0 python examples/run_xmrc.py --model_type xlmr  \
--model_name_or_path $MODEL_DIR \
--do_train \
--do_eval \
--do_lower_case \
--language en,es,de,ar,hi,vi,zh \
--train_language en \
--data_dir $DATA_DIR/MLQA \
--per_gpu_train_batch_size 12 \
--per_gpu_eval_batch_size 128 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--save_steps 0 \
--logging_each_epoch \
--max_seq_length 384 \
--doc_stride 128  \
--output_dir $OUTPUT_DIR/MLQA \
--overwrite_output_dir \
--overwrite_cache \
--evaluate_during_training \

NER:

python examples/ner/run_ner.py --data_dir $DATA_DIR/NER \
--model_type xlmroberta \
--labels $DATA_DIR/NER/labels.txt \
--model_name_or_path $MODEL_DIR \
--gpu_id 0 \
--output_dir $OUTPUT_DIR/NER \
--language de,en,es,nl \
--train_language en \
--max_seq_length 256 \
--num_train_epochs 20 \
--per_gpu_train_batch_size 32 \
--save_steps 1500 \
--seed 42 \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache \
--logging_each_epoch \
--evaluate_during_training \
--learning_rate 5e-6 \
--task_name ner

POS:

python examples/ner/run_ner.py --data_dir $DATA_DIR/POS \
--model_type xlmroberta \
--labels $DATA_DIR/POS/labels \
--model_name_or_path $MODEL_DIR \
--gpu_id 0 \
--output_dir $OUTPUT_DIR/POS \
--language en,ar,bg,de,el,es,fr,hi,it,nl,pl,pt,ru,th,tr,ur,vi,zh \
--train_language en \
--max_seq_length 128 \
--num_train_epochs 20 \
--per_gpu_train_batch_size 32 \
--save_steps 1500 \
--seed 42 \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache \
--logging_each_epoch \
--evaluate_during_training \
--learning_rate 2e-5 \
--task_name pos

The languages in fine-tuning and testing doesn't need to be same. You could only fine-tune on English and test on other languages. Based on our experiments, translate English training data to other languages and fine-tune on it could improve the performance.

Notes and Acknowledgments

This code base is built on top of Transformers.

Revised Files

examples/run_xglue.py

examples/run_xglue_ft.py

examples/run_xmrc.py

examples/ner/run_ner.py

src/transformers/data/processors/xglue.py

src/transformers/data/metrics/__init__.py

Added Files

unicoder/xglue.sh

How to cite

If you extend or use this work please cite our paper.

@inproceedings{huang2019unicoder,
  title={Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks},
  author={Huang, Haoyang and Liang, Yaobo and Duan, Nan and Gong, Ming and Shou, Linjun and Jiang, Daxin and Zhou, Ming},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={2485--2494},
  year={2019}
}

@article{Liang2020XGLUEAN,
  title={XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation},
  author={Yaobo Liang and Nan Duan and Yeyun Gong and Ning Wu and Fenfei Guo and Weizhen Qi and Ming Gong and Linjun Shou and Daxin Jiang and Guihong Cao and Xiaodong Fan and Ruofei Zhang and Rahul Agrawal and Edward Cui and Sining Wei and Taroon Bharti and Ying Qiao and Jiun-Hung Chen and Winnie Wu and Shuguang Liu and Fan Yang and Daniel Campos and Rangan Majumder and Ming Zhou},
  journal={arXiv},
  year={2020},
  volume={abs/2004.01401}
}