2c83b3c38d
* Support BERT relative position embeddings * Fix typo in README.md * Address review comment * Fix failing tests * [tiny] Fix style_doc.py check by adding an empty line to configuration_bert.py * make fix copies * fix configs of electra and albert and fix longformer * remove copy statement from longformer * fix albert * fix electra * Add bert variants forward tests for various position embeddings * [tiny] Fix style for test_modeling_bert.py * improve docstring * [tiny] improve docstring and remove unnecessary dependency * [tiny] Remove unused import * re-add to ALBERT * make embeddings work for ALBERT * add test for albert Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> |
||
---|---|---|
.. | ||
README.md | ||
run_squad.py | ||
run_squad_trainer.py | ||
run_tf_squad.py |
README.md
SQuAD
Based on the script run_squad.py
.
Fine-tuning BERT on SQuAD1.0
This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a $SQUAD_DIR directory.
And for SQuAD2.0, you need to download:
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
Training with the previously defined hyper-parameters yields the following results:
f1 = 88.52
exact_match = 81.22
Distributed training
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
--per_gpu_eval_batch_size=3 \
--per_gpu_train_batch_size=3 \
Training with the previously defined hyper-parameters yields the following results:
f1 = 93.15
exact_match = 86.91
This fine-tuned model is available as a checkpoint under the reference
bert-large-uncased-whole-word-masking-finetuned-squad
.
Fine-tuning XLNet on SQuAD
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
Command for SQuAD1.0:
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=4 \
--per_gpu_train_batch_size=4 \
--save_steps 5000
Command for SQuAD2.0:
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--version_2_with_negative \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=2 \
--per_gpu_train_batch_size=2 \
--save_steps 5000
Larger batch size may improve the performance while costing more memory.
Results for SQuAD1.0 with the previously defined hyper-parameters:
{
"exact": 85.45884578997162,
"f1": 92.5974600601065,
"total": 10570,
"HasAns_exact": 85.45884578997162,
"HasAns_f1": 92.59746006010651,
"HasAns_total": 10570
}
Results for SQuAD2.0 with the previously defined hyper-parameters:
{
"exact": 80.4177545691906,
"f1": 84.07154997729623,
"total": 11873,
"HasAns_exact": 76.73751686909581,
"HasAns_f1": 84.05558584352873,
"HasAns_total": 5928,
"NoAns_exact": 84.0874684608915,
"NoAns_f1": 84.0874684608915,
"NoAns_total": 5945
}
Fine-tuning BERT on SQuAD1.0 with relative position embeddings
The following examples show how to fine-tune BERT models with different relative position embeddings. The BERT model
bert-base-uncased
was pre-trained with default absolute position embeddings. We provide the following pre-trained
models which were pre-trained on the same training data (BooksCorpus and English Wikipedia) as in the BERT model
training, but with different relative position embeddings.
zhiheng-huang/bert-base-uncased-embedding-relative-key
, trained from scratch with relative embedding proposed by Shaw et al., Self-Attention with Relative Position Representationszhiheng-huang/bert-base-uncased-embedding-relative-key-query
, trained from scratch with relative embedding method 4 in Huang et al. Improve Transformer Models with Better Relative Position Embeddingszhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query
, fine-tuned from modelbert-large-uncased-whole-word-masking
with 3 additional epochs with relative embedding method 4 in Huang et al. Improve Transformer Models with Better Relative Position Embeddings
Base models fine-tuning
export SQUAD_DIR=/path/to/SQUAD
output_dir=relative_squad
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 512 \
--doc_stride 128 \
--output_dir ${output_dir} \
--per_gpu_eval_batch_size=60 \
--per_gpu_train_batch_size=6
Training with the above command leads to the following results. It boosts the BERT default from f1 score of 88.52 to 90.54.
'exact': 83.6802270577105, 'f1': 90.54772098174814
The change of max_seq_length
from 512 to 384 in the above command leads to the f1 score of 90.34. Replacing the above
model zhiheng-huang/bert-base-uncased-embedding-relative-key-query
with
zhiheng-huang/bert-base-uncased-embedding-relative-key
leads to the f1 score of 89.51. The changing of 8 gpus to one
gpu training leads to the f1 score of 90.71.
Large models fine-tuning
export SQUAD_DIR=/path/to/SQUAD
output_dir=relative_squad
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 512 \
--doc_stride 128 \
--output_dir ${output_dir} \
--per_gpu_eval_batch_size=6 \
--per_gpu_train_batch_size=2 \
--gradient_accumulation_steps 3
Training with the above command leads to the f1 score of 93.52, which is slightly better than the f1 score of 93.15 for
bert-large-uncased-whole-word-masking
.
SQuAD with the Tensorflow Trainer
python run_tf_squad.py \
--model_name_or_path bert-base-uncased \
--output_dir model \
--max_seq_length 384 \
--num_train_epochs 2 \
--per_gpu_train_batch_size 8 \
--per_gpu_eval_batch_size 16 \
--do_train \
--logging_dir logs \
--logging_steps 10 \
--learning_rate 3e-5 \
--doc_stride 128
For the moment evaluation is not available in the Tensorflow Trainer only the training.