ContextualSP/poet
SivilTaram 52793fdbee update poet reference & weights 2022-12-16 22:50:22 +08:00
..
README.md update poet reference & weights 2022-12-16 22:50:22 +08:00
synthesize_logic_corpus.py update poet reference & weights 2022-12-16 22:50:22 +08:00
synthesize_math_corpus.py update poet reference & weights 2022-12-16 22:50:22 +08:00

README.md

POET

This is the official repo for the paper Reasoning Like Program Executors.

Pre-training Corpus

You can find the pre-training SQL corpus from here, the pre-training Math corpus from here.

The pre-training SQL corpus can be synthesized following the same procedure as done in TAPEX with the expand_numbers_in_text function below:

def expand_numbers_in_text(text, delim=" ", ignore_chars=[","], reverse_num=False):
    number_pattern = r"[-+]?[.]?[\d]+(,\d+)*[\.]?\d*(?:[eE][-+]?\d+)?%?"
    num_char_spans = [(m.start(0), m.end(0)) for m in re.finditer(number_pattern, text)]
    if len(num_char_spans) == 0: return text
    out_text = ""
    last_e = -1
    for i, (s, e) in enumerate(num_char_spans):
        out_text += text[:s] if i == 0 else text[last_e:s]
        num_str = delim.join([c for c in list(text[s:e]) if c not in ignore_chars])
        out_text += num_str if not reverse_num else num_str[::-1]
        last_e = e
    out_text += text[last_e:]  # append rest
    return out_text

The pre-training Math corpus can be synthesized by the script synthesize_math_corpus.py. The pre-training Logic corpus can be synthesized by the script synthesize_logic_corpus.py.

For all BART-based experiments, we use the fairseq implementation, which means that you can prepare the dataset as the following format:

|- dataset
    |- train.src
    |- train.tgt
    |- valid.src
    |- valid.tgt

After necessary preprocessing (you can follow the official guide in fairseq machin translation task), you can use the following script to train the model:

fairseq-train dataset/bin/ \
    --save-dir models \
    --tensorboard-logdir tensorboard_logs \
    --restore-file BART-large/model.pt \
    --arch bart_large \
    --task translation \
    --maximize-best-checkpoint-metric \
    --criterion label_smoothed_cross_entropy  \
    --source-lang src  \
    --target-lang tgt  \
    --label-smoothing 0.1  \
    --max-tokens 1536 \
    --validate-interval	50 \
    --save-interval	50 \
    --save-interval-updates	3001 \
    --validate-interval-updates 3001 \
    --keep-interval-updates 5 \
    --update-freq 16 \
    --warmup-updates 500  \
    --max-update 20000  \
    --total-num-update 20000  \
    --required-batch-size-multiple 1  \
    --dropout 0.1  \
    --attention-dropout 0.1  \
    --relu-dropout 0.0  \
    --weight-decay 0.01  \
    --optimizer adam  \
    --adam-betas "(0.9, 0.999)"  \
    --adam-eps 1e-08  \
    --clip-norm 0.1  \
    --lr-scheduler polynomial_decay  \
    --lr 3e-5  \
    --ddp-backend no_c10d  \
    --num-workers 1  \
    --reset-meters  \
    --reset-optimizer \
    --reset-dataloader \
    --share-all-embeddings \
    --layernorm-embedding \
    --share-decoder-input-output-embed  \
    --skip-invalid-size-inputs-valid-test  \
    --log-format json  \
    --log-interval 10  \
    --patience 10 \
    --keep-best-checkpoints 1 \
    --report-accuracy \
    --no-epoch-checkpoints \
    --no-last-checkpoints \
    --no-save-optimizer-state

Pre-trained Model Weights

You can find all the available POET model weights at Huggingface Hub. For all these models, you can try to fine-tune them as the vanilla models. And these models are pre-trained on the following format of natural context and sentence:

[sentence] col : [natural context]

where [sentence] is usually the question in the task, and [natural context] is usually the passage in the task. Please refer to our paper for more details.