HiTab/data2text
zhiruow 9f3429a93e fix the evaluation script 2022-05-27 22:37:54 -04:00
..
experiment fix the evaluation script 2022-05-27 22:37:54 -04:00
README.md update code for data2text 2022-05-02 14:39:02 -04:00
do_preprocess.py update code for data2text 2022-05-02 14:39:02 -04:00
run_experiment.py update code for data2text 2022-05-02 14:39:02 -04:00

README.md

Data-to-Text Generation

We explore four baseline models to generate meaning text from hierarchical tables in HiTab. Three of them are transformer-based models: T5, BART, and BERT-to-BERT. The other is a Pointer-Generator Network based on LSTM architecture.

[0] Preliminaries

To start with, make sure to install the following requirements:

pip install openpyxl
pip install datasets 
pip install transformers 

[1] Data Pre-processing

Read in the train_samples.jsonl, dev_samples.jsonl, test_samples.jsonl in the ./data/ directory.

Process each sample with: (1) highlighted/linked table cells, (2) with additional operations and answer(s).

  • The generation target label is the annotated sub_sentence.
  • To create a serialized table data input, we need to: (1) find all linked entity/quantity cells, (2) find all of their ascendants, then linearize their cell contents following a top-down left-to-right order. If extra operational information is required, we will then append the answer formula and answer string to the source as the final model input.

This process create pairs of source-target for train/dev/test sets. To perform data pre-processing for the cell highlight setting, simply run:

python do_preprocess.py

Or to enable the cell & calculation setting, specify the additional argument by:

python do_preprocess.py --add_aggr 

Both will load the data from hitab/data/ directory and generate a processed version in hitab/data2text/data/.

Note that the input samples require a another layer of tokenization, using hitab/data2text/experiment/pointer_generator/parse_sample.py.

[2] Experiment: Training and Evaluation

The experiment directory contains the code for training (train_d2t.py) and evaluation (eval_d2t.py). The T5, BART, and BERT-to-BERT directly call the training process from the installed transformers library. Pointer-Generator Network (PGN) requires additional code modules, specifically in the pointer_generator directory.

To follow the training pipeline, take BART for an example, run:

python run_experiment.py --expr_name bart --do_train --do_eval --do_test 

Alter the expr_name argument among t5/bart/b2b/pgn to explore different models.