update the pre-trained models corpus and weights.

This commit is contained in:
SivilTaram 2021-08-27 22:09:07 +08:00
Родитель 48714e91dc
Коммит 94abe26c8c
2 изменённых файлов: 41 добавлений и 3 удалений

Просмотреть файл

@ -55,7 +55,7 @@ This project contains two parts, `tapex` library and `examples` to employ it on
# ⚡️ Quickstart
## Environment
## Prepare Environment
First, you should set up a python environment. This code base has been tested under python 3.x, and we officially support python 3.8.
@ -80,10 +80,47 @@ $ pip install --editable ./
> The argument `--editable` is important for your potential follow-up modification on the tapex library. The command will not only install dependencies, but also install `tapex` as a library, which can be imported easily.
## Get Started
## Use TAPEX
Once `tapex` is successfully installed, you could go into [examples](examples) to enjoy fine-tuning TAPEX models and using them on different applications!
# 🏰 Resource
## Pre-training Corpus
Our synthetic pre-training corpus which includes nearly **5,000,000** tuples of (*SQL queries*, *flattened tables*, *SQL execution results*) can be downloaded from [here](https://github.com/microsoft/Table-Pretraining/releases/download/v1.1/tapex_pretrain.tar.gz). You can use it for research purpose, but you should be careful about the [data license](LICENSE-Data).
Below is an example from the pre-training corpus:
- The SQL plus flattened Table as **INPUT**:
```
select vote where passed = 'may 6, 1861' col : state | passed | referendum | vote
row 1 : s. carolina | december 20, 1860 | none | none
row 2 : mississippi | january 9, 1861 | none | none row 3 : florida | january 10, 1861 | none | none
row 4 : alabama | january 11, 1861 | none | none row 5 : georgia | january 19, 1861 | none | none
row 6 : louisiana | january 26, 1861 | none | none row 7 : texas | february 1, 1861 | february 23 | 46,153-14,747
row 8 : virginia | april 17, 1861 | may 23 | 132,201-37,451 row 9 : arkansas | may 6, 1861 | none | none
row 10 : tennessee | may 6, 1861 | june 8 | 104,471-47,183 row 11 : n. carolina | may 20, 1861 | none | none
row 12 : missouri | october 31, 1861 | none | none row 13 : kentucky | november 20, 1861 | none | none
```
- The SQL Execution Result as **OUTPUT**:
```
104471
```
Here we want to acknowledge the huge effort of paper [On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries](https://arxiv.org/pdf/2010.11246.pdf), which provides the rich resources of SQL templates for us to synthesize the pre-training corpus.
If you are interested, please give a STAR to their [repo](https://github.com/tzshi/squall).
## Pre-trained models
Model | Description | # params | Download
---|---|---|---
`tapex.base` | 6 encoder and decoder layers | 140M | [tapex.base.tar.gz](https://github.com/microsoft/Table-Pretraining/releases/download/v1.0/tapex.base.tar.gz)
`tapex.large` | 12 encoder and decoder layers | 400M | [tapex.large.tar.gz](https://github.com/microsoft/Table-Pretraining/releases/download/v1.0/tapex.large.tar.gz)
> More pre-trained models will be uploaded soon!
# 💬 Citation
If our work is useful for you, please consider citing our paper:

Просмотреть файл

@ -25,6 +25,7 @@ setuptools.setup(
'transformers>=4.6.0',
'numpy==1.20.3',
"fairseq@git+git://github.com/pytorch/fairseq@801a64683164680562c77b688d9ca77fc3e0cea7",
"records"
"records",
"pandas"
],
)