This commit is contained in:
Michele Tufano 2021-05-19 18:52:12 +00:00
Родитель d412c24e7d
Коммит 135674a491
1 изменённых файлов: 33 добавлений и 0 удалений

33
corpus/README.md Normal file
Просмотреть файл

@ -0,0 +1,33 @@
# Corpus
This folder contains the corpus extracted from the dataset of mapped test cases.
The corpus is organized in different levels of focal context, incorporating information from the focal method and class within the *input sentence*, which can inform the model when generating test cases. The different levels of focal contexts are the following:
- *FM*: focal method
- *FM_FC*: focal method + focal class name
- *FM_FC_CO*: focal method + focal class name + constructor signatures
- *FM_FC_MS*: focal method + focal class name + constructor signatures + public method signatures
- *FM_FC_MS_FF*: focal method + focal class name + constructor signatures + public method signatures + public fields
## JSON
The `json/` folder contains the target test case as well as all the five variations of focal context input.
```yaml
target: <TEST_CASE>
src_fm: <FOCAL_METHOD>
src_fm_fc: <FOCAL_CLASS_NAME> <FOCAL_METHOD>
src_fm_fc_co: <FOCAL_CLASS_NAME> <FOCAL_METHOD> <CONTRSUCTORS>
src_fm_fc_ms: <FOCAL_CLASS_NAME> <FOCAL_METHOD> <CONTRSUCTORS> <METHOD_SIGNATURES>
src_fm_fc_ms_ff: <FOCAL_CLASS_NAME> <FOCAL_METHOD> <CONTRSUCTORS> <METHOD_SIGNATURES> <FIELDS>
```
## Raw
The `raw/` folder contains the raw text parallel corpus of focal methods and test cases.
The corpus does not contain duplicate pairs, and is organized in train, validation, and test set.
## Tokenized
The `tokenized/` folder contains the parallel corpus tokenized using a Byte-Level BPE Tokenizer. Specifically, we use the ByteLevelBPETokenizer from [huggingface/tokenizers](https://github.com/huggingface/tokenizers "text") with a Roberta-based vocabulary available in `preprocessed/dict.tar.bz2`
## Preprocessed
The `preprocessed/` folder contains the corpus preprocessed using fairseq as well as the dictionary. Specifically, we use the [fairseq-preprocess](https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-preprocess "fairseq-preprocess") command that build vocabularies and binarize the data, to be used during training.