Merge pull request #2 from yongbowin/patch-2

Update README.md
This commit is contained in:
spacemanidol 2019-06-11 11:47:02 -07:00 коммит произвёл GitHub
Родитель 87f84b4914 f52d4ec1a8
Коммит f5f423e30b
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 11 добавлений и 11 удалений

Просмотреть файл

@ -48,16 +48,16 @@ The MSMARCO dataset is generated by a well oiled pipeline optimized for the high
### Data Format
Much like the v2.0 release, the v2.1 release is provided as a json file. This is for easy exploration and debugging and loading. Based on feedback from our community the V2.1 now dataset now has utilities for easy conversion to the [JSONL](http://jsonlines.org/) format. Official downloads from the website are as one large json object but use the tojson.py or tojsonl.py utilites to switch easy between file formats.
Each line/entry containts the following parameters to be described below: query_id, query_type, query, passages, answers, and wellFormedAnswers.
Each line/entry contains the following parameters to be described below: query_id, query_type, query, passages, answers, and wellFormedAnswers.
For the QA task the target output is present in 'answers'. For NLGen task the target output is present in 'wellFormedAnswers'.
1. query_id: A unique id for each query that is used in evaluation
2. query: A unique query based on initial Bing usage
3. passages: A set of 10:passages, URLs, and an annotation if they were used to formulate and answer(is_selected:1). Two passages may come from the URL and these passages have been obtained by Bing as the most relevant passages. If a passage is maked as is_selected:1 it means the judge used that passage to formulate their answer. If a passage is marked as is_selected:0 it means the judge did not use that passage to generate their answer. Questions that have the answer of 'No Answer Present.' will have all passages marked as is_selecte: 0.
4. query_type: A basic division of queries based on a trained classifier. Categories are:{LOCATION,NUMERIC,PERSON,DESCRIPTION,ENTITY} and can be used to debug model performance or make smaller more forcused datasets.
5. answers: An array of answers produced by human judges, most contain a single answer but ~1% contain more than one answer(average of ~2 answers if there are multiple answers). These answers were generated by real people in their own words instead of selecting a span of text. The language used in their answer may be similair or match the language in any of the passages.
6. wellFormedAnswers. An array of rewritten answers, most contain a single answer but ~1% contain more than one answer(average of ~5 answers if there are multiple answers). These answers were generated by having a new judge read the answer and the query and they would rewrite the answer if it did not (i) include proper grammar to make it a full sentence, (ii) make sense without the context of either the query or the passage, (iii) had a high overlap with exact portions in one of the context passages. This ensures that well formed answers are true natural languge and not just span selection. Well Formed Answers are a more difficult for of Question answering because they contain words that may not be present in either the question or any of the context passages.
3. passages: A set of 10:passages, URLs, and an annotation if they were used to formulate and answer(is_selected:1). Two passages may come from the URL and these passages have been obtained by Bing as the most relevant passages. If a passage is maked as is_selected:1 it means the judge used that passage to formulate their answer. If a passage is marked as is_selected:0 it means the judge did not use that passage to generate their answer. Questions that have the answer of 'No Answer Present.' will have all passages marked as is_selected: 0.
4. query_type: A basic division of queries based on a trained classifier. Categories are:{LOCATION,NUMERIC,PERSON,DESCRIPTION,ENTITY} and can be used to debug model performance or make smaller more focused datasets.
5. answers: An array of answers produced by human judges, most contain a single answer but ~1% contain more than one answer(average of ~2 answers if there are multiple answers). These answers were generated by real people in their own words instead of selecting a span of text. The language used in their answer may be similar or match the language in any of the passages.
6. wellFormedAnswers. An array of rewritten answers, most contain a single answer but ~1% contain more than one answer(average of ~5 answers if there are multiple answers). These answers were generated by having a new judge read the answer and the query and they would rewrite the answer if it did not (i) include proper grammar to make it a full sentence, (ii) make sense without the context of either the query or the passage, (iii) had a high overlap with exact portions in one of the context passages. This ensures that well formed answers are true natural language and not just span selection. Well Formed Answers are a more difficult for of Question answering because they contain words that may not be present in either the question or any of the context passages.
example
~~~
@ -77,7 +77,7 @@ example
}
~~~
## Utilities, Stats and Related Content
Besides the main files containing judgments, we are releasing various utilites to help people explore the data and optimize the data for their needs. They have only been tested with python 3.5 and are provided as is. Usage is noted below. If you write any utils you feel the community could use and enjoy please submit them with a pull request.
Besides the main files containing judgments, we are releasing various utilities to help people explore the data and optimize the data for their needs. They have only been tested with python 3.5 and are provided as is. Usage is noted below. If you write any utils you feel the community could use and enjoy please submit them with a pull request.
### File Conversion
Our community told us that they likled being able to have the data in both json format for easy exploration and [JSONL](http://jsonlines.org/)format to make running models easier. To help the easy transition from one file format to another we have included tojson.py and tojsonl.py.
@ -99,7 +99,7 @@ python3 converttowellformed.py <your_input_file(json)> <target_json_filename>
### Dataset Statistics
Statistics about the dataset were generated with the exploredata.py file. They can be found in the Stats folder.
You can use the explore datafile to generate similiar statistics on any slice you create of the dataset.
You can use the explore datafile to generate similar statistics on any slice you create of the dataset.
~~~
python3 exploredata.py <your_input_file(json)> <-p if you are using a dataslice without answers>
@ -112,7 +112,7 @@ We have made the official evaluation script along with a sample output file on t
./run.sh <path to reference json file> <path to candidate json file>
### Leaderboard Results
To Help Teams iterate we are making the results of official submissions on our evaluation script(the scores, not the full submissions) available. We will update these files as we update metrics and as new submisions come in. They can be found in the [Leaderboard Results](https://github.com/dfcf93/MSMARCOV2/tree/master/Q%2BA/Leaderboard%20Results) folder.
To Help Teams iterate we are making the results of official submissions on our evaluation script(the scores, not the full submissions) available. We will update these files as we update metrics and as new submissions come in. They can be found in the [Leaderboard Results](https://github.com/dfcf93/MSMARCOV2/tree/master/Q%2BA/Leaderboard%20Results) folder.
### Submissions
Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps:
@ -123,12 +123,12 @@ Individual/Team Institution: Name of the institution of the individual or the te
Model information: Name of the model/technique to appear in the leaderboard [Required]
Paper Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard [Optional]
Please submit your results either in json or jsonl format and ensure that each answer you are providing has its refrence query_id and query_text. If your model does not have query_id and query_text it is difficult/impossible to evalutate the submission.
Please submit your results either in json or jsonl format and ensure that each answer you are providing has its reference query_id and query_text. If your model does not have query_id and query_text it is difficult/impossible to evaluate the submission.
To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time. Because submissions don't require the final trained model we also retain the right to request a model to validate the results being submitted
### Run baseline systems on multilingual datasets
To encourage competitors to generate performant systems regardless of the langauge we recommend teams also test their systems on datasets in other langauges such as Baidu's DuReader.
To encourage competitors to generate performant systems regardless of the language we recommend teams also test their systems on datasets in other languages such as Baidu's DuReader.
[DuReader](https://ai.baidu.com/broad/subordinate?dataset=dureader) is a Chinese dataset focused on machine reading comprehension and question answering. Its design and area of focus is very similair to that of MSMARCO. The DuReader team has created scripts to allow DuReader system to use msmarco data and we have created scripts to allow MSMARCO teams to use DuReader data. We Strongly recommend training and testing your system with both datasets. We are in the process of creating an analysis tool that would take results to both systems and debug the wins/losses.
[DuReader](https://ai.baidu.com/broad/subordinate?dataset=dureader) is a Chinese dataset focused on machine reading comprehension and question answering. Its design and area of focus is very similar to that of MSMARCO. The DuReader team has created scripts to allow DuReader system to use msmarco data and we have created scripts to allow MSMARCO teams to use DuReader data. We Strongly recommend training and testing your system with both datasets. We are in the process of creating an analysis tool that would take results to both systems and debug the wins/losses.
To download the DuReader Data navigate to their [Git Repo](https://github.com/baidu/DuReader) and follow their instructions to download the data. After you have downloaded and processed the data you can run our converter scripts to turn the data into MSMARCO format as below.
'''