Merge branch 'staging' into courtney-tests
This commit is contained in:
Коммит
366bad1538
43
README.md
43
README.md
|
@ -1,14 +1,45 @@
|
|||
|
||||
| Branch | Status | | Branch | Status |
|
||||
| ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-master?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=22&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-staging?branchName=staging)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=21&branchName=staging) |
|
||||
|
||||
# NLP Best Practices
|
||||
|
||||
This repository contains examples and best practices for building NLP systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
|
||||
This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
|
||||
|
||||
![](https://nlpbp.blob.core.windows.net/images/cognitive_services.PNG)
|
||||
## Overview
|
||||
|
||||
The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.
|
||||
The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.
|
||||
|
||||
We’re hoping that the tools would significantly reduce the time from a business problem, or a research idea, to full implementation of a system. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools.
|
||||
|
||||
In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks and can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.
|
||||
|
||||
> [*GLUE Leaderboard*](https://gluebenchmark.com/leaderboard)
|
||||
> [*SQuAD Leaderbord*](https://rajpurkar.github.io/SQuAD-explorer/)
|
||||
|
||||
## Content
|
||||
|
||||
The following is a summary of the scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and utilities.
|
||||
|
||||
| Scenario | Applications | Models |
|
||||
|---| ------------------------ | ------------------- |
|
||||
|[Text Classification](scenarios/text_classification) |Topic Classification|BERT|
|
||||
|[Named Entity Recognition](scenarios/named_entity_recognition) |Wikipedia NER |BERT|
|
||||
|[Entailment](scenarios/entailment)|XNLI Natural Language Inference|BERT|
|
||||
|[Question Answering](scenarios/question_answering) |SQuAD | BiDAF|
|
||||
|[Sentence Similarity](scenarios/sentence_similarity) |STS Benchmark |Representation: TF-IDF, Word Embeddings, Doc Embeddings<br>Metrics: Cosine Similarity, Word Mover's Distance|
|
||||
|[Embeddings](scenarios/embeddings)| Custom Embeddings Training|Word2Vec<br>fastText<br>GloVe|
|
||||
|
||||
|
||||
|
||||
## Getting Started
|
||||
To get started, navigate to the [Setup Guide](SETUP.md), where you'll find instructions on how to setup your environment and dependencies.
|
||||
|
||||
## Contributing
|
||||
This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md).
|
||||
|
||||
|
||||
## Build Status
|
||||
|
||||
| Build Type | Branch | Status |
|
||||
| --- | --- | --- |
|
||||
| **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/cpu_integration_tests_linux?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=50&branchName=master) |
|
||||
| **Linux GPU** | master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/gpu_integration_tests_linux?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=51&branchName=master) |
|
||||
|
|
|
@ -1,19 +1,14 @@
|
|||
# NLP Scenarios
|
||||
|
||||
This folder contains examples and best practices, written in Jupyter notebooks, for building Natural Language Processing systems for different scenarios.
|
||||
This folder contains examples and best practices, written in Jupyter notebooks, for building Natural Language Processing systems for the following scenarios.
|
||||
|
||||
## Summary
|
||||
|
||||
The following is a summary of the scenarios covered in the best practice notebooks. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and utilities.
|
||||
|
||||
| Scenario | Applications | Models |
|
||||
|---| ------------------------ | ------------------- |
|
||||
|[Text Classification](text_classification) |Topic Classification|BERT|
|
||||
|[Named Entity Recognition](named_entity_recognition) |Wikipedia NER |BERT|
|
||||
|[Entailment](./entailment)|XNLI Natural Language Inference|BERT|
|
||||
|[Question Answering](question_answering) |SQuAD | BiDAF|
|
||||
|[Sentence Similarity](sentence_similarity) |STS Benchmark |Representation: TF-IDF, Word Embeddings, Doc Embeddings<br>Metrics: Cosine Similarity, Word Mover's Distance|
|
||||
|[Embeddings](embeddings)| Custom Embeddings Training|Word2Vec<br>fastText<br>GloVe|
|
||||
- [Text Classification](text_classification)
|
||||
- [Named Entity Recognition](named_entity_recognition)
|
||||
- [Entailment](entailment)
|
||||
- [Question Answering](question_answering)
|
||||
- [Sentence Similarity](sentence_similarity)
|
||||
- [Embeddings](embeddings)
|
||||
|
||||
## Azure-enhanced notebooks
|
||||
|
||||
|
|
|
@ -80,13 +80,13 @@
|
|||
"\n",
|
||||
"import os\n",
|
||||
"import papermill as pm\n",
|
||||
"import scrapbook as sb\n",
|
||||
"\n",
|
||||
"from utils_nlp.dataset.preprocess import to_lowercase, to_nltk_tokens\n",
|
||||
"from utils_nlp.dataset import snli, preprocess\n",
|
||||
"from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
|
||||
"from utils_nlp.models.pretrained_embeddings.glove import download_and_extract\n",
|
||||
"import scrapbook as sb\n",
|
||||
"\n",
|
||||
"from utils_nlp.dataset import Split\n",
|
||||
"from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
|
||||
"\n",
|
||||
"print(\"System version: {}\".format(sys.version))"
|
||||
]
|
||||
|
@ -335,9 +335,9 @@
|
|||
}
|
||||
],
|
||||
"source": [
|
||||
"train = snli.load_pandas_df(base_data_path, file_split=\"train\", nrows=nrows)\n",
|
||||
"dev = snli.load_pandas_df(base_data_path, file_split=\"dev\", nrows=nrows)\n",
|
||||
"test = snli.load_pandas_df(base_data_path, file_split=\"test\", nrows=nrows)\n",
|
||||
"train = snli.load_pandas_df(base_data_path, file_split=Split.TRAIN, nrows=nrows)\n",
|
||||
"dev = snli.load_pandas_df(base_data_path, file_split=Split.DEV, nrows=nrows)\n",
|
||||
"test = snli.load_pandas_df(base_data_path, file_split=Split.TEST, nrows=nrows)\n",
|
||||
"\n",
|
||||
"train.head()"
|
||||
]
|
||||
|
|
|
@ -0,0 +1,61 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
|
||||
# More info on scheduling: https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops&tabs=yaml#scheduled-triggers
|
||||
# NOTE: this is commented since as of July 2019, DevOps has a bug in the scheduler
|
||||
#schedules:
|
||||
#- cron: "56 22 * * *"
|
||||
# displayName: Daily computation of nightly builds
|
||||
# branches:
|
||||
# include:
|
||||
# - master
|
||||
# always: true
|
||||
|
||||
|
||||
# Pull request against these branches will trigger this build
|
||||
pr:
|
||||
- master
|
||||
|
||||
trigger: none
|
||||
|
||||
jobs:
|
||||
- job: nightly
|
||||
displayName : 'Nightly tests'
|
||||
timeoutInMinutes: 180 # how long to run the job before automatically cancelling
|
||||
pool:
|
||||
name: nlpagentpool
|
||||
|
||||
steps:
|
||||
- bash: |
|
||||
echo "##vso[task.prependpath]/data/anaconda/bin"
|
||||
conda env list
|
||||
displayName: 'Add Conda to PATH'
|
||||
|
||||
# Conda creation can take around 10min
|
||||
- bash: |
|
||||
python tools/generate_conda_file.py
|
||||
conda env create -n integration_cpu -f nlp_cpu.yaml
|
||||
displayName: 'Creating Conda Environment with dependencies'
|
||||
|
||||
- bash: |
|
||||
source activate integration_cpu
|
||||
pytest --durations=0 tests/smoke -m "smoke and not gpu and not azureml" --junitxml=junit/test-smoke-test.xml
|
||||
displayName: 'Run smoke tests'
|
||||
|
||||
- bash: |
|
||||
source activate integration_cpu
|
||||
pytest --durations=0 tests/integration -m "integration and not gpu and not azureml" --junitxml=junit/test-integration-test.xml
|
||||
displayName: 'Run integration tests'
|
||||
|
||||
- bash: |
|
||||
echo Remove Conda Environment
|
||||
conda remove -n integration_cpu --all -q --force -y
|
||||
echo Done Cleanup
|
||||
displayName: 'Cleanup Task'
|
||||
condition: always()
|
||||
|
||||
- task: PublishTestResults@2
|
||||
inputs:
|
||||
testResultsFiles: '**/test-*-test.xml'
|
||||
testRunTitle: 'Test results for PyTest'
|
|
@ -0,0 +1,61 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
|
||||
# More info on scheduling: https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops&tabs=yaml#scheduled-triggers
|
||||
# NOTE: this is commented since as of July 2019, DevOps has a bug in the scheduler
|
||||
#schedules:
|
||||
#- cron: "56 11 * * *"
|
||||
# displayName: Daily computation of nightly builds
|
||||
# branches:
|
||||
# include:
|
||||
# - master
|
||||
# always: true
|
||||
|
||||
|
||||
# Pull request against these branches will trigger this build
|
||||
pr:
|
||||
- master
|
||||
|
||||
trigger: none
|
||||
|
||||
jobs:
|
||||
- job: nightly
|
||||
displayName : 'Nightly tests'
|
||||
timeoutInMinutes: 180 # how long to run the job before automatically cancelling
|
||||
pool:
|
||||
name: nlpagentpool
|
||||
|
||||
steps:
|
||||
- bash: |
|
||||
echo "##vso[task.prependpath]/data/anaconda/bin"
|
||||
conda env list
|
||||
displayName: 'Add Conda to PATH'
|
||||
|
||||
# Conda creation can take around 10min
|
||||
- bash: |
|
||||
python tools/generate_conda_file.py --gpu
|
||||
conda env create -n integration_gpu -f nlp_gpu.yaml
|
||||
displayName: 'Creating Conda Environment with dependencies'
|
||||
|
||||
- bash: |
|
||||
source activate integration_gpu
|
||||
pytest --durations=0 tests/smoke -m "smoke and gpu and not azureml" --junitxml=junit/test-smoke-test.xml
|
||||
displayName: 'Run smoke tests'
|
||||
|
||||
- bash: |
|
||||
source activate integration_gpu
|
||||
pytest --durations=0 tests/integration -m "integration and gpu and not azureml" --junitxml=junit/test-integration-test.xml
|
||||
displayName: 'Run integration tests'
|
||||
|
||||
- bash: |
|
||||
echo Remove Conda Environment
|
||||
conda remove -n integration_gpu --all -q --force -y
|
||||
echo Done Cleanup
|
||||
displayName: 'Cleanup Task'
|
||||
condition: always()
|
||||
|
||||
- task: PublishTestResults@2
|
||||
inputs:
|
||||
testResultsFiles: '**/test-*-test.xml'
|
||||
testRunTitle: 'Test results for PyTest'
|
|
@ -0,0 +1,11 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import pytest
|
||||
import torch
|
||||
|
||||
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.integration
|
||||
def test_machine_is_gpu_machine():
|
||||
assert torch.cuda.is_available() is True
|
|
@ -7,13 +7,11 @@ import papermill as pm
|
|||
from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME
|
||||
|
||||
|
||||
@pytest.mark.notebooks
|
||||
@pytest.mark.skip(reason="no way of running this programmatically")
|
||||
@pytest.mark.integration
|
||||
def test_embedding_trainer_runs(notebooks):
|
||||
notebook_path = notebooks["embedding_trainer"]
|
||||
pm.execute_notebook(
|
||||
notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
kernel_name=KERNEL_NAME,
|
||||
notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME
|
||||
)
|
||||
|
||||
|
|
|
@ -5,12 +5,8 @@ import sys
|
|||
import pytest
|
||||
import papermill as pm
|
||||
import scrapbook as sb
|
||||
from azureml.core import Experiment
|
||||
from azureml.core.run import Run
|
||||
from utils_nlp.azureml.azureml_utils import get_or_create_workspace
|
||||
from tests.notebooks_common import OUTPUT_NOTEBOOK
|
||||
from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME
|
||||
|
||||
sys.path.append("../../")
|
||||
ABS_TOL = 0.2
|
||||
ABS_TOL_PEARSONS = 0.05
|
||||
|
||||
|
@ -40,7 +36,7 @@ def baseline_results():
|
|||
@pytest.mark.azureml
|
||||
def test_similarity_embeddings_baseline_runs(notebooks, baseline_results):
|
||||
notebook_path = notebooks["similarity_embeddings_baseline"]
|
||||
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK)
|
||||
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
|
||||
results = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["results"]
|
||||
for key, value in baseline_results.items():
|
||||
assert results[key] == pytest.approx(value, abs=ABS_TOL)
|
||||
|
@ -68,58 +64,18 @@ def test_automl_local_runs(notebooks,
|
|||
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["pearson_correlation"]
|
||||
assert result == pytest.approx(0.5, abs=ABS_TOL)
|
||||
|
||||
@pytest.mark.notebooks
|
||||
@pytest.mark.gpu
|
||||
def test_similarity_senteval_local_runs(notebooks, gensen_senteval_results):
|
||||
notebook_path = notebooks["senteval_local"]
|
||||
pm.execute_notebook(
|
||||
notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
parameters=dict(
|
||||
PATH_TO_SENTEVAL="../SentEval", PATH_TO_GENSEN="../gensen"
|
||||
),
|
||||
)
|
||||
out = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["results"]
|
||||
for key, val in gensen_senteval_results.items():
|
||||
for task, result in val.items():
|
||||
assert out[key][task] == result
|
||||
|
||||
|
||||
@pytest.mark.notebooks
|
||||
@pytest.mark.azureml
|
||||
def test_similarity_senteval_azureml_runs(notebooks, gensen_senteval_results):
|
||||
notebook_path = notebooks["senteval_azureml"]
|
||||
pm.execute_notebook(
|
||||
notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
parameters=dict(
|
||||
PATH_TO_SENTEVAL="../SentEval",
|
||||
PATH_TO_GENSEN="../gensen",
|
||||
PATH_TO_SER="utils_nlp/eval/senteval.py",
|
||||
AZUREML_VERBOSE=False,
|
||||
config_path="tests/ci",
|
||||
),
|
||||
)
|
||||
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
|
||||
ws = get_or_create_workspace(config_path="tests/ci")
|
||||
experiment = Experiment(ws, name=result["experiment_name"])
|
||||
run = Run(experiment, result["run_id"])
|
||||
assert run.get_metrics()["STSBenchmark::pearson"] == pytest.approx(
|
||||
gensen_senteval_results["pearson"]["STSBenchmark"], abs=ABS_TOL
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.notebooks
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.integration
|
||||
def test_gensen_local(notebooks):
|
||||
notebook_path = notebooks["gensen_local"]
|
||||
pm.execute_notebook(
|
||||
notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
kernel_name=KERNEL_NAME,
|
||||
parameters=dict(
|
||||
max_epoch=1,
|
||||
config_filepath="../../scenarios/sentence_similarity/gensen_config.json",
|
||||
base_data_path="../../data",
|
||||
config_filepath="scenarios/sentence_similarity/gensen_config.json",
|
||||
base_data_path="data",
|
||||
),
|
||||
)
|
||||
|
||||
|
|
|
@ -0,0 +1,31 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import os
|
||||
import pytest
|
||||
|
||||
from utils_nlp.dataset import msrpc
|
||||
from utils_nlp.dataset import xnli
|
||||
|
||||
|
||||
@pytest.mark.smoke
|
||||
def test_msrpc_download(tmp_path):
|
||||
filepath = msrpc.download_msrpc(tmp_path)
|
||||
statinfo = os.stat(filepath)
|
||||
assert statinfo.st_size == 1359872
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="Can't test it programmatically, needs input")
|
||||
@pytest.mark.smoke
|
||||
def test_msrpc_load_df(tmp_path):
|
||||
df_train = msrpc.load_pandas_df(
|
||||
local_cache_path=tmp_path, dataset_type="train"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.smoke
|
||||
def test_xnli(tmp_path):
|
||||
df_train = xnli.load_pandas_df(
|
||||
local_cache_path=tmp_path, file_split="train"
|
||||
)
|
||||
assert df_train.shape == (392702, 2)
|
|
@ -0,0 +1,12 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import pytest
|
||||
import torch
|
||||
|
||||
|
||||
@pytest.mark.smoke
|
||||
@pytest.mark.gpu
|
||||
def test_machine_is_gpu_machine():
|
||||
assert torch.cuda.is_available() is True
|
||||
|
|
@ -1,14 +0,0 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import os
|
||||
import pytest
|
||||
|
||||
from utils_nlp.dataset import msrpc
|
||||
|
||||
|
||||
@pytest.mark.smoke
|
||||
def test_download_msrpc(tmp_path):
|
||||
filepath = msrpc.download_msrpc(tmp_path)
|
||||
statinfo = os.stat(filepath)
|
||||
assert statinfo.st_size == 1359872
|
|
@ -5,51 +5,14 @@ import os
|
|||
import pytest
|
||||
|
||||
from utils_nlp.dataset.url_utils import maybe_download
|
||||
from utils_nlp.dataset.msrpc import load_pandas_df
|
||||
import utils_nlp.dataset.wikigold as wg
|
||||
import utils_nlp.dataset.xnli as xnli
|
||||
from utils_nlp.dataset import msrpc
|
||||
from utils_nlp.dataset import wikigold
|
||||
from utils_nlp.dataset import xnli
|
||||
from utils_nlp.dataset import snli
|
||||
from utils_nlp.dataset import Split
|
||||
from utils_nlp.dataset.ner_utils import preprocess_conll
|
||||
|
||||
|
||||
def test_maybe_download():
|
||||
# ToDo: Change this url when repo goes public.
|
||||
file_url = (
|
||||
"https://raw.githubusercontent.com/Microsoft/Recommenders/"
|
||||
"master/LICENSE"
|
||||
)
|
||||
filepath = "license.txt"
|
||||
assert not os.path.exists(filepath)
|
||||
filepath = maybe_download(file_url, "license.txt", expected_bytes=1162)
|
||||
assert os.path.exists(filepath)
|
||||
os.remove(filepath)
|
||||
with pytest.raises(IOError):
|
||||
filepath = maybe_download(file_url, "license.txt", expected_bytes=0)
|
||||
|
||||
|
||||
def test_load_pandas_df_msrpc():
|
||||
with pytest.raises(Exception):
|
||||
load_pandas_df(dataset_type="Dummy")
|
||||
|
||||
|
||||
def test_wikigold(tmp_path):
|
||||
wg_sentence_count = 1841
|
||||
wg_test_percentage = 0.5
|
||||
wg_test_sentence_count = round(wg_sentence_count * wg_test_percentage)
|
||||
wg_train_sentence_count = wg_sentence_count - wg_test_sentence_count
|
||||
|
||||
downloaded_file = os.path.join(tmp_path, "wikigold.conll.txt")
|
||||
assert not os.path.exists(downloaded_file)
|
||||
|
||||
train_df, test_df = wg.load_train_test_dfs(
|
||||
tmp_path, test_percentage=wg_test_percentage
|
||||
)
|
||||
|
||||
assert os.path.exists(downloaded_file)
|
||||
|
||||
assert train_df.shape == (wg_train_sentence_count, 2)
|
||||
assert test_df.shape == (wg_test_sentence_count, 2)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def ner_utils_test_data(scope="module"):
|
||||
return {
|
||||
|
@ -115,6 +78,45 @@ def ner_utils_test_data(scope="module"):
|
|||
}
|
||||
|
||||
|
||||
def test_maybe_download():
|
||||
# ToDo: Change this url when repo goes public.
|
||||
file_url = (
|
||||
"https://raw.githubusercontent.com/Microsoft/Recommenders/"
|
||||
"master/LICENSE"
|
||||
)
|
||||
filepath = "license.txt"
|
||||
assert not os.path.exists(filepath)
|
||||
filepath = maybe_download(file_url, "license.txt", expected_bytes=1162)
|
||||
assert os.path.exists(filepath)
|
||||
os.remove(filepath)
|
||||
with pytest.raises(IOError):
|
||||
filepath = maybe_download(file_url, "license.txt", expected_bytes=0)
|
||||
|
||||
|
||||
def test_msrpc():
|
||||
with pytest.raises(Exception):
|
||||
msrpc.load_pandas_df(dataset_type="Dummy")
|
||||
|
||||
|
||||
def test_wikigold(tmp_path):
|
||||
wg_sentence_count = 1841
|
||||
wg_test_percentage = 0.5
|
||||
wg_test_sentence_count = round(wg_sentence_count * wg_test_percentage)
|
||||
wg_train_sentence_count = wg_sentence_count - wg_test_sentence_count
|
||||
|
||||
downloaded_file = os.path.join(tmp_path, "wikigold.conll.txt")
|
||||
assert not os.path.exists(downloaded_file)
|
||||
|
||||
train_df, test_df = wikigold.load_train_test_dfs(
|
||||
tmp_path, test_percentage=wg_test_percentage
|
||||
)
|
||||
|
||||
assert os.path.exists(downloaded_file)
|
||||
|
||||
assert train_df.shape == (wg_train_sentence_count, 2)
|
||||
assert test_df.shape == (wg_test_sentence_count, 2)
|
||||
|
||||
|
||||
def test_ner_utils(ner_utils_test_data):
|
||||
output = preprocess_conll(ner_utils_test_data["input"])
|
||||
assert output == ner_utils_test_data["expected_output"]
|
||||
|
@ -123,5 +125,21 @@ def test_ner_utils(ner_utils_test_data):
|
|||
def test_xnli(tmp_path):
|
||||
# only test for the dev df as the train dataset takes several
|
||||
# minutes to download
|
||||
dev_df = xnli.load_pandas_df(local_cache_path=tmp_path)
|
||||
dev_df = xnli.load_pandas_df(local_cache_path=tmp_path, file_split="dev")
|
||||
assert dev_df.shape == (2490, 2)
|
||||
|
||||
|
||||
def test_snli(tmp_path):
|
||||
df_train = snli.load_pandas_df(
|
||||
local_cache_path=tmp_path, file_split=Split.TRAIN
|
||||
)
|
||||
assert df_train.shape == (550152, 14)
|
||||
df_test = snli.load_pandas_df(
|
||||
local_cache_path=tmp_path, file_split=Split.TEST
|
||||
)
|
||||
assert df_test.shape == (10000, 14)
|
||||
df_dev = snli.load_pandas_df(
|
||||
local_cache_path=tmp_path, file_split=Split.DEV
|
||||
)
|
||||
assert df_dev.shape == (10000, 14)
|
||||
|
||||
|
|
|
@ -1,12 +1,8 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
"""XNLI dataset utils
|
||||
https://www.nyu.edu/projects/bowman/xnli/
|
||||
"""
|
||||
|
||||
import os
|
||||
|
||||
import pandas as pd
|
||||
|
||||
from utils_nlp.dataset.url_utils import extract_zip, maybe_download
|
||||
|
@ -16,9 +12,11 @@ URL_XNLI = "https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip"
|
|||
URL_XNLI_MT = "https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip"
|
||||
|
||||
|
||||
def load_pandas_df(local_cache_path="./", file_split="dev", language="zh"):
|
||||
def load_pandas_df(local_cache_path=".", file_split="dev", language="zh"):
|
||||
"""Downloads and extracts the dataset files.
|
||||
|
||||
Utilities information can be found `on this link <https://www.nyu.edu/projects/bowman/xnli/>`_.
|
||||
|
||||
Args:
|
||||
local_cache_path (str, optional): Path to store the data.
|
||||
Defaults to "./".
|
||||
|
|
Загрузка…
Ссылка в новой задаче