Merge branch 'staging' into courtney-tests

This commit is contained in:
cocochrane 2019-07-30 11:23:19 -04:00 коммит произвёл GitHub
Родитель 77e8684252 01b3996d82
Коммит 366bad1538
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
13 изменённых файлов: 298 добавлений и 140 удалений

Просмотреть файл

@ -1,14 +1,45 @@
| Branch | Status | | Branch | Status |
| ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-master?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=22&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-staging?branchName=staging)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=21&branchName=staging) |
# NLP Best Practices
This repository contains examples and best practices for building NLP systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
![](https://nlpbp.blob.core.windows.net/images/cognitive_services.PNG)
## Overview
The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.
The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.
Were hoping that the tools would significantly reduce the time from a business problem, or a research idea, to full implementation of a system. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools.
In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks and can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.
> [*GLUE Leaderboard*](https://gluebenchmark.com/leaderboard)
> [*SQuAD Leaderbord*](https://rajpurkar.github.io/SQuAD-explorer/)
## Content
The following is a summary of the scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and utilities.
| Scenario | Applications | Models |
|---| ------------------------ | ------------------- |
|[Text Classification](scenarios/text_classification) |Topic Classification|BERT|
|[Named Entity Recognition](scenarios/named_entity_recognition) |Wikipedia NER |BERT|
|[Entailment](scenarios/entailment)|XNLI Natural Language Inference|BERT|
|[Question Answering](scenarios/question_answering) |SQuAD | BiDAF|
|[Sentence Similarity](scenarios/sentence_similarity) |STS Benchmark |Representation: TF-IDF, Word Embeddings, Doc Embeddings<br>Metrics: Cosine Similarity, Word Mover's Distance|
|[Embeddings](scenarios/embeddings)| Custom Embeddings Training|Word2Vec<br>fastText<br>GloVe|
## Getting Started
To get started, navigate to the [Setup Guide](SETUP.md), where you'll find instructions on how to setup your environment and dependencies.
## Contributing
This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md).
## Build Status
| Build Type | Branch | Status |
| --- | --- | --- |
| **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/cpu_integration_tests_linux?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=50&branchName=master) |
| **Linux GPU** | master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/gpu_integration_tests_linux?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=51&branchName=master) |

Просмотреть файл

@ -1,19 +1,14 @@
# NLP Scenarios
This folder contains examples and best practices, written in Jupyter notebooks, for building Natural Language Processing systems for different scenarios.
This folder contains examples and best practices, written in Jupyter notebooks, for building Natural Language Processing systems for the following scenarios.
## Summary
The following is a summary of the scenarios covered in the best practice notebooks. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and utilities.
| Scenario | Applications | Models |
|---| ------------------------ | ------------------- |
|[Text Classification](text_classification) |Topic Classification|BERT|
|[Named Entity Recognition](named_entity_recognition) |Wikipedia NER |BERT|
|[Entailment](./entailment)|XNLI Natural Language Inference|BERT|
|[Question Answering](question_answering) |SQuAD | BiDAF|
|[Sentence Similarity](sentence_similarity) |STS Benchmark |Representation: TF-IDF, Word Embeddings, Doc Embeddings<br>Metrics: Cosine Similarity, Word Mover's Distance|
|[Embeddings](embeddings)| Custom Embeddings Training|Word2Vec<br>fastText<br>GloVe|
- [Text Classification](text_classification)
- [Named Entity Recognition](named_entity_recognition)
- [Entailment](entailment)
- [Question Answering](question_answering)
- [Sentence Similarity](sentence_similarity)
- [Embeddings](embeddings)
## Azure-enhanced notebooks

Просмотреть файл

@ -80,13 +80,13 @@
"\n",
"import os\n",
"import papermill as pm\n",
"import scrapbook as sb\n",
"\n",
"from utils_nlp.dataset.preprocess import to_lowercase, to_nltk_tokens\n",
"from utils_nlp.dataset import snli, preprocess\n",
"from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
"from utils_nlp.models.pretrained_embeddings.glove import download_and_extract\n",
"import scrapbook as sb\n",
"\n",
"from utils_nlp.dataset import Split\n",
"from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
"\n",
"print(\"System version: {}\".format(sys.version))"
]
@ -335,9 +335,9 @@
}
],
"source": [
"train = snli.load_pandas_df(base_data_path, file_split=\"train\", nrows=nrows)\n",
"dev = snli.load_pandas_df(base_data_path, file_split=\"dev\", nrows=nrows)\n",
"test = snli.load_pandas_df(base_data_path, file_split=\"test\", nrows=nrows)\n",
"train = snli.load_pandas_df(base_data_path, file_split=Split.TRAIN, nrows=nrows)\n",
"dev = snli.load_pandas_df(base_data_path, file_split=Split.DEV, nrows=nrows)\n",
"test = snli.load_pandas_df(base_data_path, file_split=Split.TEST, nrows=nrows)\n",
"\n",
"train.head()"
]

Просмотреть файл

@ -0,0 +1,61 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# More info on scheduling: https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops&tabs=yaml#scheduled-triggers
# NOTE: this is commented since as of July 2019, DevOps has a bug in the scheduler
#schedules:
#- cron: "56 22 * * *"
# displayName: Daily computation of nightly builds
# branches:
# include:
# - master
# always: true
# Pull request against these branches will trigger this build
pr:
- master
trigger: none
jobs:
- job: nightly
displayName : 'Nightly tests'
timeoutInMinutes: 180 # how long to run the job before automatically cancelling
pool:
name: nlpagentpool
steps:
- bash: |
echo "##vso[task.prependpath]/data/anaconda/bin"
conda env list
displayName: 'Add Conda to PATH'
# Conda creation can take around 10min
- bash: |
python tools/generate_conda_file.py
conda env create -n integration_cpu -f nlp_cpu.yaml
displayName: 'Creating Conda Environment with dependencies'
- bash: |
source activate integration_cpu
pytest --durations=0 tests/smoke -m "smoke and not gpu and not azureml" --junitxml=junit/test-smoke-test.xml
displayName: 'Run smoke tests'
- bash: |
source activate integration_cpu
pytest --durations=0 tests/integration -m "integration and not gpu and not azureml" --junitxml=junit/test-integration-test.xml
displayName: 'Run integration tests'
- bash: |
echo Remove Conda Environment
conda remove -n integration_cpu --all -q --force -y
echo Done Cleanup
displayName: 'Cleanup Task'
condition: always()
- task: PublishTestResults@2
inputs:
testResultsFiles: '**/test-*-test.xml'
testRunTitle: 'Test results for PyTest'

Просмотреть файл

@ -0,0 +1,61 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# More info on scheduling: https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops&tabs=yaml#scheduled-triggers
# NOTE: this is commented since as of July 2019, DevOps has a bug in the scheduler
#schedules:
#- cron: "56 11 * * *"
# displayName: Daily computation of nightly builds
# branches:
# include:
# - master
# always: true
# Pull request against these branches will trigger this build
pr:
- master
trigger: none
jobs:
- job: nightly
displayName : 'Nightly tests'
timeoutInMinutes: 180 # how long to run the job before automatically cancelling
pool:
name: nlpagentpool
steps:
- bash: |
echo "##vso[task.prependpath]/data/anaconda/bin"
conda env list
displayName: 'Add Conda to PATH'
# Conda creation can take around 10min
- bash: |
python tools/generate_conda_file.py --gpu
conda env create -n integration_gpu -f nlp_gpu.yaml
displayName: 'Creating Conda Environment with dependencies'
- bash: |
source activate integration_gpu
pytest --durations=0 tests/smoke -m "smoke and gpu and not azureml" --junitxml=junit/test-smoke-test.xml
displayName: 'Run smoke tests'
- bash: |
source activate integration_gpu
pytest --durations=0 tests/integration -m "integration and gpu and not azureml" --junitxml=junit/test-integration-test.xml
displayName: 'Run integration tests'
- bash: |
echo Remove Conda Environment
conda remove -n integration_gpu --all -q --force -y
echo Done Cleanup
displayName: 'Cleanup Task'
condition: always()
- task: PublishTestResults@2
inputs:
testResultsFiles: '**/test-*-test.xml'
testRunTitle: 'Test results for PyTest'

Просмотреть файл

@ -0,0 +1,11 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import pytest
import torch
@pytest.mark.gpu
@pytest.mark.integration
def test_machine_is_gpu_machine():
assert torch.cuda.is_available() is True

Просмотреть файл

@ -7,13 +7,11 @@ import papermill as pm
from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME
@pytest.mark.notebooks
@pytest.mark.skip(reason="no way of running this programmatically")
@pytest.mark.integration
def test_embedding_trainer_runs(notebooks):
notebook_path = notebooks["embedding_trainer"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME
)

Просмотреть файл

@ -5,12 +5,8 @@ import sys
import pytest
import papermill as pm
import scrapbook as sb
from azureml.core import Experiment
from azureml.core.run import Run
from utils_nlp.azureml.azureml_utils import get_or_create_workspace
from tests.notebooks_common import OUTPUT_NOTEBOOK
from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME
sys.path.append("../../")
ABS_TOL = 0.2
ABS_TOL_PEARSONS = 0.05
@ -40,7 +36,7 @@ def baseline_results():
@pytest.mark.azureml
def test_similarity_embeddings_baseline_runs(notebooks, baseline_results):
notebook_path = notebooks["similarity_embeddings_baseline"]
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK)
pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME)
results = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["results"]
for key, value in baseline_results.items():
assert results[key] == pytest.approx(value, abs=ABS_TOL)
@ -68,58 +64,18 @@ def test_automl_local_runs(notebooks,
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["pearson_correlation"]
assert result == pytest.approx(0.5, abs=ABS_TOL)
@pytest.mark.notebooks
@pytest.mark.gpu
def test_similarity_senteval_local_runs(notebooks, gensen_senteval_results):
notebook_path = notebooks["senteval_local"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
parameters=dict(
PATH_TO_SENTEVAL="../SentEval", PATH_TO_GENSEN="../gensen"
),
)
out = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["results"]
for key, val in gensen_senteval_results.items():
for task, result in val.items():
assert out[key][task] == result
@pytest.mark.notebooks
@pytest.mark.azureml
def test_similarity_senteval_azureml_runs(notebooks, gensen_senteval_results):
notebook_path = notebooks["senteval_azureml"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
parameters=dict(
PATH_TO_SENTEVAL="../SentEval",
PATH_TO_GENSEN="../gensen",
PATH_TO_SER="utils_nlp/eval/senteval.py",
AZUREML_VERBOSE=False,
config_path="tests/ci",
),
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
ws = get_or_create_workspace(config_path="tests/ci")
experiment = Experiment(ws, name=result["experiment_name"])
run = Run(experiment, result["run_id"])
assert run.get_metrics()["STSBenchmark::pearson"] == pytest.approx(
gensen_senteval_results["pearson"]["STSBenchmark"], abs=ABS_TOL
)
@pytest.mark.notebooks
@pytest.mark.gpu
@pytest.mark.integration
def test_gensen_local(notebooks):
notebook_path = notebooks["gensen_local"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(
max_epoch=1,
config_filepath="../../scenarios/sentence_similarity/gensen_config.json",
base_data_path="../../data",
config_filepath="scenarios/sentence_similarity/gensen_config.json",
base_data_path="data",
),
)

Просмотреть файл

@ -0,0 +1,31 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import os
import pytest
from utils_nlp.dataset import msrpc
from utils_nlp.dataset import xnli
@pytest.mark.smoke
def test_msrpc_download(tmp_path):
filepath = msrpc.download_msrpc(tmp_path)
statinfo = os.stat(filepath)
assert statinfo.st_size == 1359872
@pytest.mark.skip(reason="Can't test it programmatically, needs input")
@pytest.mark.smoke
def test_msrpc_load_df(tmp_path):
df_train = msrpc.load_pandas_df(
local_cache_path=tmp_path, dataset_type="train"
)
@pytest.mark.smoke
def test_xnli(tmp_path):
df_train = xnli.load_pandas_df(
local_cache_path=tmp_path, file_split="train"
)
assert df_train.shape == (392702, 2)

Просмотреть файл

@ -0,0 +1,12 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import pytest
import torch
@pytest.mark.smoke
@pytest.mark.gpu
def test_machine_is_gpu_machine():
assert torch.cuda.is_available() is True

Просмотреть файл

@ -1,14 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import os
import pytest
from utils_nlp.dataset import msrpc
@pytest.mark.smoke
def test_download_msrpc(tmp_path):
filepath = msrpc.download_msrpc(tmp_path)
statinfo = os.stat(filepath)
assert statinfo.st_size == 1359872

Просмотреть файл

@ -5,51 +5,14 @@ import os
import pytest
from utils_nlp.dataset.url_utils import maybe_download
from utils_nlp.dataset.msrpc import load_pandas_df
import utils_nlp.dataset.wikigold as wg
import utils_nlp.dataset.xnli as xnli
from utils_nlp.dataset import msrpc
from utils_nlp.dataset import wikigold
from utils_nlp.dataset import xnli
from utils_nlp.dataset import snli
from utils_nlp.dataset import Split
from utils_nlp.dataset.ner_utils import preprocess_conll
def test_maybe_download():
# ToDo: Change this url when repo goes public.
file_url = (
"https://raw.githubusercontent.com/Microsoft/Recommenders/"
"master/LICENSE"
)
filepath = "license.txt"
assert not os.path.exists(filepath)
filepath = maybe_download(file_url, "license.txt", expected_bytes=1162)
assert os.path.exists(filepath)
os.remove(filepath)
with pytest.raises(IOError):
filepath = maybe_download(file_url, "license.txt", expected_bytes=0)
def test_load_pandas_df_msrpc():
with pytest.raises(Exception):
load_pandas_df(dataset_type="Dummy")
def test_wikigold(tmp_path):
wg_sentence_count = 1841
wg_test_percentage = 0.5
wg_test_sentence_count = round(wg_sentence_count * wg_test_percentage)
wg_train_sentence_count = wg_sentence_count - wg_test_sentence_count
downloaded_file = os.path.join(tmp_path, "wikigold.conll.txt")
assert not os.path.exists(downloaded_file)
train_df, test_df = wg.load_train_test_dfs(
tmp_path, test_percentage=wg_test_percentage
)
assert os.path.exists(downloaded_file)
assert train_df.shape == (wg_train_sentence_count, 2)
assert test_df.shape == (wg_test_sentence_count, 2)
@pytest.fixture
def ner_utils_test_data(scope="module"):
return {
@ -115,6 +78,45 @@ def ner_utils_test_data(scope="module"):
}
def test_maybe_download():
# ToDo: Change this url when repo goes public.
file_url = (
"https://raw.githubusercontent.com/Microsoft/Recommenders/"
"master/LICENSE"
)
filepath = "license.txt"
assert not os.path.exists(filepath)
filepath = maybe_download(file_url, "license.txt", expected_bytes=1162)
assert os.path.exists(filepath)
os.remove(filepath)
with pytest.raises(IOError):
filepath = maybe_download(file_url, "license.txt", expected_bytes=0)
def test_msrpc():
with pytest.raises(Exception):
msrpc.load_pandas_df(dataset_type="Dummy")
def test_wikigold(tmp_path):
wg_sentence_count = 1841
wg_test_percentage = 0.5
wg_test_sentence_count = round(wg_sentence_count * wg_test_percentage)
wg_train_sentence_count = wg_sentence_count - wg_test_sentence_count
downloaded_file = os.path.join(tmp_path, "wikigold.conll.txt")
assert not os.path.exists(downloaded_file)
train_df, test_df = wikigold.load_train_test_dfs(
tmp_path, test_percentage=wg_test_percentage
)
assert os.path.exists(downloaded_file)
assert train_df.shape == (wg_train_sentence_count, 2)
assert test_df.shape == (wg_test_sentence_count, 2)
def test_ner_utils(ner_utils_test_data):
output = preprocess_conll(ner_utils_test_data["input"])
assert output == ner_utils_test_data["expected_output"]
@ -123,5 +125,21 @@ def test_ner_utils(ner_utils_test_data):
def test_xnli(tmp_path):
# only test for the dev df as the train dataset takes several
# minutes to download
dev_df = xnli.load_pandas_df(local_cache_path=tmp_path)
dev_df = xnli.load_pandas_df(local_cache_path=tmp_path, file_split="dev")
assert dev_df.shape == (2490, 2)
def test_snli(tmp_path):
df_train = snli.load_pandas_df(
local_cache_path=tmp_path, file_split=Split.TRAIN
)
assert df_train.shape == (550152, 14)
df_test = snli.load_pandas_df(
local_cache_path=tmp_path, file_split=Split.TEST
)
assert df_test.shape == (10000, 14)
df_dev = snli.load_pandas_df(
local_cache_path=tmp_path, file_split=Split.DEV
)
assert df_dev.shape == (10000, 14)

Просмотреть файл

@ -1,12 +1,8 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""XNLI dataset utils
https://www.nyu.edu/projects/bowman/xnli/
"""
import os
import pandas as pd
from utils_nlp.dataset.url_utils import extract_zip, maybe_download
@ -16,9 +12,11 @@ URL_XNLI = "https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip"
URL_XNLI_MT = "https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip"
def load_pandas_df(local_cache_path="./", file_split="dev", language="zh"):
def load_pandas_df(local_cache_path=".", file_split="dev", language="zh"):
"""Downloads and extracts the dataset files.
Utilities information can be found `on this link <https://www.nyu.edu/projects/bowman/xnli/>`_.
Args:
local_cache_path (str, optional): Path to store the data.
Defaults to "./".