diff --git a/README.md b/README.md index 9c89107..a7a6edf 100755 --- a/README.md +++ b/README.md @@ -1,14 +1,45 @@ - -| Branch | Status | | Branch | Status | -| ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-master?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=22&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-staging?branchName=staging)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=21&branchName=staging) | - # NLP Best Practices -This repository contains examples and best practices for building NLP systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language. +This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language. + +![](https://nlpbp.blob.core.windows.net/images/cognitive_services.PNG) +## Overview + +The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. +The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community. + +We’re hoping that the tools would significantly reduce the time from a business problem, or a research idea, to full implementation of a system. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools. + +In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks and can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots. + +> [*GLUE Leaderboard*](https://gluebenchmark.com/leaderboard) +> [*SQuAD Leaderbord*](https://rajpurkar.github.io/SQuAD-explorer/) + +## Content + +The following is a summary of the scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and utilities. + +| Scenario | Applications | Models | +|---| ------------------------ | ------------------- | +|[Text Classification](scenarios/text_classification) |Topic Classification|BERT| +|[Named Entity Recognition](scenarios/named_entity_recognition) |Wikipedia NER |BERT| +|[Entailment](scenarios/entailment)|XNLI Natural Language Inference|BERT| +|[Question Answering](scenarios/question_answering) |SQuAD | BiDAF| +|[Sentence Similarity](scenarios/sentence_similarity) |STS Benchmark |Representation: TF-IDF, Word Embeddings, Doc Embeddings
Metrics: Cosine Similarity, Word Mover's Distance| +|[Embeddings](scenarios/embeddings)| Custom Embeddings Training|Word2Vec
fastText
GloVe| + + ## Getting Started To get started, navigate to the [Setup Guide](SETUP.md), where you'll find instructions on how to setup your environment and dependencies. ## Contributing This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md). + + +## Build Status + +| Build Type | Branch | Status | +| --- | --- | --- | +| **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/cpu_integration_tests_linux?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=50&branchName=master) | +| **Linux GPU** | master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/gpu_integration_tests_linux?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=51&branchName=master) | diff --git a/scenarios/README.md b/scenarios/README.md index 5bcd879..eb8eba4 100644 --- a/scenarios/README.md +++ b/scenarios/README.md @@ -1,19 +1,14 @@ # NLP Scenarios -This folder contains examples and best practices, written in Jupyter notebooks, for building Natural Language Processing systems for different scenarios. +This folder contains examples and best practices, written in Jupyter notebooks, for building Natural Language Processing systems for the following scenarios. -## Summary -The following is a summary of the scenarios covered in the best practice notebooks. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and utilities. - -| Scenario | Applications | Models | -|---| ------------------------ | ------------------- | -|[Text Classification](text_classification) |Topic Classification|BERT| -|[Named Entity Recognition](named_entity_recognition) |Wikipedia NER |BERT| -|[Entailment](./entailment)|XNLI Natural Language Inference|BERT| -|[Question Answering](question_answering) |SQuAD | BiDAF| -|[Sentence Similarity](sentence_similarity) |STS Benchmark |Representation: TF-IDF, Word Embeddings, Doc Embeddings
Metrics: Cosine Similarity, Word Mover's Distance| -|[Embeddings](embeddings)| Custom Embeddings Training|Word2Vec
fastText
GloVe| +- [Text Classification](text_classification) +- [Named Entity Recognition](named_entity_recognition) +- [Entailment](entailment) +- [Question Answering](question_answering) +- [Sentence Similarity](sentence_similarity) +- [Embeddings](embeddings) ## Azure-enhanced notebooks diff --git a/scenarios/sentence_similarity/gensen_local.ipynb b/scenarios/sentence_similarity/gensen_local.ipynb index 9d7a420..a45dcc7 100644 --- a/scenarios/sentence_similarity/gensen_local.ipynb +++ b/scenarios/sentence_similarity/gensen_local.ipynb @@ -80,13 +80,13 @@ "\n", "import os\n", "import papermill as pm\n", + "import scrapbook as sb\n", "\n", "from utils_nlp.dataset.preprocess import to_lowercase, to_nltk_tokens\n", "from utils_nlp.dataset import snli, preprocess\n", - "from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n", "from utils_nlp.models.pretrained_embeddings.glove import download_and_extract\n", - "import scrapbook as sb\n", - "\n", + "from utils_nlp.dataset import Split\n", + "from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n", "\n", "print(\"System version: {}\".format(sys.version))" ] @@ -335,9 +335,9 @@ } ], "source": [ - "train = snli.load_pandas_df(base_data_path, file_split=\"train\", nrows=nrows)\n", - "dev = snli.load_pandas_df(base_data_path, file_split=\"dev\", nrows=nrows)\n", - "test = snli.load_pandas_df(base_data_path, file_split=\"test\", nrows=nrows)\n", + "train = snli.load_pandas_df(base_data_path, file_split=Split.TRAIN, nrows=nrows)\n", + "dev = snli.load_pandas_df(base_data_path, file_split=Split.DEV, nrows=nrows)\n", + "test = snli.load_pandas_df(base_data_path, file_split=Split.TEST, nrows=nrows)\n", "\n", "train.head()" ] diff --git a/tests/ci/cpu_integration_tests_linux.yml b/tests/ci/cpu_integration_tests_linux.yml new file mode 100644 index 0000000..a815703 --- /dev/null +++ b/tests/ci/cpu_integration_tests_linux.yml @@ -0,0 +1,61 @@ +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. + + +# More info on scheduling: https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops&tabs=yaml#scheduled-triggers +# NOTE: this is commented since as of July 2019, DevOps has a bug in the scheduler +#schedules: +#- cron: "56 22 * * *" +# displayName: Daily computation of nightly builds +# branches: +# include: +# - master +# always: true + + +# Pull request against these branches will trigger this build +pr: +- master + +trigger: none + +jobs: +- job: nightly + displayName : 'Nightly tests' + timeoutInMinutes: 180 # how long to run the job before automatically cancelling + pool: + name: nlpagentpool + + steps: + - bash: | + echo "##vso[task.prependpath]/data/anaconda/bin" + conda env list + displayName: 'Add Conda to PATH' + + # Conda creation can take around 10min + - bash: | + python tools/generate_conda_file.py + conda env create -n integration_cpu -f nlp_cpu.yaml + displayName: 'Creating Conda Environment with dependencies' + + - bash: | + source activate integration_cpu + pytest --durations=0 tests/smoke -m "smoke and not gpu and not azureml" --junitxml=junit/test-smoke-test.xml + displayName: 'Run smoke tests' + + - bash: | + source activate integration_cpu + pytest --durations=0 tests/integration -m "integration and not gpu and not azureml" --junitxml=junit/test-integration-test.xml + displayName: 'Run integration tests' + + - bash: | + echo Remove Conda Environment + conda remove -n integration_cpu --all -q --force -y + echo Done Cleanup + displayName: 'Cleanup Task' + condition: always() + + - task: PublishTestResults@2 + inputs: + testResultsFiles: '**/test-*-test.xml' + testRunTitle: 'Test results for PyTest' \ No newline at end of file diff --git a/tests/ci/gpu_integration_tests_linux.yml b/tests/ci/gpu_integration_tests_linux.yml new file mode 100644 index 0000000..9fce9bb --- /dev/null +++ b/tests/ci/gpu_integration_tests_linux.yml @@ -0,0 +1,61 @@ +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. + + +# More info on scheduling: https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops&tabs=yaml#scheduled-triggers +# NOTE: this is commented since as of July 2019, DevOps has a bug in the scheduler +#schedules: +#- cron: "56 11 * * *" +# displayName: Daily computation of nightly builds +# branches: +# include: +# - master +# always: true + + +# Pull request against these branches will trigger this build +pr: +- master + +trigger: none + +jobs: +- job: nightly + displayName : 'Nightly tests' + timeoutInMinutes: 180 # how long to run the job before automatically cancelling + pool: + name: nlpagentpool + + steps: + - bash: | + echo "##vso[task.prependpath]/data/anaconda/bin" + conda env list + displayName: 'Add Conda to PATH' + + # Conda creation can take around 10min + - bash: | + python tools/generate_conda_file.py --gpu + conda env create -n integration_gpu -f nlp_gpu.yaml + displayName: 'Creating Conda Environment with dependencies' + + - bash: | + source activate integration_gpu + pytest --durations=0 tests/smoke -m "smoke and gpu and not azureml" --junitxml=junit/test-smoke-test.xml + displayName: 'Run smoke tests' + + - bash: | + source activate integration_gpu + pytest --durations=0 tests/integration -m "integration and gpu and not azureml" --junitxml=junit/test-integration-test.xml + displayName: 'Run integration tests' + + - bash: | + echo Remove Conda Environment + conda remove -n integration_gpu --all -q --force -y + echo Done Cleanup + displayName: 'Cleanup Task' + condition: always() + + - task: PublishTestResults@2 + inputs: + testResultsFiles: '**/test-*-test.xml' + testRunTitle: 'Test results for PyTest' \ No newline at end of file diff --git a/tests/integration/test_gpu_utils.py b/tests/integration/test_gpu_utils.py new file mode 100644 index 0000000..9adc3a9 --- /dev/null +++ b/tests/integration/test_gpu_utils.py @@ -0,0 +1,11 @@ +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. + +import pytest +import torch + + +@pytest.mark.gpu +@pytest.mark.integration +def test_machine_is_gpu_machine(): + assert torch.cuda.is_available() is True diff --git a/tests/integration/test_notebooks_embeddings.py b/tests/integration/test_notebooks_embeddings.py index 5e856b2..1aa3153 100644 --- a/tests/integration/test_notebooks_embeddings.py +++ b/tests/integration/test_notebooks_embeddings.py @@ -7,13 +7,11 @@ import papermill as pm from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME -@pytest.mark.notebooks @pytest.mark.skip(reason="no way of running this programmatically") +@pytest.mark.integration def test_embedding_trainer_runs(notebooks): notebook_path = notebooks["embedding_trainer"] pm.execute_notebook( - notebook_path, - OUTPUT_NOTEBOOK, - kernel_name=KERNEL_NAME, + notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME ) diff --git a/tests/integration/test_notebooks_sentence_similarity.py b/tests/integration/test_notebooks_sentence_similarity.py index c83ed9f..15941d2 100644 --- a/tests/integration/test_notebooks_sentence_similarity.py +++ b/tests/integration/test_notebooks_sentence_similarity.py @@ -5,12 +5,8 @@ import sys import pytest import papermill as pm import scrapbook as sb -from azureml.core import Experiment -from azureml.core.run import Run -from utils_nlp.azureml.azureml_utils import get_or_create_workspace -from tests.notebooks_common import OUTPUT_NOTEBOOK +from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME -sys.path.append("../../") ABS_TOL = 0.2 ABS_TOL_PEARSONS = 0.05 @@ -40,7 +36,7 @@ def baseline_results(): @pytest.mark.azureml def test_similarity_embeddings_baseline_runs(notebooks, baseline_results): notebook_path = notebooks["similarity_embeddings_baseline"] - pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK) + pm.execute_notebook(notebook_path, OUTPUT_NOTEBOOK, kernel_name=KERNEL_NAME) results = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["results"] for key, value in baseline_results.items(): assert results[key] == pytest.approx(value, abs=ABS_TOL) @@ -68,58 +64,18 @@ def test_automl_local_runs(notebooks, result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["pearson_correlation"] assert result == pytest.approx(0.5, abs=ABS_TOL) -@pytest.mark.notebooks -@pytest.mark.gpu -def test_similarity_senteval_local_runs(notebooks, gensen_senteval_results): - notebook_path = notebooks["senteval_local"] - pm.execute_notebook( - notebook_path, - OUTPUT_NOTEBOOK, - parameters=dict( - PATH_TO_SENTEVAL="../SentEval", PATH_TO_GENSEN="../gensen" - ), - ) - out = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["results"] - for key, val in gensen_senteval_results.items(): - for task, result in val.items(): - assert out[key][task] == result - - -@pytest.mark.notebooks -@pytest.mark.azureml -def test_similarity_senteval_azureml_runs(notebooks, gensen_senteval_results): - notebook_path = notebooks["senteval_azureml"] - pm.execute_notebook( - notebook_path, - OUTPUT_NOTEBOOK, - parameters=dict( - PATH_TO_SENTEVAL="../SentEval", - PATH_TO_GENSEN="../gensen", - PATH_TO_SER="utils_nlp/eval/senteval.py", - AZUREML_VERBOSE=False, - config_path="tests/ci", - ), - ) - result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict - ws = get_or_create_workspace(config_path="tests/ci") - experiment = Experiment(ws, name=result["experiment_name"]) - run = Run(experiment, result["run_id"]) - assert run.get_metrics()["STSBenchmark::pearson"] == pytest.approx( - gensen_senteval_results["pearson"]["STSBenchmark"], abs=ABS_TOL - ) - - -@pytest.mark.notebooks @pytest.mark.gpu +@pytest.mark.integration def test_gensen_local(notebooks): notebook_path = notebooks["gensen_local"] pm.execute_notebook( notebook_path, OUTPUT_NOTEBOOK, + kernel_name=KERNEL_NAME, parameters=dict( max_epoch=1, - config_filepath="../../scenarios/sentence_similarity/gensen_config.json", - base_data_path="../../data", + config_filepath="scenarios/sentence_similarity/gensen_config.json", + base_data_path="data", ), ) diff --git a/tests/smoke/test_dataset.py b/tests/smoke/test_dataset.py new file mode 100644 index 0000000..a25ea60 --- /dev/null +++ b/tests/smoke/test_dataset.py @@ -0,0 +1,31 @@ +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. + +import os +import pytest + +from utils_nlp.dataset import msrpc +from utils_nlp.dataset import xnli + + +@pytest.mark.smoke +def test_msrpc_download(tmp_path): + filepath = msrpc.download_msrpc(tmp_path) + statinfo = os.stat(filepath) + assert statinfo.st_size == 1359872 + + +@pytest.mark.skip(reason="Can't test it programmatically, needs input") +@pytest.mark.smoke +def test_msrpc_load_df(tmp_path): + df_train = msrpc.load_pandas_df( + local_cache_path=tmp_path, dataset_type="train" + ) + + +@pytest.mark.smoke +def test_xnli(tmp_path): + df_train = xnli.load_pandas_df( + local_cache_path=tmp_path, file_split="train" + ) + assert df_train.shape == (392702, 2) diff --git a/tests/smoke/test_gpu_utils.py b/tests/smoke/test_gpu_utils.py new file mode 100644 index 0000000..11418ad --- /dev/null +++ b/tests/smoke/test_gpu_utils.py @@ -0,0 +1,12 @@ +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. + +import pytest +import torch + + +@pytest.mark.smoke +@pytest.mark.gpu +def test_machine_is_gpu_machine(): + assert torch.cuda.is_available() is True + diff --git a/tests/smoke/test_msrpc.py b/tests/smoke/test_msrpc.py deleted file mode 100644 index 0aa75b9..0000000 --- a/tests/smoke/test_msrpc.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) Microsoft Corporation. All rights reserved. -# Licensed under the MIT License. - -import os -import pytest - -from utils_nlp.dataset import msrpc - - -@pytest.mark.smoke -def test_download_msrpc(tmp_path): - filepath = msrpc.download_msrpc(tmp_path) - statinfo = os.stat(filepath) - assert statinfo.st_size == 1359872 diff --git a/tests/unit/test_dataset.py b/tests/unit/test_dataset.py index 0da29a6..5cb1ae0 100755 --- a/tests/unit/test_dataset.py +++ b/tests/unit/test_dataset.py @@ -5,51 +5,14 @@ import os import pytest from utils_nlp.dataset.url_utils import maybe_download -from utils_nlp.dataset.msrpc import load_pandas_df -import utils_nlp.dataset.wikigold as wg -import utils_nlp.dataset.xnli as xnli +from utils_nlp.dataset import msrpc +from utils_nlp.dataset import wikigold +from utils_nlp.dataset import xnli +from utils_nlp.dataset import snli +from utils_nlp.dataset import Split from utils_nlp.dataset.ner_utils import preprocess_conll -def test_maybe_download(): - # ToDo: Change this url when repo goes public. - file_url = ( - "https://raw.githubusercontent.com/Microsoft/Recommenders/" - "master/LICENSE" - ) - filepath = "license.txt" - assert not os.path.exists(filepath) - filepath = maybe_download(file_url, "license.txt", expected_bytes=1162) - assert os.path.exists(filepath) - os.remove(filepath) - with pytest.raises(IOError): - filepath = maybe_download(file_url, "license.txt", expected_bytes=0) - - -def test_load_pandas_df_msrpc(): - with pytest.raises(Exception): - load_pandas_df(dataset_type="Dummy") - - -def test_wikigold(tmp_path): - wg_sentence_count = 1841 - wg_test_percentage = 0.5 - wg_test_sentence_count = round(wg_sentence_count * wg_test_percentage) - wg_train_sentence_count = wg_sentence_count - wg_test_sentence_count - - downloaded_file = os.path.join(tmp_path, "wikigold.conll.txt") - assert not os.path.exists(downloaded_file) - - train_df, test_df = wg.load_train_test_dfs( - tmp_path, test_percentage=wg_test_percentage - ) - - assert os.path.exists(downloaded_file) - - assert train_df.shape == (wg_train_sentence_count, 2) - assert test_df.shape == (wg_test_sentence_count, 2) - - @pytest.fixture def ner_utils_test_data(scope="module"): return { @@ -115,6 +78,45 @@ def ner_utils_test_data(scope="module"): } +def test_maybe_download(): + # ToDo: Change this url when repo goes public. + file_url = ( + "https://raw.githubusercontent.com/Microsoft/Recommenders/" + "master/LICENSE" + ) + filepath = "license.txt" + assert not os.path.exists(filepath) + filepath = maybe_download(file_url, "license.txt", expected_bytes=1162) + assert os.path.exists(filepath) + os.remove(filepath) + with pytest.raises(IOError): + filepath = maybe_download(file_url, "license.txt", expected_bytes=0) + + +def test_msrpc(): + with pytest.raises(Exception): + msrpc.load_pandas_df(dataset_type="Dummy") + + +def test_wikigold(tmp_path): + wg_sentence_count = 1841 + wg_test_percentage = 0.5 + wg_test_sentence_count = round(wg_sentence_count * wg_test_percentage) + wg_train_sentence_count = wg_sentence_count - wg_test_sentence_count + + downloaded_file = os.path.join(tmp_path, "wikigold.conll.txt") + assert not os.path.exists(downloaded_file) + + train_df, test_df = wikigold.load_train_test_dfs( + tmp_path, test_percentage=wg_test_percentage + ) + + assert os.path.exists(downloaded_file) + + assert train_df.shape == (wg_train_sentence_count, 2) + assert test_df.shape == (wg_test_sentence_count, 2) + + def test_ner_utils(ner_utils_test_data): output = preprocess_conll(ner_utils_test_data["input"]) assert output == ner_utils_test_data["expected_output"] @@ -123,5 +125,21 @@ def test_ner_utils(ner_utils_test_data): def test_xnli(tmp_path): # only test for the dev df as the train dataset takes several # minutes to download - dev_df = xnli.load_pandas_df(local_cache_path=tmp_path) + dev_df = xnli.load_pandas_df(local_cache_path=tmp_path, file_split="dev") assert dev_df.shape == (2490, 2) + + +def test_snli(tmp_path): + df_train = snli.load_pandas_df( + local_cache_path=tmp_path, file_split=Split.TRAIN + ) + assert df_train.shape == (550152, 14) + df_test = snli.load_pandas_df( + local_cache_path=tmp_path, file_split=Split.TEST + ) + assert df_test.shape == (10000, 14) + df_dev = snli.load_pandas_df( + local_cache_path=tmp_path, file_split=Split.DEV + ) + assert df_dev.shape == (10000, 14) + diff --git a/utils_nlp/dataset/xnli.py b/utils_nlp/dataset/xnli.py index 866c0eb..dc9bbac 100644 --- a/utils_nlp/dataset/xnli.py +++ b/utils_nlp/dataset/xnli.py @@ -1,12 +1,8 @@ # Copyright (c) Microsoft Corporation. All rights reserved. # Licensed under the MIT License. -"""XNLI dataset utils -https://www.nyu.edu/projects/bowman/xnli/ -""" import os - import pandas as pd from utils_nlp.dataset.url_utils import extract_zip, maybe_download @@ -16,9 +12,11 @@ URL_XNLI = "https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip" URL_XNLI_MT = "https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip" -def load_pandas_df(local_cache_path="./", file_split="dev", language="zh"): +def load_pandas_df(local_cache_path=".", file_split="dev", language="zh"): """Downloads and extracts the dataset files. + Utilities information can be found `on this link `_. + Args: local_cache_path (str, optional): Path to store the data. Defaults to "./".