This commit is contained in:
saidbleik 2019-08-19 14:51:35 +00:00
Родитель 2b92e135ac
Коммит 37f804bc3c
12 изменённых файлов: 95 добавлений и 104 удалений

Просмотреть файл

@ -1,2 +1,2 @@
data/ data/
scenarios/ examples/

Просмотреть файл

@ -2,7 +2,7 @@
In recent years, Natural Language Processing has seen quick growth in quality and usability, and this has helped to drive business adoption of Artificial Intelligence solutions. In the last few years, researchers have been applying newer deep learning methods to natural language processing. Data Scientists started moving from traditional methods to state-of-the-art DNN algorithms which allow them to use language models pretrained on large text corpora. In recent years, Natural Language Processing has seen quick growth in quality and usability, and this has helped to drive business adoption of Artificial Intelligence solutions. In the last few years, researchers have been applying newer deep learning methods to natural language processing. Data Scientists started moving from traditional methods to state-of-the-art DNN algorithms which allow them to use language models pretrained on large text corpora.
This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language. This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](examples) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
## Overview ## Overview

Просмотреть файл

@ -0,0 +1,30 @@
# Word Embedding
This folder contains examples and best practices, written in Jupyter notebooks, for training word embedding on custom data from scratch.
There are
three typical ways for training word embedding:
[Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf),
[GloVe](https://nlp.stanford.edu/pubs/glove.pdf), and [fastText](https://arxiv.org/abs/1607.01759).
All of the three methods provide pretrained models ([pretrained model with
Word2Vec](https://code.google.com/archive/p/word2vec/), [pretrained model with
Glove](https://github.com/stanfordnlp/GloVe), [pretrained model with
fastText](https://fasttext.cc/docs/en/crawl-vectors.html)).
These pretrained models are trained with
general corpus like Wikipedia data, Common Crawl data, etc., and may not serve well for situations
where you have a domain-specific language learning problem or there is no pretrained model for the
language you need to work with. In this folder, we provide examples of how to apply each of the
three methods to train your own word embeddings.
# What is Word Embedding?
Word embedding is a technique to map words or phrases from a vocabulary to vectors or real numbers.
The learned vector representations of words capture syntactic and semantic word relationships and
therefore can be very useful for tasks like sentence similary, text classifcation, etc.
## Summary
|Notebook|Environment|Description|Dataset|
|---|---|---|---|
|[Developing Word Embeddings](embedding_trainer.ipynb)|Local| A notebook shows how to learn word representation with Word2Vec, fastText and Glove|[STS Benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#STS_benchmark_dataset_and_companion_dataset) |

Просмотреть файл

@ -176,7 +176,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"This step downloads the pre-trained [AllenNLP](https://allennlp.org/models) pretrained model and registers the model in our Workspace. The pre-trained AllenNLP model we use is called Bidirectional Attention Flow for Machine Comprehension ([BiDAF](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02\n", "This step downloads the pre-trained [AllenNLP](https://allennlp.org/models) pretrained model and registers the model in our Workspace. The pre-trained AllenNLP model we use is called Bidirectional Attention Flow for Machine Comprehension ([BiDAF](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02\n",
")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp/blob/courtney-bidaf/scenarios/question_answering/bidaf_deep_dive.ipynb\n", ")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp/examples/question_answering/bidaf_deep_dive.ipynb\n",
") for more information on this algorithm and AllenNLP implementation." ") for more information on this algorithm and AllenNLP implementation."
] ]
}, },

Просмотреть файл

@ -94,7 +94,7 @@
"from utils_nlp.dataset import snli, preprocess\n", "from utils_nlp.dataset import snli, preprocess\n",
"from utils_nlp.models.pretrained_embeddings.glove import download_and_extract\n", "from utils_nlp.models.pretrained_embeddings.glove import download_and_extract\n",
"from utils_nlp.dataset import Split\n", "from utils_nlp.dataset import Split\n",
"from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n", "from examples.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
"\n", "\n",
"print(\"System version: {}\".format(sys.version))" "print(\"System version: {}\".format(sys.version))"
] ]
@ -602,7 +602,7 @@
"text": [ "text": [
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.8 and num_layers=1\n", "/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.8 and num_layers=1\n",
" \"num_layers={}\".format(dropout, num_layers))\n", " \"num_layers={}\".format(dropout, num_layers))\n",
"../../scenarios/sentence_similarity/gensen_train.py:431: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n", "../../examples/sentence_similarity/gensen_train.py:431: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
" torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n", " torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n",
"../../utils_nlp/models/gensen/utils.py:364: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n", "../../utils_nlp/models/gensen/utils.py:364: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
" Variable(torch.LongTensor(sorted_src_lens), volatile=True)\n", " Variable(torch.LongTensor(sorted_src_lens), volatile=True)\n",
@ -610,13 +610,13 @@
" warnings.warn(\"nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.\")\n", " warnings.warn(\"nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.\")\n",
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/functional.py:1320: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.\n", "/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/functional.py:1320: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.\n",
" warnings.warn(\"nn.functional.tanh is deprecated. Use torch.tanh instead.\")\n", " warnings.warn(\"nn.functional.tanh is deprecated. Use torch.tanh instead.\")\n",
"../../scenarios/sentence_similarity/gensen_train.py:523: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n", "../../examples/sentence_similarity/gensen_train.py:523: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
" torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n", " torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n",
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/horovod/torch/__init__.py:163: UserWarning: optimizer.step(synchronize=True) called after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.step(synchronize=False) if you use optimizer.synchronize() in your code.\n", "/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/horovod/torch/__init__.py:163: UserWarning: optimizer.step(synchronize=True) called after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.step(synchronize=False) if you use optimizer.synchronize() in your code.\n",
" warnings.warn(\"optimizer.step(synchronize=True) called after \"\n", " warnings.warn(\"optimizer.step(synchronize=True) called after \"\n",
"../../scenarios/sentence_similarity/gensen_train.py:243: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n", "../../examples/sentence_similarity/gensen_train.py:243: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
" f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n", " f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n",
"../../scenarios/sentence_similarity/gensen_train.py:262: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n", "../../examples/sentence_similarity/gensen_train.py:262: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
" f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n" " f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n"
] ]
}, },

Просмотреть файл

@ -3,7 +3,7 @@
import json import json
import os import os
from scenarios.sentence_similarity.gensen_train import train from examples.sentence_similarity.gensen_train import train
from utils_nlp.eval.classification import compute_correlation_coefficients from utils_nlp.eval.classification import compute_correlation_coefficients
from utils_nlp.models.gensen.create_gensen_model import ( from utils_nlp.models.gensen.create_gensen_model import (
create_multiseq2seq_model, create_multiseq2seq_model,

Просмотреть файл

@ -1,30 +0,0 @@
# Word Embedding
This folder contains examples and best practices, written in Jupyter notebooks, for training word embedding on custom data from scratch.
There are
three typical ways for training word embedding:
[Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf),
[GloVe](https://nlp.stanford.edu/pubs/glove.pdf), and [fastText](https://arxiv.org/abs/1607.01759).
All of the three methods provide pretrained models ([pretrained model with
Word2Vec](https://code.google.com/archive/p/word2vec/), [pretrained model with
Glove](https://github.com/stanfordnlp/GloVe), [pretrained model with
fastText](https://fasttext.cc/docs/en/crawl-vectors.html)).
These pretrained models are trained with
general corpus like Wikipedia data, Common Crawl data, etc., and may not serve well for situations
where you have a domain-specific language learning problem or there is no pretrained model for the
language you need to work with. In this folder, we provide examples of how to apply each of the
three methods to train your own word embeddings.
# What is Word Embedding?
Word embedding is a technique to map words or phrases from a vocabulary to vectors or real numbers.
The learned vector representations of words capture syntactic and semantic word relationships and
therefore can be very useful for tasks like sentence similary, text classifcation, etc.
## Summary
|Notebook|Environment|Description|Dataset|
|---|---|---|---|
|[Developing Word Embeddings](embedding_trainer.ipynb)|Local| A notebook shows how to learn word representation with Word2Vec, fastText and Glove|[STS Benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#STS_benchmark_dataset_and_companion_dataset) |

Просмотреть файл

@ -6,34 +6,29 @@ from __future__ import print_function
import io import io
import re import re
from glob import glob from os.path import dirname, join
from os.path import basename, dirname, join, splitext
from setuptools import find_packages, setup from setuptools import setup
from setuptools_scm import get_version from setuptools_scm import get_version
# Determine semantic versioning automatically # Determine semantic versioning automatically
# from git commits # from git commits
__version__ = get_version() __version__ = get_version()
def read(*names, **kwargs): def read(*names, **kwargs):
with io.open( with io.open(join(dirname(__file__), *names), encoding=kwargs.get("encoding", "utf8")) as fh:
join(dirname(__file__), *names),
encoding=kwargs.get("encoding", "utf8"),
) as fh:
return fh.read() return fh.read()
setup( setup(
name="utils_nlp", name="utils_nlp",
version = __version__, version=__version__,
license="MIT License", license="MIT License",
description="NLP Utility functions that are used for best practices in building state-of-the-art NLP methods and scenarios. Developed by Microsoft AI CAT", description="NLP Utility functions that are used for best practices in building state-of-the-art NLP methods and scenarios. Developed by Microsoft AI CAT",
long_description="%s\n%s" long_description="%s\n%s"
% ( % (
re.compile("^.. start-badges.*^.. end-badges", re.M | re.S).sub( re.compile("^.. start-badges.*^.. end-badges", re.M | re.S).sub("", read("README.md")),
"", read("README.md")
),
re.sub(":[a-z]+:`~?(.*?)`", r"``\1``", read("CONTRIBUTING.md")), re.sub(":[a-z]+:`~?(.*?)`", r"``\1``", read("CONTRIBUTING.md")),
), ),
author="AI CAT", author="AI CAT",
@ -68,16 +63,11 @@ setup(
"Documentation": "https://github.com/microsoft/nlp/", "Documentation": "https://github.com/microsoft/nlp/",
"Issue Tracker": "https://github.com/microsoft/nlp/issues", "Issue Tracker": "https://github.com/microsoft/nlp/issues",
}, },
keywords=[ keywords=["Microsoft NLP", "Natural Language Processing", "Text Processing", "Word Embedding"],
"Microsoft NLP",
"Natural Language Processing",
"Text Processing",
"Word Embedding",
],
python_requires=">=3.6", python_requires=">=3.6",
install_requires=['setuptools_scm>=3.2.0',], install_requires=["setuptools_scm>=3.2.0"],
dependency_links=[], dependency_links=[],
extras_require={}, extras_require={},
use_scm_version = {"root": ".", "relative_to": __file__}, use_scm_version={"root": ".", "relative_to": __file__},
setup_requires=['setuptools_scm'], setup_requires=["setuptools_scm"],
) )

Просмотреть файл

@ -11,24 +11,26 @@ ABS_TOL = 0.2
@pytest.mark.integration @pytest.mark.integration
@pytest.mark.azureml @pytest.mark.azureml
def test_bidaf_deep_dive(notebooks, def test_bidaf_deep_dive(
subscription_id, notebooks, subscription_id, resource_group, workspace_name, workspace_region
resource_group, ):
workspace_name,
workspace_region):
notebook_path = notebooks["bidaf_deep_dive"] notebook_path = notebooks["bidaf_deep_dive"]
pm.execute_notebook(notebook_path, pm.execute_notebook(
OUTPUT_NOTEBOOK, notebook_path,
parameters = {'NUM_EPOCHS':2, OUTPUT_NOTEBOOK,
'config_path': "tests/ci", parameters={
'PROJECT_FOLDER': "scenarios/question_answering/bidaf-question-answering", "NUM_EPOCHS": 2,
'SQUAD_FOLDER': "scenarios/question_answering/squad", "config_path": "tests/ci",
'LOGS_FOLDER': "scenarios/question_answering/", "PROJECT_FOLDER": "examples/question_answering/bidaf-question-answering",
'BIDAF_CONFIG_PATH': "scenarios/question_answering/", "SQUAD_FOLDER": "examples/question_answering/squad",
'subscription_id': subscription_id, "LOGS_FOLDER": "examples/question_answering/",
'resource_group': resource_group, "BIDAF_CONFIG_PATH": "examples/question_answering/",
'workspace_name': workspace_name, "subscription_id": subscription_id,
'workspace_region': workspace_region}) "resource_group": resource_group,
"workspace_name": workspace_name,
"workspace_region": workspace_region,
},
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["validation_EM"] result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["validation_EM"]
assert result == pytest.approx(0.5, abs=ABS_TOL) assert result == pytest.approx(0.5, abs=ABS_TOL)
@ -36,20 +38,22 @@ def test_bidaf_deep_dive(notebooks,
@pytest.mark.usefixtures("teardown_service") @pytest.mark.usefixtures("teardown_service")
@pytest.mark.integration @pytest.mark.integration
@pytest.mark.azureml @pytest.mark.azureml
def test_bidaf_quickstart(notebooks, def test_bidaf_quickstart(
subscription_id, notebooks, subscription_id, resource_group, workspace_name, workspace_region
resource_group, ):
workspace_name,
workspace_region):
notebook_path = notebooks["bidaf_quickstart"] notebook_path = notebooks["bidaf_quickstart"]
pm.execute_notebook(notebook_path, pm.execute_notebook(
OUTPUT_NOTEBOOK, notebook_path,
parameters = {'config_path': "tests/ci", OUTPUT_NOTEBOOK,
'subscription_id': subscription_id, parameters={
'resource_group': resource_group, "config_path": "tests/ci",
'workspace_name': workspace_name, "subscription_id": subscription_id,
'workspace_region': workspace_region, "resource_group": resource_group,
'webservice_name': "aci-test-service"}) "workspace_name": workspace_name,
"workspace_region": workspace_region,
"webservice_name": "aci-test-service",
},
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["answer"] result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["answer"]
assert result == "Bi-Directional Attention Flow" assert result == "Bi-Directional Attention Flow"
@ -64,12 +68,12 @@ def test_bert_qa_runs(notebooks):
OUTPUT_NOTEBOOK, OUTPUT_NOTEBOOK,
parameters=dict( parameters=dict(
AZUREML_CONFIG_PATH="./tests/integration/.azureml", AZUREML_CONFIG_PATH="./tests/integration/.azureml",
DATA_FOLDER='./tests/integration/squad', DATA_FOLDER="./tests/integration/squad",
PROJECT_FOLDER='./tests/integration/pytorch-transformers', PROJECT_FOLDER="./tests/integration/pytorch-transformers",
EXPERIMENT_NAME='NLP-QA-BERT-deepdive', EXPERIMENT_NAME="NLP-QA-BERT-deepdive",
BERT_UTIL_PATH='./utils_nlp/azureml/azureml_bert_util.py', BERT_UTIL_PATH="./utils_nlp/azureml/azureml_bert_util.py",
EVALUATE_SQAD_PATH = './utils_nlp/eval/evaluate_squad.py', EVALUATE_SQAD_PATH="./utils_nlp/eval/evaluate_squad.py",
TRAIN_SCRIPT_PATH="./scenarios/question_answering/bert_run_squad_azureml.py", TRAIN_SCRIPT_PATH="./examples/question_answering/bert_run_squad_azureml.py",
BERT_MODEL="bert-base-uncased", BERT_MODEL="bert-base-uncased",
NUM_TRAIN_EPOCHS=1.0, NUM_TRAIN_EPOCHS=1.0,
NODE_COUNT=1, NODE_COUNT=1,

Просмотреть файл

@ -43,7 +43,7 @@ def test_gensen_local(notebooks):
kernel_name=KERNEL_NAME, kernel_name=KERNEL_NAME,
parameters=dict( parameters=dict(
max_epoch=1, max_epoch=1,
config_filepath="scenarios/sentence_similarity/gensen_config.json", config_filepath="examples/sentence_similarity/gensen_config.json",
base_data_path="data", base_data_path="data",
), ),
) )
@ -143,8 +143,8 @@ def test_similarity_gensen_azureml_runs(notebooks):
AZUREML_CONFIG_PATH="./tests/integration/.azureml", AZUREML_CONFIG_PATH="./tests/integration/.azureml",
UTIL_NLP_PATH="./utils_nlp", UTIL_NLP_PATH="./utils_nlp",
MAX_EPOCH=1, MAX_EPOCH=1,
TRAIN_SCRIPT="./scenarios/sentence_similarity/gensen_train.py", TRAIN_SCRIPT="./examples/sentence_similarity/gensen_train.py",
CONFIG_PATH="./scenarios/sentence_similarity/gensen_config.json", CONFIG_PATH="./examples/sentence_similarity/gensen_config.json",
MAX_TOTAL_RUNS=1, MAX_TOTAL_RUNS=1,
MAX_CONCURRENT_RUNS=1, MAX_CONCURRENT_RUNS=1,
), ),

Просмотреть файл

@ -11,7 +11,4 @@ OUTPUT_NOTEBOOK = "output.ipynb"
def path_notebooks(): def path_notebooks():
"""Returns the path of the notebooks folder""" """Returns the path of the notebooks folder"""
return os.path.abspath( return os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir, "examples"))
os.path.join(os.path.dirname(__file__), os.path.pardir, "scenarios")
)

Просмотреть файл

@ -5,7 +5,7 @@ This submodule contains a tool for explaining hidden states of models. It is an
## How to use ## How to use
We provide a notebook tutorial [here](../../scenarios/interpret_NLP_models/understand_models.ipynb) to help you get started quickly. The main class needed is the `Interpreter` in [Interpreter.py](Interpreter.py). Given any input word embeddings and a forward function $\Phi$ that transforms the word embeddings $\bf x$ to a hidden state $\bf s$, the Interpreter helps understand how much each input word contributes to the hidden state. Suppose the $\Phi$, the input $\bf x$ and the input words are defined as: We provide a notebook tutorial [here](../../examples/interpret_NLP_models/understand_models.ipynb) to help you get started quickly. The main class needed is the `Interpreter` in [Interpreter.py](Interpreter.py). Given any input word embeddings and a forward function $\Phi$ that transforms the word embeddings $\bf x$ to a hidden state $\bf s$, the Interpreter helps understand how much each input word contributes to the hidden state. Suppose the $\Phi$, the input $\bf x$ and the input words are defined as:
``` ```
import torch import torch
@ -63,5 +63,5 @@ which means that the second and forth words are most important to $\Phi$, which
## Explain a certain layer in any saved pytorch model ## Explain a certain layer in any saved pytorch model
We provide an example on how to use our method to explain a saved pytorch model (*pre-trained BERT model in our case*) [here](../../scenarios/interpret_NLP_models/understand_models.ipynb). We provide an example on how to use our method to explain a saved pytorch model (*pre-trained BERT model in our case*) [here](../../examples/interpret_NLP_models/understand_models.ipynb).
> NOTE: This result may not be consistent with the result in the paper because we use the pre-trained BERT model directly for simplicity, while the BERT model we use in paper is fine-tuned on a specific dataset like SST-2. > NOTE: This result may not be consistent with the result in the paper because we use the pre-trained BERT model directly for simplicity, while the BERT model we use in paper is fine-tuned on a specific dataset like SST-2.