update examples folder refs
This commit is contained in:
Родитель
2b92e135ac
Коммит
37f804bc3c
|
@ -1,2 +1,2 @@
|
|||
data/
|
||||
scenarios/
|
||||
examples/
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
In recent years, Natural Language Processing has seen quick growth in quality and usability, and this has helped to drive business adoption of Artificial Intelligence solutions. In the last few years, researchers have been applying newer deep learning methods to natural language processing. Data Scientists started moving from traditional methods to state-of-the-art DNN algorithms which allow them to use language models pretrained on large text corpora.
|
||||
|
||||
This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](scenarios) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
|
||||
This repository contains examples and best practices for building natural language processing (NLP) systems, provided as [Jupyter notebooks](examples) and [utility functions](utils_nlp). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
|
||||
|
||||
## Overview
|
||||
|
||||
|
|
|
@ -0,0 +1,30 @@
|
|||
# Word Embedding
|
||||
|
||||
This folder contains examples and best practices, written in Jupyter notebooks, for training word embedding on custom data from scratch.
|
||||
There are
|
||||
three typical ways for training word embedding:
|
||||
[Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf),
|
||||
[GloVe](https://nlp.stanford.edu/pubs/glove.pdf), and [fastText](https://arxiv.org/abs/1607.01759).
|
||||
All of the three methods provide pretrained models ([pretrained model with
|
||||
Word2Vec](https://code.google.com/archive/p/word2vec/), [pretrained model with
|
||||
Glove](https://github.com/stanfordnlp/GloVe), [pretrained model with
|
||||
fastText](https://fasttext.cc/docs/en/crawl-vectors.html)).
|
||||
These pretrained models are trained with
|
||||
general corpus like Wikipedia data, Common Crawl data, etc., and may not serve well for situations
|
||||
where you have a domain-specific language learning problem or there is no pretrained model for the
|
||||
language you need to work with. In this folder, we provide examples of how to apply each of the
|
||||
three methods to train your own word embeddings.
|
||||
|
||||
# What is Word Embedding?
|
||||
|
||||
Word embedding is a technique to map words or phrases from a vocabulary to vectors or real numbers.
|
||||
The learned vector representations of words capture syntactic and semantic word relationships and
|
||||
therefore can be very useful for tasks like sentence similary, text classifcation, etc.
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
|
||||
|Notebook|Environment|Description|Dataset|
|
||||
|---|---|---|---|
|
||||
|[Developing Word Embeddings](embedding_trainer.ipynb)|Local| A notebook shows how to learn word representation with Word2Vec, fastText and Glove|[STS Benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#STS_benchmark_dataset_and_companion_dataset) |
|
|
@ -176,7 +176,7 @@
|
|||
"metadata": {},
|
||||
"source": [
|
||||
"This step downloads the pre-trained [AllenNLP](https://allennlp.org/models) pretrained model and registers the model in our Workspace. The pre-trained AllenNLP model we use is called Bidirectional Attention Flow for Machine Comprehension ([BiDAF](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02\n",
|
||||
")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp/blob/courtney-bidaf/scenarios/question_answering/bidaf_deep_dive.ipynb\n",
|
||||
")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp/examples/question_answering/bidaf_deep_dive.ipynb\n",
|
||||
") for more information on this algorithm and AllenNLP implementation."
|
||||
]
|
||||
},
|
||||
|
|
|
@ -94,7 +94,7 @@
|
|||
"from utils_nlp.dataset import snli, preprocess\n",
|
||||
"from utils_nlp.models.pretrained_embeddings.glove import download_and_extract\n",
|
||||
"from utils_nlp.dataset import Split\n",
|
||||
"from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
|
||||
"from examples.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
|
||||
"\n",
|
||||
"print(\"System version: {}\".format(sys.version))"
|
||||
]
|
||||
|
@ -602,7 +602,7 @@
|
|||
"text": [
|
||||
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.8 and num_layers=1\n",
|
||||
" \"num_layers={}\".format(dropout, num_layers))\n",
|
||||
"../../scenarios/sentence_similarity/gensen_train.py:431: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
|
||||
"../../examples/sentence_similarity/gensen_train.py:431: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
|
||||
" torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n",
|
||||
"../../utils_nlp/models/gensen/utils.py:364: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
|
||||
" Variable(torch.LongTensor(sorted_src_lens), volatile=True)\n",
|
||||
|
@ -610,13 +610,13 @@
|
|||
" warnings.warn(\"nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.\")\n",
|
||||
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/functional.py:1320: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.\n",
|
||||
" warnings.warn(\"nn.functional.tanh is deprecated. Use torch.tanh instead.\")\n",
|
||||
"../../scenarios/sentence_similarity/gensen_train.py:523: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
|
||||
"../../examples/sentence_similarity/gensen_train.py:523: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
|
||||
" torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n",
|
||||
"/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/horovod/torch/__init__.py:163: UserWarning: optimizer.step(synchronize=True) called after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.step(synchronize=False) if you use optimizer.synchronize() in your code.\n",
|
||||
" warnings.warn(\"optimizer.step(synchronize=True) called after \"\n",
|
||||
"../../scenarios/sentence_similarity/gensen_train.py:243: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
|
||||
"../../examples/sentence_similarity/gensen_train.py:243: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
|
||||
" f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n",
|
||||
"../../scenarios/sentence_similarity/gensen_train.py:262: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
|
||||
"../../examples/sentence_similarity/gensen_train.py:262: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
|
||||
" f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n"
|
||||
]
|
||||
},
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
import json
|
||||
import os
|
||||
|
||||
from scenarios.sentence_similarity.gensen_train import train
|
||||
from examples.sentence_similarity.gensen_train import train
|
||||
from utils_nlp.eval.classification import compute_correlation_coefficients
|
||||
from utils_nlp.models.gensen.create_gensen_model import (
|
||||
create_multiseq2seq_model,
|
||||
|
|
|
@ -1,30 +0,0 @@
|
|||
# Word Embedding
|
||||
|
||||
This folder contains examples and best practices, written in Jupyter notebooks, for training word embedding on custom data from scratch.
|
||||
There are
|
||||
three typical ways for training word embedding:
|
||||
[Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf),
|
||||
[GloVe](https://nlp.stanford.edu/pubs/glove.pdf), and [fastText](https://arxiv.org/abs/1607.01759).
|
||||
All of the three methods provide pretrained models ([pretrained model with
|
||||
Word2Vec](https://code.google.com/archive/p/word2vec/), [pretrained model with
|
||||
Glove](https://github.com/stanfordnlp/GloVe), [pretrained model with
|
||||
fastText](https://fasttext.cc/docs/en/crawl-vectors.html)).
|
||||
These pretrained models are trained with
|
||||
general corpus like Wikipedia data, Common Crawl data, etc., and may not serve well for situations
|
||||
where you have a domain-specific language learning problem or there is no pretrained model for the
|
||||
language you need to work with. In this folder, we provide examples of how to apply each of the
|
||||
three methods to train your own word embeddings.
|
||||
|
||||
# What is Word Embedding?
|
||||
|
||||
Word embedding is a technique to map words or phrases from a vocabulary to vectors or real numbers.
|
||||
The learned vector representations of words capture syntactic and semantic word relationships and
|
||||
therefore can be very useful for tasks like sentence similary, text classifcation, etc.
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
|
||||
|Notebook|Environment|Description|Dataset|
|
||||
|---|---|---|---|
|
||||
|[Developing Word Embeddings](embedding_trainer.ipynb)|Local| A notebook shows how to learn word representation with Word2Vec, fastText and Glove|[STS Benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#STS_benchmark_dataset_and_companion_dataset) |
|
30
setup.py
30
setup.py
|
@ -6,34 +6,29 @@ from __future__ import print_function
|
|||
import io
|
||||
|
||||
import re
|
||||
from glob import glob
|
||||
from os.path import basename, dirname, join, splitext
|
||||
from os.path import dirname, join
|
||||
|
||||
from setuptools import find_packages, setup
|
||||
from setuptools import setup
|
||||
from setuptools_scm import get_version
|
||||
|
||||
# Determine semantic versioning automatically
|
||||
# from git commits
|
||||
__version__ = get_version()
|
||||
|
||||
|
||||
def read(*names, **kwargs):
|
||||
with io.open(
|
||||
join(dirname(__file__), *names),
|
||||
encoding=kwargs.get("encoding", "utf8"),
|
||||
) as fh:
|
||||
with io.open(join(dirname(__file__), *names), encoding=kwargs.get("encoding", "utf8")) as fh:
|
||||
return fh.read()
|
||||
|
||||
|
||||
setup(
|
||||
name="utils_nlp",
|
||||
version = __version__,
|
||||
version=__version__,
|
||||
license="MIT License",
|
||||
description="NLP Utility functions that are used for best practices in building state-of-the-art NLP methods and scenarios. Developed by Microsoft AI CAT",
|
||||
long_description="%s\n%s"
|
||||
% (
|
||||
re.compile("^.. start-badges.*^.. end-badges", re.M | re.S).sub(
|
||||
"", read("README.md")
|
||||
),
|
||||
re.compile("^.. start-badges.*^.. end-badges", re.M | re.S).sub("", read("README.md")),
|
||||
re.sub(":[a-z]+:`~?(.*?)`", r"``\1``", read("CONTRIBUTING.md")),
|
||||
),
|
||||
author="AI CAT",
|
||||
|
@ -68,16 +63,11 @@ setup(
|
|||
"Documentation": "https://github.com/microsoft/nlp/",
|
||||
"Issue Tracker": "https://github.com/microsoft/nlp/issues",
|
||||
},
|
||||
keywords=[
|
||||
"Microsoft NLP",
|
||||
"Natural Language Processing",
|
||||
"Text Processing",
|
||||
"Word Embedding",
|
||||
],
|
||||
keywords=["Microsoft NLP", "Natural Language Processing", "Text Processing", "Word Embedding"],
|
||||
python_requires=">=3.6",
|
||||
install_requires=['setuptools_scm>=3.2.0',],
|
||||
install_requires=["setuptools_scm>=3.2.0"],
|
||||
dependency_links=[],
|
||||
extras_require={},
|
||||
use_scm_version = {"root": ".", "relative_to": __file__},
|
||||
setup_requires=['setuptools_scm'],
|
||||
use_scm_version={"root": ".", "relative_to": __file__},
|
||||
setup_requires=["setuptools_scm"],
|
||||
)
|
||||
|
|
|
@ -11,24 +11,26 @@ ABS_TOL = 0.2
|
|||
|
||||
@pytest.mark.integration
|
||||
@pytest.mark.azureml
|
||||
def test_bidaf_deep_dive(notebooks,
|
||||
subscription_id,
|
||||
resource_group,
|
||||
workspace_name,
|
||||
workspace_region):
|
||||
def test_bidaf_deep_dive(
|
||||
notebooks, subscription_id, resource_group, workspace_name, workspace_region
|
||||
):
|
||||
notebook_path = notebooks["bidaf_deep_dive"]
|
||||
pm.execute_notebook(notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
parameters = {'NUM_EPOCHS':2,
|
||||
'config_path': "tests/ci",
|
||||
'PROJECT_FOLDER': "scenarios/question_answering/bidaf-question-answering",
|
||||
'SQUAD_FOLDER': "scenarios/question_answering/squad",
|
||||
'LOGS_FOLDER': "scenarios/question_answering/",
|
||||
'BIDAF_CONFIG_PATH': "scenarios/question_answering/",
|
||||
'subscription_id': subscription_id,
|
||||
'resource_group': resource_group,
|
||||
'workspace_name': workspace_name,
|
||||
'workspace_region': workspace_region})
|
||||
pm.execute_notebook(
|
||||
notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
parameters={
|
||||
"NUM_EPOCHS": 2,
|
||||
"config_path": "tests/ci",
|
||||
"PROJECT_FOLDER": "examples/question_answering/bidaf-question-answering",
|
||||
"SQUAD_FOLDER": "examples/question_answering/squad",
|
||||
"LOGS_FOLDER": "examples/question_answering/",
|
||||
"BIDAF_CONFIG_PATH": "examples/question_answering/",
|
||||
"subscription_id": subscription_id,
|
||||
"resource_group": resource_group,
|
||||
"workspace_name": workspace_name,
|
||||
"workspace_region": workspace_region,
|
||||
},
|
||||
)
|
||||
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["validation_EM"]
|
||||
assert result == pytest.approx(0.5, abs=ABS_TOL)
|
||||
|
||||
|
@ -36,20 +38,22 @@ def test_bidaf_deep_dive(notebooks,
|
|||
@pytest.mark.usefixtures("teardown_service")
|
||||
@pytest.mark.integration
|
||||
@pytest.mark.azureml
|
||||
def test_bidaf_quickstart(notebooks,
|
||||
subscription_id,
|
||||
resource_group,
|
||||
workspace_name,
|
||||
workspace_region):
|
||||
def test_bidaf_quickstart(
|
||||
notebooks, subscription_id, resource_group, workspace_name, workspace_region
|
||||
):
|
||||
notebook_path = notebooks["bidaf_quickstart"]
|
||||
pm.execute_notebook(notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
parameters = {'config_path': "tests/ci",
|
||||
'subscription_id': subscription_id,
|
||||
'resource_group': resource_group,
|
||||
'workspace_name': workspace_name,
|
||||
'workspace_region': workspace_region,
|
||||
'webservice_name': "aci-test-service"})
|
||||
pm.execute_notebook(
|
||||
notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
parameters={
|
||||
"config_path": "tests/ci",
|
||||
"subscription_id": subscription_id,
|
||||
"resource_group": resource_group,
|
||||
"workspace_name": workspace_name,
|
||||
"workspace_region": workspace_region,
|
||||
"webservice_name": "aci-test-service",
|
||||
},
|
||||
)
|
||||
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["answer"]
|
||||
assert result == "Bi-Directional Attention Flow"
|
||||
|
||||
|
@ -64,12 +68,12 @@ def test_bert_qa_runs(notebooks):
|
|||
OUTPUT_NOTEBOOK,
|
||||
parameters=dict(
|
||||
AZUREML_CONFIG_PATH="./tests/integration/.azureml",
|
||||
DATA_FOLDER='./tests/integration/squad',
|
||||
PROJECT_FOLDER='./tests/integration/pytorch-transformers',
|
||||
EXPERIMENT_NAME='NLP-QA-BERT-deepdive',
|
||||
BERT_UTIL_PATH='./utils_nlp/azureml/azureml_bert_util.py',
|
||||
EVALUATE_SQAD_PATH = './utils_nlp/eval/evaluate_squad.py',
|
||||
TRAIN_SCRIPT_PATH="./scenarios/question_answering/bert_run_squad_azureml.py",
|
||||
DATA_FOLDER="./tests/integration/squad",
|
||||
PROJECT_FOLDER="./tests/integration/pytorch-transformers",
|
||||
EXPERIMENT_NAME="NLP-QA-BERT-deepdive",
|
||||
BERT_UTIL_PATH="./utils_nlp/azureml/azureml_bert_util.py",
|
||||
EVALUATE_SQAD_PATH="./utils_nlp/eval/evaluate_squad.py",
|
||||
TRAIN_SCRIPT_PATH="./examples/question_answering/bert_run_squad_azureml.py",
|
||||
BERT_MODEL="bert-base-uncased",
|
||||
NUM_TRAIN_EPOCHS=1.0,
|
||||
NODE_COUNT=1,
|
||||
|
|
|
@ -43,7 +43,7 @@ def test_gensen_local(notebooks):
|
|||
kernel_name=KERNEL_NAME,
|
||||
parameters=dict(
|
||||
max_epoch=1,
|
||||
config_filepath="scenarios/sentence_similarity/gensen_config.json",
|
||||
config_filepath="examples/sentence_similarity/gensen_config.json",
|
||||
base_data_path="data",
|
||||
),
|
||||
)
|
||||
|
@ -143,8 +143,8 @@ def test_similarity_gensen_azureml_runs(notebooks):
|
|||
AZUREML_CONFIG_PATH="./tests/integration/.azureml",
|
||||
UTIL_NLP_PATH="./utils_nlp",
|
||||
MAX_EPOCH=1,
|
||||
TRAIN_SCRIPT="./scenarios/sentence_similarity/gensen_train.py",
|
||||
CONFIG_PATH="./scenarios/sentence_similarity/gensen_config.json",
|
||||
TRAIN_SCRIPT="./examples/sentence_similarity/gensen_train.py",
|
||||
CONFIG_PATH="./examples/sentence_similarity/gensen_config.json",
|
||||
MAX_TOTAL_RUNS=1,
|
||||
MAX_CONCURRENT_RUNS=1,
|
||||
),
|
||||
|
|
|
@ -11,7 +11,4 @@ OUTPUT_NOTEBOOK = "output.ipynb"
|
|||
|
||||
def path_notebooks():
|
||||
"""Returns the path of the notebooks folder"""
|
||||
return os.path.abspath(
|
||||
os.path.join(os.path.dirname(__file__), os.path.pardir, "scenarios")
|
||||
)
|
||||
|
||||
return os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir, "examples"))
|
||||
|
|
|
@ -5,7 +5,7 @@ This submodule contains a tool for explaining hidden states of models. It is an
|
|||
|
||||
## How to use
|
||||
|
||||
We provide a notebook tutorial [here](../../scenarios/interpret_NLP_models/understand_models.ipynb) to help you get started quickly. The main class needed is the `Interpreter` in [Interpreter.py](Interpreter.py). Given any input word embeddings and a forward function $\Phi$ that transforms the word embeddings $\bf x$ to a hidden state $\bf s$, the Interpreter helps understand how much each input word contributes to the hidden state. Suppose the $\Phi$, the input $\bf x$ and the input words are defined as:
|
||||
We provide a notebook tutorial [here](../../examples/interpret_NLP_models/understand_models.ipynb) to help you get started quickly. The main class needed is the `Interpreter` in [Interpreter.py](Interpreter.py). Given any input word embeddings and a forward function $\Phi$ that transforms the word embeddings $\bf x$ to a hidden state $\bf s$, the Interpreter helps understand how much each input word contributes to the hidden state. Suppose the $\Phi$, the input $\bf x$ and the input words are defined as:
|
||||
```
|
||||
import torch
|
||||
|
||||
|
@ -63,5 +63,5 @@ which means that the second and forth words are most important to $\Phi$, which
|
|||
|
||||
## Explain a certain layer in any saved pytorch model
|
||||
|
||||
We provide an example on how to use our method to explain a saved pytorch model (*pre-trained BERT model in our case*) [here](../../scenarios/interpret_NLP_models/understand_models.ipynb).
|
||||
We provide an example on how to use our method to explain a saved pytorch model (*pre-trained BERT model in our case*) [here](../../examples/interpret_NLP_models/understand_models.ipynb).
|
||||
> NOTE: This result may not be consistent with the result in the paper because we use the pre-trained BERT model directly for simplicity, while the BERT model we use in paper is fine-tuned on a specific dataset like SST-2.
|
||||
|
|
Загрузка…
Ссылка в новой задаче