Natural Language Processing Best Practices & Examples
Перейти к файлу
Said Bleik 7db6d204e5
Merge pull request #612 from microsoft/staging
Staging
2021-01-06 16:17:23 -05:00
.github Update PULL_REQUEST_TEMPLATE.md 2020-01-23 10:48:05 -05:00
docker Switching to miniconda 2020-01-27 18:27:04 +00:00
docs Staging to master to add the latest fixes (#503) 2019-11-29 19:53:57 -05:00
examples Update absa_azureml.ipynb 2021-01-06 16:14:30 -05:00
tests Merge pull request #586 from microsoft/bleik/add-models 2020-05-13 00:20:56 -04:00
tools Update remove_pixelserver.py 2020-06-16 18:59:59 +01:00
utils_nlp reverted changes to dataset 2020-06-23 14:45:52 +00:00
.amlignore update examples folder refs 2019-08-19 14:51:35 +00:00
.bumpversion.cfg feat: Configure NLP Utils Semantic Versioning with setuptools_scm and bumpversion 2019-08-09 14:56:28 +00:00
.flake8 Merge branch 'staging' into hlu/unilm_abstractive_summarization 2020-02-13 18:52:24 -05:00
.gitignore Add text classification notebook with MTDNN on MNLI dataset 2020-06-28 23:38:31 +00:00
.pre-commit-config.yaml Changed python version in pre-commit-config back to 3.6 2019-06-13 14:46:57 -04:00
CONTRIBUTING.md Rijai reposetup (#1) 2019-04-05 19:01:56 -04:00
DatasetReferences.md add BERTSUM CNN/DM Dataset 2020-01-03 16:21:40 +00:00
LICENSE Intial commit to put the receipe template in 2019-04-05 13:55:58 -04:00
MANIFEST.in Make utils pip installable using setup.py 2019-07-24 14:22:22 -04:00
NLP-Logo.png Adding a logo file 2019-10-01 15:20:05 -04:00
NOTICE.txt Merge pull request #600 from shanepeckham/patch-2 2020-06-17 00:46:56 -04:00
README.md Merge pull request #595 from microsoft/sharatsc-patch-4 2020-06-17 00:48:48 -04:00
SETUP.md Add instructions for choosing cudatoolkit version and upgrading cuda driver. 2020-02-28 00:00:49 +00:00
VERSIONING.md Update to install pip editable in dev mode and remove dependency on setuptools_scm. 2019-09-18 14:34:45 +00:00
_config.yml Set theme jekyll-theme-cayman 2019-09-13 22:31:09 -04:00
cgmanifest.json added xlm 2020-06-01 12:56:00 -04:00
pyproject.toml Update pyproject.toml 2020-02-12 11:33:51 -05:00
setup.py Staging to master to add the latest fixes (#503) 2019-11-29 19:53:57 -05:00

README.md

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

  • "Natural language is not a synonym for English"
  • "English isn't generic for language, despite what NLP papers might lead you to believe"
  • "Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario Models Description Languages
Text Classification BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition BERT Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. English
Text Summarization BERTSumExt
BERTSumAbs
UniLM (s2s-ft)
MiniLM
Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text. English
Entailment BERT, XLNet, RoBERTa Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not. English
Question Answering BiDAF, BERT, XLNet Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. English
Sentence Similarity BERT, GenSen Sentence similarity is the process of computing a similarity score given a pair of text documents. English
Embeddings Word2Vec
fastText
GloVe
Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension. English
Sentiment Analysis Dependency Parser
GloVe
Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect . English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository Description
Transformers A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks ML and deep learning examples with Azure Machine Learning.
AzureML-BERT End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM Unified Language Model Pre-training.
DialoGPT DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build Branch Status
Linux CPU master Build Status
Linux CPU staging Build Status
Linux GPU master Build Status
Linux GPU staging Build Status