From 33817ffdc858f1931ae587dee70b7e658e8b7e10 Mon Sep 17 00:00:00 2001 From: nonstoptimm Date: Thu, 9 Sep 2021 18:37:51 +0200 Subject: [PATCH] documentation updates --- README.md | 6 ---- docs/README.md | 17 ++++++----- docs/setup/03 - Project Setup.md | 41 +++++++++++++++++++++++++- docs/setup/04 - Data Cleaning Steps.md | 22 +++++++------- docs/setup/05 - Training.md | 7 +++++ 5 files changed, 67 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 9d89be1..6e960aa 100644 --- a/README.md +++ b/README.md @@ -61,12 +61,6 @@ Repository Structure Secrets in production should be stored in the Azure KeyVault -------- -## Naming -### Assets -`\(-\)-\(-\)` -- where step in [source, train, deploy], for data assets. -- where task is an int, referring to the parameters, for models. - ## Acknowledgements Verseagility is built in part using the following frameworks: - [Transformers](https://github.com/huggingface/pytorch-transformers) by HuggingFace diff --git a/docs/README.md b/docs/README.md index 713a628..bf58deb 100644 --- a/docs/README.md +++ b/docs/README.md @@ -14,19 +14,21 @@ This documentation helps you to get started with the toolkit from infrastructure 1. [Verseagility-Setup](#verseagility-setup) 1. [Demo-Setup](#demo-setup) 1. [Code Repository](#code-repository) -1. [Questions / FAQ / Problems / Contact](#questions-/-faq-/-problems-/-contact) +1. [Questions / FAQ / Problems / Contact](#questions%20-%20/%20-%20faq%20-%20/%20-%20problems%20-%20/%20-%20contact) ## Requirements [**Verseagility**](https://github.com/microsoft/verseagility) targets (product and field) data scientists, program managers and architects working within the field of NLP. Nevertheless, everyone who is interested in the field of NLP and can be deployed with no changes in the code. You will enjoy the most if you already bring some previous knowledge: * [x] Foundational proficiency in Python * [x] Understand the key principles of NLP tasks -* [x] Experience in Azure Machine Learning +* [x] Experience in [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/#product-overview) Following technical framework is recommended and will help you to succeed: -* [x] an [Azure](https://portal.azure.com) subscription (including GPU-quota if you want to train your own models) -* [x] Python 3.7 and Virtual Environments +* [x] An [Azure](https://portal.azure.com) subscription (including GPU-quota if you want to train your own models) +* [x] Python 3.7 (64bit) and Virtual Environments + - **We recommend using Python 3.7.9** as the solution has been fully tested on this version, both on Windows and Linux. You can get Python 3.7.9 following this link: [here](https://www.python.org/downloads/release/python-379/). On top of the Python base installation, some further packages are required to serve the purpose of the API collection. These are listed in the `requirements.txt` with the respective version numbers. When deploying the service, it will automatically be used for transferring and installing it. +* [x] [Azure Command Line Interface (CLI)](https://docs.microsoft.com/de-de/cli/azure/install-azure-cli), command line tools for Azure using PowerShell * [ ] [VSCode](https://code.visualstudio.com/docs/?dv=win) (recommended) - - alternatively, you can also run the scripts using PowerShell + - Alternatively, you can also run the scripts using PowerShell, Windows CMD or Bash ## Software Architecture The toolkit is based on the following architecture. Basically, it is possible to run the toolkit locally to train models and create assets, yet we highly recommend you to use the Microsoft Azure ecosystem to leverage the full functionality and provisioning as a cloud service. @@ -44,9 +46,6 @@ The intended purpose is illustrated using [Microsoft Forum](https://answers.micr Verseagility also allows you to set up your own personal demo WebApp on Azure. This can be done with your logo/that of your customer or end-to-end with your own data. The different approaches are described [here](demo-environment/README.md). -## Code Repository -The code and detailed instructions can be found on the [Verseagility](https://github.com/microsoft/verseagility) GitHub repository. In case you cannot access the repository, you first have to join the Microsoft GitHub organization. You will find instructions on the [FAQ page](FAQ.md). See the repository for latest features, work in progress and open todo's. **Feel free to contribute!** - ## Questions / FAQ / Problems / Contact Feel free to create an issue or pull request in GitHub if you find a bug, have a feature request or a questions. Alternatively, feel free to reach out to the main contributors to this project: @@ -55,3 +54,5 @@ Alternatively, feel free to reach out to the main contributors to this project: - [Christian Vorhemus](mailto:christian.vorhemus@microsoft.com) Also, see our [FAQ page](FAQ.md) which is going to be continuously extended. + +**Feel free to contribute!** \ No newline at end of file diff --git a/docs/setup/03 - Project Setup.md b/docs/setup/03 - Project Setup.md index 1f91b79..0a76629 100644 --- a/docs/setup/03 - Project Setup.md +++ b/docs/setup/03 - Project Setup.md @@ -56,6 +56,45 @@ As described before, currently the following tasks are supported: - Question Answering The section below covers briefly what they consist of, which dependencies they have and how you can customize them further. +### **Classification** +This section describes which models are used to train classification models. Both multi-class and multi-label approaches are supported and facilitated by the [FARM](https://github.com/deepset-ai/FARM) framework. We primarily use so-called Transformer models to train classification assets. + +#### **Transformers** +- Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone. +- Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each Python module defining an architecture can be used as a standalone and modified to enable quick research experiments. +- Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other. In this setup, we use PyTorch. + +The models used are defined in `src/helper.py` and the dictionary below can be extended by other model names and languages. The list of pretrained models for many purposes can be found on [HuggingFace](https://huggingface.co/transformers/pretrained_models.html). +```python +farm_model_lookup = { + 'bert': { + 'xx':'bert-base-multilingual-cased', + 'en':'bert-base-cased', + 'de':'bert-base-german-cased', + 'fr':'camembert-base', + 'cn':'bert-base-chinese' + }, + 'roberta' : { + 'en' : 'roberta-base', + 'de' : 'roberta-base', + 'fr' : 'roberta-base', + 'es' : 'roberta-base', + 'it' : 'roberta-base' + # All languages for roberta because of multi_classificaiton + }, + 'xlm-roberta' : { + 'xx' : 'xlm-roberta-base' + }, + 'albert' : { + 'en' : 'albert-base-v2' + }, + 'distilbert' : { + 'xx' : 'distilbert-base-multilingual-cased', + 'de' : 'distilbert-base-german-cased' + } +} +``` + ### **Named Entity Recognition** The toolkit supports and includes different approaches and frameworks for recognizing relevant entities in text paragraphs, called _Named Entity Recognition_, short _NER_: - [Azure Text Analytics API](https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking?tabs=version-3) @@ -88,7 +127,7 @@ The most basic approach of Named Entity Recognition in text files is to take use 4. Make sure the values are all lower-case, but the keys should be properly formatted ### **Question Answering** -This page is devoted to the question-answering component of the NLP toolkit and describes how the answer suggestions are being ranked during runtime. Please keep in mind that this component of the toolkit requires a large amount of potential answers for each text that has been trained along with the input texts in order to +This section is devoted to the question-answering component of the NLP toolkit and describes how the answer suggestions are being ranked during runtime. Please keep in mind that this component of the toolkit requires a large amount of potential answers for each text that has been trained along with the input texts in order to #### **Ranking Algorithm** The current version of Verseagility supports the Okapi BM25 information retrieval algorithm to sort historical question answer pairs by relevance. BM25 is a ranking approach used by search engines to estimate the relevance of a document to a given search query, such as a text or document. This is implemented using the [gensim library](https://radimrehurek.com/gensim/summarization/bm25.html). The ranking framework is accessed by the file `code/rank.py`. diff --git a/docs/setup/04 - Data Cleaning Steps.md b/docs/setup/04 - Data Cleaning Steps.md index 7d25aee..4b4daea 100644 --- a/docs/setup/04 - Data Cleaning Steps.md +++ b/docs/setup/04 - Data Cleaning Steps.md @@ -2,17 +2,17 @@ The following section covers aspects of data pre-processing used for different tasks. Some pre-processing steps are universal, while others are task specific. They are split accordingly. You may edit/comment some steps out, or even add further ones depending on your needs in the file `src/prepare.py`. This section for example covers data cleaning steps exclusively for German emails and support tickets, with typical phrases occuring in these kinds of documents. You may add further phrases for the language code needed and they will be considered in the data preparation. ```python - #DE - if self.language == 'de': - line = re.sub(r'\b(mit )?(beste|viele|liebe|freundlich\w+)? (gr[u,ü][ß,ss].*)', '', line, flags=re.I) - line = re.sub(r'\b(besten|herzlichen|lieben) dank.*', '', line, flags=re.I) - line = re.sub(r'\bvielen dank für ihr verständnis.*', '', line, flags=re.I) - line = re.sub(r'\bvielen dank im voraus.*', '', line, flags=re.I) - line = re.sub(r'\b(mfg|m\.f\.g) .*','', line, flags=re.I) - line = re.sub(r'\b(lg) .*','',line, flags=re.I) - line = re.sub(r'\b(meinem iPhone gesendet) .*','',line, flags=re.I) - line = re.sub(r'\b(Gesendet mit der (WEB|GMX)) .*','',line, flags=re.I) - line = re.sub(r'\b(Diese E-Mail wurde von Avast) .*','',line, flags=re.I) +# DE +if self.language == 'de': + line = re.sub(r'\b(mit )?(beste|viele|liebe|freundlich\w+)? (gr[u,ü][ß,ss].*)', '', line, flags=re.I) + line = re.sub(r'\b(besten|herzlichen|lieben) dank.*', '', line, flags=re.I) + line = re.sub(r'\bvielen dank für ihr verständnis.*', '', line, flags=re.I) + line = re.sub(r'\bvielen dank im voraus.*', '', line, flags=re.I) + line = re.sub(r'\b(mfg|m\.f\.g) .*','', line, flags=re.I) + line = re.sub(r'\b(lg) .*','',line, flags=re.I) + line = re.sub(r'\b(meinem iPhone gesendet) .*','',line, flags=re.I) + line = re.sub(r'\b(Gesendet mit der (WEB|GMX)) .*','',line, flags=re.I) + line = re.sub(r'\b(Diese E-Mail wurde von Avast) .*','',line, flags=re.I) ``` ## Universal Steps diff --git a/docs/setup/05 - Training.md b/docs/setup/05 - Training.md index 2d24dab..d20f089 100644 --- a/docs/setup/05 - Training.md +++ b/docs/setup/05 - Training.md @@ -1,6 +1,13 @@ # Training This part of the documentation serves as guideline for the model training process. The tasks being submitted for training depend on which tasks you have defined in your config files. +## Naming +The naming of the experiments in Azure Machine Learning is structures as below: +- `\(-\)-\(-\)` + - Where step in [source, train, deploy], for data assets. + - Where task is an int, referring to the parameters, for models. + - Example: `msforum_ + ## Initiation of the Training Process After setting up your projects in the previous pages, you are now ready to train your models. This training step incorporates the classification, named entity recognition and question/answering models all in one. 1. Open your command line in VSCode, PowerShell or bash.