This commit is contained in:
nonstoptimm 2021-09-09 18:37:51 +02:00
Родитель 33bb31f504
Коммит 33817ffdc8
5 изменённых файлов: 67 добавлений и 26 удалений

Просмотреть файл

@ -61,12 +61,6 @@ Repository Structure
Secrets in production should be stored in the Azure KeyVault
--------
## Naming
### Assets
`\<project name\>(-\<task\>)-\<step\>(-\<environment\>)`
- where step in [source, train, deploy], for data assets.
- where task is an int, referring to the parameters, for models.
## Acknowledgements
Verseagility is built in part using the following frameworks:
- [Transformers](https://github.com/huggingface/pytorch-transformers) by HuggingFace

Просмотреть файл

@ -14,19 +14,21 @@ This documentation helps you to get started with the toolkit from infrastructure
1. [Verseagility-Setup](#verseagility-setup)
1. [Demo-Setup](#demo-setup)
1. [Code Repository](#code-repository)
1. [Questions / FAQ / Problems / Contact](#questions-/-faq-/-problems-/-contact)
1. [Questions / FAQ / Problems / Contact](#questions%20-%20/%20-%20faq%20-%20/%20-%20problems%20-%20/%20-%20contact)
## Requirements
[**Verseagility**](https://github.com/microsoft/verseagility) targets (product and field) data scientists, program managers and architects working within the field of NLP. Nevertheless, everyone who is interested in the field of NLP and can be deployed with no changes in the code. You will enjoy the most if you already bring some previous knowledge:
* [x] Foundational proficiency in Python
* [x] Understand the key principles of NLP tasks
* [x] Experience in Azure Machine Learning
* [x] Experience in [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/#product-overview)
Following technical framework is recommended and will help you to succeed:
* [x] an [Azure](https://portal.azure.com) subscription (including GPU-quota if you want to train your own models)
* [x] Python 3.7 and Virtual Environments
* [x] An [Azure](https://portal.azure.com) subscription (including GPU-quota if you want to train your own models)
* [x] Python 3.7 (64bit) and Virtual Environments
- **We recommend using Python 3.7.9** as the solution has been fully tested on this version, both on Windows and Linux. You can get Python 3.7.9 following this link: [here](https://www.python.org/downloads/release/python-379/). On top of the Python base installation, some further packages are required to serve the purpose of the API collection. These are listed in the `requirements.txt` with the respective version numbers. When deploying the service, it will automatically be used for transferring and installing it.
* [x] [Azure Command Line Interface (CLI)](https://docs.microsoft.com/de-de/cli/azure/install-azure-cli), command line tools for Azure using PowerShell
* [ ] [VSCode](https://code.visualstudio.com/docs/?dv=win) (recommended)
- alternatively, you can also run the scripts using PowerShell
- Alternatively, you can also run the scripts using PowerShell, Windows CMD or Bash
## Software Architecture
The toolkit is based on the following architecture. Basically, it is possible to run the toolkit locally to train models and create assets, yet we highly recommend you to use the Microsoft Azure ecosystem to leverage the full functionality and provisioning as a cloud service.
@ -44,9 +46,6 @@ The intended purpose is illustrated using [Microsoft Forum](https://answers.micr
Verseagility also allows you to set up your own personal demo WebApp on Azure. This can be done with your logo/that of your customer or end-to-end with your own data. The different approaches are described [here](demo-environment/README.md).
## Code Repository
The code and detailed instructions can be found on the [Verseagility](https://github.com/microsoft/verseagility) GitHub repository. In case you cannot access the repository, you first have to join the Microsoft GitHub organization. You will find instructions on the [FAQ page](FAQ.md). See the repository for latest features, work in progress and open todo's. **Feel free to contribute!**
## Questions / FAQ / Problems / Contact
Feel free to create an issue or pull request in GitHub if you find a bug, have a feature request or a questions.
Alternatively, feel free to reach out to the main contributors to this project:
@ -55,3 +54,5 @@ Alternatively, feel free to reach out to the main contributors to this project:
- [Christian Vorhemus](mailto:christian.vorhemus@microsoft.com)
Also, see our [FAQ page](FAQ.md) which is going to be continuously extended.
**Feel free to contribute!**

Просмотреть файл

@ -56,6 +56,45 @@ As described before, currently the following tasks are supported:
- Question Answering
The section below covers briefly what they consist of, which dependencies they have and how you can customize them further.
### **Classification**
This section describes which models are used to train classification models. Both multi-class and multi-label approaches are supported and facilitated by the [FARM](https://github.com/deepset-ai/FARM) framework. We primarily use so-called Transformer models to train classification assets.
#### **Transformers**
- Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
- Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each Python module defining an architecture can be used as a standalone and modified to enable quick research experiments.
- Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other. In this setup, we use PyTorch.
The models used are defined in `src/helper.py` and the dictionary below can be extended by other model names and languages. The list of pretrained models for many purposes can be found on [HuggingFace](https://huggingface.co/transformers/pretrained_models.html).
```python
farm_model_lookup = {
'bert': {
'xx':'bert-base-multilingual-cased',
'en':'bert-base-cased',
'de':'bert-base-german-cased',
'fr':'camembert-base',
'cn':'bert-base-chinese'
},
'roberta' : {
'en' : 'roberta-base',
'de' : 'roberta-base',
'fr' : 'roberta-base',
'es' : 'roberta-base',
'it' : 'roberta-base'
# All languages for roberta because of multi_classificaiton
},
'xlm-roberta' : {
'xx' : 'xlm-roberta-base'
},
'albert' : {
'en' : 'albert-base-v2'
},
'distilbert' : {
'xx' : 'distilbert-base-multilingual-cased',
'de' : 'distilbert-base-german-cased'
}
}
```
### **Named Entity Recognition**
The toolkit supports and includes different approaches and frameworks for recognizing relevant entities in text paragraphs, called _Named Entity Recognition_, short _NER_:
- [Azure Text Analytics API](https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking?tabs=version-3)
@ -88,7 +127,7 @@ The most basic approach of Named Entity Recognition in text files is to take use
4. Make sure the values are all lower-case, but the keys should be properly formatted
### **Question Answering**
This page is devoted to the question-answering component of the NLP toolkit and describes how the answer suggestions are being ranked during runtime. Please keep in mind that this component of the toolkit requires a large amount of potential answers for each text that has been trained along with the input texts in order to
This section is devoted to the question-answering component of the NLP toolkit and describes how the answer suggestions are being ranked during runtime. Please keep in mind that this component of the toolkit requires a large amount of potential answers for each text that has been trained along with the input texts in order to
#### **Ranking Algorithm**
The current version of Verseagility supports the Okapi BM25 information retrieval algorithm to sort historical question answer pairs by relevance. BM25 is a ranking approach used by search engines to estimate the relevance of a document to a given search query, such as a text or document. This is implemented using the [gensim library](https://radimrehurek.com/gensim/summarization/bm25.html). The ranking framework is accessed by the file `code/rank.py`.

Просмотреть файл

@ -2,17 +2,17 @@
The following section covers aspects of data pre-processing used for different tasks. Some pre-processing steps are universal, while others are task specific. They are split accordingly. You may edit/comment some steps out, or even add further ones depending on your needs in the file `src/prepare.py`. This section for example covers data cleaning steps exclusively for German emails and support tickets, with typical phrases occuring in these kinds of documents. You may add further phrases for the language code needed and they will be considered in the data preparation.
```python
#DE
if self.language == 'de':
line = re.sub(r'\b(mit )?(beste|viele|liebe|freundlich\w+)? (gr[u,ü][ß,ss].*)', '', line, flags=re.I)
line = re.sub(r'\b(besten|herzlichen|lieben) dank.*', '', line, flags=re.I)
line = re.sub(r'\bvielen dank für ihr verständnis.*', '', line, flags=re.I)
line = re.sub(r'\bvielen dank im voraus.*', '', line, flags=re.I)
line = re.sub(r'\b(mfg|m\.f\.g) .*','', line, flags=re.I)
line = re.sub(r'\b(lg) .*','',line, flags=re.I)
line = re.sub(r'\b(meinem iPhone gesendet) .*','',line, flags=re.I)
line = re.sub(r'\b(Gesendet mit der (WEB|GMX)) .*','',line, flags=re.I)
line = re.sub(r'\b(Diese E-Mail wurde von Avast) .*','',line, flags=re.I)
# DE
if self.language == 'de':
line = re.sub(r'\b(mit )?(beste|viele|liebe|freundlich\w+)? (gr[u,ü][ß,ss].*)', '', line, flags=re.I)
line = re.sub(r'\b(besten|herzlichen|lieben) dank.*', '', line, flags=re.I)
line = re.sub(r'\bvielen dank für ihr verständnis.*', '', line, flags=re.I)
line = re.sub(r'\bvielen dank im voraus.*', '', line, flags=re.I)
line = re.sub(r'\b(mfg|m\.f\.g) .*','', line, flags=re.I)
line = re.sub(r'\b(lg) .*','',line, flags=re.I)
line = re.sub(r'\b(meinem iPhone gesendet) .*','',line, flags=re.I)
line = re.sub(r'\b(Gesendet mit der (WEB|GMX)) .*','',line, flags=re.I)
line = re.sub(r'\b(Diese E-Mail wurde von Avast) .*','',line, flags=re.I)
```
## Universal Steps

Просмотреть файл

@ -1,6 +1,13 @@
# Training
This part of the documentation serves as guideline for the model training process. The tasks being submitted for training depend on which tasks you have defined in your config files.
## Naming
The naming of the experiments in Azure Machine Learning is structures as below:
- `\<project name\>(-\<task\>)-\<step\>(-\<environment\>)`
- Where step in [source, train, deploy], for data assets.
- Where task is an int, referring to the parameters, for models.
- Example: `msforum_
## Initiation of the Training Process
After setting up your projects in the previous pages, you are now ready to train your models. This training step incorporates the classification, named entity recognition and question/answering models all in one.
1. Open your command line in VSCode, PowerShell or bash.