This commit is contained in:
Martin Kayser 2020-08-24 16:24:50 +02:00
Родитель 6aa3b9bcb1
Коммит 80d2a27e98
15 изменённых файлов: 148 добавлений и 52 удалений

Просмотреть файл

@ -5,46 +5,64 @@ Verseagility is a Python-based toolkit for your custom natural language processi
See the [wiki](https://dev.azure.com/DAISolutions/KnowledgeMining/_wiki/wikis) for detailed documentation how to get started with the toolkit.
## Supported Use cases
## Supported Use Cases
- Binary, multi-class & multi-label classification
- Named entity recognition
- Question answering
- Text summarization
## Live Demo
The live demo of models resulting from Verseagility is hosted at MTC Germany:
> https://aka.ms/nlp-demo
Repository Structure
------------
├── README.md <- The top-level README for developers using this project.
├── assets <- Version controlled assets, such as stopword lists. Max size
│ per file: 10 MB. Training data should
│ be stored in local data directory, outside of repository or within gitignore.
├── demo <- Demo environment that can be deployed as is, or customized.
├── notebooks <- Jupyter notebooks. Naming convention is <[Task]-[Short Description]>,
│ for example: 'Data - Exploration.ipynb'
├── pipeline <- Document processing pipeline components, including document cracker.
├── scraper <- Website scraper used to fetch sample data.
│ Can be reused for similarly structured forum websites.
├── src <- Source code for use in this project.
│   ├── infer.py <- Inference file, for scoring the model
│ │
│   ├── data.py <- Use case agnostic utils file, for data management incl upload/download
│ │
│   └── helper.py <- Use case agnostic utils file, with common functions incl secret handling
├── deploy <- Scripts used for deploying training or test service
│   ├── training.py <- Deploy your training to a remote compute instance, via AML
│ │
│   ├── hyperdrive.py <- Deploy hyperparemeter sweep on a remote compute instance, via AML
│ │
│   └── service.py <- Deploy a service (endpoint) to ACI or AKS, via AML
├── tests <- Unit tests (using pytest)
├── requirements.txt <- The requirements file for reproducing the analysis environment.
│ Can be generated using `pip freeze > requirements.txt`
└── config.ini <- Configuration and secrets used while developing locally
Secrets in production should be stored in the Azure KeyVault
--------
## Naming
### Assets
> \<project name\>(-\<task\>)-\<step\>(-\<environment\>)
- where step in [source, train, deploy], for data assets.
- where task is an int, referring to the parameters, for models.
## TODO
### Classification
- [ ] **(IP)** multi label support
- [ ] integrate handling for larger documents vs short documents
- [ ] integrate explicit handling for unbalanced datasets
- [ ] ONNX support
### NER
- [ ] improve duplicate handling
### Question Answering
- [ ] Apply advanced IR methods
### Deployment
- [ ] Deploy service to Azure Function (without AzureML)
### Notebooks
- [x] review prepared data
- [ ] **(IP)** review model results (auto generate after each training step)
- [ ] review model bias (auto generate after each training step)
- [ ] **(IP)** available models benchmark (incl AutoML)
### Tests
- [ ] unit tests (pytest)
### New Features (TBD)
- [ ] **(IP)** Summarization
- [ ] Deployable feedback loop
- [ ] Integration with GitHub Actions
## Acknowledgements
Verseagility is built in part using the following:
- [Transformers](https://github.com/huggingface/pytorch-transformers) by HuggingFace
@ -58,6 +76,29 @@ Maintainers:
- [Christian Vorhemus](mailto:christian.vorhemus@microsoft.com)
- [Martin Kayser](mailto:martin.kayser@microsoft.com)
## To-Dos
The following section contains a list of possible new features or enhancements. Feel free to contribute.
### Classification
- [x] multi label support
- [ ] integrate handling for larger documents vs short documents
- [ ] integrate explicit handling for unbalanced datasets
- [ ] ONNX support
### NER
- [ ] improve duplicate handling
### Question Answering
- [ ] apply advanced IR methods
### Summarization
- [ ] **(IP)** full test of integration
### Deployment
- [ ] deploy service to Azure Function (without AzureML)
- [ ] setup GitHub actions
### Notebooks Templates
- [ ] **(IP)** review model results (auto generate after each training step)
- [ ] review model bias (auto generate after each training step)
- [ ] **(IP)** available models benchmark (incl AutoML)
### Tests
- [ ] unit tests (pytest)
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
@ -69,4 +110,4 @@ provided by the bot. You will only need to do this once across all repos using o
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

Просмотреть файл

@ -1,3 +1,5 @@
# Reuse this for local development.
# All keys used remotely should be stored in, and retreived from, Azure KeyVault
[environ]
aml-ws-name=
aml-ws-sid=

43
requirements.txt Normal file
Просмотреть файл

@ -0,0 +1,43 @@
numpy>=1.18.1
pandas>=1.0.3
azure-storage-blob==2.1.0
azure-cosmos==3.1.2
azureml-sdk>=1.1.5
azureml-dataprep[pandas,fuse]>=1.3.5
# mlflow>=1.6.0 #NOT NEEDED?
# azureml-mlflow>=1.1.5 #NOT NEEDED?
azure-ai-textanalytics==5.0.0 #UPGRADED?!
gensim==3.8.0
spacy==2.3.2 #UPGRADED
transformers==3.0.2 #UPGRADED
farm==0.4.6 #UPGRADED
flair==0.6 #UPGRADED
selenium==3.141.0
bs4
boto3
scipy>=1.3.2
sklearn
seqeval
dotmap==1.3.0
pyyml
opencensus==0.7
opencensus-ext-azure==1
##DEPLOY ONLY
azure-keyvault==1.1.0
azure-identity==1.3.1
azure-keyvault-secrets==4.1.0
##LOCAL ONLY
pylint>=2.4.4
pytest>=5.3.5
flake8>=3.7.9
ipykernel>=5.1.4
streamlit==0.65
tqdm
# - pytorch=1.4.0 # NOTE: UNCOMMENT THIS FOR LOCAL ENV INSTALL, BUT COMMENT IT AGAIN FOR TRAINING/DEPLOYMENT
# - cudatoolkit=10.1 #For GPU training only

Просмотреть файл

@ -1,10 +1,15 @@
''' MICROSOFT FORUM PAGE SCRAPER '''
""" MICROSOFT FORUM PAGE SCRAPER
Website: answers.microsoft.com
Example:
> python 1_getsites.py --language de-de --product xbox
"""
import argparse
import re
import sys
import os
# Run arguments
# example: python 1_getsites.py --language de-de --product xbox
parser = argparse.ArgumentParser()
parser.add_argument("--language",
default="de-de",

Просмотреть файл

@ -1,4 +1,10 @@
''' MICROSOFT FORUM TICKET SCRAPER '''
""" MICROSOFT FORUM TICKET SCRAPER
Website: answers.microsoft.com
Example:
> python 2_extract.py --language de-de --product windows
"""
import re
import urllib
import urllib.request
@ -28,7 +34,6 @@ parser.add_argument('--product',
help="['windows', 'msoffice', 'xbox', 'outlook_com', 'skype', 'surface', 'protect', 'edge','ie','musicandvideo']")
args = parser.parse_args()
# Example: python 2_extract.py --language de-de --product windows
# Set params
lang = args.language
@ -184,22 +189,22 @@ def scrapeMe(url, product):
file.write(content+",")
print(f"[SUCCESS] - File {fileid}\n")
''' LOOP THROUGH THE OUTPUT TEXT FILES AND CREATE JSON '''
# Check mode
# LOOP THROUGH THE OUTPUT TEXT FILES AND CREATE JSON
## Check mode
if productsel == "list":
products = ['windows', 'msoffice', 'xbox', 'outlook_com', 'skype', 'surface', 'protect', 'edge', 'musicandvideo', 'msteams', 'microsoftedge']
else:
products = [productsel]
# Loop through product
## Loop through product
for product in products:
try:
# Read File
### Read File
docs = codecs.open(f"output-{product}-{lang}.txt", 'r', encoding='utf-8').read()
# Prepare Links
### Prepare Links
url_temp = re.findall(r'(https?://answers.microsoft.com/' + lang + r'/' + product + r'/forum/[^\s]+)', docs)
url_temp2 = [s.strip('"') for s in url_temp]
url_list = [x for x in url_temp2 if not x.endswith('LastReply')]
# Drop duplicates
### Drop duplicates
url_list = list(dict.fromkeys(url_list))
failed_url = []
for i, value in enumerate(url_list):

Просмотреть файл

@ -23,7 +23,7 @@ from farm.modeling.language_model import LanguageModel, Roberta, Albert, DistilB
from farm.modeling.prediction_head import TextClassificationHead, MultiLabelTextClassificationHead
from farm.modeling.tokenization import Tokenizer, RobertaTokenizer, AlbertTokenizer
from farm.train import Trainer, EarlyStopping
from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
from farm.utils import set_all_seeds, initialize_device_settings
from farm.eval import Evaluator
from sklearn.metrics import (matthews_corrcoef, recall_score, precision_score,
f1_score, mean_squared_error, r2_score)

Просмотреть файл

Просмотреть файл

Просмотреть файл

@ -36,7 +36,7 @@ except Exception as e:
############################################
def get_logger(level='info', location = None, excl_az_storage=True):
'''Get runtime logger'''
"""Get runtime logger"""
global logger
# Exceptions
@ -65,7 +65,7 @@ def get_logger(level='info', location = None, excl_az_storage=True):
return logger
def get_context():
'''Get AML Run Context for Logging to AML Services'''
"""Get AML Run Context for Logging to AML Services"""
try:
run = Run.get_context()
except Exception as e:
@ -377,7 +377,7 @@ def append_ner(v, s, e, l, t=''):
############################################
# def decrypt(token, dataframe=False):
# ''' Decrypt symetric object using Fernet '''
# """ Decrypt symetric object using Fernet """
# secret = get_secret()
# f = Fernet(bytes(secret, encoding='utf-8'))
# token = f.decrypt(token)
@ -395,7 +395,7 @@ def append_ner(v, s, e, l, t=''):
# df.to_csv(fn.replace('.enc', '.txt'), sep='\t', encoding='utf-8', index=False) #TODO: match encrypt fn out
# def encrypt(token, dataframe=False):
# ''' Encrypt symetric object using Fernet '''
# """ Encrypt symetric object using Fernet """
# secret = get_secret()
# f = Fernet(bytes(secret, encoding='utf-8'))
# if dataframe:

Просмотреть файл

Просмотреть файл

@ -24,7 +24,7 @@ from farm.modeling.language_model import Roberta
from farm.modeling.prediction_head import MultiLabelTextClassificationHead
from farm.modeling.tokenization import Tokenizer
from farm.train import Trainer
from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
from farm.utils import set_all_seeds, initialize_device_settings
from sklearn.metrics import (matthews_corrcoef, recall_score, precision_score,
f1_score, mean_squared_error, r2_score)
from farm.evaluation.metrics import simple_accuracy, register_metrics

Просмотреть файл

@ -19,7 +19,7 @@ from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import TokenClassificationHead
from farm.modeling.tokenization import Tokenizer
from farm.train import Trainer
from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
from farm.utils import set_all_seeds, initialize_device_settings
from azure.ai.textanalytics import TextAnalyticsClient, TextAnalyticsApiKeyCredential

Просмотреть файл

@ -141,7 +141,7 @@ class Clean():
rp_generic=False,
rp_custom=False,
rp_num=False):
'''Replace text with type specfic placeholders'''
"""Replace text with type specfic placeholders"""
# Customer placeholders
@ -166,7 +166,7 @@ class Clean():
return line
def tokenize(self, line, lemmatize = False, rm_stopwords = False):
'''Tokenizer for non DL tasks'''
"""Tokenizer for non DL tasks"""
if not isinstance(line, str):
line = str(line)

Просмотреть файл

Просмотреть файл

@ -9,13 +9,13 @@ from gensim.summarization.summarizer import summarize
import networkx as nx
import numpy as np
''' BERTABS '''
""" BERTABS """
def summarizeText(text, minLength=60):
result = model(text, min_length = minLength)
full = ''.join(result)
return full
''' SAMPLING '''
""" SAMPLING """
def sentencenize(text):
sentences = []
for sent in text:
@ -36,11 +36,11 @@ def removeStopwords(sen, sw):
sentence = " ".join([i for i in sen if i not in sw])
return sentence
''' BERTABS '''
""" BERTABS """
model = Summarizer()
''' SAMPLING '''
""" SAMPLING """
clean_sentences = [removeStopwords(r.split(), sw) for r in clean_sentences]
''' GENSIM '''
""" GENSIM """
summarize()