Update repo structure
This commit is contained in:
Родитель
6aa3b9bcb1
Коммит
80d2a27e98
95
README.md
95
README.md
|
@ -5,46 +5,64 @@ Verseagility is a Python-based toolkit for your custom natural language processi
|
|||
|
||||
See the [wiki](https://dev.azure.com/DAISolutions/KnowledgeMining/_wiki/wikis) for detailed documentation how to get started with the toolkit.
|
||||
|
||||
## Supported Use cases
|
||||
## Supported Use Cases
|
||||
- Binary, multi-class & multi-label classification
|
||||
- Named entity recognition
|
||||
- Question answering
|
||||
- Text summarization
|
||||
|
||||
## Live Demo
|
||||
The live demo of models resulting from Verseagility is hosted at MTC Germany:
|
||||
> https://aka.ms/nlp-demo
|
||||
|
||||
Repository Structure
|
||||
------------
|
||||
|
||||
├── README.md <- The top-level README for developers using this project.
|
||||
├── assets <- Version controlled assets, such as stopword lists. Max size
|
||||
│ per file: 10 MB. Training data should
|
||||
│ be stored in local data directory, outside of repository or within gitignore.
|
||||
│
|
||||
├── demo <- Demo environment that can be deployed as is, or customized.
|
||||
│
|
||||
├── notebooks <- Jupyter notebooks. Naming convention is <[Task]-[Short Description]>,
|
||||
│ for example: 'Data - Exploration.ipynb'
|
||||
│
|
||||
├── pipeline <- Document processing pipeline components, including document cracker.
|
||||
│
|
||||
│
|
||||
├── scraper <- Website scraper used to fetch sample data.
|
||||
│ Can be reused for similarly structured forum websites.
|
||||
│
|
||||
├── src <- Source code for use in this project.
|
||||
│ ├── infer.py <- Inference file, for scoring the model
|
||||
│ │
|
||||
│ ├── data.py <- Use case agnostic utils file, for data management incl upload/download
|
||||
│ │
|
||||
│ └── helper.py <- Use case agnostic utils file, with common functions incl secret handling
|
||||
│
|
||||
├── deploy <- Scripts used for deploying training or test service
|
||||
│ ├── training.py <- Deploy your training to a remote compute instance, via AML
|
||||
│ │
|
||||
│ ├── hyperdrive.py <- Deploy hyperparemeter sweep on a remote compute instance, via AML
|
||||
│ │
|
||||
│ └── service.py <- Deploy a service (endpoint) to ACI or AKS, via AML
|
||||
│
|
||||
├── tests <- Unit tests (using pytest)
|
||||
│
|
||||
├── requirements.txt <- The requirements file for reproducing the analysis environment.
|
||||
│ Can be generated using `pip freeze > requirements.txt`
|
||||
│
|
||||
└── config.ini <- Configuration and secrets used while developing locally
|
||||
Secrets in production should be stored in the Azure KeyVault
|
||||
--------
|
||||
|
||||
## Naming
|
||||
### Assets
|
||||
> \<project name\>(-\<task\>)-\<step\>(-\<environment\>)
|
||||
- where step in [source, train, deploy], for data assets.
|
||||
- where task is an int, referring to the parameters, for models.
|
||||
|
||||
## TODO
|
||||
|
||||
### Classification
|
||||
- [ ] **(IP)** multi label support
|
||||
- [ ] integrate handling for larger documents vs short documents
|
||||
- [ ] integrate explicit handling for unbalanced datasets
|
||||
- [ ] ONNX support
|
||||
### NER
|
||||
- [ ] improve duplicate handling
|
||||
### Question Answering
|
||||
- [ ] Apply advanced IR methods
|
||||
### Deployment
|
||||
- [ ] Deploy service to Azure Function (without AzureML)
|
||||
### Notebooks
|
||||
- [x] review prepared data
|
||||
- [ ] **(IP)** review model results (auto generate after each training step)
|
||||
- [ ] review model bias (auto generate after each training step)
|
||||
- [ ] **(IP)** available models benchmark (incl AutoML)
|
||||
### Tests
|
||||
- [ ] unit tests (pytest)
|
||||
### New Features (TBD)
|
||||
- [ ] **(IP)** Summarization
|
||||
- [ ] Deployable feedback loop
|
||||
- [ ] Integration with GitHub Actions
|
||||
|
||||
## Acknowledgements
|
||||
Verseagility is built in part using the following:
|
||||
- [Transformers](https://github.com/huggingface/pytorch-transformers) by HuggingFace
|
||||
|
@ -58,6 +76,29 @@ Maintainers:
|
|||
- [Christian Vorhemus](mailto:christian.vorhemus@microsoft.com)
|
||||
- [Martin Kayser](mailto:martin.kayser@microsoft.com)
|
||||
|
||||
## To-Dos
|
||||
The following section contains a list of possible new features or enhancements. Feel free to contribute.
|
||||
### Classification
|
||||
- [x] multi label support
|
||||
- [ ] integrate handling for larger documents vs short documents
|
||||
- [ ] integrate explicit handling for unbalanced datasets
|
||||
- [ ] ONNX support
|
||||
### NER
|
||||
- [ ] improve duplicate handling
|
||||
### Question Answering
|
||||
- [ ] apply advanced IR methods
|
||||
### Summarization
|
||||
- [ ] **(IP)** full test of integration
|
||||
### Deployment
|
||||
- [ ] deploy service to Azure Function (without AzureML)
|
||||
- [ ] setup GitHub actions
|
||||
### Notebooks Templates
|
||||
- [ ] **(IP)** review model results (auto generate after each training step)
|
||||
- [ ] review model bias (auto generate after each training step)
|
||||
- [ ] **(IP)** available models benchmark (incl AutoML)
|
||||
### Tests
|
||||
- [ ] unit tests (pytest)
|
||||
|
||||
## Contributing
|
||||
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
||||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
||||
|
@ -69,4 +110,4 @@ provided by the bot. You will only need to do this once across all repos using o
|
|||
|
||||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
||||
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
||||
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
|
@ -1,3 +1,5 @@
|
|||
# Reuse this for local development.
|
||||
# All keys used remotely should be stored in, and retreived from, Azure KeyVault
|
||||
[environ]
|
||||
aml-ws-name=
|
||||
aml-ws-sid=
|
||||
|
|
|
@ -0,0 +1,43 @@
|
|||
numpy>=1.18.1
|
||||
pandas>=1.0.3
|
||||
|
||||
azure-storage-blob==2.1.0
|
||||
azure-cosmos==3.1.2
|
||||
azureml-sdk>=1.1.5
|
||||
azureml-dataprep[pandas,fuse]>=1.3.5
|
||||
# mlflow>=1.6.0 #NOT NEEDED?
|
||||
# azureml-mlflow>=1.1.5 #NOT NEEDED?
|
||||
|
||||
azure-ai-textanalytics==5.0.0 #UPGRADED?!
|
||||
gensim==3.8.0
|
||||
spacy==2.3.2 #UPGRADED
|
||||
transformers==3.0.2 #UPGRADED
|
||||
farm==0.4.6 #UPGRADED
|
||||
flair==0.6 #UPGRADED
|
||||
|
||||
selenium==3.141.0
|
||||
|
||||
bs4
|
||||
boto3
|
||||
scipy>=1.3.2
|
||||
sklearn
|
||||
seqeval
|
||||
dotmap==1.3.0
|
||||
pyyml
|
||||
|
||||
opencensus==0.7
|
||||
opencensus-ext-azure==1
|
||||
##DEPLOY ONLY
|
||||
azure-keyvault==1.1.0
|
||||
azure-identity==1.3.1
|
||||
azure-keyvault-secrets==4.1.0
|
||||
##LOCAL ONLY
|
||||
pylint>=2.4.4
|
||||
pytest>=5.3.5
|
||||
flake8>=3.7.9
|
||||
ipykernel>=5.1.4
|
||||
streamlit==0.65
|
||||
tqdm
|
||||
|
||||
# - pytorch=1.4.0 # NOTE: UNCOMMENT THIS FOR LOCAL ENV INSTALL, BUT COMMENT IT AGAIN FOR TRAINING/DEPLOYMENT
|
||||
# - cudatoolkit=10.1 #For GPU training only
|
|
@ -1,10 +1,15 @@
|
|||
''' MICROSOFT FORUM PAGE SCRAPER '''
|
||||
""" MICROSOFT FORUM PAGE SCRAPER
|
||||
|
||||
Website: answers.microsoft.com
|
||||
Example:
|
||||
> python 1_getsites.py --language de-de --product xbox
|
||||
|
||||
"""
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
import os
|
||||
# Run arguments
|
||||
# example: python 1_getsites.py --language de-de --product xbox
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--language",
|
||||
default="de-de",
|
||||
|
|
|
@ -1,4 +1,10 @@
|
|||
''' MICROSOFT FORUM TICKET SCRAPER '''
|
||||
""" MICROSOFT FORUM TICKET SCRAPER
|
||||
|
||||
Website: answers.microsoft.com
|
||||
Example:
|
||||
> python 2_extract.py --language de-de --product windows
|
||||
|
||||
"""
|
||||
import re
|
||||
import urllib
|
||||
import urllib.request
|
||||
|
@ -28,7 +34,6 @@ parser.add_argument('--product',
|
|||
help="['windows', 'msoffice', 'xbox', 'outlook_com', 'skype', 'surface', 'protect', 'edge','ie','musicandvideo']")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Example: python 2_extract.py --language de-de --product windows
|
||||
|
||||
# Set params
|
||||
lang = args.language
|
||||
|
@ -184,22 +189,22 @@ def scrapeMe(url, product):
|
|||
file.write(content+",")
|
||||
print(f"[SUCCESS] - File {fileid}\n")
|
||||
|
||||
''' LOOP THROUGH THE OUTPUT TEXT FILES AND CREATE JSON '''
|
||||
# Check mode
|
||||
# LOOP THROUGH THE OUTPUT TEXT FILES AND CREATE JSON
|
||||
## Check mode
|
||||
if productsel == "list":
|
||||
products = ['windows', 'msoffice', 'xbox', 'outlook_com', 'skype', 'surface', 'protect', 'edge', 'musicandvideo', 'msteams', 'microsoftedge']
|
||||
else:
|
||||
products = [productsel]
|
||||
# Loop through product
|
||||
## Loop through product
|
||||
for product in products:
|
||||
try:
|
||||
# Read File
|
||||
### Read File
|
||||
docs = codecs.open(f"output-{product}-{lang}.txt", 'r', encoding='utf-8').read()
|
||||
# Prepare Links
|
||||
### Prepare Links
|
||||
url_temp = re.findall(r'(https?://answers.microsoft.com/' + lang + r'/' + product + r'/forum/[^\s]+)', docs)
|
||||
url_temp2 = [s.strip('"') for s in url_temp]
|
||||
url_list = [x for x in url_temp2 if not x.endswith('LastReply')]
|
||||
# Drop duplicates
|
||||
### Drop duplicates
|
||||
url_list = list(dict.fromkeys(url_list))
|
||||
failed_url = []
|
||||
for i, value in enumerate(url_list):
|
||||
|
|
|
@ -23,7 +23,7 @@ from farm.modeling.language_model import LanguageModel, Roberta, Albert, DistilB
|
|||
from farm.modeling.prediction_head import TextClassificationHead, MultiLabelTextClassificationHead
|
||||
from farm.modeling.tokenization import Tokenizer, RobertaTokenizer, AlbertTokenizer
|
||||
from farm.train import Trainer, EarlyStopping
|
||||
from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
|
||||
from farm.utils import set_all_seeds, initialize_device_settings
|
||||
from farm.eval import Evaluator
|
||||
from sklearn.metrics import (matthews_corrcoef, recall_score, precision_score,
|
||||
f1_score, mean_squared_error, r2_score)
|
|
@ -36,7 +36,7 @@ except Exception as e:
|
|||
############################################
|
||||
|
||||
def get_logger(level='info', location = None, excl_az_storage=True):
|
||||
'''Get runtime logger'''
|
||||
"""Get runtime logger"""
|
||||
global logger
|
||||
|
||||
# Exceptions
|
||||
|
@ -65,7 +65,7 @@ def get_logger(level='info', location = None, excl_az_storage=True):
|
|||
return logger
|
||||
|
||||
def get_context():
|
||||
'''Get AML Run Context for Logging to AML Services'''
|
||||
"""Get AML Run Context for Logging to AML Services"""
|
||||
try:
|
||||
run = Run.get_context()
|
||||
except Exception as e:
|
||||
|
@ -377,7 +377,7 @@ def append_ner(v, s, e, l, t=''):
|
|||
############################################
|
||||
|
||||
# def decrypt(token, dataframe=False):
|
||||
# ''' Decrypt symetric object using Fernet '''
|
||||
# """ Decrypt symetric object using Fernet """
|
||||
# secret = get_secret()
|
||||
# f = Fernet(bytes(secret, encoding='utf-8'))
|
||||
# token = f.decrypt(token)
|
||||
|
@ -395,7 +395,7 @@ def append_ner(v, s, e, l, t=''):
|
|||
# df.to_csv(fn.replace('.enc', '.txt'), sep='\t', encoding='utf-8', index=False) #TODO: match encrypt fn out
|
||||
|
||||
# def encrypt(token, dataframe=False):
|
||||
# ''' Encrypt symetric object using Fernet '''
|
||||
# """ Encrypt symetric object using Fernet """
|
||||
# secret = get_secret()
|
||||
# f = Fernet(bytes(secret, encoding='utf-8'))
|
||||
# if dataframe:
|
|
@ -24,7 +24,7 @@ from farm.modeling.language_model import Roberta
|
|||
from farm.modeling.prediction_head import MultiLabelTextClassificationHead
|
||||
from farm.modeling.tokenization import Tokenizer
|
||||
from farm.train import Trainer
|
||||
from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
|
||||
from farm.utils import set_all_seeds, initialize_device_settings
|
||||
from sklearn.metrics import (matthews_corrcoef, recall_score, precision_score,
|
||||
f1_score, mean_squared_error, r2_score)
|
||||
from farm.evaluation.metrics import simple_accuracy, register_metrics
|
|
@ -19,7 +19,7 @@ from farm.modeling.language_model import LanguageModel
|
|||
from farm.modeling.prediction_head import TokenClassificationHead
|
||||
from farm.modeling.tokenization import Tokenizer
|
||||
from farm.train import Trainer
|
||||
from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
|
||||
from farm.utils import set_all_seeds, initialize_device_settings
|
||||
|
||||
from azure.ai.textanalytics import TextAnalyticsClient, TextAnalyticsApiKeyCredential
|
||||
|
|
@ -141,7 +141,7 @@ class Clean():
|
|||
rp_generic=False,
|
||||
rp_custom=False,
|
||||
rp_num=False):
|
||||
'''Replace text with type specfic placeholders'''
|
||||
"""Replace text with type specfic placeholders"""
|
||||
|
||||
|
||||
# Customer placeholders
|
||||
|
@ -166,7 +166,7 @@ class Clean():
|
|||
return line
|
||||
|
||||
def tokenize(self, line, lemmatize = False, rm_stopwords = False):
|
||||
'''Tokenizer for non DL tasks'''
|
||||
"""Tokenizer for non DL tasks"""
|
||||
if not isinstance(line, str):
|
||||
line = str(line)
|
||||
|
|
@ -9,13 +9,13 @@ from gensim.summarization.summarizer import summarize
|
|||
import networkx as nx
|
||||
import numpy as np
|
||||
|
||||
''' BERTABS '''
|
||||
""" BERTABS """
|
||||
def summarizeText(text, minLength=60):
|
||||
result = model(text, min_length = minLength)
|
||||
full = ''.join(result)
|
||||
return full
|
||||
|
||||
''' SAMPLING '''
|
||||
""" SAMPLING """
|
||||
def sentencenize(text):
|
||||
sentences = []
|
||||
for sent in text:
|
||||
|
@ -36,11 +36,11 @@ def removeStopwords(sen, sw):
|
|||
sentence = " ".join([i for i in sen if i not in sw])
|
||||
return sentence
|
||||
|
||||
''' BERTABS '''
|
||||
""" BERTABS """
|
||||
model = Summarizer()
|
||||
|
||||
''' SAMPLING '''
|
||||
""" SAMPLING """
|
||||
clean_sentences = [removeStopwords(r.split(), sw) for r in clean_sentences]
|
||||
|
||||
''' GENSIM '''
|
||||
""" GENSIM """
|
||||
summarize()
|
Загрузка…
Ссылка в новой задаче