Update repo structure

2020-08-24 16:24:50 +02:00 · 2020-08-24 16:24:50 +02:00 · 80d2a27e98
--- a/README.md
+++ b/README.md
@ -5,46 +5,64 @@ Verseagility is a Python-based toolkit for your custom natural language processi

 See the [wiki](https://dev.azure.com/DAISolutions/KnowledgeMining/_wiki/wikis) for detailed documentation how to get started with the toolkit.

-## Supported Use cases
+## Supported Use Cases
 - Binary, multi-class & multi-label classification
 - Named entity recognition
 - Question answering
+- Text summarization

 ## Live Demo
 The live demo of models resulting from Verseagility is hosted at MTC Germany:
 > https://aka.ms/nlp-demo

+Repository Structure
+------------
+
+    ├── README.md          <- The top-level README for developers using this project.
+    ├── assets             <- Version controlled assets, such as stopword lists. Max size 
+    │                         per file: 10 MB. Training data should
+    │                         be stored in local data directory, outside of repository or within gitignore. 
+    │
+    ├── demo               <- Demo environment that can be deployed as is, or customized. 
+    │
+    ├── notebooks          <- Jupyter notebooks. Naming convention is <[Task]-[Short Description]>,
+    │                         for example: 'Data - Exploration.ipynb'
+    │
+    ├── pipeline           <- Document processing pipeline components, including document cracker. 
+    │
+    │
+    ├── scraper            <- Website scraper used to fetch sample data. 
+    │                         Can be reused for similarly structured forum websites.
+    │
+    ├── src                <- Source code for use in this project.
+    │   ├── infer.py       <- Inference file, for scoring the model
+    │   │   
+    │   ├── data.py        <- Use case agnostic utils file, for data management incl upload/download
+    │   │
+    │   └── helper.py      <- Use case agnostic utils file, with common functions incl secret handling
+    │
+    ├── deploy             <- Scripts used for deploying training or test service  
+    │   ├── training.py    <- Deploy your training to a remote compute instance, via AML
+    │   │   
+    │   ├── hyperdrive.py  <- Deploy hyperparemeter sweep on a remote compute instance, via AML
+    │   │
+    │   └── service.py     <- Deploy a service (endpoint) to ACI or AKS, via AML
+    │
+    ├── tests              <- Unit tests (using pytest)
+    │
+    ├── requirements.txt   <- The requirements file for reproducing the analysis environment.
+    │                         Can be generated using `pip freeze > requirements.txt`
+    │
+    └── config.ini         <- Configuration and secrets used while developing locally
+                              Secrets in production should be stored in the Azure KeyVault
+--------
+
 ## Naming
 ### Assets
 > \<project name\>(-\<task\>)-\<step\>(-\<environment\>)
 - where step in [source, train, deploy], for data assets.
 - where task is an int, referring to the parameters, for models.

-## TODO
-
-### Classification
- [ ] **(IP)** multi label support
- [ ] integrate handling for larger documents vs short documents
- [ ] integrate explicit handling for unbalanced datasets
- [ ] ONNX support
-### NER
- [ ] improve duplicate handling
-### Question Answering
- [ ] Apply advanced IR methods
-### Deployment
- [ ] Deploy service to Azure Function (without AzureML)
-### Notebooks
- [x] review prepared data
- [ ] **(IP)** review model results (auto generate after each training step)
- [ ] review model bias (auto generate after each training step)
- [ ] **(IP)** available models benchmark (incl AutoML)
-### Tests
- [ ] unit tests (pytest)
-### New Features (TBD)
- [ ] **(IP)** Summarization
- [ ] Deployable feedback loop
- [ ] Integration with GitHub Actions
-
 ## Acknowledgements
 Verseagility is built in part using the following:
 - [Transformers](https://github.com/huggingface/pytorch-transformers) by HuggingFace
@ -58,6 +76,29 @@ Maintainers:
 - [Christian Vorhemus](mailto:christian.vorhemus@microsoft.com)
 - [Martin Kayser](mailto:martin.kayser@microsoft.com)

+## To-Dos
+The following section contains a list of possible new features or enhancements. Feel free to contribute. 
+### Classification
+- [x] multi label support
+- [ ] integrate handling for larger documents vs short documents
+- [ ] integrate explicit handling for unbalanced datasets
+- [ ] ONNX support
+### NER
+- [ ] improve duplicate handling
+### Question Answering
+- [ ] apply advanced IR methods
+### Summarization
+- [ ] **(IP)** full test of integration
+### Deployment
+- [ ] deploy service to Azure Function (without AzureML)
+- [ ] setup GitHub actions
+### Notebooks Templates
+- [ ] **(IP)** review model results (auto generate after each training step)
+- [ ] review model bias (auto generate after each training step)
+- [ ] **(IP)** available models benchmark (incl AutoML)
+### Tests
+- [ ] unit tests (pytest)
+
 ## Contributing
 This project welcomes contributions and suggestions.  Most contributions require you to agree to a
 Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
@ -69,4 +110,4 @@ provided by the bot. You will only need to do this once across all repos using o

 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
 For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
-contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
--- a/config.sample.ini
+++ b/config.sample.ini
@ -1,3 +1,5 @@
+# Reuse this for local development.
+# All keys used remotely should be stored in, and retreived from, Azure KeyVault
 [environ]
 aml-ws-name=
 aml-ws-sid=
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,43 @@
+numpy>=1.18.1
+pandas>=1.0.3
+
+azure-storage-blob==2.1.0
+azure-cosmos==3.1.2
+azureml-sdk>=1.1.5
+azureml-dataprep[pandas,fuse]>=1.3.5
+# mlflow>=1.6.0 #NOT NEEDED?
+# azureml-mlflow>=1.1.5 #NOT NEEDED?
+
+azure-ai-textanalytics==5.0.0 #UPGRADED?!
+gensim==3.8.0
+spacy==2.3.2 #UPGRADED
+transformers==3.0.2 #UPGRADED
+farm==0.4.6 #UPGRADED
+flair==0.6 #UPGRADED
+
+selenium==3.141.0
+
+bs4
+boto3
+scipy>=1.3.2
+sklearn
+seqeval
+dotmap==1.3.0
+pyyml
+
+opencensus==0.7
+opencensus-ext-azure==1
+##DEPLOY ONLY
+azure-keyvault==1.1.0
+azure-identity==1.3.1
+azure-keyvault-secrets==4.1.0
+##LOCAL ONLY
+pylint>=2.4.4
+pytest>=5.3.5
+flake8>=3.7.9
+ipykernel>=5.1.4
+streamlit==0.65
+tqdm
+
+# - pytorch=1.4.0 # NOTE: UNCOMMENT THIS FOR LOCAL ENV INSTALL, BUT COMMENT IT AGAIN FOR TRAINING/DEPLOYMENT
+# - cudatoolkit=10.1 #For GPU training only
--- a/scraper/1_getsites.py
+++ b/scraper/1_getsites.py
@ -1,10 +1,15 @@
-''' MICROSOFT FORUM PAGE SCRAPER '''
+""" MICROSOFT FORUM PAGE SCRAPER 
+
+Website: answers.microsoft.com
+Example: 
+> python 1_getsites.py --language de-de --product xbox
+
+"""
 import argparse
 import re
 import sys
 import os
-# Run arguments
-# example: python 1_getsites.py --language de-de --product xbox
+
 parser = argparse.ArgumentParser()
 parser.add_argument("--language", 
                default="de-de",
--- a/scraper/2_extract.py
+++ b/scraper/2_extract.py
@ -1,4 +1,10 @@
-''' MICROSOFT FORUM TICKET SCRAPER '''
+""" MICROSOFT FORUM TICKET SCRAPER 
+
+Website: answers.microsoft.com
+Example: 
+> python 2_extract.py --language de-de --product windows
+
+"""
 import re
 import urllib
 import urllib.request
@ -28,7 +34,6 @@ parser.add_argument('--product',
                help="['windows', 'msoffice', 'xbox', 'outlook_com', 'skype', 'surface', 'protect', 'edge','ie','musicandvideo']")  
 args = parser.parse_args()

-# Example: python 2_extract.py --language de-de --product windows

 # Set params
 lang = args.language
@ -184,22 +189,22 @@ def scrapeMe(url, product):
        file.write(content+",")
        print(f"[SUCCESS] - File {fileid}\n")

-''' LOOP THROUGH THE OUTPUT TEXT FILES AND CREATE JSON '''
-# Check mode
+# LOOP THROUGH THE OUTPUT TEXT FILES AND CREATE JSON
+## Check mode
 if productsel == "list":
    products = ['windows', 'msoffice', 'xbox', 'outlook_com', 'skype', 'surface', 'protect', 'edge', 'musicandvideo', 'msteams', 'microsoftedge']
 else:
    products = [productsel]
-# Loop through product
+## Loop through product
 for product in products:
    try:
-        # Read File
+        ### Read File
        docs = codecs.open(f"output-{product}-{lang}.txt", 'r', encoding='utf-8').read()
-        # Prepare Links
+        ### Prepare Links
        url_temp = re.findall(r'(https?://answers.microsoft.com/' + lang + r'/' + product + r'/forum/[^\s]+)', docs)
        url_temp2 = [s.strip('"') for s in url_temp]
        url_list = [x for x in url_temp2 if not x.endswith('LastReply')]
-        # Drop duplicates
+        ### Drop duplicates
        url_list = list(dict.fromkeys(url_list))
        failed_url = []
        for i, value in enumerate(url_list):
--- a/code/classification.py
+++ b/code/classification.py
@ -23,7 +23,7 @@ from farm.modeling.language_model import LanguageModel, Roberta, Albert, DistilB
 from farm.modeling.prediction_head import TextClassificationHead, MultiLabelTextClassificationHead
 from farm.modeling.tokenization import Tokenizer, RobertaTokenizer, AlbertTokenizer
 from farm.train import Trainer, EarlyStopping
-from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
+from farm.utils import set_all_seeds, initialize_device_settings
 from farm.eval import Evaluator
 from sklearn.metrics import (matthews_corrcoef, recall_score, precision_score,
                         f1_score, mean_squared_error, r2_score)
--- a/code/custom.py
+++ b/code/custom.py
--- a/code/data.py
+++ b/code/data.py
--- a/code/helper.py
+++ b/code/helper.py
@ -36,7 +36,7 @@ except Exception as e:
 ############################################

 def get_logger(level='info', location = None, excl_az_storage=True):
-    '''Get runtime logger'''
+    """Get runtime logger"""
    global logger

    # Exceptions
@ -65,7 +65,7 @@ def get_logger(level='info', location = None, excl_az_storage=True):
    return logger

 def get_context():
-    '''Get AML Run Context for Logging to AML Services'''
+    """Get AML Run Context for Logging to AML Services"""
    try:
        run = Run.get_context()
    except Exception as e:
@ -377,7 +377,7 @@ def append_ner(v, s, e, l, t=''):
 ############################################

 # def decrypt(token, dataframe=False):
-#     ''' Decrypt symetric object using Fernet '''
+#     """ Decrypt symetric object using Fernet """
 #     secret = get_secret()
 #     f = Fernet(bytes(secret, encoding='utf-8'))
 #     token = f.decrypt(token)
@ -395,7 +395,7 @@ def append_ner(v, s, e, l, t=''):
 #     df.to_csv(fn.replace('.enc', '.txt'), sep='\t', encoding='utf-8', index=False) #TODO: match encrypt fn out

 # def encrypt(token, dataframe=False):
-#     ''' Encrypt symetric object using Fernet '''
+#     """ Encrypt symetric object using Fernet """
 #     secret = get_secret()
 #     f = Fernet(bytes(secret, encoding='utf-8'))
 #     if dataframe:
--- a/code/infer.py
+++ b/code/infer.py
--- a/code/multi_classification.py
+++ b/code/multi_classification.py
@ -24,7 +24,7 @@ from farm.modeling.language_model import Roberta
 from farm.modeling.prediction_head import MultiLabelTextClassificationHead
 from farm.modeling.tokenization import Tokenizer
 from farm.train import Trainer
-from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
+from farm.utils import set_all_seeds, initialize_device_settings
 from sklearn.metrics import (matthews_corrcoef, recall_score, precision_score,
                         f1_score, mean_squared_error, r2_score)
 from farm.evaluation.metrics import simple_accuracy, register_metrics
--- a/code/ner.py
+++ b/code/ner.py
@ -19,7 +19,7 @@ from farm.modeling.language_model import LanguageModel
 from farm.modeling.prediction_head import TokenClassificationHead
 from farm.modeling.tokenization import Tokenizer
 from farm.train import Trainer
-from farm.utils import set_all_seeds, MLFlowLogger, initialize_device_settings
+from farm.utils import set_all_seeds, initialize_device_settings

 from azure.ai.textanalytics import TextAnalyticsClient, TextAnalyticsApiKeyCredential

--- a/code/prepare.py
+++ b/code/prepare.py
@ -141,7 +141,7 @@ class Clean():
                        rp_generic=False,
                        rp_custom=False,
                        rp_num=False):
-        '''Replace text with type specfic placeholders'''
+        """Replace text with type specfic placeholders"""


        # Customer placeholders
@ -166,7 +166,7 @@ class Clean():
        return line

    def tokenize(self, line, lemmatize = False, rm_stopwords = False):
-        '''Tokenizer for non DL tasks'''
+        """Tokenizer for non DL tasks"""
        if not isinstance(line, str):
            line = str(line)
        
--- a/code/rank.py
+++ b/code/rank.py
--- a/code/summarization.py
+++ b/code/summarization.py
@ -9,13 +9,13 @@ from gensim.summarization.summarizer import summarize
 import networkx as nx
 import numpy as np

-''' BERTABS '''
+""" BERTABS """
 def summarizeText(text, minLength=60):
    result = model(text, min_length = minLength)
    full = ''.join(result)
    return full

-''' SAMPLING '''
+""" SAMPLING """
 def sentencenize(text):
    sentences = []
    for sent in text:
@ -36,11 +36,11 @@ def removeStopwords(sen, sw):
    sentence = " ".join([i for i in sen if i not in sw])
    return sentence

-''' BERTABS '''
+""" BERTABS """
 model = Summarizer()

-''' SAMPLING '''
+""" SAMPLING """
 clean_sentences = [removeStopwords(r.split(), sw) for r in clean_sentences]

-''' GENSIM '''
+""" GENSIM """
 summarize()