First version of solution accelerator

1. Reads data from input container 2. Clusters pages into groups as per template 3. Uploads results to blob storage destination container
2021-12-22 05:51:56 +00:00 · 2021-12-22 05:51:56 +00:00 · 29af4539ee
--- a/README.md
+++ b/README.md
@ -1,14 +1,73 @@
-# Project
+# Form-Recognizer-Accelerator

-> This repo has been populated by an initial template to help get you started. Please
-> make sure to update the content to build a great experience for community-building.
+This solution accelerator helps in clustering / segregating data into templates. Data once segregated to templates can be used from training Form Recognizer "Train with Labels" models.

-As the maintainer of this project, please make a few updates:
+## Input Data

- Improving this README.MD file to provide a great experience
- Updating SUPPORT.MD with content about this project's support experience
- Understanding the security reporting process in SECURITY.MD
- Remove this section from the README
+- This solution expects input data to be placed on Azure Blob Storage containers
+- It accepts data in pdf, png, tif and jpeg formats
+
+## Pre-requisites
+
+- **Azure subscription** (If you don't already have a subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin).
+- **Azure Blob Storage account** (If you don't have an Azure Storage account, refer to the [quickstarts](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal) before you begin).
+- **Azure Form Recognizer resource** (If you don't have a Form Recognizer resource, refer to the [quickstart](https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/try-sample-label-tool#create-a-form-recognizer-resource)) 
+- **Azure Machine Learning service compute instance** or an **Azure Machine Learning Python SDK** installed on your local compute (refer to this link for the [Installation of Azure Machine Learning SDK for Python](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py)).
+
+  **_NOTE:_** In the case of working on your local compute, please note that the following packages needs to be installed:
+    - [nbformat:](https://pypi.org/project/nbformat/) - Python package for Jupyter Notebook format
+
+      ```pip install nbformat```
+    - [pikepdf](https://pypi.org/project/pikepdf/) - Python library for reading and writing PDF files
+
+      ```pip install pikepdf```
+    - [img2pdf](https://pypi.org/project/img2pdf/) - Convert images to PDF via direct JPEG inclusion.
+
+      ``` pip install img2pdf ```
+## How to use the solution
+
+1. Download the repo to you local compute or Azure AML compute instance
+2. Create 2 containers on storage account, one for the "input" data and one for the "results" (refer to this link on how to [create a container](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal#create-a-container) in your storage account).
+    1. Upload your input data to the "input" container created.
+    2. Make sure that your "results" container you've created is still empty. 
+3. Update the following parameters in "**main.ipynb**" notebook, as well as in the "**AzureBlobStorageLib.ipynb**", under the "code" directory.
+
+    Populate the below parameters, **src_container** and **dst_container**, with the names of the containers you created before in the previous step.
+
+      <img src="images/storage-account-params.jpg" width="70%">
+
+    **_NOTE:_** See below more information on the parameters : 
+    - **storage_conn_string** : it's the storage account connection string (refer to this [resource](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal#view-account-access-keys) on how to find it)
+    - **src_container** -  it's the Container name where your input data is stored 
+    - **dst_container** - is the "Container name where results are uplaoded".
+
+
+4. Create a "**config.json**" file in "code" folder and update your Form Recognizer endpoint and key as shown below.
+
+      ```
+      {
+        "endpoint": "YOUR_FORM_RECOGNIZER_ENDPOINT",
+        "apim-key": "YOUR_FORM_RECOGNIZER_API_KEY"
+      }  
+      ```
+    **_NOTE:_** You can check out this resource on how to [retrieve the Form Recognizer key and endpoint.](https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/try-sample-label-tool#retrieve-the-key-and-endpoint)
+
+5. Run "**main.ipynb**" notebook.
+
+The solution will segregate the input data into templates. You can notice the progress by observing the comments. you will also notice that new sub-folders under data directory where results will be saved.
+
+Results are uploaded to blob, once the run completes
+
+# How does this work
+![Architecture](images/architecture.gif)
+
+- Data is sampled for training Form Recognizer "Train without labels" Model
+- All documents in input data are inferred using the trained model
+- If a document is assigned a cluster id (Template ID), its moved to template location and removed from original population
+- Above 3 steps are repeated, until one of the termination conditions are reached
+- Termination conditions
+    - All input data is considered for training
+    - Percentage of data segregated to templates <5%  of population

 ## Contributing

--- a/code/AzureBlobStorageLib.ipynb
+++ b/code/AzureBlobStorageLib.ipynb
@ -0,0 +1,320 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Install required libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install azure-storage-blob\n",
+    "# or !pip install --upgrade --force-reinstall azure.storage.blob\n",
+    "# remember to restart kernel / pip list to confirm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Establish blob connection\n",
+    "\n",
+    "### The following code establishes connection with Azure Blob storage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Azure Blob Storage v12.9.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Check the version of the client library\n",
+    "import os, uuid\n",
+    "from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__\n",
+    "from datetime import datetime, timedelta\n",
+    "\n",
+    "print(\"Azure Blob Storage v\" + __version__) # should be at least Azure Blob Storage v12.9.0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Please update these variables - \n",
+    "###### Only If you are running this notebook separately, otherwise these parameters are picked up from main file\n",
+    "\n",
+    "### storage_conn_string \"Storage account connection string\"\n",
+    "### src_container \"Container where data is stored\"\n",
+    "### dst_container \"Container where results should be uploaded\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "tags": [
+     "parameters"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "# # set up connection\n",
+    "# storage_conn_string = \"\"\n",
+    "# src_container = \"\"\n",
+    "# dst_container = \"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#or on linux, do: export AZURE_STORAGE_CONNECTION_STRING=\"<yourconnectionstring>\" and then do storage_conn_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')\n",
+    "\n",
+    "# Create the BlobServiceClient object which will be used to create a container client\n",
+    "blob_service_client = BlobServiceClient.from_connection_string(storage_conn_string)\n",
+    "\n",
+    "# create a client for a specific container\n",
+    "srcblob_container_client = blob_service_client.get_container_client(src_container)\n",
+    "\n",
+    "# create a client for a specific container\n",
+    "dstblob_container_client = blob_service_client.get_container_client(dst_container)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Download data from Blob\n",
+    "\n",
+    "### The following function downloads data from blob storage to local folder\n",
+    "#### parameters\n",
+    "\n",
+    "local_path         \"Local path where data should be stored\"\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 70,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "def download2local(local_path, container_client = srcblob_container_client, container_name = src_container):\n",
+    "\n",
+    "    # Create a local directory to hold blob data\n",
+    "    Path(local_path).mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "    # download all the files to the local_path folder\n",
+    "    blob_list = container_client.list_blobs() # need to run this again, no way to reset the iterator\n",
+    "    for blob in blob_list:\n",
+    "\n",
+    "        # note that this is called on the service_client and not the container_client\n",
+    "        blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob)\n",
+    "\n",
+    "        download_file_path = os.path.join(local_path, blob.name)\n",
+    "        print(\"\\tDownloading \" + blob.name + \" to \" + download_file_path)\n",
+    "\n",
+    "        basedir = os.path.split(download_file_path)[0]\n",
+    "        Path(basedir).mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "        with open(download_file_path, \"wb\") as download_file:\n",
+    "            download_file.write(blob_client.download_blob().readall())\n",
+    "\n",
+    "    print(\"Files Downloaded.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#local_path = \"../data/samples\"\n",
+    "#download2local(local_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Upload data to Blob container\n",
+    "\n",
+    "### The following function uploads data from local drive to azure blob storage container\n",
+    "#### parameters\n",
+    "\n",
+    "local_path         \"Local path where data should be stored\" \n",
+    "container_path     \"Path on container where data should be stored\"\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "from os.path import relpath\n",
+    "import glob\n",
+    "\n",
+    "def upload2blob(local_path, container_path = \"\", container_client = dstblob_container_client, container_name = dst_container):\n",
+    "\n",
+    "    # list the local files\n",
+    "    files_to_upload = glob.glob(os.path.join(local_path,\"**/*\"), recursive=True)\n",
+    "\n",
+    "    for file_to_upload in files_to_upload:\n",
+    "        print(\"\\nUploading to Azure Storage as blob: \\t\" + file_to_upload)\n",
+    "\n",
+    "        # Upload the file\n",
+    "        #path_to_file = os.path.join(local_path, file_to_upload)\n",
+    "        path_to_file = file_to_upload\n",
+    "        file_path_on_azure = relpath(file_to_upload, local_path)\n",
+    "        file_path_on_azure = os.path.join(container_path,file_path_on_azure)\n",
+    "\n",
+    "        #print(\"local path:\", path_to_file, \"Azure path:\", file_path_on_azure)\n",
+    "        # Create a blob client using the local file name as the name for the blob\n",
+    "        blob_client = blob_service_client.get_blob_client(container=container_name, blob=file_path_on_azure)\n",
+    "\n",
+    "        path = Path(path_to_file)\n",
+    "        if path.is_file():\n",
+    "            with open(path_to_file, \"rb\") as data:\n",
+    "                blob_client.upload_blob(data, overwrite = True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# local_path = \"../data/samples\"\n",
+    "# upload2blob(local_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Delete data in Blob container\n",
+    "\n",
+    "### The following function deletes all data in azure blob storage container\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "from os.path import relpath\n",
+    "import glob\n",
+    "\n",
+    "def deleteContainerData(container_client = dstblob_container_client, container_name = dst_container):\n",
+    "\n",
+    "    blob_list = container_client.list_blobs() # need to run this again, no way to reset the iterator\n",
+    "    for blob in blob_list:\n",
+    "        dstblob_container_client.delete_blob(blob=blob)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Create SAS signature for blob container\n",
+    "\n",
+    "### This container generates SAS URL for a Blob storage container\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azure.storage.blob import AccessPolicy, ContainerSasPermissions\n",
+    "from azure.storage.blob import generate_container_sas, generate_account_sas\n",
+    "\n",
+    "def fr_get_sas_url(dst_container = dst_container):\n",
+    "    #permission = ContainerSasPermissions(read=True, write=True, delete=True, list=True)\n",
+    "    permission = ContainerSasPermissions.from_string(storage_conn_string)\n",
+    "    expiry=datetime.utcnow() + timedelta(hours=24)\n",
+    "    start=datetime.utcnow() - timedelta(minutes=1)\n",
+    "\n",
+    "    access_policy = AccessPolicy(permission, expiry, start)\n",
+    "\n",
+    "    sas_token = generate_container_sas(\n",
+    "        account_name = blob_service_client.account_name,\n",
+    "        container_name = dst_container,\n",
+    "        account_key = blob_service_client.credential.account_key,\n",
+    "        permission = permission,\n",
+    "        expiry = expiry,\n",
+    "        start = start\n",
+    "    )\n",
+    "    \n",
+    "    #return(sas_token)\n",
+    "    \n",
+    "    url = 'https://'+blob_service_client.account_name+'.blob.core.windows.net/'+dst_container+'?'+sas_token\n",
+    "    return(url)\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#sas_token = fr_get_sas_url()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#sas_token"
+   ]
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "6506c5172df6811f7bf01b57d8469d1357a539c9dc33b21363eaf6ce598c5969"
+  },
+  "kernelspec": {
+   "display_name": "Python 3.6 - AzureML",
+   "language": "python",
+   "name": "python3-azureml"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/code/Data-utils.ipynb
+++ b/code/Data-utils.ipynb
@ -0,0 +1,180 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "dea0c864",
+   "metadata": {},
+   "source": [
+    "# This note contains file processing utility functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b81510a8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pikepdf import Pdf\n",
+    "# import cv2\n",
+    "import os\n",
+    "import img2pdf\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48aa9832",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import PIL\n",
+    "PIL.Image.MAX_IMAGE_PIXELS = None"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "345dc68d",
+   "metadata": {},
+   "source": [
+    "### This function converts pdf, jpg, tif and png files in \"src\" to pdf files and places them in \"pdfdst\" directory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8ecb76a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def convert2pdf(src, pdfdst):\n",
+    "    \n",
+    "    a4inpt = (img2pdf.mm_to_pt(210),img2pdf.mm_to_pt(297))\n",
+    "    layout_fun = img2pdf.get_layout_fun(a4inpt)\n",
+    "\n",
+    "    #convert images to pdf\n",
+    "    for r, _, f in os.walk(src):\n",
+    "        for fname in f:\n",
+    "            flname = fname.lower()\n",
+    "            if flname.endswith(\".pdf\"):\n",
+    "                copyfile(os.path.join(r, fname), os.path.join(pdfdst, flname))\n",
+    "                print(\"copying file: %s\" % fname)\n",
+    "            elif flname.endswith(\".tif\"):\n",
+    "                src = os.path.join(r, fname)\n",
+    "                dst = os.path.join(pdfdst, flname.replace(\".tif\",\".pdf\"))\n",
+    "                print(\"converting file: %s\" % src)\n",
+    "                with open(dst,\"wb\") as f:\n",
+    "                    f.write(img2pdf.convert(src, layout_fun=layout_fun))\n",
+    "                    #f.write(img2pdf.convert(src))\n",
+    "            elif flname.endswith(\".tiff\"):\n",
+    "                src = os.path.join(r, fname)\n",
+    "                dst = os.path.join(pdfdst, flname.replace(\".tiff\",\".pdf\"))\n",
+    "                print(\"converting file: %s\" % src)\n",
+    "                with open(dst,\"wb\") as f:\n",
+    "                    f.write(img2pdf.convert(src), layout_fun=layout_fun)\n",
+    "                    #f.write(img2pdf.convert(src))\n",
+    "            elif flname.endswith(\".jpg\"):\n",
+    "                src = os.path.join(r, fname)\n",
+    "                dst = os.path.join(pdfdst, flname.replace(\".jpg\",\".pdf\"))\n",
+    "                print(\"converting file: %s\" % src)\n",
+    "                with open(dst,\"wb\") as f:\n",
+    "                    f.write(img2pdf.convert(src), layout_fun=layout_fun)\n",
+    "                    #f.write(img2pdf.convert(src))\n",
+    "            elif flname.endswith(\".png\"):\n",
+    "                src = os.path.join(r, fname)\n",
+    "                dst = os.path.join(pdfdst, flname.replace(\".png\",\".pdf\"))\n",
+    "                print(\"converting file: %s\" % src)\n",
+    "                with open(dst,\"wb\") as f:\n",
+    "                    f.write(img2pdf.convert(src), layout_fun=layout_fun)\n",
+    "                    #f.write(img2pdf.convert(src))\n",
+    "            else:\n",
+    "                continue"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb94d80c",
+   "metadata": {},
+   "source": [
+    "### This function converts multipage PDF file to single page PDF files"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "77686dd3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def pdf_split(src,dst):\n",
+    "\n",
+    "    fls = os.listdir(src)\n",
+    "\n",
+    "    for fl in fls:\n",
+    "        pdf = Pdf.open(os.path.join(src,fl))\n",
+    "        for n, page in enumerate(pdf.pages):\n",
+    "            dst_fl = Pdf.new()\n",
+    "            dst_fl.pages.append(page)\n",
+    "            \n",
+    "            dst_fl_name = '%02d-'%n+fl\n",
+    "            dst_fl.save(os.path.join(dst,dst_fl_name))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29f8ae93",
+   "metadata": {},
+   "source": [
+    "#### This function splits a list into \"n\" parts and create a list of lists"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8af06032",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def chunkify(lst,n):\n",
+    "     return [lst[i::n] for i in range(n)]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "00db6a6b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "08528281",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.6 - AzureML",
+   "language": "python",
+   "name": "python3-azureml"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/code/FR-utils.ipynb
+++ b/code/FR-utils.ipynb
@ -0,0 +1,388 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "062cd3e4",
+   "metadata": {},
+   "source": [
+    "# This file has Form Recognizer Model trainign and Inferencing code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e790e5a",
+   "metadata": {},
+   "source": [
+    "#### Read configuration file and get endpoint, key of the service"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "95f9d84e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "########### Python Form Recognizer Labeled Async Train #############\n",
+    "import json\n",
+    "import time\n",
+    "from requests import get, post\n",
+    "\n",
+    "#read form recognizer service parameters\n",
+    "with open('config.json','r') as config_file:\n",
+    "    config = json.load(config_file)\n",
+    "\n",
+    "# Endpoint URL\n",
+    "endpoint = config['endpoint']\n",
+    "post_url = endpoint + r\"/formrecognizer/v2.1/custom/models\"\n",
+    "apim_key = config['apim-key']\n",
+    "filetype = 'application/json'\n",
+    "\n",
+    "\n",
+    "headers = {\n",
+    "    # Request headers\n",
+    "    'Content-Type': filetype,\n",
+    "    'Ocp-Apim-Subscription-Key': apim_key,\n",
+    "}\n",
+    "\n",
+    "body = {\n",
+    "    \"source\": \"\",\n",
+    "    \"sourceFilter\": {\n",
+    "        \"prefix\": \"\",\n",
+    "        \"includeSubFolders\": False\n",
+    "    },\n",
+    "    \"useLabelFile\": False\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f111694c",
+   "metadata": {},
+   "source": [
+    "# Unsupervised training "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0f43e3e",
+   "metadata": {},
+   "source": [
+    "### Function to train unsupervised Form Recognizer Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8fbe1b96",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n_tries = 60\n",
+    "n_try = 0\n",
+    "wait_sec = 60\n",
+    "\n",
+    "\n",
+    "def train_fr_model(sas_url, folder_path, model_file):\n",
+    "    \n",
+    "    body['source'] = sas_url\n",
+    "    body['sourceFilter']['prefix'] = folder_path\n",
+    "    \n",
+    "    # trigger training\n",
+    "    try:\n",
+    "        resp = post(url = post_url, json = body, headers = headers)\n",
+    "        #print(body)\n",
+    "        #print(headers)\n",
+    "        if resp.status_code != 201:\n",
+    "            print(\"Training model failed (%s):\\n%s\" % (resp.status_code, json.dumps(resp.json())))\n",
+    "            return\n",
+    "        print(\"Training Started:\\n%s\" % resp.headers)\n",
+    "        get_url = resp.headers[\"location\"]\n",
+    "    except Exception as e:\n",
+    "        print(\"Error occurred when triggering training:\\n%s\" % str(e))\n",
+    "        quit()\n",
+    "        \n",
+    "    n_try = 0\n",
+    "    #wait for training to complete and save model to json file\n",
+    "    while n_try < n_tries:\n",
+    "        try:\n",
+    "            resp = get(url = get_url, headers = headers)\n",
+    "            resp_json = resp.json()\n",
+    "            if resp.status_code != 200:\n",
+    "                print(\"Model training failed (%s):\\n%s\" % (resp.status_code, json.dumps(resp_json)))\n",
+    "                break\n",
+    "            model_status = resp_json[\"modelInfo\"][\"status\"]\n",
+    "            print(\"Model Status:\", model_status)\n",
+    "            if model_status == \"ready\":\n",
+    "                #print(\"Training succeeded:\\n%s\" % json.dumps(resp_json))\n",
+    "                print(\"Training succeeded:\")\n",
+    "                with open(model_file,\"w\") as f:\n",
+    "                    json.dump(resp_json, f)\n",
+    "                break\n",
+    "            if model_status == \"invalid\":\n",
+    "                print(\"Training failed. Model is invalid:\\n%s\" % json.dumps(resp_json))\n",
+    "                break\n",
+    "            # Training still running. Wait and retry.\n",
+    "            time.sleep(wait_sec)\n",
+    "            n_try += 1\n",
+    "        except Exception as e:\n",
+    "            msg = \"Model training returned error:\\n%s\" % str(e)\n",
+    "            print(msg)\n",
+    "            break\n",
+    "\n",
+    "    if resp.status_code != 200:\n",
+    "        print(\"Train operation did not complete within the allocated time.\")    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "313e93ea",
+   "metadata": {},
+   "source": [
+    "# FR Model Inferencing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "af82e67f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import glob\n",
+    "import os\n",
+    "import datetime\n",
+    "import tempfile\n",
+    "import pandas as pd\n",
+    "import shutil\n",
+    "\n",
+    "from concurrent.futures import ThreadPoolExecutor, as_completed"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ebd521e2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "params_infer = {\n",
+    "    \"includeTextDetails\": True\n",
+    "}\n",
+    "\n",
+    "headers_infer = {\n",
+    "    # Request headers\n",
+    "    'Content-Type': 'application/pdf',\n",
+    "    'Ocp-Apim-Subscription-Key': apim_key,\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b1026a9",
+   "metadata": {},
+   "source": [
+    "#### Form Recognizer inferencing function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d211010f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#######################################################\n",
+    "# FR Inference function multithreading\n",
+    "#######################################################\n",
+    "\n",
+    "def fr_mt_inference(files, json_fld, model_id):\n",
+    "    \n",
+    "    post_url = endpoint + \"formrecognizer/v2.1/custom/models/%s/analyze\" % model_id\n",
+    "\n",
+    "    \n",
+    "    ###################\n",
+    "    #send all requests in one go\n",
+    "    ###################\n",
+    "    session = requests.Session()\n",
+    "    url_list=[]\n",
+    "    for fl in files:\n",
+    "        \n",
+    "        #read file\n",
+    "        fname = os.path.basename(fl)\n",
+    "        #print(\"working on file: %s, %s\" %(fl, datetime.datetime.now()))\n",
+    "        with open(fl, \"rb\") as f:\n",
+    "            data_bytes = f.read()\n",
+    "            \n",
+    "        #set variables to default values\n",
+    "        get_url = None\n",
+    "        #st_time = datetime.datetime.now()\n",
+    "        st_time = datetime.now()\n",
+    "        gap_between_requests = 1 #in seconds\n",
+    "        \n",
+    "        try:\n",
+    "            \n",
+    "            #send post request (wait and send if overlaoded)\n",
+    "            post_success = 0\n",
+    "            while post_success == 0:\n",
+    "                resp = session.post(url = post_url, data = data_bytes, headers = headers_infer, params = params_infer)\n",
+    "                if resp.status_code != 429:\n",
+    "                    break\n",
+    "                time.sleep(1)    \n",
+    "                \n",
+    "            #print(fl, resp.status_code)\n",
+    "                    \n",
+    "            if resp.status_code != 202:\n",
+    "                print(\"POST analyze failed:\\n%s\" % json.dumps(resp.json()))\n",
+    "\n",
+    "            #print(\"POST analyze succeeded:\\n%s\" % resp.headers)\n",
+    "            #print(\"POST analyze succeeded for %s \\n\" % fl)\n",
+    "            get_url = resp.headers[\"operation-location\"]\n",
+    "        except Exception as e:\n",
+    "            print(\"POST analyze failed 1:\\n%s\" % str(e))\n",
+    "        \n",
+    "        url_list.append((fl, fname, get_url))\n",
+    "        end_time = datetime.now()\n",
+    "        #end_time = datetime.datetime.now()\n",
+    "        delta = end_time - st_time\n",
+    "        delta = delta.total_seconds()\n",
+    "        if delta < gap_between_requests:\n",
+    "            time.sleep(gap_between_requests - delta)\n",
+    "\n",
+    "    ####################################\n",
+    "    # get all responses in one go\n",
+    "    ####################################\n",
+    "    n_tries = 15\n",
+    "    wait_sec = 15\n",
+    "\n",
+    "    for cnt in range(n_tries):\n",
+    "        \n",
+    "        #get results of requests sent\n",
+    "        completed = []\n",
+    "        for i in range(len(url_list)):\n",
+    "\n",
+    "            fl, fname, get_url = url_list[i]\n",
+    "            if get_url is not None:\n",
+    "\n",
+    "                try:\n",
+    "                    resp = session.get(url = get_url, headers = {\"Ocp-Apim-Subscription-Key\": apim_key})\n",
+    "                    resp_json = resp.json()\n",
+    "\n",
+    "                    if resp.status_code != 200:\n",
+    "                        print(\"GET analyze results failed:%s \\n%s\" % fl, json.dumps(resp_json))\n",
+    "                        break\n",
+    "\n",
+    "                    status = resp_json[\"status\"]\n",
+    "                    if status == \"succeeded\":\n",
+    "                        print(\"Analysis succeeded for %s:\\n\" % fl)\n",
+    "                        with open(os.path.join(json_fld,fname.replace('.pdf','.json')), 'w') as outfile:\n",
+    "                            json.dump(resp_json, outfile)\n",
+    "\n",
+    "                        completed.append(i)\n",
+    "\n",
+    "                    if status == \"failed\":\n",
+    "                        print(\"Analysis failed:%s \\n%s\" % fl, json.dumps(resp_json))\n",
+    "                        break\n",
+    "                except Exception as e:\n",
+    "                    msg = \"GET analyze results failed 2:\\n%s\" % str(e)\n",
+    "                    print(msg)\n",
+    "                    break\n",
+    "\n",
+    "        # remove files where\n",
+    "        completed.sort(reverse=True)\n",
+    "        for i in completed:\n",
+    "            url_list.pop(i)\n",
+    "\n",
+    "        print(\"iteration\",cnt,\"complete. Still\",len(url_list), \" to infer\")\n",
+    "        if len(url_list) == 0:\n",
+    "            break\n",
+    "            \n",
+    "        time.sleep(wait_sec)\n",
+    "        \n",
+    "    ####################################\n",
+    "    # retun files not inferred\n",
+    "    ####################################\n",
+    "    session.close()\n",
+    "    \n",
+    "    if len(url_list) == 0:\n",
+    "        return(\"All files successfully inferred by FR\")\n",
+    "    else:\n",
+    "        return(url_list)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e55580da",
+   "metadata": {},
+   "source": [
+    "#### Form Recognizer multi-threading inferencing function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18fb8475",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Form Recognizer inference\n",
+    "\n",
+    "def fr_model_inference(src_dir, json_dir, model_file, thread_cnt):\n",
+    "    \n",
+    "    #read model details\n",
+    "    with open(model_file,'r') as model_file:\n",
+    "        model = json.load(model_file)\n",
+    "\n",
+    "    if model['modelInfo']['modelId'] != None :\n",
+    "        model_id = model['modelInfo']['modelId']\n",
+    "        print(\"model id: %s\" % model_id)\n",
+    "    else:\n",
+    "        print(\"Model details not present, either model training is not performed or the file is missing\")\n",
+    "        return\n",
+    "    \n",
+    "    #Read files and divide into chunks\n",
+    "    fls = glob.glob(os.path.join(src_dir, \"*.pdf\"))\n",
+    "    print(\"inferencing \", len(fls), \"files with\", thread_cnt, \"thread count\")\n",
+    "    fchunk = chunkify(fls, 100)\n",
+    "    \n",
+    "    for chunk in fchunk:\n",
+    "        \n",
+    "        fr_threads = min(len(chunk),thread_cnt)\n",
+    "\n",
+    "        flist = chunkify(chunk, fr_threads)\n",
+    "\n",
+    "        #Call FR inference \n",
+    "        threads= []\n",
+    "        with ThreadPoolExecutor(max_workers=thread_cnt) as executor:\n",
+    "            for files in flist:\n",
+    "                threads.append(executor.submit(fr_mt_inference, files, json_dir, model_id))\n",
+    "\n",
+    "            for task in as_completed(threads):\n",
+    "                print(task.result()) "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.6 - AzureML",
+   "language": "python",
+   "name": "python3-azureml"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/code/file-utils.ipynb
+++ b/code/file-utils.ipynb
@ -0,0 +1,172 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e59451e7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import glob\n",
+    "import numpy as np\n",
+    "from shutil import copyfile\n",
+    "import random\n",
+    "import pandas as pd\n",
+    "import json\n",
+    "import shutil\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4e8e337e",
+   "metadata": {},
+   "source": [
+    "# Data sampling\n",
+    "\n",
+    "#### This function samples data required for Form Recognizer Training.\n",
+    "It sampels such that the file count is < file_limit and Total sample size is < size_limit (in MB)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2dbf5331",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def sample_training_data(src_fld, dst_fld, file_limit=500, size_limit=50):\n",
+    "\n",
+    "    fls = glob.glob(os.path.join(src_fld,\"*.pdf\"))\n",
+    "\n",
+    "    #sample and copy files\n",
+    "    if len(fls) > file_limit:\n",
+    "        flist = random.sample(fls,file_limit)\n",
+    "        print(\"sampling files\")\n",
+    "    else:\n",
+    "        flist = fls\n",
+    "        print(\"file count <\",file_limit, \"considering all files\")\n",
+    "\n",
+    "    for fl in flist:\n",
+    "        flname = os.path.basename(fl)\n",
+    "        copyfile(os.path.join(src_fld, flname), os.path.join(dst_fld, flname))\n",
+    "\n",
+    "    #delete files if they exceed size limit\n",
+    "    fls = os.listdir(dst_fld)\n",
+    "    file_n_size = [(f,os.stat(os.path.join(dst_fld, f)).st_size) for f in fls]\n",
+    "    #print(file_n_size)\n",
+    "    \n",
+    "    file_df = pd.DataFrame(file_n_size,columns=['file','size'])\n",
+    "    file_df = file_df.sort_values(by=['size'])\n",
+    "    file_df['total_size'] = file_df['size'].cumsum()\n",
+    "    file_df['to_delete'] = file_df['total_size'] < size_limit*10^6\n",
+    "    \n",
+    "    del_df = file_df[file_df['to_delete'] == True]\n",
+    "    if del_df.shape[0] > 0:\n",
+    "        for idx, row in del_df.iterrows():\n",
+    "            flname = row['file']\n",
+    "            os.remove(os.path.join(dst_fld, flname))\n",
+    "    \n",
+    "    print(file_df.tail(5))\n",
+    "    #return(file_df)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "341c27ad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# %%time\n",
+    "\n",
+    "# sample_training_data(src_fld=\"../data/samples/trainset1\", dst_fld=\"../data/train_sample\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35f00bf5",
+   "metadata": {},
+   "source": [
+    "# Segregate data function\n",
+    "\n",
+    "#### This function segregates data into templates."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "57aa05f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# function to read cluster id and segregate the data\n",
+    "\n",
+    "def segregate_data(src_dir, result_dir, cluster_dir, prefix, cluster_file):\n",
+    "    \n",
+    "    fls = glob.glob(os.path.join(result_dir,\"*.json\"))\n",
+    "\n",
+    "    cols = ['filename','clusterId']\n",
+    "    clusters = pd.DataFrame(columns=cols)\n",
+    "    moved_file_cnt = 0\n",
+    "\n",
+    "    for fl in fls:\n",
+    "\n",
+    "        f = open(fl)\n",
+    "        dat = json.load(f)\n",
+    "        if dat['status'] == \"succeeded\":\n",
+    "            cid = dat['analyzeResult']['pageResults'][0]['clusterId']\n",
+    "            flname = os.path.basename(fl)\n",
+    "            tdf = pd.DataFrame([[flname, cid]], columns=cols)\n",
+    "            clusters = clusters.append(tdf, ignore_index=True)\n",
+    "            #print(\"clusterID for: \"+fl+\" is: \"+str(cid), not (cid is None))\n",
+    "\n",
+    "            if not (cid is None):\n",
+    "                #move files with cluster ID to cluster folders\n",
+    "                pdf_name = flname.replace(\".json\",\".pdf\")\n",
+    "                fld_name = prefix+\"-\"+str(cid)\n",
+    "                fld_path = os.path.join(cluster_dir,fld_name)\n",
+    "                if not os.path.isdir(fld_path):\n",
+    "                    os.makedirs(fld_path)\n",
+    "                    print(\"creating folder:\"+fld_path)\n",
+    "                if os.path.isfile(os.path.join(src_dir,pdf_name)):\n",
+    "                    shutil.move(os.path.join(src_dir,pdf_name), os.path.join(fld_path,pdf_name))\n",
+    "                    #print(os.path.join(src_dir,pdf_name), os.path.join(fld_path,pdf_name))\n",
+    "                    moved_file_cnt = moved_file_cnt + 1\n",
+    "\n",
+    "    clusters.to_csv(cluster_file)\n",
+    "    \n",
+    "    return(moved_file_cnt)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b9c3d39",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.6 - AzureML",
+   "language": "python",
+   "name": "python3-azureml"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/code/main.ipynb
+++ b/code/main.ipynb
@ -0,0 +1,322 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0f512263",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "import glob\n",
+    "import tempfile\n",
+    "from pathlib import Path"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0dff2838",
+   "metadata": {},
+   "source": [
+    "#### Provide storage account parameters here"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c846d47d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "storage_conn_string = \"\"\n",
+    "src_container = \"\"\n",
+    "dst_container = \"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6647b1f1",
+   "metadata": {},
+   "source": [
+    "# Import functions from other notebooks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d780d572",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run \"Data-utils.ipynb\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bf232a80",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run \"FR-Utils.ipynb\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc58d0b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run \"file-utils.ipynb\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e66e781c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run \"AzureBlobStorageLib.ipynb\" storage_conn_string src_container dst_container"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f49fe7ec",
+   "metadata": {},
+   "source": [
+    "# Data Preparation\n",
+    "\n",
+    "#### Steps include\n",
+    "\n",
+    "1. Downlaoding data from Blob Storage\n",
+    "2. Converting all format files to PDF files\n",
+    "3. Splitting multipage PDF to single page PDF files"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1598f440",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create temporary directory and download files\n",
+    "\n",
+    "#temp_dir = tempfile.TemporaryDirectory()\n",
+    "#data_dir = temp_dir.name\n",
+    "\n",
+    "data_dir = \"../data/\"\n",
+    "if os.path.exists(data_dir) :\n",
+    "    shutil.rmtree(data_dir)\n",
+    "Path(data_dir).mkdir(parents=True, exist_ok=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "24a666dd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "raw_files = os.path.join(data_dir,\"rawFiles\")\n",
+    "Path(raw_files).mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "download2local(raw_files)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e84907a9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# convert all file types to pdf and then split to 1-p docs\n",
+    "raw_pdf = os.path.join(data_dir,\"allPdf\")\n",
+    "Path(raw_pdf).mkdir(parents=True, exist_ok=True)\n",
+    "convert2pdf(src = raw_files, pdfdst = raw_pdf)\n",
+    "print(\"Input files are stored at:\", raw_pdf)\n",
+    "\n",
+    "pdf_1p = os.path.join(data_dir,\"1p-pdf\")\n",
+    "Path(pdf_1p).mkdir(parents=True, exist_ok=True)\n",
+    "pdf_split(src = raw_pdf, dst = pdf_1p)\n",
+    "print(\"Processed files are stored at:\", pdf_1p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c7219171",
+   "metadata": {},
+   "source": [
+    "#### Get initial Parameters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a7ab8343",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#get initial file count\n",
+    "fls = glob.glob(os.path.join(pdf_1p,\"*.pdf\"))\n",
+    "initial_file_cnt = len(fls)\n",
+    "print(initial_file_cnt)\n",
+    "\n",
+    "#results directory\n",
+    "results_dir = os.path.join(data_dir,\"Results\")\n",
+    "Path(results_dir).mkdir(parents=True, exist_ok=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fce047f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# generate SAS signature for storage container\n",
+    "sas_url = fr_get_sas_url(dst_container)\n",
+    "#sas_url = \"\"\n",
+    "sas_url "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d6bd362",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Call below function, if you want to clean up the destination blob container\n",
+    "#deleteContainerData()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91cbc126",
+   "metadata": {},
+   "source": [
+    "# FR Template identification process"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c3027cb3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "iteration = 1\n",
+    "\n",
+    "while(iteration):\n",
+    "    \n",
+    "    ########################################################\n",
+    "    # create directory for current iteration\n",
+    "    ########################################################\n",
+    "    iter_fld = \"I\"+str(iteration)\n",
+    "    iter_dir = os.path.join(results_dir,iter_fld)\n",
+    "    Path(iter_dir).mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "    ########################################################\n",
+    "    # sample files for training\n",
+    "    ########################################################\n",
+    "    train_fld = \"trainset\"\n",
+    "    train_dir = os.path.join(iter_dir,train_fld)\n",
+    "    Path(train_dir).mkdir(parents=True, exist_ok=True)\n",
+    "    sample_training_data(src_fld = pdf_1p, dst_fld = train_dir)\n",
+    "\n",
+    "    ########################################################\n",
+    "    # upload the files to blob\n",
+    "    ########################################################\n",
+    "    blob_path = os.path.join(iter_fld,train_fld)\n",
+    "    upload2blob(local_path = train_dir, container_path = blob_path)\n",
+    "    \n",
+    "    ########################################################\n",
+    "    #train FR unsupervised model\n",
+    "    ########################################################\n",
+    "    model_file = iter_fld+\"-model-details.json\"\n",
+    "    train_fr_model(sas_url = sas_url, folder_path = blob_path.replace(\"\\\\\", \"/\"), model_file = model_file)\n",
+    "\n",
+    "    ########################################################\n",
+    "    #if model is created infer using the model\n",
+    "    ########################################################\n",
+    "    iter_model_file = os.path.join(iter_dir, model_file)\n",
+    "    if os.path.exists(model_file):\n",
+    "        shutil.copyfile(model_file, iter_model_file)\n",
+    "\n",
+    "    infer_fld = \"fr-json\"\n",
+    "    infer_dir = os.path.join(iter_dir,infer_fld)\n",
+    "    Path(infer_dir).mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "    #Start FR inferencing\n",
+    "    fr_model_inference(src_dir = pdf_1p, json_dir = infer_dir, model_file = iter_model_file, thread_cnt = 10)\n",
+    "\n",
+    "    ########################################################\n",
+    "    # Segregate files to clusters\n",
+    "    ########################################################\n",
+    "    clust_dir = os.path.join(results_dir,\"clusters\")\n",
+    "    Path(clust_dir).mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "    cluster_file = os.path.join(iter_dir, iter_fld+\"-clusters.csv\")\n",
+    "\n",
+    "    files_clustered = segregate_data(src_dir = pdf_1p, result_dir= infer_dir, cluster_dir = clust_dir, \n",
+    "                                     prefix = iter_fld, cluster_file = cluster_file)\n",
+    "\n",
+    "    print(\"Identified clusters for:\", files_clustered, \"files\")\n",
+    "\n",
+    "    ########################################################\n",
+    "    # Upload iteration results to blob storage\n",
+    "    ########################################################\n",
+    "    \n",
+    "    upload2blob(local_path = iter_dir, container_path = iter_fld)  #train data, model details and clusters\n",
+    "    upload2blob(local_path = clust_dir, container_path = \"clusters\")  #Files segregated into clusters\n",
+    "    \n",
+    "    ########################################################\n",
+    "    # decide on next iteration\n",
+    "    ########################################################\n",
+    "    moved_percent = files_clustered * 100 / initial_file_cnt\n",
+    "\n",
+    "    if (moved_percent < 5) | (initial_file_cnt < 500):\n",
+    "        iteration = 0\n",
+    "    else:\n",
+    "        iteration = iteration + 1\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d7eacb0",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "6506c5172df6811f7bf01b57d8469d1357a539c9dc33b21363eaf6ce598c5969"
+  },
+  "kernelspec": {
+   "display_name": "Python 3.6 - AzureML",
+   "language": "python",
+   "name": "python3-azureml"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/images/architecture.gif
+++ b/images/architecture.gif
--- a/images/fr-config-file.jpg
+++ b/images/fr-config-file.jpg
--- a/images/storage-account-params.jpg
+++ b/images/storage-account-params.jpg