First version of solution accelerator

1. Reads data from input container
2. Clusters pages into groups as per template
3. Uploads results to blob storage destination container
This commit is contained in:
p-lekkala 2021-12-22 05:51:56 +00:00
Родитель bfca77194e
Коммит 29af4539ee
9 изменённых файлов: 1449 добавлений и 8 удалений

Просмотреть файл

@ -1,14 +1,73 @@
# Project
# Form-Recognizer-Accelerator
> This repo has been populated by an initial template to help get you started. Please
> make sure to update the content to build a great experience for community-building.
This solution accelerator helps in clustering / segregating data into templates. Data once segregated to templates can be used from training Form Recognizer "Train with Labels" models.
As the maintainer of this project, please make a few updates:
## Input Data
- Improving this README.MD file to provide a great experience
- Updating SUPPORT.MD with content about this project's support experience
- Understanding the security reporting process in SECURITY.MD
- Remove this section from the README
- This solution expects input data to be placed on Azure Blob Storage containers
- It accepts data in pdf, png, tif and jpeg formats
## Pre-requisites
- **Azure subscription** (If you don't already have a subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin).
- **Azure Blob Storage account** (If you don't have an Azure Storage account, refer to the [quickstarts](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal) before you begin).
- **Azure Form Recognizer resource** (If you don't have a Form Recognizer resource, refer to the [quickstart](https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/try-sample-label-tool#create-a-form-recognizer-resource))
- **Azure Machine Learning service compute instance** or an **Azure Machine Learning Python SDK** installed on your local compute (refer to this link for the [Installation of Azure Machine Learning SDK for Python](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py)).
**_NOTE:_** In the case of working on your local compute, please note that the following packages needs to be installed:
- [nbformat:](https://pypi.org/project/nbformat/) - Python package for Jupyter Notebook format
```pip install nbformat```
- [pikepdf](https://pypi.org/project/pikepdf/) - Python library for reading and writing PDF files
```pip install pikepdf```
- [img2pdf](https://pypi.org/project/img2pdf/) - Convert images to PDF via direct JPEG inclusion.
``` pip install img2pdf ```
## How to use the solution
1. Download the repo to you local compute or Azure AML compute instance
2. Create 2 containers on storage account, one for the "input" data and one for the "results" (refer to this link on how to [create a container](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal#create-a-container) in your storage account).
1. Upload your input data to the "input" container created.
2. Make sure that your "results" container you've created is still empty.
3. Update the following parameters in "**main.ipynb**" notebook, as well as in the "**AzureBlobStorageLib.ipynb**", under the "code" directory.
Populate the below parameters, **src_container** and **dst_container**, with the names of the containers you created before in the previous step.
<img src="images/storage-account-params.jpg" width="70%">
**_NOTE:_** See below more information on the parameters :
- **storage_conn_string** : it's the storage account connection string (refer to this [resource](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal#view-account-access-keys) on how to find it)
- **src_container** - it's the Container name where your input data is stored
- **dst_container** - is the "Container name where results are uplaoded".
4. Create a "**config.json**" file in "code" folder and update your Form Recognizer endpoint and key as shown below.
```
{
"endpoint": "YOUR_FORM_RECOGNIZER_ENDPOINT",
"apim-key": "YOUR_FORM_RECOGNIZER_API_KEY"
}
```
**_NOTE:_** You can check out this resource on how to [retrieve the Form Recognizer key and endpoint.](https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/try-sample-label-tool#retrieve-the-key-and-endpoint)
5. Run "**main.ipynb**" notebook.
The solution will segregate the input data into templates. You can notice the progress by observing the comments. you will also notice that new sub-folders under data directory where results will be saved.
Results are uploaded to blob, once the run completes
# How does this work
![Architecture](images/architecture.gif)
- Data is sampled for training Form Recognizer "Train without labels" Model
- All documents in input data are inferred using the trained model
- If a document is assigned a cluster id (Template ID), its moved to template location and removed from original population
- Above 3 steps are repeated, until one of the termination conditions are reached
- Termination conditions
- All input data is considered for training
- Percentage of data segregated to templates <5% of population
## Contributing

Просмотреть файл

@ -0,0 +1,320 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Install required libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install azure-storage-blob\n",
"# or !pip install --upgrade --force-reinstall azure.storage.blob\n",
"# remember to restart kernel / pip list to confirm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Establish blob connection\n",
"\n",
"### The following code establishes connection with Azure Blob storage"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Azure Blob Storage v12.9.0\n"
]
}
],
"source": [
"# Check the version of the client library\n",
"import os, uuid\n",
"from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__\n",
"from datetime import datetime, timedelta\n",
"\n",
"print(\"Azure Blob Storage v\" + __version__) # should be at least Azure Blob Storage v12.9.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Please update these variables - \n",
"###### Only If you are running this notebook separately, otherwise these parameters are picked up from main file\n",
"\n",
"### storage_conn_string \"Storage account connection string\"\n",
"### src_container \"Container where data is stored\"\n",
"### dst_container \"Container where results should be uploaded\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# # set up connection\n",
"# storage_conn_string = \"\"\n",
"# src_container = \"\"\n",
"# dst_container = \"\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"#or on linux, do: export AZURE_STORAGE_CONNECTION_STRING=\"<yourconnectionstring>\" and then do storage_conn_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')\n",
"\n",
"# Create the BlobServiceClient object which will be used to create a container client\n",
"blob_service_client = BlobServiceClient.from_connection_string(storage_conn_string)\n",
"\n",
"# create a client for a specific container\n",
"srcblob_container_client = blob_service_client.get_container_client(src_container)\n",
"\n",
"# create a client for a specific container\n",
"dstblob_container_client = blob_service_client.get_container_client(dst_container)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Download data from Blob\n",
"\n",
"### The following function downloads data from blob storage to local folder\n",
"#### parameters\n",
"\n",
"local_path \"Local path where data should be stored\"\n"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"def download2local(local_path, container_client = srcblob_container_client, container_name = src_container):\n",
"\n",
" # Create a local directory to hold blob data\n",
" Path(local_path).mkdir(parents=True, exist_ok=True)\n",
"\n",
" # download all the files to the local_path folder\n",
" blob_list = container_client.list_blobs() # need to run this again, no way to reset the iterator\n",
" for blob in blob_list:\n",
"\n",
" # note that this is called on the service_client and not the container_client\n",
" blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob)\n",
"\n",
" download_file_path = os.path.join(local_path, blob.name)\n",
" print(\"\\tDownloading \" + blob.name + \" to \" + download_file_path)\n",
"\n",
" basedir = os.path.split(download_file_path)[0]\n",
" Path(basedir).mkdir(parents=True, exist_ok=True)\n",
"\n",
" with open(download_file_path, \"wb\") as download_file:\n",
" download_file.write(blob_client.download_blob().readall())\n",
"\n",
" print(\"Files Downloaded.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#local_path = \"../data/samples\"\n",
"#download2local(local_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Upload data to Blob container\n",
"\n",
"### The following function uploads data from local drive to azure blob storage container\n",
"#### parameters\n",
"\n",
"local_path \"Local path where data should be stored\" \n",
"container_path \"Path on container where data should be stored\"\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"from os.path import relpath\n",
"import glob\n",
"\n",
"def upload2blob(local_path, container_path = \"\", container_client = dstblob_container_client, container_name = dst_container):\n",
"\n",
" # list the local files\n",
" files_to_upload = glob.glob(os.path.join(local_path,\"**/*\"), recursive=True)\n",
"\n",
" for file_to_upload in files_to_upload:\n",
" print(\"\\nUploading to Azure Storage as blob: \\t\" + file_to_upload)\n",
"\n",
" # Upload the file\n",
" #path_to_file = os.path.join(local_path, file_to_upload)\n",
" path_to_file = file_to_upload\n",
" file_path_on_azure = relpath(file_to_upload, local_path)\n",
" file_path_on_azure = os.path.join(container_path,file_path_on_azure)\n",
"\n",
" #print(\"local path:\", path_to_file, \"Azure path:\", file_path_on_azure)\n",
" # Create a blob client using the local file name as the name for the blob\n",
" blob_client = blob_service_client.get_blob_client(container=container_name, blob=file_path_on_azure)\n",
"\n",
" path = Path(path_to_file)\n",
" if path.is_file():\n",
" with open(path_to_file, \"rb\") as data:\n",
" blob_client.upload_blob(data, overwrite = True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# local_path = \"../data/samples\"\n",
"# upload2blob(local_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Delete data in Blob container\n",
"\n",
"### The following function deletes all data in azure blob storage container\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"from os.path import relpath\n",
"import glob\n",
"\n",
"def deleteContainerData(container_client = dstblob_container_client, container_name = dst_container):\n",
"\n",
" blob_list = container_client.list_blobs() # need to run this again, no way to reset the iterator\n",
" for blob in blob_list:\n",
" dstblob_container_client.delete_blob(blob=blob)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create SAS signature for blob container\n",
"\n",
"### This container generates SAS URL for a Blob storage container\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from azure.storage.blob import AccessPolicy, ContainerSasPermissions\n",
"from azure.storage.blob import generate_container_sas, generate_account_sas\n",
"\n",
"def fr_get_sas_url(dst_container = dst_container):\n",
" #permission = ContainerSasPermissions(read=True, write=True, delete=True, list=True)\n",
" permission = ContainerSasPermissions.from_string(storage_conn_string)\n",
" expiry=datetime.utcnow() + timedelta(hours=24)\n",
" start=datetime.utcnow() - timedelta(minutes=1)\n",
"\n",
" access_policy = AccessPolicy(permission, expiry, start)\n",
"\n",
" sas_token = generate_container_sas(\n",
" account_name = blob_service_client.account_name,\n",
" container_name = dst_container,\n",
" account_key = blob_service_client.credential.account_key,\n",
" permission = permission,\n",
" expiry = expiry,\n",
" start = start\n",
" )\n",
" \n",
" #return(sas_token)\n",
" \n",
" url = 'https://'+blob_service_client.account_name+'.blob.core.windows.net/'+dst_container+'?'+sas_token\n",
" return(url)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#sas_token = fr_get_sas_url()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#sas_token"
]
}
],
"metadata": {
"interpreter": {
"hash": "6506c5172df6811f7bf01b57d8469d1357a539c9dc33b21363eaf6ce598c5969"
},
"kernelspec": {
"display_name": "Python 3.6 - AzureML",
"language": "python",
"name": "python3-azureml"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

180
code/Data-utils.ipynb Normal file
Просмотреть файл

@ -0,0 +1,180 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "dea0c864",
"metadata": {},
"source": [
"# This note contains file processing utility functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b81510a8",
"metadata": {},
"outputs": [],
"source": [
"from pikepdf import Pdf\n",
"# import cv2\n",
"import os\n",
"import img2pdf\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "48aa9832",
"metadata": {},
"outputs": [],
"source": [
"import PIL\n",
"PIL.Image.MAX_IMAGE_PIXELS = None"
]
},
{
"cell_type": "markdown",
"id": "345dc68d",
"metadata": {},
"source": [
"### This function converts pdf, jpg, tif and png files in \"src\" to pdf files and places them in \"pdfdst\" directory"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8ecb76a",
"metadata": {},
"outputs": [],
"source": [
"def convert2pdf(src, pdfdst):\n",
" \n",
" a4inpt = (img2pdf.mm_to_pt(210),img2pdf.mm_to_pt(297))\n",
" layout_fun = img2pdf.get_layout_fun(a4inpt)\n",
"\n",
" #convert images to pdf\n",
" for r, _, f in os.walk(src):\n",
" for fname in f:\n",
" flname = fname.lower()\n",
" if flname.endswith(\".pdf\"):\n",
" copyfile(os.path.join(r, fname), os.path.join(pdfdst, flname))\n",
" print(\"copying file: %s\" % fname)\n",
" elif flname.endswith(\".tif\"):\n",
" src = os.path.join(r, fname)\n",
" dst = os.path.join(pdfdst, flname.replace(\".tif\",\".pdf\"))\n",
" print(\"converting file: %s\" % src)\n",
" with open(dst,\"wb\") as f:\n",
" f.write(img2pdf.convert(src, layout_fun=layout_fun))\n",
" #f.write(img2pdf.convert(src))\n",
" elif flname.endswith(\".tiff\"):\n",
" src = os.path.join(r, fname)\n",
" dst = os.path.join(pdfdst, flname.replace(\".tiff\",\".pdf\"))\n",
" print(\"converting file: %s\" % src)\n",
" with open(dst,\"wb\") as f:\n",
" f.write(img2pdf.convert(src), layout_fun=layout_fun)\n",
" #f.write(img2pdf.convert(src))\n",
" elif flname.endswith(\".jpg\"):\n",
" src = os.path.join(r, fname)\n",
" dst = os.path.join(pdfdst, flname.replace(\".jpg\",\".pdf\"))\n",
" print(\"converting file: %s\" % src)\n",
" with open(dst,\"wb\") as f:\n",
" f.write(img2pdf.convert(src), layout_fun=layout_fun)\n",
" #f.write(img2pdf.convert(src))\n",
" elif flname.endswith(\".png\"):\n",
" src = os.path.join(r, fname)\n",
" dst = os.path.join(pdfdst, flname.replace(\".png\",\".pdf\"))\n",
" print(\"converting file: %s\" % src)\n",
" with open(dst,\"wb\") as f:\n",
" f.write(img2pdf.convert(src), layout_fun=layout_fun)\n",
" #f.write(img2pdf.convert(src))\n",
" else:\n",
" continue"
]
},
{
"cell_type": "markdown",
"id": "fb94d80c",
"metadata": {},
"source": [
"### This function converts multipage PDF file to single page PDF files"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77686dd3",
"metadata": {},
"outputs": [],
"source": [
"def pdf_split(src,dst):\n",
"\n",
" fls = os.listdir(src)\n",
"\n",
" for fl in fls:\n",
" pdf = Pdf.open(os.path.join(src,fl))\n",
" for n, page in enumerate(pdf.pages):\n",
" dst_fl = Pdf.new()\n",
" dst_fl.pages.append(page)\n",
" \n",
" dst_fl_name = '%02d-'%n+fl\n",
" dst_fl.save(os.path.join(dst,dst_fl_name))\n"
]
},
{
"cell_type": "markdown",
"id": "29f8ae93",
"metadata": {},
"source": [
"#### This function splits a list into \"n\" parts and create a list of lists"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8af06032",
"metadata": {},
"outputs": [],
"source": [
"def chunkify(lst,n):\n",
" return [lst[i::n] for i in range(n)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00db6a6b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "08528281",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6 - AzureML",
"language": "python",
"name": "python3-azureml"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

388
code/FR-utils.ipynb Normal file
Просмотреть файл

@ -0,0 +1,388 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "062cd3e4",
"metadata": {},
"source": [
"# This file has Form Recognizer Model trainign and Inferencing code"
]
},
{
"cell_type": "markdown",
"id": "0e790e5a",
"metadata": {},
"source": [
"#### Read configuration file and get endpoint, key of the service"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95f9d84e",
"metadata": {},
"outputs": [],
"source": [
"########### Python Form Recognizer Labeled Async Train #############\n",
"import json\n",
"import time\n",
"from requests import get, post\n",
"\n",
"#read form recognizer service parameters\n",
"with open('config.json','r') as config_file:\n",
" config = json.load(config_file)\n",
"\n",
"# Endpoint URL\n",
"endpoint = config['endpoint']\n",
"post_url = endpoint + r\"/formrecognizer/v2.1/custom/models\"\n",
"apim_key = config['apim-key']\n",
"filetype = 'application/json'\n",
"\n",
"\n",
"headers = {\n",
" # Request headers\n",
" 'Content-Type': filetype,\n",
" 'Ocp-Apim-Subscription-Key': apim_key,\n",
"}\n",
"\n",
"body = {\n",
" \"source\": \"\",\n",
" \"sourceFilter\": {\n",
" \"prefix\": \"\",\n",
" \"includeSubFolders\": False\n",
" },\n",
" \"useLabelFile\": False\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "f111694c",
"metadata": {},
"source": [
"# Unsupervised training "
]
},
{
"cell_type": "markdown",
"id": "c0f43e3e",
"metadata": {},
"source": [
"### Function to train unsupervised Form Recognizer Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8fbe1b96",
"metadata": {},
"outputs": [],
"source": [
"n_tries = 60\n",
"n_try = 0\n",
"wait_sec = 60\n",
"\n",
"\n",
"def train_fr_model(sas_url, folder_path, model_file):\n",
" \n",
" body['source'] = sas_url\n",
" body['sourceFilter']['prefix'] = folder_path\n",
" \n",
" # trigger training\n",
" try:\n",
" resp = post(url = post_url, json = body, headers = headers)\n",
" #print(body)\n",
" #print(headers)\n",
" if resp.status_code != 201:\n",
" print(\"Training model failed (%s):\\n%s\" % (resp.status_code, json.dumps(resp.json())))\n",
" return\n",
" print(\"Training Started:\\n%s\" % resp.headers)\n",
" get_url = resp.headers[\"location\"]\n",
" except Exception as e:\n",
" print(\"Error occurred when triggering training:\\n%s\" % str(e))\n",
" quit()\n",
" \n",
" n_try = 0\n",
" #wait for training to complete and save model to json file\n",
" while n_try < n_tries:\n",
" try:\n",
" resp = get(url = get_url, headers = headers)\n",
" resp_json = resp.json()\n",
" if resp.status_code != 200:\n",
" print(\"Model training failed (%s):\\n%s\" % (resp.status_code, json.dumps(resp_json)))\n",
" break\n",
" model_status = resp_json[\"modelInfo\"][\"status\"]\n",
" print(\"Model Status:\", model_status)\n",
" if model_status == \"ready\":\n",
" #print(\"Training succeeded:\\n%s\" % json.dumps(resp_json))\n",
" print(\"Training succeeded:\")\n",
" with open(model_file,\"w\") as f:\n",
" json.dump(resp_json, f)\n",
" break\n",
" if model_status == \"invalid\":\n",
" print(\"Training failed. Model is invalid:\\n%s\" % json.dumps(resp_json))\n",
" break\n",
" # Training still running. Wait and retry.\n",
" time.sleep(wait_sec)\n",
" n_try += 1\n",
" except Exception as e:\n",
" msg = \"Model training returned error:\\n%s\" % str(e)\n",
" print(msg)\n",
" break\n",
"\n",
" if resp.status_code != 200:\n",
" print(\"Train operation did not complete within the allocated time.\") "
]
},
{
"cell_type": "markdown",
"id": "313e93ea",
"metadata": {},
"source": [
"# FR Model Inferencing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af82e67f",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import glob\n",
"import os\n",
"import datetime\n",
"import tempfile\n",
"import pandas as pd\n",
"import shutil\n",
"\n",
"from concurrent.futures import ThreadPoolExecutor, as_completed"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ebd521e2",
"metadata": {},
"outputs": [],
"source": [
"\n",
"params_infer = {\n",
" \"includeTextDetails\": True\n",
"}\n",
"\n",
"headers_infer = {\n",
" # Request headers\n",
" 'Content-Type': 'application/pdf',\n",
" 'Ocp-Apim-Subscription-Key': apim_key,\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "2b1026a9",
"metadata": {},
"source": [
"#### Form Recognizer inferencing function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d211010f",
"metadata": {},
"outputs": [],
"source": [
"#######################################################\n",
"# FR Inference function multithreading\n",
"#######################################################\n",
"\n",
"def fr_mt_inference(files, json_fld, model_id):\n",
" \n",
" post_url = endpoint + \"formrecognizer/v2.1/custom/models/%s/analyze\" % model_id\n",
"\n",
" \n",
" ###################\n",
" #send all requests in one go\n",
" ###################\n",
" session = requests.Session()\n",
" url_list=[]\n",
" for fl in files:\n",
" \n",
" #read file\n",
" fname = os.path.basename(fl)\n",
" #print(\"working on file: %s, %s\" %(fl, datetime.datetime.now()))\n",
" with open(fl, \"rb\") as f:\n",
" data_bytes = f.read()\n",
" \n",
" #set variables to default values\n",
" get_url = None\n",
" #st_time = datetime.datetime.now()\n",
" st_time = datetime.now()\n",
" gap_between_requests = 1 #in seconds\n",
" \n",
" try:\n",
" \n",
" #send post request (wait and send if overlaoded)\n",
" post_success = 0\n",
" while post_success == 0:\n",
" resp = session.post(url = post_url, data = data_bytes, headers = headers_infer, params = params_infer)\n",
" if resp.status_code != 429:\n",
" break\n",
" time.sleep(1) \n",
" \n",
" #print(fl, resp.status_code)\n",
" \n",
" if resp.status_code != 202:\n",
" print(\"POST analyze failed:\\n%s\" % json.dumps(resp.json()))\n",
"\n",
" #print(\"POST analyze succeeded:\\n%s\" % resp.headers)\n",
" #print(\"POST analyze succeeded for %s \\n\" % fl)\n",
" get_url = resp.headers[\"operation-location\"]\n",
" except Exception as e:\n",
" print(\"POST analyze failed 1:\\n%s\" % str(e))\n",
" \n",
" url_list.append((fl, fname, get_url))\n",
" end_time = datetime.now()\n",
" #end_time = datetime.datetime.now()\n",
" delta = end_time - st_time\n",
" delta = delta.total_seconds()\n",
" if delta < gap_between_requests:\n",
" time.sleep(gap_between_requests - delta)\n",
"\n",
" ####################################\n",
" # get all responses in one go\n",
" ####################################\n",
" n_tries = 15\n",
" wait_sec = 15\n",
"\n",
" for cnt in range(n_tries):\n",
" \n",
" #get results of requests sent\n",
" completed = []\n",
" for i in range(len(url_list)):\n",
"\n",
" fl, fname, get_url = url_list[i]\n",
" if get_url is not None:\n",
"\n",
" try:\n",
" resp = session.get(url = get_url, headers = {\"Ocp-Apim-Subscription-Key\": apim_key})\n",
" resp_json = resp.json()\n",
"\n",
" if resp.status_code != 200:\n",
" print(\"GET analyze results failed:%s \\n%s\" % fl, json.dumps(resp_json))\n",
" break\n",
"\n",
" status = resp_json[\"status\"]\n",
" if status == \"succeeded\":\n",
" print(\"Analysis succeeded for %s:\\n\" % fl)\n",
" with open(os.path.join(json_fld,fname.replace('.pdf','.json')), 'w') as outfile:\n",
" json.dump(resp_json, outfile)\n",
"\n",
" completed.append(i)\n",
"\n",
" if status == \"failed\":\n",
" print(\"Analysis failed:%s \\n%s\" % fl, json.dumps(resp_json))\n",
" break\n",
" except Exception as e:\n",
" msg = \"GET analyze results failed 2:\\n%s\" % str(e)\n",
" print(msg)\n",
" break\n",
"\n",
" # remove files where\n",
" completed.sort(reverse=True)\n",
" for i in completed:\n",
" url_list.pop(i)\n",
"\n",
" print(\"iteration\",cnt,\"complete. Still\",len(url_list), \" to infer\")\n",
" if len(url_list) == 0:\n",
" break\n",
" \n",
" time.sleep(wait_sec)\n",
" \n",
" ####################################\n",
" # retun files not inferred\n",
" ####################################\n",
" session.close()\n",
" \n",
" if len(url_list) == 0:\n",
" return(\"All files successfully inferred by FR\")\n",
" else:\n",
" return(url_list)\n"
]
},
{
"cell_type": "markdown",
"id": "e55580da",
"metadata": {},
"source": [
"#### Form Recognizer multi-threading inferencing function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18fb8475",
"metadata": {},
"outputs": [],
"source": [
"# Form Recognizer inference\n",
"\n",
"def fr_model_inference(src_dir, json_dir, model_file, thread_cnt):\n",
" \n",
" #read model details\n",
" with open(model_file,'r') as model_file:\n",
" model = json.load(model_file)\n",
"\n",
" if model['modelInfo']['modelId'] != None :\n",
" model_id = model['modelInfo']['modelId']\n",
" print(\"model id: %s\" % model_id)\n",
" else:\n",
" print(\"Model details not present, either model training is not performed or the file is missing\")\n",
" return\n",
" \n",
" #Read files and divide into chunks\n",
" fls = glob.glob(os.path.join(src_dir, \"*.pdf\"))\n",
" print(\"inferencing \", len(fls), \"files with\", thread_cnt, \"thread count\")\n",
" fchunk = chunkify(fls, 100)\n",
" \n",
" for chunk in fchunk:\n",
" \n",
" fr_threads = min(len(chunk),thread_cnt)\n",
"\n",
" flist = chunkify(chunk, fr_threads)\n",
"\n",
" #Call FR inference \n",
" threads= []\n",
" with ThreadPoolExecutor(max_workers=thread_cnt) as executor:\n",
" for files in flist:\n",
" threads.append(executor.submit(fr_mt_inference, files, json_dir, model_id))\n",
"\n",
" for task in as_completed(threads):\n",
" print(task.result()) "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6 - AzureML",
"language": "python",
"name": "python3-azureml"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

172
code/file-utils.ipynb Normal file
Просмотреть файл

@ -0,0 +1,172 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "e59451e7",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import glob\n",
"import numpy as np\n",
"from shutil import copyfile\n",
"import random\n",
"import pandas as pd\n",
"import json\n",
"import shutil\n"
]
},
{
"cell_type": "markdown",
"id": "4e8e337e",
"metadata": {},
"source": [
"# Data sampling\n",
"\n",
"#### This function samples data required for Form Recognizer Training.\n",
"It sampels such that the file count is < file_limit and Total sample size is < size_limit (in MB)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2dbf5331",
"metadata": {},
"outputs": [],
"source": [
"def sample_training_data(src_fld, dst_fld, file_limit=500, size_limit=50):\n",
"\n",
" fls = glob.glob(os.path.join(src_fld,\"*.pdf\"))\n",
"\n",
" #sample and copy files\n",
" if len(fls) > file_limit:\n",
" flist = random.sample(fls,file_limit)\n",
" print(\"sampling files\")\n",
" else:\n",
" flist = fls\n",
" print(\"file count <\",file_limit, \"considering all files\")\n",
"\n",
" for fl in flist:\n",
" flname = os.path.basename(fl)\n",
" copyfile(os.path.join(src_fld, flname), os.path.join(dst_fld, flname))\n",
"\n",
" #delete files if they exceed size limit\n",
" fls = os.listdir(dst_fld)\n",
" file_n_size = [(f,os.stat(os.path.join(dst_fld, f)).st_size) for f in fls]\n",
" #print(file_n_size)\n",
" \n",
" file_df = pd.DataFrame(file_n_size,columns=['file','size'])\n",
" file_df = file_df.sort_values(by=['size'])\n",
" file_df['total_size'] = file_df['size'].cumsum()\n",
" file_df['to_delete'] = file_df['total_size'] < size_limit*10^6\n",
" \n",
" del_df = file_df[file_df['to_delete'] == True]\n",
" if del_df.shape[0] > 0:\n",
" for idx, row in del_df.iterrows():\n",
" flname = row['file']\n",
" os.remove(os.path.join(dst_fld, flname))\n",
" \n",
" print(file_df.tail(5))\n",
" #return(file_df)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "341c27ad",
"metadata": {},
"outputs": [],
"source": [
"# %%time\n",
"\n",
"# sample_training_data(src_fld=\"../data/samples/trainset1\", dst_fld=\"../data/train_sample\")"
]
},
{
"cell_type": "markdown",
"id": "35f00bf5",
"metadata": {},
"source": [
"# Segregate data function\n",
"\n",
"#### This function segregates data into templates."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "57aa05f7",
"metadata": {},
"outputs": [],
"source": [
"# function to read cluster id and segregate the data\n",
"\n",
"def segregate_data(src_dir, result_dir, cluster_dir, prefix, cluster_file):\n",
" \n",
" fls = glob.glob(os.path.join(result_dir,\"*.json\"))\n",
"\n",
" cols = ['filename','clusterId']\n",
" clusters = pd.DataFrame(columns=cols)\n",
" moved_file_cnt = 0\n",
"\n",
" for fl in fls:\n",
"\n",
" f = open(fl)\n",
" dat = json.load(f)\n",
" if dat['status'] == \"succeeded\":\n",
" cid = dat['analyzeResult']['pageResults'][0]['clusterId']\n",
" flname = os.path.basename(fl)\n",
" tdf = pd.DataFrame([[flname, cid]], columns=cols)\n",
" clusters = clusters.append(tdf, ignore_index=True)\n",
" #print(\"clusterID for: \"+fl+\" is: \"+str(cid), not (cid is None))\n",
"\n",
" if not (cid is None):\n",
" #move files with cluster ID to cluster folders\n",
" pdf_name = flname.replace(\".json\",\".pdf\")\n",
" fld_name = prefix+\"-\"+str(cid)\n",
" fld_path = os.path.join(cluster_dir,fld_name)\n",
" if not os.path.isdir(fld_path):\n",
" os.makedirs(fld_path)\n",
" print(\"creating folder:\"+fld_path)\n",
" if os.path.isfile(os.path.join(src_dir,pdf_name)):\n",
" shutil.move(os.path.join(src_dir,pdf_name), os.path.join(fld_path,pdf_name))\n",
" #print(os.path.join(src_dir,pdf_name), os.path.join(fld_path,pdf_name))\n",
" moved_file_cnt = moved_file_cnt + 1\n",
"\n",
" clusters.to_csv(cluster_file)\n",
" \n",
" return(moved_file_cnt)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b9c3d39",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6 - AzureML",
"language": "python",
"name": "python3-azureml"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

322
code/main.ipynb Normal file
Просмотреть файл

@ -0,0 +1,322 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "0f512263",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import glob\n",
"import tempfile\n",
"from pathlib import Path"
]
},
{
"cell_type": "markdown",
"id": "0dff2838",
"metadata": {},
"source": [
"#### Provide storage account parameters here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c846d47d",
"metadata": {},
"outputs": [],
"source": [
"storage_conn_string = \"\"\n",
"src_container = \"\"\n",
"dst_container = \"\""
]
},
{
"cell_type": "markdown",
"id": "6647b1f1",
"metadata": {},
"source": [
"# Import functions from other notebooks"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d780d572",
"metadata": {},
"outputs": [],
"source": [
"%run \"Data-utils.ipynb\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf232a80",
"metadata": {},
"outputs": [],
"source": [
"%run \"FR-Utils.ipynb\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dc58d0b3",
"metadata": {},
"outputs": [],
"source": [
"%run \"file-utils.ipynb\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e66e781c",
"metadata": {},
"outputs": [],
"source": [
"%run \"AzureBlobStorageLib.ipynb\" storage_conn_string src_container dst_container"
]
},
{
"cell_type": "markdown",
"id": "f49fe7ec",
"metadata": {},
"source": [
"# Data Preparation\n",
"\n",
"#### Steps include\n",
"\n",
"1. Downlaoding data from Blob Storage\n",
"2. Converting all format files to PDF files\n",
"3. Splitting multipage PDF to single page PDF files"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1598f440",
"metadata": {},
"outputs": [],
"source": [
"# Create temporary directory and download files\n",
"\n",
"#temp_dir = tempfile.TemporaryDirectory()\n",
"#data_dir = temp_dir.name\n",
"\n",
"data_dir = \"../data/\"\n",
"if os.path.exists(data_dir) :\n",
" shutil.rmtree(data_dir)\n",
"Path(data_dir).mkdir(parents=True, exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24a666dd",
"metadata": {},
"outputs": [],
"source": [
"raw_files = os.path.join(data_dir,\"rawFiles\")\n",
"Path(raw_files).mkdir(parents=True, exist_ok=True)\n",
"\n",
"download2local(raw_files)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e84907a9",
"metadata": {},
"outputs": [],
"source": [
"# convert all file types to pdf and then split to 1-p docs\n",
"raw_pdf = os.path.join(data_dir,\"allPdf\")\n",
"Path(raw_pdf).mkdir(parents=True, exist_ok=True)\n",
"convert2pdf(src = raw_files, pdfdst = raw_pdf)\n",
"print(\"Input files are stored at:\", raw_pdf)\n",
"\n",
"pdf_1p = os.path.join(data_dir,\"1p-pdf\")\n",
"Path(pdf_1p).mkdir(parents=True, exist_ok=True)\n",
"pdf_split(src = raw_pdf, dst = pdf_1p)\n",
"print(\"Processed files are stored at:\", pdf_1p)"
]
},
{
"cell_type": "markdown",
"id": "c7219171",
"metadata": {},
"source": [
"#### Get initial Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7ab8343",
"metadata": {},
"outputs": [],
"source": [
"#get initial file count\n",
"fls = glob.glob(os.path.join(pdf_1p,\"*.pdf\"))\n",
"initial_file_cnt = len(fls)\n",
"print(initial_file_cnt)\n",
"\n",
"#results directory\n",
"results_dir = os.path.join(data_dir,\"Results\")\n",
"Path(results_dir).mkdir(parents=True, exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fce047f7",
"metadata": {},
"outputs": [],
"source": [
"# generate SAS signature for storage container\n",
"sas_url = fr_get_sas_url(dst_container)\n",
"#sas_url = \"\"\n",
"sas_url "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d6bd362",
"metadata": {},
"outputs": [],
"source": [
"# Call below function, if you want to clean up the destination blob container\n",
"#deleteContainerData()"
]
},
{
"cell_type": "markdown",
"id": "91cbc126",
"metadata": {},
"source": [
"# FR Template identification process"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3027cb3",
"metadata": {},
"outputs": [],
"source": [
"iteration = 1\n",
"\n",
"while(iteration):\n",
" \n",
" ########################################################\n",
" # create directory for current iteration\n",
" ########################################################\n",
" iter_fld = \"I\"+str(iteration)\n",
" iter_dir = os.path.join(results_dir,iter_fld)\n",
" Path(iter_dir).mkdir(parents=True, exist_ok=True)\n",
"\n",
" ########################################################\n",
" # sample files for training\n",
" ########################################################\n",
" train_fld = \"trainset\"\n",
" train_dir = os.path.join(iter_dir,train_fld)\n",
" Path(train_dir).mkdir(parents=True, exist_ok=True)\n",
" sample_training_data(src_fld = pdf_1p, dst_fld = train_dir)\n",
"\n",
" ########################################################\n",
" # upload the files to blob\n",
" ########################################################\n",
" blob_path = os.path.join(iter_fld,train_fld)\n",
" upload2blob(local_path = train_dir, container_path = blob_path)\n",
" \n",
" ########################################################\n",
" #train FR unsupervised model\n",
" ########################################################\n",
" model_file = iter_fld+\"-model-details.json\"\n",
" train_fr_model(sas_url = sas_url, folder_path = blob_path.replace(\"\\\\\", \"/\"), model_file = model_file)\n",
"\n",
" ########################################################\n",
" #if model is created infer using the model\n",
" ########################################################\n",
" iter_model_file = os.path.join(iter_dir, model_file)\n",
" if os.path.exists(model_file):\n",
" shutil.copyfile(model_file, iter_model_file)\n",
"\n",
" infer_fld = \"fr-json\"\n",
" infer_dir = os.path.join(iter_dir,infer_fld)\n",
" Path(infer_dir).mkdir(parents=True, exist_ok=True)\n",
"\n",
" #Start FR inferencing\n",
" fr_model_inference(src_dir = pdf_1p, json_dir = infer_dir, model_file = iter_model_file, thread_cnt = 10)\n",
"\n",
" ########################################################\n",
" # Segregate files to clusters\n",
" ########################################################\n",
" clust_dir = os.path.join(results_dir,\"clusters\")\n",
" Path(clust_dir).mkdir(parents=True, exist_ok=True)\n",
"\n",
" cluster_file = os.path.join(iter_dir, iter_fld+\"-clusters.csv\")\n",
"\n",
" files_clustered = segregate_data(src_dir = pdf_1p, result_dir= infer_dir, cluster_dir = clust_dir, \n",
" prefix = iter_fld, cluster_file = cluster_file)\n",
"\n",
" print(\"Identified clusters for:\", files_clustered, \"files\")\n",
"\n",
" ########################################################\n",
" # Upload iteration results to blob storage\n",
" ########################################################\n",
" \n",
" upload2blob(local_path = iter_dir, container_path = iter_fld) #train data, model details and clusters\n",
" upload2blob(local_path = clust_dir, container_path = \"clusters\") #Files segregated into clusters\n",
" \n",
" ########################################################\n",
" # decide on next iteration\n",
" ########################################################\n",
" moved_percent = files_clustered * 100 / initial_file_cnt\n",
"\n",
" if (moved_percent < 5) | (initial_file_cnt < 500):\n",
" iteration = 0\n",
" else:\n",
" iteration = iteration + 1\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d7eacb0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"interpreter": {
"hash": "6506c5172df6811f7bf01b57d8469d1357a539c9dc33b21363eaf6ce598c5969"
},
"kernelspec": {
"display_name": "Python 3.6 - AzureML",
"language": "python",
"name": "python3-azureml"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

Двоичные данные
images/architecture.gif Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 69 KiB

Двоичные данные
images/fr-config-file.jpg Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 14 KiB

Двоичные данные
images/storage-account-params.jpg Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 20 KiB