Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

data-generation data-science machine-learning ner ocr-recognition python synthetic-data synthetic-data-generation synthetic-images text-alignment

Перейти к файлу

Jianjie Liu 5339c76d05 Merge pull request #15 from microsoft/laserprec/ci_templates Add the follow CI pipelines and use template to construct all pipelines: 1. Nightly Build 2. PR Gate with full test matrix for unit and e2e tests		2021-01-28 00:27:04 -05:00
devops	Add publish artifacts step	2021-01-27 22:43:00 -05:00
example	Code Migration from Azure DevOps (#2 )	2020-07-17 12:23:20 -04:00
genalog	Raise errors when writing to disk	2021-01-27 10:07:28 -05:00
tests	Mark more io tests	2021-01-27 22:18:36 -05:00
.gitignore	Code Migration from Azure DevOps (#2 )	2020-07-17 12:23:20 -04:00
CODEOWNERS	Update CODEOWNERS	2021-01-26 10:17:00 -05:00
CODE_OF_CONDUCT.md	Initial CODE_OF_CONDUCT.md commit	2020-06-15 13:35:09 -07:00
LICENSE	Code Migration from Azure DevOps (#2 )	2020-07-17 12:23:20 -04:00
MANIFEST.in	Tox-runnable	2021-01-26 09:41:38 -05:00
README.md	Update build badge url	2021-01-25 17:16:19 -05:00
SECURITY.md	Initial SECURITY.md commit	2020-06-15 13:35:11 -07:00
VERSION.txt	Version bump to alpha3 to test release pipeline	2021-01-25 17:18:50 -05:00
requirements-dev.txt	Raise errors when writing to disk	2021-01-27 10:07:28 -05:00
requirements.txt	Remove deprecated azure-storage dependencies	2021-01-25 11:39:51 -05:00
setup.py	Tox-runnable	2021-01-26 09:41:38 -05:00
tox.ini	Add detail test summary report	2021-01-27 22:18:36 -05:00

README.md

Genalog - Synthetic Data Generator

Genalog is an open source, cross-platform python package allowing to generate synthetic document images with text data. Tool also allows you to add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.

Overview

Genalog has various capabilities:

Flexible format Image Generation
Custom image degradation
Extract Text from Images using Cognitive Search Pipeline
Get OCR Performance Metrics

The aim of this project is to provide a complete solution for generating synthetic images from any text data rich in natural language and to imitate most of OCR noises founded in scanned text documents.

Getting Started

The following is a summary of the common applications scenarios of Genalog. Please refer the Jupyter notebook examples that make use of the core code base of Genalog and repository utilities.

	Steps	Indepth Jupyter Notebook Examples	Quick Start Guides
1	Create Template for Image Generation	Demo Notebook	Here is our guide to Document Generation
2	Degrade Prebuilt Images	Demo Notebook	Here is our guide to Image Degradation
3	Get Text From Images Using OCR	Demo Notebook	Here is our guide to Extracting Text
4	Align Text Produced from OCR with Ground Truth Text	Demo Notebook	Here is our guide to Text Alignment
5	NER Label Propagation from Ground Truth to OCR Tokens	Demo Notebook	Here is our guide to Label Propagation

We also provide notebooks for the complete end-to-end scenario of generating a synthetic dataset connecting all the components of genalog:

	Scenario	Indepth Jupyter Notebook
1	Synthetic Dataset Generation with LABELED NER Dataset	Demo Notebook
2	Synthetic Dataset Batch Generation with Varying Degradation	Demo Notebook

Installation

We are currently in a pre-release stage. Stable release is currently pushed to the TestPyPI.

pip install -i https://test.pypi.org/simple/ genalog

Installation from Source:

Create and activate the virtual environment you want to install the package:
1. python -m venv .env
2. pip install --upgrade pip setuptools
3. source .env/bin/activate or on Windows .env/Scripts/activate.bat
git clone https://github.com/microsoft/genalog.git
cd genalog
pip install -e .

Other Requirements:

If you want to use the OCR Capabilities of Azure to Extract Text from the Images You'll require the following resources:
1. Azure Cognitive Search Service Quickstart Guide Here
2. Azure Blob Storage Quickstart Guide Here
See Azure Docs for more information on Azure Cognitive Search.

Repo Structure

Tools-Synthetic-Data-Generator
├────genalog
│       ├─── generation                      # generate text images
│       ├──── degradation                    # methods for image degradation
│       ├──── ocr                            # running the Azure Search Pipeline
│       └──── text                           # methods to Align OCR Output Text with Input Text 
├────examples                                # Example Jupyter Notebooks for Various Synthetic Data Generation Scenarios
├────tests                                   # PyTest files
├────README.md                               # Main Readme file   
└────LICENSE                                 # License file

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Contribution Guidelines

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Collaborators

Genalog was originally developed by the MAIDAP team at Microsoft Cambridge NERD in association with the Text Analytics Team in Redmond.