genalog/genalog
Jianjie Liu 9047cd6197
Laserprec/bugfix img save (#31)
* :bug fix saving on disk

* bump version to 0.1.0-rc6
2021-07-20 13:19:23 -04:00
..
degradation Add copyright disclaimer (#27) 2021-07-07 16:31:36 -04:00
generation Add copyright disclaimer (#27) 2021-07-07 16:31:36 -04:00
ocr Add copyright disclaimer (#27) 2021-07-07 16:31:36 -04:00
text Laserprec/jupyter book doc (#28) 2021-07-19 17:21:45 -04:00
README.md Code Migration from Azure DevOps (#2) 2020-07-17 12:23:20 -04:00
__init__.py Code Migration from Azure DevOps (#2) 2020-07-17 12:23:20 -04:00
pipeline.py Laserprec/bugfix img save (#31) 2021-07-20 13:19:23 -04:00

README.md

Genalog Core

This is the core of the package and contains all core components necessary to generate new docs, degrade the documents and get text out of degraded images using OCR Capabilities of Azure.

Image Generation

This directory contains the class implementations for image generation. The image generation leverages Jinja templates for image generation. You can create a Jinja HTML template for any image layout and specify content variables to add content into images. This allows you the flexibility to be as declarative as possible.

Here is our guide to Image Generation

Image Degradation

This directory contains the class implementations for degrading your images such that they simulate real world Document degradations.

Here is our guide to Image Degradation

Extract Text from Images

This directory contains the class implementations for Extract Text from Images using Azure OCR Process.

Here is our guide to Extract Text from Images

Text Alignment

This directory contains the class implementations for text alignment. We expect that these capabilities will be required when you need to align text with its incorrect versions when you degrade documents and then have errors in OCR. We use Biopython's implementation of the Needleman-Wunsch algorithm for text alignment as the method genalog.text.alignment.align(). This algorithm is an exhaustive search for all possible candidates with dynamic programming. It produces weighted score for each candidate and returns those having the highest score. Note this is an algorithm with quadratic time and space complexity, and is not so efficient on aligning longer strings.

For more efficient alignment on longer documents, we also include an implementation of the RETAS method from the paper "A Fast Alignment Scheme for Automatic OCR Evaluation of Books" in genalog.text.anchor.align_w_anchor(). We would recommend using this method for input longer than 200 characters.

Here is our guide to Text Alignment