* Add documentation in jupyter-book
* Add Trademark Notice
This commit is contained in:
Jianjie Liu 2021-07-19 17:21:45 -04:00 коммит произвёл GitHub
Родитель 6180948ce5
Коммит 0e982f2724
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
40 изменённых файлов: 1494 добавлений и 324 удалений

Просмотреть файл

@ -2,7 +2,7 @@
[![Build Status](https://dev.azure.com/genalog-dev/genalog/_apis/build/status/Nightly-Build?branchName=main)](https://dev.azure.com/genalog-dev/genalog/_build/latest?definitionId=4&branchName=main) ![Azure DevOps tests (compact)](https://img.shields.io/azure-devops/tests/genalog-dev/genalog/4?compact_message) ![Azure DevOps coverage (main)](https://img.shields.io/azure-devops/coverage/genalog-dev/genalog/4/main) ![Python Versions](https://img.shields.io/badge/py-3.6%20%7C%203.7%20%7C%203.8%20-blue) ![Supported OSs](https://img.shields.io/badge/platform-%20linux--64%20-red) ![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)
Genalog is an open source, cross-platform python package allowing to generate synthetic document images with text data. Tool also allows you to add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.
`Genalog` is an open source, cross-platform python package for **gen**erating document images with synthetic noise that mimics scanned an**alog** documents (thus the name `genalog`). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.
Overview
-------------------------------------
@ -85,16 +85,23 @@ If you are running on Windows, MacOS, or other Linux distributions, please see [
Repo Structure
-------------------
Tools-Synthetic-Data-Generator
genalog
├────genalog
│ ├─── generation # generate text images
│ ├──── degradation # methods for image degradation
│ ├──── ocr # running the Azure Search Pipeline
│ └──── text # methods to Align OCR Output Text with Input Text
├────examples # Example Jupyter Notebooks for Various Synthetic Data Generation Scenarios
├────tests # PyTest files
├────README.md # Main Readme file
└────LICENSE # License file
│ └──── text # methods to Align OCR Output Text with
├────devops # CI/CD pipelines
├────docs # containing online documentaions
├────examples # example Jupyter Notebooks for Various
├────tests # tests
├────tox.ini # CI orchestration and configurations
├────README.md
└────LICENSE
Trademark Notice
--------------------
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsofts Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-partys policies.
Microsoft Open Source Code of Conduct
-------------------------------------
@ -118,7 +125,6 @@ For more information see the [Code of Conduct FAQ](https://opensource.microsoft.
or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
Collaborators
-------------------------------------
Genalog was originally developed by the [MAIDAP team at Microsoft Cambridge NERD](http://www.microsoftnewengland.com/nerd-ai/) in association with the Text Analytics Team in Redmond.

6
docs/.gitignore поставляемый
Просмотреть файл

@ -1,3 +1,3 @@
_build/
_static/
_templates/
**/example.txt
**/_build
**/data

Просмотреть файл

@ -1,20 +0,0 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

Просмотреть файл

@ -1,67 +0,0 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import os
import sys
sys.path.insert(0, os.path.abspath('.'))
sys.path.insert(0, os.path.abspath('..'))
sys.path.insert(0, os.path.abspath('../genalog'))
sys.path.insert(0, os.path.abspath('../genalog/degradation'))
# -- Project information -----------------------------------------------------
project = 'genalog'
copyright = '2021, Microsoft'
author = 'Microsoft'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.napoleon',
'sphinx.ext.coverage',
]
# The master toctree document.
master_doc = 'index'
autodoc_member_order = 'groupwise'
autoclass_content = 'both'
# Napoleon settings
napoleon_google_docstring = True
napoleon_numpy_docstring = True
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

Просмотреть файл

@ -1,29 +0,0 @@
genalog.degradation package
===========================
Submodules
----------
genalog.degradation.degrader module
-----------------------------------
.. automodule:: genalog.degradation.degrader
:members:
:undoc-members:
:show-inheritance:
genalog.degradation.effect module
---------------------------------
.. automodule:: genalog.degradation.effect
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: genalog.degradation
:members:
:undoc-members:
:show-inheritance:

Просмотреть файл

@ -1,32 +0,0 @@
genalog package
===============
Subpackages
-----------
.. toctree::
:maxdepth: 4
genalog.degradation
genalog.generation
genalog.ocr
genalog.text
Submodules
----------
genalog.pipeline module
-----------------------
.. automodule:: genalog.pipeline
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: genalog
:members:
:undoc-members:
:show-inheritance:

Просмотреть файл

@ -0,0 +1,46 @@
title : <h1 style="font-size:2em;text-align:center;color:#FF5733">Genalog</h1>
author: Jianjie Liu and Amit Gupte
# logo: 'qe-logo-large.png'
# Short description about the book
description: >-
Guide for end-to-end synthetic analog document generation
execute:
execute_notebooks : off
# Interact link settings
notebook_interface : "notebook"
# Launch button settings
repository:
url : https://github.com/microsoft/genalog
path_to_book : /docs/genalog_docs
branch : main
launch_buttons:
notebook_interface : classic
# HTML-specific settings
html:
home_page_in_navbar : false
use_repository_button : true
# # LaTeX settings
# bibtex_bibfiles:
# - _bibliography/references.bib
# latex:
# latex_engine : "xelatex"
# latex_documents:
# targetname: book.tex
sphinx:
extra_extensions:
- sphinx_inline_tabs
- sphinx.ext.autodoc
- sphinx.ext.napoleon
- sphinx.ext.viewcode
config:
napoleon_google_docstring: True
autodoc_member_order: groupwise
autoclass_content: both

Просмотреть файл

@ -0,0 +1,24 @@
root: index
format: jb-book
defaults:
numbered: false
parts:
- caption: Getting Started
chapters:
- file: installation
- file: generation_pipeline
- file: e2e_dataset_pipeline
- caption: Fabricating Document & Noise
chapters:
- file: doc_generation
- file: doc_degradation
- caption: Handling Noisy Text
chapters:
- file: text_alignment
- file: ocr_label_propagation
- caption: API Documentation
chapters:
- file: docstring/genalog.degradation
- file: docstring/genalog.generation
- file: docstring/genalog.ocr
- file: docstring/genalog.text

Просмотреть файл

@ -0,0 +1,257 @@
# Degrade a document
`genalog.degradation` module allows you to degrade any images with real world degradations.
## Download a sample image
We can download a [sample image](https://github.com/microsoft/genalog/blob/main/example/sample/degradation/text_zoomed.png) from our repo, but you are welcome to skip this step and use an image you generated in the [previous page](document-generation) or elsewhere.
```python
import request
sample_img_url = "https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/degradation/text_zoomed.png"
sample_img = "text_zoomed.png"
r = requests.get(sample_text_url, allow_redirects=True)
open(sample_img, 'wb').write(r.content)
# Load in sample image
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
```
## Degrader
The `Degrader` class is the standard way to apply multiple degradations to an image.
```python
import cv2
from genalog.degradation.degrader import Degrader
from matplotlib import pyplot as plt
# We are applying degradation effects to the images in the following sequence:
# blur -> bleed_through -> salt
degradations = [
("blur", {"radius": 3}),
("bleed_through", {"alpha": 0.8}),
("salt", {"amount": 0.5}),
]
# All of the referenced degradation effects are in submodule `genalog.degradation.effect`
degrader = Degrader(degradations)
dst = degrader.apply_effects(src)
plt.imshow(dst, cmap="gray")
```
```{image} static/degrader.png
:width: 40%
:align: center
```
### Advanced Degradation Configurations
`genalog` provides an enumeration `ImageState` to reference the image at different state in the degradation process. For example:
1. `ImageState.ORIGINAL_STATE` refers to the origin state of the image before applying any degradation, while
1. `ImageState.CURRENT_STATE` refers to the state of the image after applying the last degradation effect.
This is most useful when you want to combine multiple layers of degradation, like the following examples.
```python
from genalog.degradation.degrader import Degrader, ImageState
degradations = [
("morphology", {"operation": "open", "kernel_shape":(9,9), "kernel_type":"plus"}),
("morphology", {"operation": "close", "kernel_shape":(9,1), "kernel_type":"ones"}),
("salt", {"amount": 0.7}),
("overlay", {
"src": ImageState.ORIGINAL_STATE,
"background": ImageState.CURRENT_STATE,
}),
("bleed_through", {
"src": ImageState.CURRENT_STATE,
"background": ImageState.ORIGINAL_STATE,
"alpha": 0.90,
"offset_x": -5,
"offset_y": -5,
}),
("pepper", {"amount": 0.005}),
("blur", {"radius": 3}),
("salt", {"amount": 0.15}),
]
degrader = Degrader(degradations)
dst = degrader.apply_effects(src)
plt.imshow(dst, cmap="gray")
```
```{image} static/degrader_heavy.png
:width: 40%
:align: center
```
## Blur
An effect that occurs when scanner cannot focus on the document properly that results in document looking foggy/hazy.
```python
# Import Genalog Degradations and other libraries
import genalog.degradation.effect as effect
import cv2
from matplotlib import pyplot as plt
# Load in sample image
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
# Add noise to the Image
blurred = effect.blur(src, radius=7) # the larger the radius, the lower the contrast
plt.imshow(blurred, cmap="gray")
plt.title('blurred', fontsize=6)
plt.show()
```
```{image} static/blur.png
:width: 60%
:align: center
```
## Bleed Through
This effect tries to mimic the seepage of ink from one side of a printed page to the other.
```python
# Import Genalog Degradations and other libraries
import genalog.degradation.effect as effect
import cv2
from matplotlib import pyplot as plt
# Load in sample image
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
# Add noise to the Image
bleed_through = effect.bleed_through(src, alpha=0.9)# higher the alpha, the less visible is the effect
plt.imshow(bleed_through, cmap="gray")
plt.title('bleed_through', fontsize=6)
plt.show()
```
```{image} static/bleed_through.png
:width: 60%
:align: center
```
## Salt and Pepper noise
In this effect we randomly sprinkle "salt" (white pixels) and "pepper" (dark pixels) onto the original image to imitate ink degradation and page degradation.
```python
# Import Genalog Degradations and other libraries
import genalog.degradation.effect as effect
import cv2
from matplotlib import pyplot as plt
# Load in sample image
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
# Add noise to the Image
salted = effect.salt(src, amount=0.4) # amount is the percentage of pixels to be salted (whitened)
plt.imshow(salted, cmap="gray")
plt.title('Salted', fontsize=6)
plt.show()
```
```{image} static/salt_pepper.png
:width: 70%
:align: center
```
## Morphological Degradations
`Morphological Degradations` : Morphological operations is a structural degradation commonly applied on a binary image. For more information, please see [link](http://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm). The convention for these binary images is to have the subject, or the foreground, in white on a black background. However, our example image has the subject in black on a white background, so the morphological degradation will have the effect opposite to its name.
### Erode and Open
```python
# Import Genalog Degradations and other libraries
import genalog.degradation.effect as effect
import cv2
from matplotlib import pyplot as plt
# Load in sample image
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
# Add noise to the Image
kernel = effect.create_2D_kernel((5,5), kernel_type="ones")
erode = effect.erode(src, kernel)
open = effect.open(src, kernel) # retains more of the foreground shape than erosion, given the same kernel
# display input and output image
fig = plt.figure(figsize=(6, 4), dpi=300)
fig.add_subplot(1,3,1)
plt.imshow(src, cmap="gray")
plt.title('src', fontsize=6)
fig.add_subplot(1,3,2)
plt.imshow(open, cmap="gray")
plt.title('open', fontsize=6)
fig.add_subplot(1,3,3)
plt.imshow(erode, cmap="gray")
plt.title('erode', fontsize=6)
plt.show()
```
```{image} static/open_erode.png
:width: 80%
:align: center
```
Here we are "opening" up the foreground structures (text) and joining the character structuring together. In another perspective, we are "eroding" away the white background by expanding the foreground.
### Dilate and Close
```python
# Load in sample image
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
kernel = effect.create_2D_kernel((3,3), kernel_type="ones")
dilate = effect.dilate(src, kernel)
close = effect.close(src, kernel) # less destructive than dilation, given the same kernel
# display input and output image
fig = plt.figure(figsize=(6, 4), dpi=300)
fig.add_subplot(1,3,1)
plt.imshow(src, cmap="gray")
plt.title('src', fontsize=6)
fig.add_subplot(1,3,2)
plt.imshow(close, cmap="gray")
plt.title('close', fontsize=6)
fig.add_subplot(1,3,3)
plt.imshow(dilate, cmap="gray")
plt.title('dilate', fontsize=6)
plt.show()
```
```{image} static/close_dilate.png
:width: 80%
:align: center
```
We are "closing" or "dilating" the white background, thus chipping away the foreground structures (text). This effect can mimic the effect of degrading ink or a printer running out of ink.
### Kernel Size and Shape
An important element of the morphological degradation is the [structuring element](http://homepages.inf.ed.ac.uk/rbf/HIPR2/strctel.htm), or the kernel used. With proper size and shape of the kernel, one can extract interesting structures of the source image.
````{toggle}
```python
elliptical_kernel = effect.create_2D_kernel((4,4), kernel_type="ellipse")
vertical_kernel = effect.create_2D_kernel((5,1), kernel_type="ones")
horizontal_kernel = effect.create_2D_kernel((1,5), kernel_type="ones")
upper_tri_kernel = effect.create_2D_kernel((5,5), kernel_type="upper_triangle")
lower_tri_kernel = effect.create_2D_kernel((5,5), kernel_type="lower_triangle")
x_kernel = effect.create_2D_kernel((4,4), kernel_type="x")
plus_kernel = effect.create_2D_kernel((6,6), kernel_type="plus")
dilate_w_elliptical_k = effect.dilate(src, elliptical_kernel)
dilate_w_vertical_k = effect.dilate(src, vertical_kernel)
dilate_w_horizontal_k = effect.dilate(src, horizontal_kernel)
dilate_w_upper_tri_k = effect.dilate(src, upper_tri_kernel)
dilate_w_lower_tri_k = effect.dilate(src, lower_tri_kernel)
dilate_w_x_kernel = effect.dilate(src, x_kernel)
dilate_w_plus_kernel = effect.dilate(src, plus_kernel)
```
````
```{image} static/kernel_morph.png
:width: 80%
:align: center
```

Просмотреть файл

@ -0,0 +1,139 @@
(document-generation)=
# Create a document
`genalog` allows you to generate synthetic documents from **any** given text.
To generate the synthetic documents, there are two important concepts to be familiar with:
1. `Template` - controls the layout of the document (i.e. font, langauge, position of the content, etc)
2. `Content` - items to be used to fill the template (i.e. text, images, tables, lists, etc)
We are using a HTML templating engine [(Jinja2)](https://jinja.palletsprojects.com/en/3.0.x/) to build our html templates, and a html-pdf converter [(Weasyprint)](https://weasyprint.readthedocs.io/en/latest/) to print the html as a pdf or an image.
We provide **three** standard templates for with document layouts:
````{tab} columns.html.jinja
```{figure} static/columns_Times_11px.png
:width: 30%
```
````
````{tab} letter.html.jinja
```{figure} static/letter_Times_11px.png
:width: 30%
```
````
````{tab} text_block.html.jinja
```{figure} static/text_block_Times_11px.png
:width: 30%
```
````
You can find the source code of these templates in path [`genalog/generation/templates`](https://github.com/microsoft/genalog/tree/main/genalog/generation/templates).
## Document Content
The goal is to be able to generate synthetic documents on ANY text input. Here we are loading in an sample file from our repo. You may use any text as well.
```python
import requests
sample_text_url = "https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/generation/example.txt"
r = requests.get(sample_text_url, allow_redirects=True)
text = r.content.decode("ascii")
```
### Initialize `CompositeContent`
To properly initiate the content populating a document template, we need to create the `CompositeContent` class.
```python
from genalog.generation.content import CompositeContent, ContentType
# Initialize CompositeContent Object
paragraphs = text.split('\n\n') # split paragraphs by `\n\n`
content_types = [ContentType.PARAGRAPH] * len(paragraphs)
content = CompositeContent(paragraphs, content_types)
```
The `CompositeContent` is a list of pairs of bodies of text and their `ContentType`. Here we can declaring a list of multiple `ContentType.PARAGRAPH`s.
```{note}
`ContentType` is an enumeration dictating the supported content type (ex. ContentType.PARAGRAPH, ContentType.TITLE, ContentType.COMPOSITE). This enumeration controls the collection of CSS styles to be apply onto the associated content. If you change to `ContentType.TITLE`, for example, the paragraph will inherit the style of a title section (bolded text, enlarged font-size, etc).
```
### Populate Content Into a Template
Once we initialized a `CompositeContent` object, we can populate the content into any standard template, via `DocumentGenerator` class.
```python
from genalog.generation.document import DocumentGenerator
default_generator = DocumentGenerator()
print(f"Available default templates: {default_generator.template_list}")
print(f"Default styles to generate: {default_generator.styles_to_generate}")
```
The `DocumentGenerator` has default styles. The above code snippet will show the default configurations and the names of the 3 standard templates. You will use the information to select the template you want to generate. The three templates are `["columns.html.jinja", "letter.html.jinja", "text_block.html.jinja"]`
```python
# Select specific template, content and create the generator
doc_gen = default_generator.create_generator(content, ["columns.html.jinja", "letter.html.jinja", "text_block.html.jinja"])
# we will use the `CompositeContent` object initialized from above cell
# python generator
for doc in doc_gen:
template_name = doc.template.name.replace(".html.jinja", "")
doc.render_png(target=f"example_{template_name}.png", resolution=300) #in dots per inch
```
You can also retrieve the raw image byte information without specifying the `target`
```python
from genalog.generation.document import DocumentGenerator
from IPython.core.display import Image, display
doc_gen = default_generator.create_generator(content, ['text_block.html.jinja'])
for doc in doc_gen:
image_byte = doc.render_png(resolution=100)
display(Image(image_byte))
```
Alternative, you can also save the document as a PDF file.
```python
# Select specific template, content and create the generator
doc_gen = default_generator.create_generator(content, ['text_block.html.jinja'])
# we will use the `CompositeContent` object initialized from above cell
# python generator
for doc in doc_gen:
doc.render_pdf(target="example_text_block.png")
```
### Changing Document Styles
You can alter the document styles including font family, font size, enabling hyphenation, and text alignment. These are mock style properties of their CSS counterparts. You can find standard CSS values replace the following properties.
```python
from genalog.generation.document import DocumentGenerator
from IPython.core.display import Image, display
# You can add as many options as possible. A new document will be generated per combination of the styles
new_style_combinations = {
"hyphenate": [True],
"font_size": ["11px", "12px"], # most CSS units are supported `px`, `cm`, `em`, etc...
"font_family": ["Times"],
"text_align": ["justify"]
}
default_generator = DocumentGenerator()
default_generator.set_styles_to_generate(new_style_combinations)
# Example the list of all style combination to generate
print(f"Styles to generate: {default_generator.styles_to_generate}")
doc_gen = default_generator.create_generator(titled_content, ["columns.html.jinja", "letter.html.jinja"])
for doc in doc_gen:
print(doc.styles)
print(doc.template.name)
image_byte = doc.render_png(resolution=300)
display(Image(image_byte))
```

Просмотреть файл

@ -0,0 +1,15 @@
genalog.degradation
====================
Image Degrader
-----------------------------------
.. automodule:: genalog.degradation.degrader
:members:
Degration Effects
---------------------------------
.. automodule:: genalog.degradation.effect
:members:
:show-inheritance:

Просмотреть файл

@ -1,15 +1,11 @@
genalog.generation package
genalog.generation
==========================
Submodules
----------
genalog.generation.content module
---------------------------------
.. automodule:: genalog.generation.content
:members:
:undoc-members:
:show-inheritance:
genalog.generation.document module
@ -17,13 +13,4 @@ genalog.generation.document module
.. automodule:: genalog.generation.document
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: genalog.generation
:members:
:undoc-members:
:show-inheritance:

Просмотреть файл

@ -1,53 +1,33 @@
genalog.ocr package
genalog.ocr
===================
Submodules
----------
genalog.ocr.blob\_client module
-------------------------------
.. automodule:: genalog.ocr.blob_client
:members:
:undoc-members:
:show-inheritance:
genalog.ocr.common module
-------------------------
.. automodule:: genalog.ocr.common
:members:
:undoc-members:
:show-inheritance:
genalog.ocr.grok module
-----------------------
.. automodule:: genalog.ocr.grok
:members:
:undoc-members:
:show-inheritance:
genalog.ocr.metrics module
--------------------------
.. automodule:: genalog.ocr.metrics
:members:
:undoc-members:
:show-inheritance:
genalog.ocr.rest\_client module
-------------------------------
.. automodule:: genalog.ocr.rest_client
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
genalog.ocr.blob\_client module
-------------------------------
.. automodule:: genalog.ocr
.. automodule:: genalog.ocr.blob_client
:members:
:undoc-members:
:show-inheritance:

Просмотреть файл

@ -1,69 +1,47 @@
genalog.text package
genalog.text
====================
Submodules
----------
genalog.text.alignment module
-----------------------------
.. automodule:: genalog.text.alignment
:members:
:undoc-members:
:show-inheritance:
genalog.text.anchor module
--------------------------
.. automodule:: genalog.text.anchor
:members:
:undoc-members:
:show-inheritance:
genalog.text.conll\_format module
---------------------------------
.. automodule:: genalog.text.conll_format
:members:
:undoc-members:
:show-inheritance:
genalog.text.lcs module
-----------------------
.. automodule:: genalog.text.lcs
:members:
:undoc-members:
:show-inheritance:
genalog.text.ner\_label module
------------------------------
.. automodule:: genalog.text.ner_label
:members:
:undoc-members:
:show-inheritance:
:private-members: _propagate_label_to_ocr
genalog.text.preprocess module
------------------------------
.. automodule:: genalog.text.preprocess
:members:
:undoc-members:
:show-inheritance:
genalog.text.splitter module
----------------------------
.. automodule:: genalog.text.splitter
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: genalog.text
:members:
:undoc-members:
:show-inheritance:

Просмотреть файл

@ -0,0 +1,12 @@
# OCR-NER Dataset Generation
```{image} static/labeled_synthetic_pipeline.png
:width: 80%
:align: center
```
If you were brought here by our paper [insert link here], you may be interested in the data preparation pipeline built with `genalog`. The figure above shows the steps involved in tranforming a Named-Entity Recognition (NER) dataset like [CoNLL 2003](https://deepai.org/dataset/conll-2003-english) with synthetic Optical Character Recognition (OCR) errors. This OCR-NER dataset is useful to train an error-prune NER model against common OCR mistakes. You can find the full dataset prepration pipeline in this [notebook](https://github.com/microsoft/genalog/blob/main/example/dataset_generation.ipynb) from our repo.
We believe this methodology of inducing OCR errors onto the dataset can be applied to other NLP tasks to improve model performance against inherent noise from OCR outputs. We welcome the community to contribute if this fits your use cases.

Просмотреть файл

@ -0,0 +1,367 @@
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "python3",
"display_name": "Python 3.6.9 64-bit ('.env': venv)"
},
"metadata": {
"interpreter": {
"hash": "463957e7759ed5c981e4d097e7f970bbf621ad48bd269f8044dc509b219ad94f"
}
},
"interpreter": {
"hash": "463957e7759ed5c981e4d097e7f970bbf621ad48bd269f8044dc509b219ad94f"
}
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"source": [
"# Generate your synthetic document\n",
"\n",
"\n",
"```{figure} static/analog_doc_gen_pipeline.png\n",
":width: 500px\n",
"```\n",
"\n",
"Genalog provides a simple interface (`AnalogDocumentGeneration`) to programmatic generate documents with degradation from a body of text."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from genalog.pipeline import AnalogDocumentGeneration\n"
]
},
{
"source": [
"## Configurations\n",
"\n",
"To use the pipeline, you will need to supply the following information:\n",
"\n",
"### CSS Style Combinations\n",
"\n",
"`STYLE_COMBINATIONS`: a dictionary defining the combination of styles to generate per text document (i.e. a copy of the same text document is generate per style combination)\n",
"\n"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"STYLE_COMBINATIONS = {\n",
" \"language\": [\"en_US\"],\n",
" \"font_family\": [\"Segeo UI\"],\n",
" \"font_size\": [\"12px\"],\n",
" \"text_align\": [\"justify\"],\n",
" \"hyphenate\": [True],\n",
"}"
]
},
{
"source": [
"```{note}\n",
"Genalog depends on Weasyprint as the engine to render these CSS styles. Most of these fields are standard CSS properties and accepts common values as specified in [W3C CSS Properties](https://www.w3.org/Style/CSS/all-properties.en.html). For details, please see [Weasyprint Documentation](https://weasyprint.readthedocs.io/en/stable/features.html#fonts).\n",
"```"
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"### Choose a Prebuild HTML Template\n",
"\n",
"`HTML_TEMPLATE`: name of html template used to generate the synthetic images. The `genalog` package has the following default templates: \n",
"\n",
"````{tab} columns.html.jinja\n",
"```{figure} static/columns_Times_11px.png\n",
":width: 30%\n",
"Document template with 2 columns \n",
"```\n",
"````\n",
"````{tab} letter.html.jinja\n",
"```{figure} static/letter_Times_11px.png\n",
":width: 30%\n",
"Letter-like document template\n",
"```\n",
"````\n",
"````{tab} text_block.html.jinja\n",
"```{figure} static/text_block_Times_11px.png\n",
":width: 30%\n",
"Simple text block template\n",
"```\n",
"````"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"HTML_TEMPLATE = \"text_block.html.jinja\"\n"
]
},
{
"source": [
"### Image Degradations\n",
"\n",
"`DEGRADATIONS`: a list defining the sequence of degradation effects applied onto the synthetic images. Each element is a two-element tuple of which the first element is one of the method names from `genalog.degradation.effect` and the second element is the corresponding function keyword arguments.\n"
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"````{tab} bleed_through\n",
"```{figure} static/bleed_through.png\n",
":name: Bleed-through\n",
":width: 90%\n",
"Mimics a document printed on two sides. Valid values: [0,1].\n",
"```\n",
"````\n",
"````{tab} blur\n",
"```{figure} static/blur.png\n",
":name: Blur\n",
":width: 90%\n",
"Lowers image quality. Unit are in number of pixels.\n",
"```\n",
"````\n",
"````{tab} salt/pepper\n",
"```{figure} static/salt_pepper.png\n",
":name: Salt/Pepper\n",
":width: 65%\n",
"Mimics ink degradation. Valid values: [0, 1].\n",
"```\n",
"````\n",
"`````{tab} close/dilate\n",
"```{figure} static/close_dilate.png\n",
":name: Close/Dilate\n",
"Degrades printing quality.\n",
"```\n",
"````{margin}\n",
"```{note}\n",
"For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)\n",
"```\n",
"````\n",
"`````\n",
"`````{tab} open/erode\n",
"```{figure} static/open_erode.png\n",
":name: Open/Errode\n",
"Ink overflows\n",
"```\n",
"````{margin}\n",
"```{note}\n",
"For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)\n",
"```\n",
"````\n",
"`````"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from genalog.degradation.degrader import ImageState\n",
"\n",
"DEGRADATIONS = [\n",
" (\"blur\", {\"radius\": 5}),\n",
" (\"bleed_through\", {\n",
" \"src\": ImageState.CURRENT_STATE,\n",
" \"background\": ImageState.ORIGINAL_STATE,\n",
" \"alpha\": 0.8,\n",
" \"offset_x\": -6,\n",
" \"offset_y\": -12,\n",
" }),\n",
" (\"morphology\", {\"operation\": \"open\", \"kernel_shape\":(9,9), \"kernel_type\":\"plus\"}),\n",
" (\"pepper\", {\"amount\": 0.005}),\n",
" (\"salt\", {\"amount\": 0.15}),\n",
"]"
]
},
{
"source": [
"```{note}\n",
"`ImageState.ORIGINAL_STATE` refers to the origin state of the image before applying any degradation, while\n",
"`ImageState.CURRENT_STATE` refers to the state of the image after applying the last degradation effect.\n",
"```\n",
"\n",
"The example above will apply degradation effects to synthetic images in the sequence of: \n",
" \n",
" blur -> bleed_through -> morphological operation (open) -> pepper -> salt\n",
" \n",
"For the full list of supported degradation effects, please see [documentation on degradation](https://github.com/microsoft/genalog/blob/main/genalog/degradation/README.md)."
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"We use `Jinja` to prepare html templates. You can find example of these Jinja templates in [our source code](https://github.com/microsoft/genalog/tree/main/genalog/generation/templates)."
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"## Document Generation\n",
"\n",
"With the above configurations, we can go ahead and start generate synthetic document.\n",
"\n",
"### Load Sample Text content\n",
"\n",
"You can use **any** text documents as the content of the generated images. For the sake of the tutorial, you can use the [sample text](https://github.com/microsoft/genalog/blob/main/example/sample/generation/example.txt) from our repo."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"sample_text_url = \"https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/generation/example.txt\"\n",
"sample_text = \"example.txt\"\n",
"\n",
"r = requests.get(sample_text_url, allow_redirects=True)\n",
"open(sample_text, 'wb').write(r.content)\n"
]
},
{
"source": [
"### Generate Synthetic Documents\n",
"\n",
"Next, we can supply the three aforementioned configurations in initalizing `AnalogDocumentGeneration` object"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from genalog.pipeline import AnalogDocumentGeneration\n",
"\n",
"IMG_RESOLUTION = 300 # dots per inch (dpi) of the generated pdf/image\n",
"\n",
"doc_generation = AnalogDocumentGeneration(styles=STYLE_COMBINATIONS, degradations=DEGRADATIONS, resolution=IMG_RESOLUTION, template_path=None)"
]
},
{
"source": [
"To use custom templates, please set `template_path` to the folder of containing them. You can find more information from our [`document_generation.ipynb`](https://github.com/microsoft/genalog/blob/main/example/document_generation.ipynb).\n",
"\n",
"Once initialized, you can call `generate_img()` method to get the synthetic documents as images"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# for custom templates, please set template_path.\n",
"img_array = doc_generation.generate_img(sample_text, HTML_TEMPLATE, target_folder=None) # returns the raw image bytes if target_folder is not specified"
]
},
{
"source": [
"```{note}\n",
"Setting `target_folder` to `None` will return the raw image bytes as a `Numpy.ndarray`. Otherwise the generated image will be save on the disk as a PNG file in the specified path.\n",
"```"
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"### Display the Document"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import cv2\n",
"from IPython.core.display import Image, display\n",
"\n",
"_, encoded_image = cv2.imencode('.png', img_array)\n",
"display(Image(data=encoded_image, width=600))"
]
},
{
"source": [
"## Document Generation (Multi-process)\n",
"\n",
"To scale up the generation across multiple text files, you can use `generate_dataset_multiprocess`. The method will split the list of text filenames into batches and run document generation across different batches as subprocesses in parallel"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from genalog.pipeline import generate_dataset_multiprocess\n",
"\n",
"DST_PATH = \"data\" # where on disk to write the generated image\n",
"\n",
"generate_dataset_multiprocess(\n",
" [sample_text], DST_PATH, STYLE_COMBINATIONS, DEGRADATIONS, HTML_TEMPLATE, \n",
" resolution=IMG_RESOLUTION, batch_size=5\n",
")"
]
},
{
"source": [
"```{note}\n",
"`[sample_text]` is a list of filenames to generate the synthetic dataset over.\n",
"```"
],
"cell_type": "markdown",
"metadata": {}
}
]
}

Просмотреть файл

@ -0,0 +1,93 @@
# Synthetic Document Generator
[![Build Status](https://dev.azure.com/genalog-dev/genalog/_apis/build/status/Nightly-Build?branchName=main)](https://dev.azure.com/genalog-dev/genalog/_build/latest?definitionId=4&branchName=main) ![Azure DevOps tests (compact)](https://img.shields.io/azure-devops/tests/genalog-dev/genalog/4?compact_message) ![Azure DevOps coverage (main)](https://img.shields.io/azure-devops/coverage/genalog-dev/genalog/4/main) ![Python Versions](https://img.shields.io/badge/py-3.6%20%7C%203.7%20%7C%203.8%20-blue) ![Supported OSs](https://img.shields.io/badge/platform-%20linux--64%20-red) ![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)
````{margin}
```sh
pip install genalog
```
````
`genalog` is an open source, cross-platform python package for **gen**erating document images with synthetic noise that mimics scanned an**alog** documents (thus the name `genalog`). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you can create in simple HTML format.
`genalog` provides several document templates as a start. You can alter the document layout using standard CSS properties like `font-family`, `font-size`, `text-align`, etc. Here are some of the example generated documents:
````{tab} Multi-Column
```{figure} static/columns_Times_11px.png
:width: 60%
:name: two-columns-index
Document template with 2 columns
```
````
````{tab} Letter-like
```{figure} static/letter_Times_11px.png
:width: 60%
:name: letter-like-index
Letter-like document template
```
````
````{tab} Simple Text Block
```{figure} static/text_block_Times_11px.png
:width: 60%
:name: text-block-index
Simple text block template
```
````
Once a document is generated, you can combine various image degradation effects and apply onto the synthetic documents. Here are some of the degradation effects:
````{tab} Bleed-through
```{figure} static/bleed_through.png
:name: bleed-through-index
:width: 80%
Mimics a document printed on two sides
```
````
````{tab} Blur
```{figure} static/blur.png
:name: blur-index
:width: 80%
Lowers image quality
```
````
````{tab} Salt/Pepper
```{figure} static/salt_pepper.png
:name: salt/pepper-index
:width: 50%
Mimics ink degradation
```
````
`````{tab} Close/Dilate
```{figure} static/close_dilate.png
:name: close-dilate-index
:width: 90%
Degrades printing quality
```
````{margin}
```{note}
For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)
```
````
`````
`````{tab} Open/Erode
```{figure} static/open_erode.png
:name: open-erode-index
:width: 90%
Ink overflows
```
````{margin}
```{note}
For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)
```
````
`````
````{tab} Combined Effects
```{figure} static/degrader.png
:width: 40%
:name: combined-effects-index
Combining various degradation effects: blur, salt, open, and bleed-through
```
````
In addition to the document generation and degradation, `genalog` also provide efficient implementation for [text alignment](text-alignment-page) between the source and noise text.

Просмотреть файл

@ -0,0 +1,28 @@
# Installation
Genalog is supported across Windows, Mac and Linux on Python 3.6+. However there are *additional* installation steps for Windows and Mac users.
````{tab} pip
```sh
pip install genalog
```
````
````{tab} source
```sh
git clone https://github.com/microsoft/genalog.git && cd genalog && pip install -e .
```
````
## Extra Steps for Windows & Mac Users
We have a dependency on [`Weasyprint`](https://weasyprint.readthedocs.io/en/stable/install.html) for image generation, which in turn has non-python dependencies including `Pango`, `cairo` and `GDK-PixBuf` that need to be installed separately.
So far, `Pango`, `cairo` and `GDK-PixBuf` libraries are available in `Ubuntu-18.04` and later by default.
If you are running on Windows, MacOS, or other Linux distributions, please see [installation instructions from WeasyPrint](https://weasyprint.readthedocs.io/en/stable/install.html).
```{note}
If you encounter the errors like `no library called "libcairo-2" was found`, this is probably due to the three extra dependencies missing.
```

Просмотреть файл

@ -0,0 +1,211 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(label-propagation-page)=\n",
"# Propagation of NER labels\n",
"\n",
"In the context of Named Entity Recognition (NER), typical datasets contain the text tokens and the NER labels for each of the tokens. For example:\n",
"\n",
"````{margin}\n",
"```{note}\n",
"`B-P` is short for \"Beginning-Place\"\n",
"and `I-P` is short for \"Inside-Place\"\n",
"whereas `O` means \"Other\".\n",
"See [IOB Tagging](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) for more details\n",
"```\n",
"````\n",
" NER Labels: B-P I-P O O\n",
" Text: New York is big\n"
]
},
{
"source": [
"Now, imagine we have obtained a noisy version of the grouth truth text through the OCR process, for example. The problem becomes: how can we label the noisy tokens?\n",
"\n",
"\n",
" NER Labels: B-P I-P O O\n",
" GT Text: New York is big\n",
" Noisy Text: New Yo rkis big\n",
" NER Labels: ? ? ? ?\n",
"\n",
"We can utilize text alignment and **propagate** the NER labels onto the noisy tokens. We will demonstrate how in the rest of this document.\n"
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"## Tokenization\n",
"\n",
"To ensure consistent interpretation of the text alignment results, we need to first tokenize the grouth truth and the OCR'ed (nosiy) text."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from genalog.text import ner_label\n",
"from genalog.text import preprocess\n",
"\n",
"gt_txt = \"New York is big\"\n",
"ocr_txt = \"New Yo rkis big\"\n",
"\n",
"# Input to the method\n",
"gt_labels = [\"B-P\", \"I-P\", \"O\", \"O\"]\n",
"gt_tokens = preprocess.tokenize(gt_txt) # tokenize into list of tokens\n",
"ocr_tokens = preprocess.tokenize(ocr_txt)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['B-P', 'I-P', 'O', 'O']\n",
"['New', 'York', 'is', 'big']\n",
"['New', 'Yo', 'rkis', 'big']\n"
]
}
],
"source": [
"# Inputs to the method\n",
"print(gt_labels)\n",
"print(gt_tokens)\n",
"print(ocr_tokens)"
]
},
{
"source": [
"## Label Propagation\n",
"\n",
"We then can run label propagation to obtain the NER labels for the OCR'ed (noisy) tokens."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Method returns a tuple of 4 elements (gt_tokens, gt_labels, ocr_tokens, ocr_labels, gap_char)\n",
"ocr_labels, aligned_gt, aligned_ocr, gap_char = ner_label.propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OCR labels: ['B-P', 'I-P', 'I-P', 'O']\n",
"Aligned ground truth: New Yo@rk is big\n",
"Alinged OCR text: New Yo rk@is big\n"
]
}
],
"source": [
"# Outputs\n",
"print(f\"OCR labels: {ocr_labels}\")\n",
"print(f\"Aligned ground truth: {aligned_gt}\")\n",
"print(f\"Alinged OCR text: {aligned_ocr}\")"
]
},
{
"source": [
"## Display Result After Propagation"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"B-P I-P O O \n",
"New York is big \n",
"New Yo@rk is big\n",
"||||||.||.||||||\n",
"New Yo rk@is big\n",
"New Yo rkis big \n",
"B-P I-P I-P O \n",
"\n"
]
}
],
"source": [
"print(ner_label.format_label_propagation(gt_tokens, gt_labels, ocr_tokens, ocr_labels, aligned_gt, aligned_ocr))"
]
},
{
"source": [
"## Final Results\n",
"\n",
"Formatting the OCR tokens and their NER labels."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"B-P I-P I-P O \n",
"New Yo rkis big \n",
"\n"
]
}
],
"source": [
"# Format tokens and labels\n",
"print(ner_label.format_labels(ocr_tokens, ocr_labels))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

Двоичные данные
docs/genalog_docs/static/analog_doc_gen_pipeline.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 40 KiB

Двоичные данные
docs/genalog_docs/static/bleed_through.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 200 KiB

Двоичные данные
docs/genalog_docs/static/blur.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 190 KiB

Двоичные данные
docs/genalog_docs/static/close_dilate.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 242 KiB

Двоичные данные
docs/genalog_docs/static/columns_Times_11px.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 196 KiB

Двоичные данные
docs/genalog_docs/static/degrader.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 92 KiB

Двоичные данные
docs/genalog_docs/static/degrader_heavy.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 104 KiB

Двоичные данные
docs/genalog_docs/static/kernel_morph.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 1.1 MiB

Двоичные данные
docs/genalog_docs/static/labeled_synthetic_pipeline.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 54 KiB

Двоичные данные
docs/genalog_docs/static/letter_Times_11px.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 157 KiB

Двоичные данные
docs/genalog_docs/static/open_erode.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 259 KiB

Двоичные данные
docs/genalog_docs/static/salt_pepper.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 512 KiB

Двоичные данные
docs/genalog_docs/static/text_block_Times_11px.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 142 KiB

Просмотреть файл

@ -0,0 +1,237 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(text-alignment-page)=\n",
"# Text alignment\n",
"\n",
"````{margin}\n",
"```{note}\n",
"There are many OCR engines you can use to work with `genalog`, including [Azure Cognitve Services](https://docs.microsoft.com/en-us/python/api/overview/azure/cognitiveservices-vision-computervision-readme?view=azure-python) and [Tesseract](https://github.com/tesseract-ocr/tesseract).\n",
"```\n",
"````\n",
"\n",
"`genalog` provides text alignment capabilities. This is most useful in the following situations after you have ran Opitcal Character Recognition (OCR) on the synthetic documents:\n",
"\n",
"- Text alignment between noisy (OCR result) and grouth truth text\n",
"- NER label propagation using text alignment results (we will cover this in the next page)\n",
"\n",
"`genalog` provides two methods of alignment:\n",
"1. `genalog.text.anchor.align_w_anchor()`\n",
"1. `genalog.text.alignment.align()`\n",
"\n",
"`align_w_anchor()` implements the Recursive Text Alignment Scheme (RETAS) from the paper [A Fast Alignment Scheme for Automatic OCR Evaluation of Books](https://ieeexplore.ieee.org/abstract/document/6065412) and works best on longer text strings, while `align()` implement the [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) and works best on shorter strings. \n",
"\n",
"We recommend using the `align_w_anchor()` method on inputs longer than **200 characters**. Both methods share the same function contract and are interchangeable. \n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"gt_txt = \"New York is big\"\n",
"noise_txt = \"New Yo rkis\""
]
},
{
"source": [
"## RETAS Method\n",
"\n",
"This is our implementation of The Recursive Text Alignment Scheme (RETAS) from the paper [A Fast Alignment Scheme for Automatic OCR Evaluation of Books](https://ieeexplore.ieee.org/abstract/document/6065412), as the original paper did not release the algorithm written in Python.\n"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Aligned ground truth: New Yo@rk is big\n",
"Aligned noise: New Yo rk@is@@@@\n"
]
}
],
"source": [
"from genalog.text import anchor\n",
"\n",
"# Extra whitespaces are removed\n",
"aligned_gt, aligned_noise = anchor.align_w_anchor(gt_txt, noise_txt)\n",
"print(f\"Aligned ground truth: {aligned_gt}\")\n",
"print(f\"Aligned noise: {aligned_noise}\")"
]
},
{
"source": [
"```{hint}\n",
"`@` is the default gap character inserted by the alignment algorithm, you can change the gap character by providing the keyword-argument `anchor.align_w_anchor(gt_txt, noise_txt, gap_char=<NEW_CHAR>)`\n",
"```"
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"## Needleman-Wunsch Algorithm\n",
"\n",
"We use [Biopython](https://biopython.org/)'s implementation of the Needleman-Wunsch algorithm for text alignment.\n",
"This algorithm is an exhaustive search for all possible candidates with dynamic programming. \n",
"It produces weighted score for each candidate and returns those having the highest score. \n",
"(**NOTE** that multiple candidates can share the same score)"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Aligned ground truth: New Yo@rk is big\n",
"Aligned noise: New Yo rk@is@@@@\n"
]
}
],
"source": [
"# Needleman-Wunsch alignment ONLY\n",
"from genalog.text import alignment\n",
"\n",
"aligned_gt, aligned_noise = alignment.align(gt_txt, noise_txt)\n",
"print(f\"Aligned ground truth: {aligned_gt}\")\n",
"print(f\"Aligned noise: {aligned_noise}\")"
]
},
{
"source": [
"### Advanced Algorithm Configurations\n",
"\n",
"The Needleman-Wunsch Algorithm algorithm has 4 hyperparameters for tuning candidate scores:\n",
"1. **Match Reward** - how much the algorithm rewards matching characters\n",
"1. **Mismatch Penalty** - how much the algorithm penalizes mismatching characters\n",
"1. **Gap Penalty** - how much the algorithm penalizes for creating a gap with a GAP_CHAR (defaults to '@')\n",
"1. **Gap Extension Penalty** - how much the algorithm penalizes for extending a gap (ex \"@@@@\")\n",
"\n",
"You can find the default values for these four parameters as a constant in the package:\n",
"1. `genalog.text.alignment.MATCH_REWARD`\n",
"1. `genalog.text.alignment.MISMATCH_PENALTY`\n",
"1. `genalog.text.alignment.GAP_PENALTY`\n",
"1. `genalog.text.alignment.GAP_EXT_PENALTY`"
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"## Interpret the Alignment Results\n",
"\n",
"`genalog` provide additional functionality to interpret the alignment results and produce a relational mapping between the tokens in the noisy and grouth truth text."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gt_to_noise: [[0], [1, 2], [2], []]\n",
"noise_to_gt: [[0], [1], [1, 2], []]\n"
]
}
],
"source": [
"from genalog.text import alignment\n",
"\n",
"# Process the aligned strings to find out how the tokens are related\n",
"gt_to_noise_mapping, noise_to_gt_mapping = alignment.parse_alignment(aligned_gt, aligned_noise, gap_char=\"@\")\n",
"print(f\"gt_to_noise: {gt_to_noise_mapping}\")\n",
"print(f\"noise_to_gt: {noise_to_gt_mapping}\")"
]
},
{
"source": [
"Recall that the ground truth is `New York is big` while the noisy text is `New Yo rkis`.\n",
"\n",
"`gt_to_noise: [[0], [1, 2], [2], []]` can be interpreted as: \"the **0th** gt token (`New`) maps to the **0th** noisy token (`New`), the **1st** gt token (`York`) maps to the **1st and 2nd** nosity tokens (`Yo` and `rkis`), the **2nd** token (`is`) maps to the **2nd** noisy token (`rkis`), and finally, the last gt token (`big`) cannot be mapped to any noisy token.\"\n",
"\n",
"And the vice versa for `noise_to_gt: [[0], [1], [1, 2], []]`\n"
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"## Formatting Alignment Results\n",
"\n",
"You can use `genalog.alignment._format_alignment()` for better visual understanding of the alignment results"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New Yo@rk is @ big@\n",
"||||||.||.||||||||.\n",
"New Yo rk@is @ big \n",
"\n"
]
}
],
"source": [
"# Format aligned string for better display\n",
"print(alignment._format_alignment(aligned_gt, aligned_noise))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

Просмотреть файл

@ -1,22 +0,0 @@
.. genalog documentation master file, created by
sphinx-quickstart on Thu Jan 28 15:19:33 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to genalog's documentation!
===================================
.. toctree::
:maxdepth: 2
:caption: Contents:
genalog/genalog
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

Просмотреть файл

@ -1,35 +0,0 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

Просмотреть файл

@ -1,2 +1,4 @@
jupyter-book
sphinx
sphinx_rtd_theme
sphinx_inline_tabs
ghp-import

Просмотреть файл

@ -32,7 +32,7 @@
"# Analog Document Generation\n",
"\n",
"<p float=\"left\">\n",
" <img src=\"static/analog_doc_gen_pipeline.png\" width=\"800\" />\n",
" <img src=\"static\\analog_doc_gen_pipeline.png\" width=\"800\" />\n",
"</p>\n",
"\n",
"Genalog provides a simple interface (`AnalogDocumentGeneration`) to programmatic generate documents with degradation from a body of text."

Просмотреть файл

@ -339,7 +339,7 @@ def parse_alignment(aligned_gt, aligned_noise, gap_char=GAP_CHAR):
gap_char (char, optional) : gap char used in alignment algorithm. Defaults to GAP_CHAR.
Returns:
tuple -- a tuple ``(gt_to_noise_mapping, noise_to_gt_mapping)`` of two 2D int arrays:
tuple : ``(gt_to_noise_mapping, noise_to_gt_mapping)`` of two 2D int arrays:
where each array defines the mapping between aligned gt tokens
to noise tokens and vice versa.

Просмотреть файл

@ -201,8 +201,7 @@ def propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True):
gt_tokens (list) : a list of ground truth string tokens
ocr_tokens (list) : a list of OCR'ed text tokens
gap_char (char, optional) : gap char used in alignment algorithm. Defaults to ``alignment.GAP_CHAR``.
use_anchor (bool, optional) : use faster alignment method with anchors if set to True
. Defaults to True.
use_anchor (bool, optional) : use faster alignment method with anchors if set to True. Defaults to True.
Raises:
GapCharError:
@ -210,12 +209,11 @@ def propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True):
to set of all possible gap characters (GAP_CHAR_SET)
Returns:
tuple : a tuple of 3 elements ``(ocr_labels, aligned_gt, aligned_ocr, gap_char)``
where
1. ``ocr_labels`` is a list of NER label for the corresponding ocr tokens
2. ``aligned_gt`` is the ground truth string aligned with the ocr text
3. ``aligned_ocr`` is the ocr text aligned with ground true
4. ``gap_char`` is the char used to alignment for inserting gaps
tuple : a tuple of 3 elements ``(ocr_labels, aligned_gt, aligned_ocr, gap_char)`` where
1. ``ocr_labels`` is a list of NER label for the corresponding ocr tokens
2. ``aligned_gt`` is the ground truth string aligned with the ocr text
3. ``aligned_ocr`` is the ocr text aligned with ground true
4. ``gap_char`` is the char used to alignment for inserting gaps
"""
# Find a set of suitable GAP_CHAR based not in the set of input characters
gap_char_candidates, input_char_set = _find_gap_char_candidates(
@ -241,14 +239,14 @@ def propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True):
def _propagate_label_to_ocr(
gt_labels, gt_tokens, ocr_tokens, gap_char=alignment.GAP_CHAR, use_anchor=True
):
"""Propagate NER label for ground truth tokens to to ocr tokens. Low level implementation
r"""Propagate NER label for ground truth tokens to to ocr tokens. Low level implementation
NOTE: that `gt_tokens` and `ocr_tokens` MUST NOT contain invalid tokens.
Invalid tokens are:
1. non-atomic tokens, or space-separated string ("New York")
2. multiple occurrences of the GAP_CHAR ('@@@')
3. empty string ("")
4. string with spaces (" ")
1. non-atomic tokens, or space-separated string ("New York")
2. multiple occurrences of the GAP_CHAR ('@@@')
3. empty string ("")
4. string with spaces (" ")
::
@ -261,7 +259,7 @@ def _propagate_label_to_ocr(
gt label B-p I-p B-p I-p B-p I-p B-p I-p B-p I-p I-p
| | | | | | | | | | |
gt_token New York New York New York New York New York City
/ \\ / \\ \\/ /\\ / | | |
/ \ / \ \ / /\ / | | |
ocr_token N ew Yo rk NewYork N ew@York New York City
| | | | | | | | | |
ocr label B-p I-p I-p I-p B-p B-p I-p B-p B-p I-p
@ -274,7 +272,7 @@ def _propagate_label_to_ocr(
gt label O V O O V W O O
| | | | | | | |
gt_token something is big this is huge is big
/ \\ \\ \\/ /\\ /\\/ |
/ \ \ \ / /\ /\ / |
ocr_token so me thing isbig th isi shuge is
| | | | | | | |
ocr label o o o V O O V O
@ -288,37 +286,32 @@ def _propagate_label_to_ocr(
Defaults to True.
Raises:
ValueError: when
1. there is unequal number of gt_tokens and gt_labels
2. there is a non-atomic token in gt_tokens or ocr_tokens
3. there is an empty string in gt_tokens or ocr_tokens
4. there is a token full of space characters only in gt_tokens or ocr_tokens
5. gt_to_ocr_mapping has more tokens than gt_tokens
1. there is unequal number of gt_tokens and gt_labels
2. there is a non-atomic token in gt_tokens or ocr_tokens
3. there is an empty string in gt_tokens or ocr_tokens
4. there is a token full of space characters only in gt_tokens or ocr_tokens
5. gt_to_ocr_mapping has more tokens than gt_tokens
GapCharError: when
1. there is a token consisted of GAP_CHAR only
1. there is a token consisted of GAP_CHAR only
Returns:
a tuple of 4 elements:
(ocr_labels, aligned_gt, aligned_ocr, gap_char)
a tuple of 4 elements: (ocr_labels, aligned_gt, aligned_ocr, gap_char)
where
`ocr_labels` is a list of NER label for the corresponding ocr tokens
`aligned_gt` is the ground truth string aligned with the ocr text
`aligned_ocr` is the ocr text aligned with ground true
`gap_char` is the char used to alignment for inserting gaps
`ocr_labels` is a list of NER label for the corresponding ocr tokens
`aligned_gt` is the ground truth string aligned with the ocr text
`aligned_ocr` is the ocr text aligned with ground true
`gap_char` is the char used to alignment for inserting gaps
For example,
given input:
For example, given input:
gt_labels: ["B-place", "I-place", "o", "o"]
gt_tokens: ["New", "York", "is", "big"]
ocr_tokens: ["N", "ewYork", "big"]
>>> _propagate_label_to_ocr(
["B-place", "I-place", "o", "o"],
["New", "York", "is", "big"],
["N", "ewYork", "big"]
)
(["B-place", "I-place", "o"], "N@ew York is big", "N ew@York@@@ big", '@')
output:
(
["B-place", "I-place", "o"],
"N@ew York is big",
"N ew@York@@@ big"
)
"""
# Pseudo-algorithm:

Просмотреть файл

@ -16,12 +16,12 @@ setuptools.setup(
name="genalog",
install_requires=requirements,
version=BUILD_VERSION,
author="Team Enki",
author_email="ta_nerds@microsoft.com",
author="Jianjie Liu & Amit Gupte",
author_email="ta_maidap_fy20_h2@microsoft.com",
description="Tools for generating analog document (images) from raw text",
long_description=long_description,
long_description_content_type="text/markdown",
url='https://msazure.visualstudio.com/DefaultCollection/Cognitive%20Services/_git/Tools-Synthetic-Data-Generator',
url='https://github.com/microsoft/genalog',
packages=setuptools.find_packages(exclude=['tests', 'tests.*']),
package_data={'': [
'genalog/generation/templates/*.jinja'