Laserprec/jupyter book doc (#28)
* Add documentation in jupyter-book * Add Trademark Notice
22
README.md
|
@ -2,7 +2,7 @@
|
|||
|
||||
[![Build Status](https://dev.azure.com/genalog-dev/genalog/_apis/build/status/Nightly-Build?branchName=main)](https://dev.azure.com/genalog-dev/genalog/_build/latest?definitionId=4&branchName=main) ![Azure DevOps tests (compact)](https://img.shields.io/azure-devops/tests/genalog-dev/genalog/4?compact_message) ![Azure DevOps coverage (main)](https://img.shields.io/azure-devops/coverage/genalog-dev/genalog/4/main) ![Python Versions](https://img.shields.io/badge/py-3.6%20%7C%203.7%20%7C%203.8%20-blue) ![Supported OSs](https://img.shields.io/badge/platform-%20linux--64%20-red) ![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)
|
||||
|
||||
Genalog is an open source, cross-platform python package allowing to generate synthetic document images with text data. Tool also allows you to add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.
|
||||
`Genalog` is an open source, cross-platform python package for **gen**erating document images with synthetic noise that mimics scanned an**alog** documents (thus the name `genalog`). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format.
|
||||
|
||||
Overview
|
||||
-------------------------------------
|
||||
|
@ -85,16 +85,23 @@ If you are running on Windows, MacOS, or other Linux distributions, please see [
|
|||
|
||||
Repo Structure
|
||||
-------------------
|
||||
Tools-Synthetic-Data-Generator
|
||||
genalog
|
||||
├────genalog
|
||||
│ ├─── generation # generate text images
|
||||
│ ├──── degradation # methods for image degradation
|
||||
│ ├──── ocr # running the Azure Search Pipeline
|
||||
│ └──── text # methods to Align OCR Output Text with Input Text
|
||||
├────examples # Example Jupyter Notebooks for Various Synthetic Data Generation Scenarios
|
||||
├────tests # PyTest files
|
||||
├────README.md # Main Readme file
|
||||
└────LICENSE # License file
|
||||
│ └──── text # methods to Align OCR Output Text with
|
||||
├────devops # CI/CD pipelines
|
||||
├────docs # containing online documentaions
|
||||
├────examples # example Jupyter Notebooks for Various
|
||||
├────tests # tests
|
||||
├────tox.ini # CI orchestration and configurations
|
||||
├────README.md
|
||||
└────LICENSE
|
||||
|
||||
Trademark Notice
|
||||
--------------------
|
||||
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
|
||||
|
||||
Microsoft Open Source Code of Conduct
|
||||
-------------------------------------
|
||||
|
@ -118,7 +125,6 @@ For more information see the [Code of Conduct FAQ](https://opensource.microsoft.
|
|||
or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
||||
|
||||
|
||||
|
||||
Collaborators
|
||||
-------------------------------------
|
||||
Genalog was originally developed by the [MAIDAP team at Microsoft Cambridge NERD](http://www.microsoftnewengland.com/nerd-ai/) in association with the Text Analytics Team in Redmond.
|
||||
|
|
|
@ -1,3 +1,3 @@
|
|||
_build/
|
||||
_static/
|
||||
_templates/
|
||||
**/example.txt
|
||||
**/_build
|
||||
**/data
|
|
@ -1,20 +0,0 @@
|
|||
# Minimal makefile for Sphinx documentation
|
||||
#
|
||||
|
||||
# You can set these variables from the command line, and also
|
||||
# from the environment for the first two.
|
||||
SPHINXOPTS ?=
|
||||
SPHINXBUILD ?= sphinx-build
|
||||
SOURCEDIR = .
|
||||
BUILDDIR = _build
|
||||
|
||||
# Put it first so that "make" without argument is like "make help".
|
||||
help:
|
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
|
||||
%: Makefile
|
||||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
67
docs/conf.py
|
@ -1,67 +0,0 @@
|
|||
# Configuration file for the Sphinx documentation builder.
|
||||
#
|
||||
# This file only contains a selection of the most common options. For a full
|
||||
# list see the documentation:
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html
|
||||
|
||||
# -- Path setup --------------------------------------------------------------
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
|
||||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.abspath('.'))
|
||||
sys.path.insert(0, os.path.abspath('..'))
|
||||
sys.path.insert(0, os.path.abspath('../genalog'))
|
||||
sys.path.insert(0, os.path.abspath('../genalog/degradation'))
|
||||
|
||||
|
||||
# -- Project information -----------------------------------------------------
|
||||
|
||||
project = 'genalog'
|
||||
copyright = '2021, Microsoft'
|
||||
author = 'Microsoft'
|
||||
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = [
|
||||
'sphinx.ext.autodoc',
|
||||
'sphinx.ext.napoleon',
|
||||
'sphinx.ext.coverage',
|
||||
]
|
||||
|
||||
# The master toctree document.
|
||||
master_doc = 'index'
|
||||
autodoc_member_order = 'groupwise'
|
||||
autoclass_content = 'both'
|
||||
|
||||
# Napoleon settings
|
||||
napoleon_google_docstring = True
|
||||
napoleon_numpy_docstring = True
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ['_templates']
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
# This pattern also affects html_static_path and html_extra_path.
|
||||
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
|
||||
|
||||
|
||||
# -- Options for HTML output -------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
#
|
||||
html_theme = 'sphinx_rtd_theme'
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = ['_static']
|
|
@ -1,29 +0,0 @@
|
|||
genalog.degradation package
|
||||
===========================
|
||||
|
||||
Submodules
|
||||
----------
|
||||
|
||||
genalog.degradation.degrader module
|
||||
-----------------------------------
|
||||
|
||||
.. automodule:: genalog.degradation.degrader
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.degradation.effect module
|
||||
---------------------------------
|
||||
|
||||
.. automodule:: genalog.degradation.effect
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
Module contents
|
||||
---------------
|
||||
|
||||
.. automodule:: genalog.degradation
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
|
@ -1,32 +0,0 @@
|
|||
genalog package
|
||||
===============
|
||||
|
||||
Subpackages
|
||||
-----------
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
genalog.degradation
|
||||
genalog.generation
|
||||
genalog.ocr
|
||||
genalog.text
|
||||
|
||||
Submodules
|
||||
----------
|
||||
|
||||
genalog.pipeline module
|
||||
-----------------------
|
||||
|
||||
.. automodule:: genalog.pipeline
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
Module contents
|
||||
---------------
|
||||
|
||||
.. automodule:: genalog
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
|
@ -0,0 +1,46 @@
|
|||
title : <h1 style="font-size:2em;text-align:center;color:#FF5733">Genalog</h1>
|
||||
author: Jianjie Liu and Amit Gupte
|
||||
# logo: 'qe-logo-large.png'
|
||||
|
||||
# Short description about the book
|
||||
description: >-
|
||||
Guide for end-to-end synthetic analog document generation
|
||||
|
||||
execute:
|
||||
execute_notebooks : off
|
||||
|
||||
# Interact link settings
|
||||
notebook_interface : "notebook"
|
||||
|
||||
# Launch button settings
|
||||
repository:
|
||||
url : https://github.com/microsoft/genalog
|
||||
path_to_book : /docs/genalog_docs
|
||||
branch : main
|
||||
|
||||
launch_buttons:
|
||||
notebook_interface : classic
|
||||
|
||||
# HTML-specific settings
|
||||
html:
|
||||
home_page_in_navbar : false
|
||||
use_repository_button : true
|
||||
|
||||
# # LaTeX settings
|
||||
# bibtex_bibfiles:
|
||||
# - _bibliography/references.bib
|
||||
# latex:
|
||||
# latex_engine : "xelatex"
|
||||
# latex_documents:
|
||||
# targetname: book.tex
|
||||
|
||||
sphinx:
|
||||
extra_extensions:
|
||||
- sphinx_inline_tabs
|
||||
- sphinx.ext.autodoc
|
||||
- sphinx.ext.napoleon
|
||||
- sphinx.ext.viewcode
|
||||
config:
|
||||
napoleon_google_docstring: True
|
||||
autodoc_member_order: groupwise
|
||||
autoclass_content: both
|
|
@ -0,0 +1,24 @@
|
|||
root: index
|
||||
format: jb-book
|
||||
defaults:
|
||||
numbered: false
|
||||
parts:
|
||||
- caption: Getting Started
|
||||
chapters:
|
||||
- file: installation
|
||||
- file: generation_pipeline
|
||||
- file: e2e_dataset_pipeline
|
||||
- caption: Fabricating Document & Noise
|
||||
chapters:
|
||||
- file: doc_generation
|
||||
- file: doc_degradation
|
||||
- caption: Handling Noisy Text
|
||||
chapters:
|
||||
- file: text_alignment
|
||||
- file: ocr_label_propagation
|
||||
- caption: API Documentation
|
||||
chapters:
|
||||
- file: docstring/genalog.degradation
|
||||
- file: docstring/genalog.generation
|
||||
- file: docstring/genalog.ocr
|
||||
- file: docstring/genalog.text
|
|
@ -0,0 +1,257 @@
|
|||
# Degrade a document
|
||||
|
||||
`genalog.degradation` module allows you to degrade any images with real world degradations.
|
||||
|
||||
## Download a sample image
|
||||
We can download a [sample image](https://github.com/microsoft/genalog/blob/main/example/sample/degradation/text_zoomed.png) from our repo, but you are welcome to skip this step and use an image you generated in the [previous page](document-generation) or elsewhere.
|
||||
|
||||
```python
|
||||
import request
|
||||
|
||||
sample_img_url = "https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/degradation/text_zoomed.png"
|
||||
sample_img = "text_zoomed.png"
|
||||
|
||||
r = requests.get(sample_text_url, allow_redirects=True)
|
||||
open(sample_img, 'wb').write(r.content)
|
||||
|
||||
# Load in sample image
|
||||
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
|
||||
```
|
||||
|
||||
## Degrader
|
||||
|
||||
The `Degrader` class is the standard way to apply multiple degradations to an image.
|
||||
|
||||
```python
|
||||
import cv2
|
||||
from genalog.degradation.degrader import Degrader
|
||||
from matplotlib import pyplot as plt
|
||||
|
||||
# We are applying degradation effects to the images in the following sequence:
|
||||
# blur -> bleed_through -> salt
|
||||
degradations = [
|
||||
("blur", {"radius": 3}),
|
||||
("bleed_through", {"alpha": 0.8}),
|
||||
("salt", {"amount": 0.5}),
|
||||
]
|
||||
# All of the referenced degradation effects are in submodule `genalog.degradation.effect`
|
||||
|
||||
degrader = Degrader(degradations)
|
||||
dst = degrader.apply_effects(src)
|
||||
plt.imshow(dst, cmap="gray")
|
||||
```
|
||||
|
||||
```{image} static/degrader.png
|
||||
:width: 40%
|
||||
:align: center
|
||||
```
|
||||
|
||||
### Advanced Degradation Configurations
|
||||
|
||||
`genalog` provides an enumeration `ImageState` to reference the image at different state in the degradation process. For example:
|
||||
|
||||
1. `ImageState.ORIGINAL_STATE` refers to the origin state of the image before applying any degradation, while
|
||||
1. `ImageState.CURRENT_STATE` refers to the state of the image after applying the last degradation effect.
|
||||
|
||||
This is most useful when you want to combine multiple layers of degradation, like the following examples.
|
||||
|
||||
```python
|
||||
from genalog.degradation.degrader import Degrader, ImageState
|
||||
|
||||
degradations = [
|
||||
("morphology", {"operation": "open", "kernel_shape":(9,9), "kernel_type":"plus"}),
|
||||
("morphology", {"operation": "close", "kernel_shape":(9,1), "kernel_type":"ones"}),
|
||||
("salt", {"amount": 0.7}),
|
||||
("overlay", {
|
||||
"src": ImageState.ORIGINAL_STATE,
|
||||
"background": ImageState.CURRENT_STATE,
|
||||
}),
|
||||
("bleed_through", {
|
||||
"src": ImageState.CURRENT_STATE,
|
||||
"background": ImageState.ORIGINAL_STATE,
|
||||
"alpha": 0.90,
|
||||
"offset_x": -5,
|
||||
"offset_y": -5,
|
||||
}),
|
||||
("pepper", {"amount": 0.005}),
|
||||
("blur", {"radius": 3}),
|
||||
("salt", {"amount": 0.15}),
|
||||
]
|
||||
|
||||
degrader = Degrader(degradations)
|
||||
dst = degrader.apply_effects(src)
|
||||
plt.imshow(dst, cmap="gray")
|
||||
```
|
||||
|
||||
```{image} static/degrader_heavy.png
|
||||
:width: 40%
|
||||
:align: center
|
||||
```
|
||||
|
||||
## Blur
|
||||
|
||||
An effect that occurs when scanner cannot focus on the document properly that results in document looking foggy/hazy.
|
||||
|
||||
```python
|
||||
# Import Genalog Degradations and other libraries
|
||||
import genalog.degradation.effect as effect
|
||||
import cv2
|
||||
from matplotlib import pyplot as plt
|
||||
|
||||
# Load in sample image
|
||||
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
|
||||
# Add noise to the Image
|
||||
blurred = effect.blur(src, radius=7) # the larger the radius, the lower the contrast
|
||||
plt.imshow(blurred, cmap="gray")
|
||||
plt.title('blurred', fontsize=6)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{image} static/blur.png
|
||||
:width: 60%
|
||||
:align: center
|
||||
```
|
||||
|
||||
## Bleed Through
|
||||
This effect tries to mimic the seepage of ink from one side of a printed page to the other.
|
||||
```python
|
||||
# Import Genalog Degradations and other libraries
|
||||
import genalog.degradation.effect as effect
|
||||
import cv2
|
||||
from matplotlib import pyplot as plt
|
||||
|
||||
|
||||
# Load in sample image
|
||||
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
|
||||
# Add noise to the Image
|
||||
bleed_through = effect.bleed_through(src, alpha=0.9)# higher the alpha, the less visible is the effect
|
||||
plt.imshow(bleed_through, cmap="gray")
|
||||
plt.title('bleed_through', fontsize=6)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{image} static/bleed_through.png
|
||||
:width: 60%
|
||||
:align: center
|
||||
```
|
||||
|
||||
## Salt and Pepper noise
|
||||
In this effect we randomly sprinkle "salt" (white pixels) and "pepper" (dark pixels) onto the original image to imitate ink degradation and page degradation.
|
||||
```python
|
||||
# Import Genalog Degradations and other libraries
|
||||
import genalog.degradation.effect as effect
|
||||
import cv2
|
||||
from matplotlib import pyplot as plt
|
||||
|
||||
# Load in sample image
|
||||
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
|
||||
# Add noise to the Image
|
||||
salted = effect.salt(src, amount=0.4) # amount is the percentage of pixels to be salted (whitened)
|
||||
plt.imshow(salted, cmap="gray")
|
||||
plt.title('Salted', fontsize=6)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{image} static/salt_pepper.png
|
||||
:width: 70%
|
||||
:align: center
|
||||
```
|
||||
|
||||
## Morphological Degradations
|
||||
|
||||
`Morphological Degradations` : Morphological operations is a structural degradation commonly applied on a binary image. For more information, please see [link](http://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm). The convention for these binary images is to have the subject, or the foreground, in white on a black background. However, our example image has the subject in black on a white background, so the morphological degradation will have the effect opposite to its name.
|
||||
|
||||
### Erode and Open
|
||||
|
||||
```python
|
||||
# Import Genalog Degradations and other libraries
|
||||
import genalog.degradation.effect as effect
|
||||
import cv2
|
||||
from matplotlib import pyplot as plt
|
||||
|
||||
# Load in sample image
|
||||
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
|
||||
# Add noise to the Image
|
||||
kernel = effect.create_2D_kernel((5,5), kernel_type="ones")
|
||||
erode = effect.erode(src, kernel)
|
||||
open = effect.open(src, kernel) # retains more of the foreground shape than erosion, given the same kernel
|
||||
|
||||
# display input and output image
|
||||
fig = plt.figure(figsize=(6, 4), dpi=300)
|
||||
fig.add_subplot(1,3,1)
|
||||
plt.imshow(src, cmap="gray")
|
||||
plt.title('src', fontsize=6)
|
||||
fig.add_subplot(1,3,2)
|
||||
plt.imshow(open, cmap="gray")
|
||||
plt.title('open', fontsize=6)
|
||||
fig.add_subplot(1,3,3)
|
||||
plt.imshow(erode, cmap="gray")
|
||||
plt.title('erode', fontsize=6)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{image} static/open_erode.png
|
||||
:width: 80%
|
||||
:align: center
|
||||
```
|
||||
|
||||
Here we are "opening" up the foreground structures (text) and joining the character structuring together. In another perspective, we are "eroding" away the white background by expanding the foreground.
|
||||
|
||||
### Dilate and Close
|
||||
|
||||
```python
|
||||
# Load in sample image
|
||||
src = cv2.imread(sample_img, cv2.IMREAD_GRAYSCALE)
|
||||
kernel = effect.create_2D_kernel((3,3), kernel_type="ones")
|
||||
dilate = effect.dilate(src, kernel)
|
||||
close = effect.close(src, kernel) # less destructive than dilation, given the same kernel
|
||||
|
||||
# display input and output image
|
||||
fig = plt.figure(figsize=(6, 4), dpi=300)
|
||||
fig.add_subplot(1,3,1)
|
||||
plt.imshow(src, cmap="gray")
|
||||
plt.title('src', fontsize=6)
|
||||
fig.add_subplot(1,3,2)
|
||||
plt.imshow(close, cmap="gray")
|
||||
plt.title('close', fontsize=6)
|
||||
fig.add_subplot(1,3,3)
|
||||
plt.imshow(dilate, cmap="gray")
|
||||
plt.title('dilate', fontsize=6)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{image} static/close_dilate.png
|
||||
:width: 80%
|
||||
:align: center
|
||||
```
|
||||
|
||||
We are "closing" or "dilating" the white background, thus chipping away the foreground structures (text). This effect can mimic the effect of degrading ink or a printer running out of ink.
|
||||
|
||||
### Kernel Size and Shape
|
||||
|
||||
An important element of the morphological degradation is the [structuring element](http://homepages.inf.ed.ac.uk/rbf/HIPR2/strctel.htm), or the kernel used. With proper size and shape of the kernel, one can extract interesting structures of the source image.
|
||||
|
||||
````{toggle}
|
||||
```python
|
||||
elliptical_kernel = effect.create_2D_kernel((4,4), kernel_type="ellipse")
|
||||
vertical_kernel = effect.create_2D_kernel((5,1), kernel_type="ones")
|
||||
horizontal_kernel = effect.create_2D_kernel((1,5), kernel_type="ones")
|
||||
upper_tri_kernel = effect.create_2D_kernel((5,5), kernel_type="upper_triangle")
|
||||
lower_tri_kernel = effect.create_2D_kernel((5,5), kernel_type="lower_triangle")
|
||||
x_kernel = effect.create_2D_kernel((4,4), kernel_type="x")
|
||||
plus_kernel = effect.create_2D_kernel((6,6), kernel_type="plus")
|
||||
|
||||
dilate_w_elliptical_k = effect.dilate(src, elliptical_kernel)
|
||||
dilate_w_vertical_k = effect.dilate(src, vertical_kernel)
|
||||
dilate_w_horizontal_k = effect.dilate(src, horizontal_kernel)
|
||||
dilate_w_upper_tri_k = effect.dilate(src, upper_tri_kernel)
|
||||
dilate_w_lower_tri_k = effect.dilate(src, lower_tri_kernel)
|
||||
dilate_w_x_kernel = effect.dilate(src, x_kernel)
|
||||
dilate_w_plus_kernel = effect.dilate(src, plus_kernel)
|
||||
```
|
||||
````
|
||||
|
||||
```{image} static/kernel_morph.png
|
||||
:width: 80%
|
||||
:align: center
|
||||
```
|
|
@ -0,0 +1,139 @@
|
|||
(document-generation)=
|
||||
# Create a document
|
||||
|
||||
`genalog` allows you to generate synthetic documents from **any** given text.
|
||||
|
||||
To generate the synthetic documents, there are two important concepts to be familiar with:
|
||||
|
||||
1. `Template` - controls the layout of the document (i.e. font, langauge, position of the content, etc)
|
||||
2. `Content` - items to be used to fill the template (i.e. text, images, tables, lists, etc)
|
||||
|
||||
We are using a HTML templating engine [(Jinja2)](https://jinja.palletsprojects.com/en/3.0.x/) to build our html templates, and a html-pdf converter [(Weasyprint)](https://weasyprint.readthedocs.io/en/latest/) to print the html as a pdf or an image.
|
||||
|
||||
We provide **three** standard templates for with document layouts:
|
||||
|
||||
````{tab} columns.html.jinja
|
||||
```{figure} static/columns_Times_11px.png
|
||||
:width: 30%
|
||||
```
|
||||
````
|
||||
````{tab} letter.html.jinja
|
||||
```{figure} static/letter_Times_11px.png
|
||||
:width: 30%
|
||||
```
|
||||
````
|
||||
````{tab} text_block.html.jinja
|
||||
```{figure} static/text_block_Times_11px.png
|
||||
:width: 30%
|
||||
```
|
||||
````
|
||||
|
||||
You can find the source code of these templates in path [`genalog/generation/templates`](https://github.com/microsoft/genalog/tree/main/genalog/generation/templates).
|
||||
|
||||
## Document Content
|
||||
|
||||
The goal is to be able to generate synthetic documents on ANY text input. Here we are loading in an sample file from our repo. You may use any text as well.
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
sample_text_url = "https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/generation/example.txt"
|
||||
|
||||
r = requests.get(sample_text_url, allow_redirects=True)
|
||||
text = r.content.decode("ascii")
|
||||
```
|
||||
### Initialize `CompositeContent`
|
||||
To properly initiate the content populating a document template, we need to create the `CompositeContent` class.
|
||||
|
||||
```python
|
||||
from genalog.generation.content import CompositeContent, ContentType
|
||||
|
||||
# Initialize CompositeContent Object
|
||||
paragraphs = text.split('\n\n') # split paragraphs by `\n\n`
|
||||
content_types = [ContentType.PARAGRAPH] * len(paragraphs)
|
||||
content = CompositeContent(paragraphs, content_types)
|
||||
```
|
||||
The `CompositeContent` is a list of pairs of bodies of text and their `ContentType`. Here we can declaring a list of multiple `ContentType.PARAGRAPH`s.
|
||||
|
||||
```{note}
|
||||
`ContentType` is an enumeration dictating the supported content type (ex. ContentType.PARAGRAPH, ContentType.TITLE, ContentType.COMPOSITE). This enumeration controls the collection of CSS styles to be apply onto the associated content. If you change to `ContentType.TITLE`, for example, the paragraph will inherit the style of a title section (bolded text, enlarged font-size, etc).
|
||||
```
|
||||
|
||||
### Populate Content Into a Template
|
||||
|
||||
Once we initialized a `CompositeContent` object, we can populate the content into any standard template, via `DocumentGenerator` class.
|
||||
|
||||
```python
|
||||
from genalog.generation.document import DocumentGenerator
|
||||
default_generator = DocumentGenerator()
|
||||
|
||||
print(f"Available default templates: {default_generator.template_list}")
|
||||
print(f"Default styles to generate: {default_generator.styles_to_generate}")
|
||||
```
|
||||
|
||||
The `DocumentGenerator` has default styles. The above code snippet will show the default configurations and the names of the 3 standard templates. You will use the information to select the template you want to generate. The three templates are `["columns.html.jinja", "letter.html.jinja", "text_block.html.jinja"]`
|
||||
|
||||
```python
|
||||
# Select specific template, content and create the generator
|
||||
doc_gen = default_generator.create_generator(content, ["columns.html.jinja", "letter.html.jinja", "text_block.html.jinja"])
|
||||
# we will use the `CompositeContent` object initialized from above cell
|
||||
|
||||
# python generator
|
||||
for doc in doc_gen:
|
||||
template_name = doc.template.name.replace(".html.jinja", "")
|
||||
doc.render_png(target=f"example_{template_name}.png", resolution=300) #in dots per inch
|
||||
```
|
||||
You can also retrieve the raw image byte information without specifying the `target`
|
||||
|
||||
```python
|
||||
from genalog.generation.document import DocumentGenerator
|
||||
from IPython.core.display import Image, display
|
||||
|
||||
doc_gen = default_generator.create_generator(content, ['text_block.html.jinja'])
|
||||
|
||||
for doc in doc_gen:
|
||||
image_byte = doc.render_png(resolution=100)
|
||||
display(Image(image_byte))
|
||||
```
|
||||
|
||||
Alternative, you can also save the document as a PDF file.
|
||||
|
||||
```python
|
||||
# Select specific template, content and create the generator
|
||||
doc_gen = default_generator.create_generator(content, ['text_block.html.jinja'])
|
||||
# we will use the `CompositeContent` object initialized from above cell
|
||||
|
||||
# python generator
|
||||
for doc in doc_gen:
|
||||
doc.render_pdf(target="example_text_block.png")
|
||||
```
|
||||
|
||||
### Changing Document Styles
|
||||
|
||||
You can alter the document styles including font family, font size, enabling hyphenation, and text alignment. These are mock style properties of their CSS counterparts. You can find standard CSS values replace the following properties.
|
||||
|
||||
```python
|
||||
from genalog.generation.document import DocumentGenerator
|
||||
from IPython.core.display import Image, display
|
||||
|
||||
# You can add as many options as possible. A new document will be generated per combination of the styles
|
||||
new_style_combinations = {
|
||||
"hyphenate": [True],
|
||||
"font_size": ["11px", "12px"], # most CSS units are supported `px`, `cm`, `em`, etc...
|
||||
"font_family": ["Times"],
|
||||
"text_align": ["justify"]
|
||||
}
|
||||
|
||||
default_generator = DocumentGenerator()
|
||||
default_generator.set_styles_to_generate(new_style_combinations)
|
||||
# Example the list of all style combination to generate
|
||||
print(f"Styles to generate: {default_generator.styles_to_generate}")
|
||||
|
||||
doc_gen = default_generator.create_generator(titled_content, ["columns.html.jinja", "letter.html.jinja"])
|
||||
|
||||
for doc in doc_gen:
|
||||
print(doc.styles)
|
||||
print(doc.template.name)
|
||||
image_byte = doc.render_png(resolution=300)
|
||||
display(Image(image_byte))
|
||||
```
|
|
@ -0,0 +1,15 @@
|
|||
genalog.degradation
|
||||
====================
|
||||
|
||||
Image Degrader
|
||||
-----------------------------------
|
||||
|
||||
.. automodule:: genalog.degradation.degrader
|
||||
:members:
|
||||
|
||||
Degration Effects
|
||||
---------------------------------
|
||||
|
||||
.. automodule:: genalog.degradation.effect
|
||||
:members:
|
||||
:show-inheritance:
|
|
@ -1,15 +1,11 @@
|
|||
genalog.generation package
|
||||
genalog.generation
|
||||
==========================
|
||||
|
||||
Submodules
|
||||
----------
|
||||
|
||||
genalog.generation.content module
|
||||
---------------------------------
|
||||
|
||||
.. automodule:: genalog.generation.content
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.generation.document module
|
||||
|
@ -17,13 +13,4 @@ genalog.generation.document module
|
|||
|
||||
.. automodule:: genalog.generation.document
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
Module contents
|
||||
---------------
|
||||
|
||||
.. automodule:: genalog.generation
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
|
@ -1,53 +1,33 @@
|
|||
genalog.ocr package
|
||||
genalog.ocr
|
||||
===================
|
||||
|
||||
Submodules
|
||||
----------
|
||||
|
||||
genalog.ocr.blob\_client module
|
||||
-------------------------------
|
||||
|
||||
.. automodule:: genalog.ocr.blob_client
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.ocr.common module
|
||||
-------------------------
|
||||
|
||||
.. automodule:: genalog.ocr.common
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.ocr.grok module
|
||||
-----------------------
|
||||
|
||||
.. automodule:: genalog.ocr.grok
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.ocr.metrics module
|
||||
--------------------------
|
||||
|
||||
.. automodule:: genalog.ocr.metrics
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.ocr.rest\_client module
|
||||
-------------------------------
|
||||
|
||||
.. automodule:: genalog.ocr.rest_client
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
Module contents
|
||||
---------------
|
||||
genalog.ocr.blob\_client module
|
||||
-------------------------------
|
||||
|
||||
.. automodule:: genalog.ocr
|
||||
.. automodule:: genalog.ocr.blob_client
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,69 +1,47 @@
|
|||
genalog.text package
|
||||
genalog.text
|
||||
====================
|
||||
|
||||
Submodules
|
||||
----------
|
||||
|
||||
genalog.text.alignment module
|
||||
-----------------------------
|
||||
|
||||
.. automodule:: genalog.text.alignment
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.text.anchor module
|
||||
--------------------------
|
||||
|
||||
.. automodule:: genalog.text.anchor
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.text.conll\_format module
|
||||
---------------------------------
|
||||
|
||||
.. automodule:: genalog.text.conll_format
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.text.lcs module
|
||||
-----------------------
|
||||
|
||||
.. automodule:: genalog.text.lcs
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.text.ner\_label module
|
||||
------------------------------
|
||||
|
||||
.. automodule:: genalog.text.ner_label
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
:private-members: _propagate_label_to_ocr
|
||||
|
||||
genalog.text.preprocess module
|
||||
------------------------------
|
||||
|
||||
.. automodule:: genalog.text.preprocess
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
genalog.text.splitter module
|
||||
----------------------------
|
||||
|
||||
.. automodule:: genalog.text.splitter
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
Module contents
|
||||
---------------
|
||||
|
||||
.. automodule:: genalog.text
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
|
@ -0,0 +1,12 @@
|
|||
# OCR-NER Dataset Generation
|
||||
|
||||
```{image} static/labeled_synthetic_pipeline.png
|
||||
:width: 80%
|
||||
:align: center
|
||||
```
|
||||
|
||||
If you were brought here by our paper [insert link here], you may be interested in the data preparation pipeline built with `genalog`. The figure above shows the steps involved in tranforming a Named-Entity Recognition (NER) dataset like [CoNLL 2003](https://deepai.org/dataset/conll-2003-english) with synthetic Optical Character Recognition (OCR) errors. This OCR-NER dataset is useful to train an error-prune NER model against common OCR mistakes. You can find the full dataset prepration pipeline in this [notebook](https://github.com/microsoft/genalog/blob/main/example/dataset_generation.ipynb) from our repo.
|
||||
|
||||
We believe this methodology of inducing OCR errors onto the dataset can be applied to other NLP tasks to improve model performance against inherent noise from OCR outputs. We welcome the community to contribute if this fits your use cases.
|
||||
|
||||
|
|
@ -0,0 +1,367 @@
|
|||
{
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
},
|
||||
"orig_nbformat": 2,
|
||||
"kernelspec": {
|
||||
"name": "python3",
|
||||
"display_name": "Python 3.6.9 64-bit ('.env': venv)"
|
||||
},
|
||||
"metadata": {
|
||||
"interpreter": {
|
||||
"hash": "463957e7759ed5c981e4d097e7f970bbf621ad48bd269f8044dc509b219ad94f"
|
||||
}
|
||||
},
|
||||
"interpreter": {
|
||||
"hash": "463957e7759ed5c981e4d097e7f970bbf621ad48bd269f8044dc509b219ad94f"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2,
|
||||
"cells": [
|
||||
{
|
||||
"source": [
|
||||
"# Generate your synthetic document\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"```{figure} static/analog_doc_gen_pipeline.png\n",
|
||||
":width: 500px\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Genalog provides a simple interface (`AnalogDocumentGeneration`) to programmatic generate documents with degradation from a body of text."
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from genalog.pipeline import AnalogDocumentGeneration\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Configurations\n",
|
||||
"\n",
|
||||
"To use the pipeline, you will need to supply the following information:\n",
|
||||
"\n",
|
||||
"### CSS Style Combinations\n",
|
||||
"\n",
|
||||
"`STYLE_COMBINATIONS`: a dictionary defining the combination of styles to generate per text document (i.e. a copy of the same text document is generate per style combination)\n",
|
||||
"\n"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"STYLE_COMBINATIONS = {\n",
|
||||
" \"language\": [\"en_US\"],\n",
|
||||
" \"font_family\": [\"Segeo UI\"],\n",
|
||||
" \"font_size\": [\"12px\"],\n",
|
||||
" \"text_align\": [\"justify\"],\n",
|
||||
" \"hyphenate\": [True],\n",
|
||||
"}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"```{note}\n",
|
||||
"Genalog depends on Weasyprint as the engine to render these CSS styles. Most of these fields are standard CSS properties and accepts common values as specified in [W3C CSS Properties](https://www.w3.org/Style/CSS/all-properties.en.html). For details, please see [Weasyprint Documentation](https://weasyprint.readthedocs.io/en/stable/features.html#fonts).\n",
|
||||
"```"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"### Choose a Prebuild HTML Template\n",
|
||||
"\n",
|
||||
"`HTML_TEMPLATE`: name of html template used to generate the synthetic images. The `genalog` package has the following default templates: \n",
|
||||
"\n",
|
||||
"````{tab} columns.html.jinja\n",
|
||||
"```{figure} static/columns_Times_11px.png\n",
|
||||
":width: 30%\n",
|
||||
"Document template with 2 columns \n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
"````{tab} letter.html.jinja\n",
|
||||
"```{figure} static/letter_Times_11px.png\n",
|
||||
":width: 30%\n",
|
||||
"Letter-like document template\n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
"````{tab} text_block.html.jinja\n",
|
||||
"```{figure} static/text_block_Times_11px.png\n",
|
||||
":width: 30%\n",
|
||||
"Simple text block template\n",
|
||||
"```\n",
|
||||
"````"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"HTML_TEMPLATE = \"text_block.html.jinja\"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"### Image Degradations\n",
|
||||
"\n",
|
||||
"`DEGRADATIONS`: a list defining the sequence of degradation effects applied onto the synthetic images. Each element is a two-element tuple of which the first element is one of the method names from `genalog.degradation.effect` and the second element is the corresponding function keyword arguments.\n"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"````{tab} bleed_through\n",
|
||||
"```{figure} static/bleed_through.png\n",
|
||||
":name: Bleed-through\n",
|
||||
":width: 90%\n",
|
||||
"Mimics a document printed on two sides. Valid values: [0,1].\n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
"````{tab} blur\n",
|
||||
"```{figure} static/blur.png\n",
|
||||
":name: Blur\n",
|
||||
":width: 90%\n",
|
||||
"Lowers image quality. Unit are in number of pixels.\n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
"````{tab} salt/pepper\n",
|
||||
"```{figure} static/salt_pepper.png\n",
|
||||
":name: Salt/Pepper\n",
|
||||
":width: 65%\n",
|
||||
"Mimics ink degradation. Valid values: [0, 1].\n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
"`````{tab} close/dilate\n",
|
||||
"```{figure} static/close_dilate.png\n",
|
||||
":name: Close/Dilate\n",
|
||||
"Degrades printing quality.\n",
|
||||
"```\n",
|
||||
"````{margin}\n",
|
||||
"```{note}\n",
|
||||
"For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)\n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
"`````\n",
|
||||
"`````{tab} open/erode\n",
|
||||
"```{figure} static/open_erode.png\n",
|
||||
":name: Open/Errode\n",
|
||||
"Ink overflows\n",
|
||||
"```\n",
|
||||
"````{margin}\n",
|
||||
"```{note}\n",
|
||||
"For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)\n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
"`````"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from genalog.degradation.degrader import ImageState\n",
|
||||
"\n",
|
||||
"DEGRADATIONS = [\n",
|
||||
" (\"blur\", {\"radius\": 5}),\n",
|
||||
" (\"bleed_through\", {\n",
|
||||
" \"src\": ImageState.CURRENT_STATE,\n",
|
||||
" \"background\": ImageState.ORIGINAL_STATE,\n",
|
||||
" \"alpha\": 0.8,\n",
|
||||
" \"offset_x\": -6,\n",
|
||||
" \"offset_y\": -12,\n",
|
||||
" }),\n",
|
||||
" (\"morphology\", {\"operation\": \"open\", \"kernel_shape\":(9,9), \"kernel_type\":\"plus\"}),\n",
|
||||
" (\"pepper\", {\"amount\": 0.005}),\n",
|
||||
" (\"salt\", {\"amount\": 0.15}),\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"```{note}\n",
|
||||
"`ImageState.ORIGINAL_STATE` refers to the origin state of the image before applying any degradation, while\n",
|
||||
"`ImageState.CURRENT_STATE` refers to the state of the image after applying the last degradation effect.\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"The example above will apply degradation effects to synthetic images in the sequence of: \n",
|
||||
" \n",
|
||||
" blur -> bleed_through -> morphological operation (open) -> pepper -> salt\n",
|
||||
" \n",
|
||||
"For the full list of supported degradation effects, please see [documentation on degradation](https://github.com/microsoft/genalog/blob/main/genalog/degradation/README.md)."
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"We use `Jinja` to prepare html templates. You can find example of these Jinja templates in [our source code](https://github.com/microsoft/genalog/tree/main/genalog/generation/templates)."
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Document Generation\n",
|
||||
"\n",
|
||||
"With the above configurations, we can go ahead and start generate synthetic document.\n",
|
||||
"\n",
|
||||
"### Load Sample Text content\n",
|
||||
"\n",
|
||||
"You can use **any** text documents as the content of the generated images. For the sake of the tutorial, you can use the [sample text](https://github.com/microsoft/genalog/blob/main/example/sample/generation/example.txt) from our repo."
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"sample_text_url = \"https://raw.githubusercontent.com/microsoft/genalog/main/example/sample/generation/example.txt\"\n",
|
||||
"sample_text = \"example.txt\"\n",
|
||||
"\n",
|
||||
"r = requests.get(sample_text_url, allow_redirects=True)\n",
|
||||
"open(sample_text, 'wb').write(r.content)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"### Generate Synthetic Documents\n",
|
||||
"\n",
|
||||
"Next, we can supply the three aforementioned configurations in initalizing `AnalogDocumentGeneration` object"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from genalog.pipeline import AnalogDocumentGeneration\n",
|
||||
"\n",
|
||||
"IMG_RESOLUTION = 300 # dots per inch (dpi) of the generated pdf/image\n",
|
||||
"\n",
|
||||
"doc_generation = AnalogDocumentGeneration(styles=STYLE_COMBINATIONS, degradations=DEGRADATIONS, resolution=IMG_RESOLUTION, template_path=None)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"To use custom templates, please set `template_path` to the folder of containing them. You can find more information from our [`document_generation.ipynb`](https://github.com/microsoft/genalog/blob/main/example/document_generation.ipynb).\n",
|
||||
"\n",
|
||||
"Once initialized, you can call `generate_img()` method to get the synthetic documents as images"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# for custom templates, please set template_path.\n",
|
||||
"img_array = doc_generation.generate_img(sample_text, HTML_TEMPLATE, target_folder=None) # returns the raw image bytes if target_folder is not specified"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"```{note}\n",
|
||||
"Setting `target_folder` to `None` will return the raw image bytes as a `Numpy.ndarray`. Otherwise the generated image will be save on the disk as a PNG file in the specified path.\n",
|
||||
"```"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"### Display the Document"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import cv2\n",
|
||||
"from IPython.core.display import Image, display\n",
|
||||
"\n",
|
||||
"_, encoded_image = cv2.imencode('.png', img_array)\n",
|
||||
"display(Image(data=encoded_image, width=600))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Document Generation (Multi-process)\n",
|
||||
"\n",
|
||||
"To scale up the generation across multiple text files, you can use `generate_dataset_multiprocess`. The method will split the list of text filenames into batches and run document generation across different batches as subprocesses in parallel"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from genalog.pipeline import generate_dataset_multiprocess\n",
|
||||
"\n",
|
||||
"DST_PATH = \"data\" # where on disk to write the generated image\n",
|
||||
"\n",
|
||||
"generate_dataset_multiprocess(\n",
|
||||
" [sample_text], DST_PATH, STYLE_COMBINATIONS, DEGRADATIONS, HTML_TEMPLATE, \n",
|
||||
" resolution=IMG_RESOLUTION, batch_size=5\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"```{note}\n",
|
||||
"`[sample_text]` is a list of filenames to generate the synthetic dataset over.\n",
|
||||
"```"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
}
|
||||
]
|
||||
}
|
|
@ -0,0 +1,93 @@
|
|||
# Synthetic Document Generator
|
||||
|
||||
[![Build Status](https://dev.azure.com/genalog-dev/genalog/_apis/build/status/Nightly-Build?branchName=main)](https://dev.azure.com/genalog-dev/genalog/_build/latest?definitionId=4&branchName=main) ![Azure DevOps tests (compact)](https://img.shields.io/azure-devops/tests/genalog-dev/genalog/4?compact_message) ![Azure DevOps coverage (main)](https://img.shields.io/azure-devops/coverage/genalog-dev/genalog/4/main) ![Python Versions](https://img.shields.io/badge/py-3.6%20%7C%203.7%20%7C%203.8%20-blue) ![Supported OSs](https://img.shields.io/badge/platform-%20linux--64%20-red) ![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)
|
||||
|
||||
````{margin}
|
||||
```sh
|
||||
pip install genalog
|
||||
```
|
||||
````
|
||||
|
||||
`genalog` is an open source, cross-platform python package for **gen**erating document images with synthetic noise that mimics scanned an**alog** documents (thus the name `genalog`). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you can create in simple HTML format.
|
||||
|
||||
`genalog` provides several document templates as a start. You can alter the document layout using standard CSS properties like `font-family`, `font-size`, `text-align`, etc. Here are some of the example generated documents:
|
||||
|
||||
````{tab} Multi-Column
|
||||
```{figure} static/columns_Times_11px.png
|
||||
:width: 60%
|
||||
:name: two-columns-index
|
||||
Document template with 2 columns
|
||||
```
|
||||
````
|
||||
````{tab} Letter-like
|
||||
```{figure} static/letter_Times_11px.png
|
||||
:width: 60%
|
||||
:name: letter-like-index
|
||||
Letter-like document template
|
||||
```
|
||||
````
|
||||
````{tab} Simple Text Block
|
||||
```{figure} static/text_block_Times_11px.png
|
||||
:width: 60%
|
||||
:name: text-block-index
|
||||
Simple text block template
|
||||
```
|
||||
````
|
||||
|
||||
Once a document is generated, you can combine various image degradation effects and apply onto the synthetic documents. Here are some of the degradation effects:
|
||||
|
||||
````{tab} Bleed-through
|
||||
```{figure} static/bleed_through.png
|
||||
:name: bleed-through-index
|
||||
:width: 80%
|
||||
Mimics a document printed on two sides
|
||||
```
|
||||
````
|
||||
````{tab} Blur
|
||||
```{figure} static/blur.png
|
||||
:name: blur-index
|
||||
:width: 80%
|
||||
Lowers image quality
|
||||
```
|
||||
````
|
||||
````{tab} Salt/Pepper
|
||||
```{figure} static/salt_pepper.png
|
||||
:name: salt/pepper-index
|
||||
:width: 50%
|
||||
Mimics ink degradation
|
||||
```
|
||||
````
|
||||
`````{tab} Close/Dilate
|
||||
```{figure} static/close_dilate.png
|
||||
:name: close-dilate-index
|
||||
:width: 90%
|
||||
Degrades printing quality
|
||||
```
|
||||
````{margin}
|
||||
```{note}
|
||||
For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)
|
||||
```
|
||||
````
|
||||
`````
|
||||
`````{tab} Open/Erode
|
||||
```{figure} static/open_erode.png
|
||||
:name: open-erode-index
|
||||
:width: 90%
|
||||
Ink overflows
|
||||
```
|
||||
````{margin}
|
||||
```{note}
|
||||
For more details on this degradation, see [Morphilogical Operations](https://homepages.inf.ed.ac.uk/rbf/HIPR2/morops.htm)
|
||||
```
|
||||
````
|
||||
`````
|
||||
````{tab} Combined Effects
|
||||
```{figure} static/degrader.png
|
||||
:width: 40%
|
||||
:name: combined-effects-index
|
||||
Combining various degradation effects: blur, salt, open, and bleed-through
|
||||
```
|
||||
````
|
||||
|
||||
In addition to the document generation and degradation, `genalog` also provide efficient implementation for [text alignment](text-alignment-page) between the source and noise text.
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
# Installation
|
||||
|
||||
Genalog is supported across Windows, Mac and Linux on Python 3.6+. However there are *additional* installation steps for Windows and Mac users.
|
||||
|
||||
|
||||
````{tab} pip
|
||||
```sh
|
||||
pip install genalog
|
||||
```
|
||||
````
|
||||
````{tab} source
|
||||
```sh
|
||||
git clone https://github.com/microsoft/genalog.git && cd genalog && pip install -e .
|
||||
```
|
||||
````
|
||||
|
||||
## Extra Steps for Windows & Mac Users
|
||||
|
||||
We have a dependency on [`Weasyprint`](https://weasyprint.readthedocs.io/en/stable/install.html) for image generation, which in turn has non-python dependencies including `Pango`, `cairo` and `GDK-PixBuf` that need to be installed separately.
|
||||
|
||||
So far, `Pango`, `cairo` and `GDK-PixBuf` libraries are available in `Ubuntu-18.04` and later by default.
|
||||
|
||||
If you are running on Windows, MacOS, or other Linux distributions, please see [installation instructions from WeasyPrint](https://weasyprint.readthedocs.io/en/stable/install.html).
|
||||
|
||||
```{note}
|
||||
If you encounter the errors like `no library called "libcairo-2" was found`, this is probably due to the three extra dependencies missing.
|
||||
```
|
||||
|
|
@ -0,0 +1,211 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"(label-propagation-page)=\n",
|
||||
"# Propagation of NER labels\n",
|
||||
"\n",
|
||||
"In the context of Named Entity Recognition (NER), typical datasets contain the text tokens and the NER labels for each of the tokens. For example:\n",
|
||||
"\n",
|
||||
"````{margin}\n",
|
||||
"```{note}\n",
|
||||
"`B-P` is short for \"Beginning-Place\"\n",
|
||||
"and `I-P` is short for \"Inside-Place\"\n",
|
||||
"whereas `O` means \"Other\".\n",
|
||||
"See [IOB Tagging](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) for more details\n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
" NER Labels: B-P I-P O O\n",
|
||||
" Text: New York is big\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"Now, imagine we have obtained a noisy version of the grouth truth text through the OCR process, for example. The problem becomes: how can we label the noisy tokens?\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" NER Labels: B-P I-P O O\n",
|
||||
" GT Text: New York is big\n",
|
||||
" Noisy Text: New Yo rkis big\n",
|
||||
" NER Labels: ? ? ? ?\n",
|
||||
"\n",
|
||||
"We can utilize text alignment and **propagate** the NER labels onto the noisy tokens. We will demonstrate how in the rest of this document.\n"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Tokenization\n",
|
||||
"\n",
|
||||
"To ensure consistent interpretation of the text alignment results, we need to first tokenize the grouth truth and the OCR'ed (nosiy) text."
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from genalog.text import ner_label\n",
|
||||
"from genalog.text import preprocess\n",
|
||||
"\n",
|
||||
"gt_txt = \"New York is big\"\n",
|
||||
"ocr_txt = \"New Yo rkis big\"\n",
|
||||
"\n",
|
||||
"# Input to the method\n",
|
||||
"gt_labels = [\"B-P\", \"I-P\", \"O\", \"O\"]\n",
|
||||
"gt_tokens = preprocess.tokenize(gt_txt) # tokenize into list of tokens\n",
|
||||
"ocr_tokens = preprocess.tokenize(ocr_txt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['B-P', 'I-P', 'O', 'O']\n",
|
||||
"['New', 'York', 'is', 'big']\n",
|
||||
"['New', 'Yo', 'rkis', 'big']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Inputs to the method\n",
|
||||
"print(gt_labels)\n",
|
||||
"print(gt_tokens)\n",
|
||||
"print(ocr_tokens)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Label Propagation\n",
|
||||
"\n",
|
||||
"We then can run label propagation to obtain the NER labels for the OCR'ed (noisy) tokens."
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Method returns a tuple of 4 elements (gt_tokens, gt_labels, ocr_tokens, ocr_labels, gap_char)\n",
|
||||
"ocr_labels, aligned_gt, aligned_ocr, gap_char = ner_label.propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"OCR labels: ['B-P', 'I-P', 'I-P', 'O']\n",
|
||||
"Aligned ground truth: New Yo@rk is big\n",
|
||||
"Alinged OCR text: New Yo rk@is big\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Outputs\n",
|
||||
"print(f\"OCR labels: {ocr_labels}\")\n",
|
||||
"print(f\"Aligned ground truth: {aligned_gt}\")\n",
|
||||
"print(f\"Alinged OCR text: {aligned_ocr}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Display Result After Propagation"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"B-P I-P O O \n",
|
||||
"New York is big \n",
|
||||
"New Yo@rk is big\n",
|
||||
"||||||.||.||||||\n",
|
||||
"New Yo rk@is big\n",
|
||||
"New Yo rkis big \n",
|
||||
"B-P I-P I-P O \n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(ner_label.format_label_propagation(gt_tokens, gt_labels, ocr_tokens, ocr_labels, aligned_gt, aligned_ocr))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Final Results\n",
|
||||
"\n",
|
||||
"Formatting the OCR tokens and their NER labels."
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"B-P I-P I-P O \n",
|
||||
"New Yo rkis big \n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Format tokens and labels\n",
|
||||
"print(ner_label.format_labels(ocr_tokens, ocr_labels))"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
После Ширина: | Высота: | Размер: 40 KiB |
После Ширина: | Высота: | Размер: 200 KiB |
После Ширина: | Высота: | Размер: 190 KiB |
После Ширина: | Высота: | Размер: 242 KiB |
После Ширина: | Высота: | Размер: 196 KiB |
После Ширина: | Высота: | Размер: 92 KiB |
После Ширина: | Высота: | Размер: 104 KiB |
После Ширина: | Высота: | Размер: 1.1 MiB |
После Ширина: | Высота: | Размер: 54 KiB |
После Ширина: | Высота: | Размер: 157 KiB |
После Ширина: | Высота: | Размер: 259 KiB |
После Ширина: | Высота: | Размер: 512 KiB |
После Ширина: | Высота: | Размер: 142 KiB |
|
@ -0,0 +1,237 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"(text-alignment-page)=\n",
|
||||
"# Text alignment\n",
|
||||
"\n",
|
||||
"````{margin}\n",
|
||||
"```{note}\n",
|
||||
"There are many OCR engines you can use to work with `genalog`, including [Azure Cognitve Services](https://docs.microsoft.com/en-us/python/api/overview/azure/cognitiveservices-vision-computervision-readme?view=azure-python) and [Tesseract](https://github.com/tesseract-ocr/tesseract).\n",
|
||||
"```\n",
|
||||
"````\n",
|
||||
"\n",
|
||||
"`genalog` provides text alignment capabilities. This is most useful in the following situations after you have ran Opitcal Character Recognition (OCR) on the synthetic documents:\n",
|
||||
"\n",
|
||||
"- Text alignment between noisy (OCR result) and grouth truth text\n",
|
||||
"- NER label propagation using text alignment results (we will cover this in the next page)\n",
|
||||
"\n",
|
||||
"`genalog` provides two methods of alignment:\n",
|
||||
"1. `genalog.text.anchor.align_w_anchor()`\n",
|
||||
"1. `genalog.text.alignment.align()`\n",
|
||||
"\n",
|
||||
"`align_w_anchor()` implements the Recursive Text Alignment Scheme (RETAS) from the paper [A Fast Alignment Scheme for Automatic OCR Evaluation of Books](https://ieeexplore.ieee.org/abstract/document/6065412) and works best on longer text strings, while `align()` implement the [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) and works best on shorter strings. \n",
|
||||
"\n",
|
||||
"We recommend using the `align_w_anchor()` method on inputs longer than **200 characters**. Both methods share the same function contract and are interchangeable. \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 40,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"gt_txt = \"New York is big\"\n",
|
||||
"noise_txt = \"New Yo rkis\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## RETAS Method\n",
|
||||
"\n",
|
||||
"This is our implementation of The Recursive Text Alignment Scheme (RETAS) from the paper [A Fast Alignment Scheme for Automatic OCR Evaluation of Books](https://ieeexplore.ieee.org/abstract/document/6065412), as the original paper did not release the algorithm written in Python.\n"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 41,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Aligned ground truth: New Yo@rk is big\n",
|
||||
"Aligned noise: New Yo rk@is@@@@\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from genalog.text import anchor\n",
|
||||
"\n",
|
||||
"# Extra whitespaces are removed\n",
|
||||
"aligned_gt, aligned_noise = anchor.align_w_anchor(gt_txt, noise_txt)\n",
|
||||
"print(f\"Aligned ground truth: {aligned_gt}\")\n",
|
||||
"print(f\"Aligned noise: {aligned_noise}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"```{hint}\n",
|
||||
"`@` is the default gap character inserted by the alignment algorithm, you can change the gap character by providing the keyword-argument `anchor.align_w_anchor(gt_txt, noise_txt, gap_char=<NEW_CHAR>)`\n",
|
||||
"```"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Needleman-Wunsch Algorithm\n",
|
||||
"\n",
|
||||
"We use [Biopython](https://biopython.org/)'s implementation of the Needleman-Wunsch algorithm for text alignment.\n",
|
||||
"This algorithm is an exhaustive search for all possible candidates with dynamic programming. \n",
|
||||
"It produces weighted score for each candidate and returns those having the highest score. \n",
|
||||
"(**NOTE** that multiple candidates can share the same score)"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 42,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Aligned ground truth: New Yo@rk is big\n",
|
||||
"Aligned noise: New Yo rk@is@@@@\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Needleman-Wunsch alignment ONLY\n",
|
||||
"from genalog.text import alignment\n",
|
||||
"\n",
|
||||
"aligned_gt, aligned_noise = alignment.align(gt_txt, noise_txt)\n",
|
||||
"print(f\"Aligned ground truth: {aligned_gt}\")\n",
|
||||
"print(f\"Aligned noise: {aligned_noise}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"### Advanced Algorithm Configurations\n",
|
||||
"\n",
|
||||
"The Needleman-Wunsch Algorithm algorithm has 4 hyperparameters for tuning candidate scores:\n",
|
||||
"1. **Match Reward** - how much the algorithm rewards matching characters\n",
|
||||
"1. **Mismatch Penalty** - how much the algorithm penalizes mismatching characters\n",
|
||||
"1. **Gap Penalty** - how much the algorithm penalizes for creating a gap with a GAP_CHAR (defaults to '@')\n",
|
||||
"1. **Gap Extension Penalty** - how much the algorithm penalizes for extending a gap (ex \"@@@@\")\n",
|
||||
"\n",
|
||||
"You can find the default values for these four parameters as a constant in the package:\n",
|
||||
"1. `genalog.text.alignment.MATCH_REWARD`\n",
|
||||
"1. `genalog.text.alignment.MISMATCH_PENALTY`\n",
|
||||
"1. `genalog.text.alignment.GAP_PENALTY`\n",
|
||||
"1. `genalog.text.alignment.GAP_EXT_PENALTY`"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Interpret the Alignment Results\n",
|
||||
"\n",
|
||||
"`genalog` provide additional functionality to interpret the alignment results and produce a relational mapping between the tokens in the noisy and grouth truth text."
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"gt_to_noise: [[0], [1, 2], [2], []]\n",
|
||||
"noise_to_gt: [[0], [1], [1, 2], []]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from genalog.text import alignment\n",
|
||||
"\n",
|
||||
"# Process the aligned strings to find out how the tokens are related\n",
|
||||
"gt_to_noise_mapping, noise_to_gt_mapping = alignment.parse_alignment(aligned_gt, aligned_noise, gap_char=\"@\")\n",
|
||||
"print(f\"gt_to_noise: {gt_to_noise_mapping}\")\n",
|
||||
"print(f\"noise_to_gt: {noise_to_gt_mapping}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"Recall that the ground truth is `New York is big` while the noisy text is `New Yo rkis`.\n",
|
||||
"\n",
|
||||
"`gt_to_noise: [[0], [1, 2], [2], []]` can be interpreted as: \"the **0th** gt token (`New`) maps to the **0th** noisy token (`New`), the **1st** gt token (`York`) maps to the **1st and 2nd** nosity tokens (`Yo` and `rkis`), the **2nd** token (`is`) maps to the **2nd** noisy token (`rkis`), and finally, the last gt token (`big`) cannot be mapped to any noisy token.\"\n",
|
||||
"\n",
|
||||
"And the vice versa for `noise_to_gt: [[0], [1], [1, 2], []]`\n"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"source": [
|
||||
"## Formatting Alignment Results\n",
|
||||
"\n",
|
||||
"You can use `genalog.alignment._format_alignment()` for better visual understanding of the alignment results"
|
||||
],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"New Yo@rk is @ big@\n",
|
||||
"||||||.||.||||||||.\n",
|
||||
"New Yo rk@is @ big \n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Format aligned string for better display\n",
|
||||
"print(alignment._format_alignment(aligned_gt, aligned_noise))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
|
@ -1,22 +0,0 @@
|
|||
.. genalog documentation master file, created by
|
||||
sphinx-quickstart on Thu Jan 28 15:19:33 2021.
|
||||
You can adapt this file completely to your liking, but it should at least
|
||||
contain the root `toctree` directive.
|
||||
|
||||
Welcome to genalog's documentation!
|
||||
===================================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: Contents:
|
||||
|
||||
genalog/genalog
|
||||
|
||||
|
||||
|
||||
Indices and tables
|
||||
==================
|
||||
|
||||
* :ref:`genindex`
|
||||
* :ref:`modindex`
|
||||
* :ref:`search`
|
|
@ -1,35 +0,0 @@
|
|||
@ECHO OFF
|
||||
|
||||
pushd %~dp0
|
||||
|
||||
REM Command file for Sphinx documentation
|
||||
|
||||
if "%SPHINXBUILD%" == "" (
|
||||
set SPHINXBUILD=sphinx-build
|
||||
)
|
||||
set SOURCEDIR=.
|
||||
set BUILDDIR=_build
|
||||
|
||||
if "%1" == "" goto help
|
||||
|
||||
%SPHINXBUILD% >NUL 2>NUL
|
||||
if errorlevel 9009 (
|
||||
echo.
|
||||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
|
||||
echo.installed, then set the SPHINXBUILD environment variable to point
|
||||
echo.to the full path of the 'sphinx-build' executable. Alternatively you
|
||||
echo.may add the Sphinx directory to PATH.
|
||||
echo.
|
||||
echo.If you don't have Sphinx installed, grab it from
|
||||
echo.http://sphinx-doc.org/
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
|
||||
goto end
|
||||
|
||||
:help
|
||||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
|
||||
|
||||
:end
|
||||
popd
|
|
@ -1,2 +1,4 @@
|
|||
jupyter-book
|
||||
sphinx
|
||||
sphinx_rtd_theme
|
||||
sphinx_inline_tabs
|
||||
ghp-import
|
|
@ -32,7 +32,7 @@
|
|||
"# Analog Document Generation\n",
|
||||
"\n",
|
||||
"<p float=\"left\">\n",
|
||||
" <img src=\"static/analog_doc_gen_pipeline.png\" width=\"800\" />\n",
|
||||
" <img src=\"static\\analog_doc_gen_pipeline.png\" width=\"800\" />\n",
|
||||
"</p>\n",
|
||||
"\n",
|
||||
"Genalog provides a simple interface (`AnalogDocumentGeneration`) to programmatic generate documents with degradation from a body of text."
|
||||
|
|
|
@ -339,7 +339,7 @@ def parse_alignment(aligned_gt, aligned_noise, gap_char=GAP_CHAR):
|
|||
gap_char (char, optional) : gap char used in alignment algorithm. Defaults to GAP_CHAR.
|
||||
|
||||
Returns:
|
||||
tuple -- a tuple ``(gt_to_noise_mapping, noise_to_gt_mapping)`` of two 2D int arrays:
|
||||
tuple : ``(gt_to_noise_mapping, noise_to_gt_mapping)`` of two 2D int arrays:
|
||||
|
||||
where each array defines the mapping between aligned gt tokens
|
||||
to noise tokens and vice versa.
|
||||
|
|
|
@ -201,8 +201,7 @@ def propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True):
|
|||
gt_tokens (list) : a list of ground truth string tokens
|
||||
ocr_tokens (list) : a list of OCR'ed text tokens
|
||||
gap_char (char, optional) : gap char used in alignment algorithm. Defaults to ``alignment.GAP_CHAR``.
|
||||
use_anchor (bool, optional) : use faster alignment method with anchors if set to True
|
||||
. Defaults to True.
|
||||
use_anchor (bool, optional) : use faster alignment method with anchors if set to True. Defaults to True.
|
||||
|
||||
Raises:
|
||||
GapCharError:
|
||||
|
@ -210,8 +209,7 @@ def propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True):
|
|||
to set of all possible gap characters (GAP_CHAR_SET)
|
||||
|
||||
Returns:
|
||||
tuple : a tuple of 3 elements ``(ocr_labels, aligned_gt, aligned_ocr, gap_char)``
|
||||
where
|
||||
tuple : a tuple of 3 elements ``(ocr_labels, aligned_gt, aligned_ocr, gap_char)`` where
|
||||
1. ``ocr_labels`` is a list of NER label for the corresponding ocr tokens
|
||||
2. ``aligned_gt`` is the ground truth string aligned with the ocr text
|
||||
3. ``aligned_ocr`` is the ocr text aligned with ground true
|
||||
|
@ -241,7 +239,7 @@ def propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True):
|
|||
def _propagate_label_to_ocr(
|
||||
gt_labels, gt_tokens, ocr_tokens, gap_char=alignment.GAP_CHAR, use_anchor=True
|
||||
):
|
||||
"""Propagate NER label for ground truth tokens to to ocr tokens. Low level implementation
|
||||
r"""Propagate NER label for ground truth tokens to to ocr tokens. Low level implementation
|
||||
|
||||
NOTE: that `gt_tokens` and `ocr_tokens` MUST NOT contain invalid tokens.
|
||||
Invalid tokens are:
|
||||
|
@ -261,7 +259,7 @@ def _propagate_label_to_ocr(
|
|||
gt label B-p I-p B-p I-p B-p I-p B-p I-p B-p I-p I-p
|
||||
| | | | | | | | | | |
|
||||
gt_token New York New York New York New York New York City
|
||||
/ \\ / \\ \\/ /\\ / | | |
|
||||
/ \ / \ \ / /\ / | | |
|
||||
ocr_token N ew Yo rk NewYork N ew@York New York City
|
||||
| | | | | | | | | |
|
||||
ocr label B-p I-p I-p I-p B-p B-p I-p B-p B-p I-p
|
||||
|
@ -274,7 +272,7 @@ def _propagate_label_to_ocr(
|
|||
gt label O V O O V W O O
|
||||
| | | | | | | |
|
||||
gt_token something is big this is huge is big
|
||||
/ \\ \\ \\/ /\\ /\\/ |
|
||||
/ \ \ \ / /\ /\ / |
|
||||
ocr_token so me thing isbig th isi shuge is
|
||||
| | | | | | | |
|
||||
ocr label o o o V O O V O
|
||||
|
@ -298,27 +296,22 @@ def _propagate_label_to_ocr(
|
|||
|
||||
|
||||
Returns:
|
||||
a tuple of 4 elements:
|
||||
(ocr_labels, aligned_gt, aligned_ocr, gap_char)
|
||||
a tuple of 4 elements: (ocr_labels, aligned_gt, aligned_ocr, gap_char)
|
||||
where
|
||||
`ocr_labels` is a list of NER label for the corresponding ocr tokens
|
||||
`aligned_gt` is the ground truth string aligned with the ocr text
|
||||
`aligned_ocr` is the ocr text aligned with ground true
|
||||
`gap_char` is the char used to alignment for inserting gaps
|
||||
|
||||
For example,
|
||||
given input:
|
||||
For example, given input:
|
||||
|
||||
gt_labels: ["B-place", "I-place", "o", "o"]
|
||||
gt_tokens: ["New", "York", "is", "big"]
|
||||
ocr_tokens: ["N", "ewYork", "big"]
|
||||
|
||||
output:
|
||||
(
|
||||
["B-place", "I-place", "o"],
|
||||
"N@ew York is big",
|
||||
"N ew@York@@@ big"
|
||||
>>> _propagate_label_to_ocr(
|
||||
["B-place", "I-place", "o", "o"],
|
||||
["New", "York", "is", "big"],
|
||||
["N", "ewYork", "big"]
|
||||
)
|
||||
(["B-place", "I-place", "o"], "N@ew York is big", "N ew@York@@@ big", '@')
|
||||
|
||||
"""
|
||||
# Pseudo-algorithm:
|
||||
|
||||
|
|
6
setup.py
|
@ -16,12 +16,12 @@ setuptools.setup(
|
|||
name="genalog",
|
||||
install_requires=requirements,
|
||||
version=BUILD_VERSION,
|
||||
author="Team Enki",
|
||||
author_email="ta_nerds@microsoft.com",
|
||||
author="Jianjie Liu & Amit Gupte",
|
||||
author_email="ta_maidap_fy20_h2@microsoft.com",
|
||||
description="Tools for generating analog document (images) from raw text",
|
||||
long_description=long_description,
|
||||
long_description_content_type="text/markdown",
|
||||
url='https://msazure.visualstudio.com/DefaultCollection/Cognitive%20Services/_git/Tools-Synthetic-Data-Generator',
|
||||
url='https://github.com/microsoft/genalog',
|
||||
packages=setuptools.find_packages(exclude=['tests', 'tests.*']),
|
||||
package_data={'': [
|
||||
'genalog/generation/templates/*.jinja'
|
||||
|
|