diff --git a/README.md b/README.md
index a2eb48f..2500711 100644
--- a/README.md
+++ b/README.md
@@ -4,9 +4,10 @@ DeepSpeech based forced alignment tool
 ## Installation
 
 It is recommended to use this tool from within a virtual environment.
-There is a script for creating one with all requirements in the git-ignored dir `venv`:
+After cloning and changing to the root of the project,
+there is a script for creating one with all requirements in the git-ignored dir `venv`:
 
-```bash
+```shell script
 $ bin/createenv.sh
 $ ls venv
 bin  include  lib  lib64  pyvenv.cfg  share
@@ -14,689 +15,52 @@ bin  include  lib  lib64  pyvenv.cfg  share
 
 `bin/align.sh` will automatically use it.
 
-## Prerequisites
-
-### Language specific data
-
 Internally DSAlign uses the [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT engine.
 For it to be able to function, it requires a couple of files that are specific to 
 the language of the speech data you want to align.
 If you want to align English, there is already a helper script that will download and prepare
 all required data:
 
-```bash
+```shell script
 $ bin/getmodel.sh 
 [...]
 $ ls models/en/
 alphabet.txt  lm.binary  output_graph.pb  output_graph.pbmm  output_graph.tflite  trie
 ```
 
-### Dependencies for generating individual language models
+## Overview and documentation
 
-If you plan to let the tool generate individual language models per text (you should!),
-you have to get (essentially build) [KenLM](https://kheafield.com/code/kenlm/).
-Before doing this, you should install its [dependencies](https://kheafield.com/code/kenlm/dependencies/).
-For Debian based systems this can be done through:
-```bash
-$ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
-```
+A typical application of the aligner is done in three phases: 
 
-With all requirements fulfilled, there is a script for building and installing KenLM
-and the required DeepSpeech tools in the right location:
-```bash
-$ bin/lm-dependencies.sh
-```
+ 1. __Preparing__ the data. Albeit most of this has to be done individually,
+    there are some [tools for data preparation, statistics and maintenance](doc/tools.md).
+    All involved file formats are described [here](doc/files.md).
+ 2. __Aligning__ the data using [the alignment tool and it algorithm](doc/algo.md).
+ 3. __Exporting__ aligned data using [the data-set exporter](doc/export.md).
 
-If all went well, the alignment tool will find and use it to automatically create individual
-language models for each document.
+## Quickstart example
 
 ### Example data
 
-There is also a script for downloading and preparing some public domain speech and transcript data.
+There is a script for downloading and preparing some public domain speech and transcript data.
+It requires `ffmpeg` for some sample conversion.
 
-```bash
+```shell script
 $ bin/gettestdata.sh
 $ ls data
 test1  test2
 ```
 
-## Using the tool
-
-```bash
-$ bin/align.sh --help
-[...]
-```
-
 ### Alignment using example data
 
-```bash
-$ bin/align.sh --output-max-cer 15 --loglevel 10 --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log
+Now the aligner can be called either "manually" (specifying all involved files directly):
+
+```shell script
+$ bin/align.sh --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log
 ```
 
-## The algorithm
+Or "automatically" by specifying a so-called catalog file that bundles all involved paths:
 
-### Step 1 - Splitting audio
-
-A voice activity detector (at the moment this is `webrtcvad`) is used
-to split the provided audio data into voice fragments.
-These fragments are essentially streams of continuous speech without any longer pauses 
-(e.g. sentences).
-
-`--audio-vad-aggressiveness <AGGRESSIVENESS>` can be used to influence the length of the
-resulting fragments.
-
-### Step 2 - Preparation of original text
-
-STT transcripts are typically provided in a normalized textual form with
-- no casing,
-- no punctuation and
-- normalized whitespace (single spaces only).
-
-So for being able to align STT transcripts with the original text it is necessary
-to internally convert the original text into the same form.
-
-This happens in two steps:
-1. Normalization of whitespace, lower-casing all text and 
-replacing some characters with spaces (e.g. dashes)
-2. Removal of all characters that are not in the languages's alphabet
-(see DeepSpeech model data)
-
-Be aware: *This conversion happens on text basis and will not remove unspoken content
-like markup/markdown tags or artifacts. This should be done beforehand.
-Reducing the difference of spoken and original text will improve alignment quality and speed.*
-
-In the very unlikely situation that you have to change the default behavior (of step 1),
-there are some switches:
-
-`--text-keep-dashes` will prevent substitution of dashes with spaces.
-
-`--text-keep-ws` will keep whitespace untouched.
-
-`--text-keep-casing` will keep character casing as provided.
-
-### Step 4a (optional) - Generating document specific language model
-
-If the [dependencies][Dependencies for generating individual language models] for 
-individual language model generation got installed, this document-individual model will
-now be generated by default.
-
-Assuming your text document is named `original.txt`, these files will be generated:
-- `original.txt.clean` - cleaned version of the original text
-- `original.txt.arpa` - text file with probabilities in ARPA format
-- `original.txt.lm` - binary representation of the former one
-- `original.txt.trie` - prefix-tree optimized for probability lookup
-
-`--stt-no-own-lm` deactivates creation of individual language models per document and
-uses the one from model dir instead.
-
-### Step 4b - Transcription of voice fragments through STT
-
-After VAD splitting the resulting fragments are transcribed into textual phrases.
-This transcription is done through [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT.
-
-As this can take a longer time, all resulting phrases are - together with their 
-timestamps - saved as JSON into a transcription log file 
-(the `audio` parameter path with suffix `.tlog` instead of `.wav`).
-Consecutive calls will look for that file and - if found - 
-load it and skip the transcription phase.
-
-`--stt-model-dir <DIR>` points DeepSpeech to the language specific model data directory.
-It defaults to `models/en`. Use `bin/getmodel.sh` for preparing it.  
-
-### Step 5 - Rough alignment
-
-The actual text alignment is based on a recursive divide and conquer approach:
-
-1. Construct an ordered list of of all phrases in the current interval
-(at the beginning this is the list of all phrases that are to be aligned),
-where long phrases close to the middle of the interval come first.
-2. Iterate through the list and compute the best Smith-Waterman alignment
-(see the following sub-sections) with the document's original text...
-3. ...till there is a phrase whose Smith-Waterman alignment score surpasses a (low) recursion-depth 
-dependent threshold (in most cases this should already be the first phrase).
-4. Recursively continue with step 1 for the sub-intervals and original text ranges
-to the left and right of the phrase and its aligned text range within the original text.
-5. Return all phrases in order of appearance (depth-first) that were aligned with the minimum 
-Smith-Waterman score on their recursion level.
-
-This approach assumes that all phrases were spoken in the same order as they appear in the
-original transcript. It has the following advantages compared to individual
-global phrase matching:
-
-- Long non-matching chunks of spoken text or the original transcript will automatically and 
-cleanly get ignored.
-- Short phrases (with the risk of matching more than one time per document) will automatically
-get aligned to their intended locations by longer ones who "squeeze" them in.
-- Smith-Waterman score thresholds can be kept lower 
-(and thus better match lower quality STT transcripts), as there is a lower chance for 
-  - long sequences to match at a wrong location and for 
-  - shorter sequences to match at a wrong location within their shortened intervals
-  (as they are getting matched later and deeper in the recursion tree).
-
-#### Smith-Waterman candidate selection
-
-Finding the best match of a given phrase within the original (potentially long) transcript
-using vanilla Smith-Waterman is not feasible.
-
-So this tool follows a two-phase approach where the first goal is to get a list of alignment 
-candidates. As the first step the original text is virtually partitioned into windows of the 
-same length as the search pattern. These windows are ordered descending by the number of 3-grams
-they share with the pattern.
-Best alignment candidates are now taken from the beginning of this ordered list.
-
-`--align-max-candidates <CANDIDATES>` sets the maximum number of candidate windows
-taken from the beginning of the list for further alignment.
-
-`--align-candidate-threshold <THRESHOLD>` multiplied with the number of 3-grams of the predecessor
-window it gives the minimum number of 3-grams the next candidate window has to have to also be
-considered a candidate.
-
-#### Smith-Waterman alignment
-
-For each candidate, the best possible alignment is computed using the 
-[Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) algorithm
-within an extended interval of one window-size around the candidate window.
-
-`--align-match-score <SCORE>` is the score per correctly matched character. Default: 100
-
-`--align-mismatch-score <SCORE>` is the score per non-matching (exchanged) character. Default: -100
-
-`--align-gap-score <SCORE>` is the score per character gap (removing 1 character from pattern or original). Default: -100
-
-The overall best score for the best match is normalized to a value of about 100 maximum by dividing
-it through the maximum character count of either the match or the pattern.
-
-### Step 6 - Gap alignment
-
-After recursive matching of fragments there are potential text leftovers between aligned original
-texts.
-
-Some examples:
-- Often: Missing (and therefore unaligned) STT transcripts of word-endings (e.g. English past tense endings _-d_ and _-ed_)
-on phrase endings to the left of the gap
-- Seldom: Phrase beginnings or endings that were wrongly matched on unspoken (but written) text whose actual
-alignments are now left unaligned in the gap
-- Big unmatched chunks of text, like
-  - Preface, text summaries or any other kind of meta information
-  - Copyright headers/footers
-  - Table of contents
-- Chapter headers (if not spoken as they appear)
-- Captions of figures
-- Contents of tables
-- Line-headers like character names in drama scripts
-- Dependent of the (pre-processing) quality: OCR leftovers like
-  - page headers
-  - page numbers
-  - reader's notes
-  
-The basic challenge here is to figure out, if all or some of the gap text should be used to extend 
-the phrase to the left and/or to the right of the gap.
-
-As Smith-Waterman alignment led to the current (potentially incomplete or even wrong) result,
-its score cannot be used for further fine-tuning. Therefore there is a collection of
-so called test-distance algorithms to pick from using the `--align-similarity-algo`
-parameter.
-
-Using the selected distance metric, the gap alignment is done by looking for the best scoring 
-extension of the left and right phrases up to their maximum extension.
-
-`--align-stretch-factor <FRACTION>` is the fraction of the text length that it could get
-stretched at max.  
-
-For many languages it is worth putting some emphasis on matching to words boundaries 
-(that is white-space separated sub-sequences).
-
-`--align-snap-factor <FACTOR>` allows to control the snappiness to word boundaries.
-
-If the best scoring extensions should overlap, the best scoring sum of non-overlapping
-(but touching) extensions will win.
-
-### Step 7 - Selection, filtering and output
-
-Finally the best alignment of all candidate windows is selected as the winner.
-It has to survive a series of filters for getting into the result file.
-
-For each text distance metric there are two filter parameters:
-
-`--output-min-<METRIC-ID> <VALUE>` only keeps utterances having the provided minimum value for the
-metric with id `METRIC-ID`
-                              
-`--output-max-<METRIC-ID> <VALUE>` only keeps utterances having the provided maximum value for the
-metric with id `METRIC-ID`
-
-For each text distance metric there's also the option to have it added to each utterance's entry:
-
-`--output-<METRIC-ID>` adds the computed value for `<METRIC-ID>` to the utterances array-entry
-
-Error rates and scores are provided as fractional values (typically between 0.0 = 0% and 1.0 = 100%
-where numbers >1.0 are theoretically possible).
-
-### General options
-
-`--play` will play each aligned sample using the `play` command of the SoX audio toolkit
-
-`--text-context <CONTEXT-SIZE>` will add additional `CONTEXT-SIZE` characters around original
-transcripts when logged
-
-## Export
-
-After files got successfully aligned, one would possibly want to export the aligned utterances
-as machine learning training samples.
-
-This is where the export tool `bin/export.sh` comes in.
-
-### Step 1 - Reading the input
-
-The exporter takes either a single audio file (`--audio`) 
-plus a corresponding `.aligned` file (`--aligned`) or a series
-of such pairs from a `.catalog` file (`--catalog`) as input.
-
-All of the following computations will be done on the joined list of all aligned
-utterances of all input pairs.
-
-### Step 2 - (Pre-) Filtering
-
-The parameter `--filter <EXPR>` allows to specify a Python expression that has access
-to all data fields of an aligned utterance (as can be seen in `.aligned` file entries).
-
-This expression is now applied to each aligned utterance and in case it returns `True`,
-the utterance will get excluded from all the following steps. 
-This is useful for excluding utterances that would not work as input for the planned
-training or other kind of application.
-
-### Step 3 - Computing quality
-
-As with filtering, the parameter `--criteria <EXPR>` allows for specifying a Python 
-expression that has access to all data fields of an aligned utterance.
-
-The expression is applied to each aligned utterance and its numerical return 
-value is assigned to each utterance as `quality`.
-
-### Step 4 - De-biasing
-
-This step is to (optionally) exclude utterances that would otherwise bias the data
-(risk of overfitting).
-
-For each `--debias <META DATA TYPE>` parameter the following procedure is applied:
-1. Take the meta data type (e.g. "name") and read its instances (e.g. "Alice" or "Bob")
-from each utternace and group all utterances accordingly
-(e.g. a group with 2 utterances of "Alice" and a group with 15 utterances of "Bob"...)
-2. Compute the standard deviation (`sigma`) of the instance-counts of the groups
-3. For each group: If the instance-count exceeds `sigma` times `--debias-sigma-factor <FACTOR>`:
-    - Drop the number of exceeding utterances in order of their `quality` (lowest first)
-    
-### Step 5 - Partitioning
-
-Training sets are often partitioned into several quality levels.
-
-For each `--partition <QUALITY:PARTITION>` parameter (ordered descending by `QUALITY`):
-If the utterance's `quality` value is greater or equal `QUALITY`, assign it to `PARTITION`.
-
-Remaining utterances are assigned to partition `other`.
-
-### Step 6 - Splitting
-
-Training sets (actually their partitions) are typically split into sets `train`, `dev` 
-and `test` ([explanation](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)).
-
-This can get automated through parameter `--split` which will let the exporter split each
-partition (or the entire set) accordingly.
-
-Parameter `--split-field` allows for specifying a meta data type that should be considered 
-atomic (e.g. "speaker" would result in all utterances of a speaker 
-instance - like "Alice" - to end up in one sub-set only). This atomic behavior will also hold
-true across partitions.
-
-### Step 7 - Output
-
-For each partition/sub-set combination the following is done:
- - Construction of a `name` (e.g. `good-dev` will represent the validation set of partition `good`).
- - Writing all utterance audio fragments (as `.wav` files) into a sub-directory of `--target-dir <DIR>`
- named `name` (using parameters `--channels <N>` and `--rate <RATE>`).
- - Writing an utterance list into `--target-dir <DIR>` named `name.(json|csv)` dependent on the
- output format specified through `--format <FORMAT>`
- 
-### Additional functionality
-
-Using `--dry-run` one can avoid any writing and get a preview on set-splits and so forth
-(`--dry-run-fast` won't even load any sample).
-
-`--force` will force overwriting of samples and list files.
-
-`--workers <N>` allows for specifying the number of parallel workers.
-
-## File formats
-
-### Catalog files (.catalog)
-
-Catalog files (suffix `.catalog`) are used for organizing bigger data file collections and
-defining relations among them. It is basically a JSON array of hash-tables where each entry stands
-for a single audio file and its associated original transcript.
-
-So a typical catalog looks like this (`data/all.catalog` from this project):
-
-```javascript
-[
-  {
-    "audio": "test1/joined.mp3",
-    "tlog": "test1/joined.tlog",
-    "script": "test1/transcript.txt",
-    "aligned": "test1/joined.aligned"
-  },
-  {
-    "audio": "test2/joined.mp3",
-    "tlog": "test2/joined.tlog",
-    "script": "test2/transcript.script",
-    "aligned": "test2/joined.aligned"
-  }
-]
+```shell script
+$ bin/align.sh --catalog data/test1.catalog
 ```
-
-- `audio` is a path to an audio file (of a format that `pydub` supports)
-- `tlog` is the (supposed) path to the STT generated transcription log of the audio file
-- `script` is the path to the original transcript of the audio file
-(as `.txt` or `.script` file)
-- `aligned` is the (supposed) path to a `.aligned` file
-
-Be aware: __All relative file paths are treated as relative to the catalog file's directory__.
-
-The tools `bin/align.sh`, `bin/statistics.sh` and `bin/export.sh` all support parameter
-`--catalog`:
-
-The __alignment tool__ `bin/align.sh` requires either `tlog` to point to an existing
-file or (if not) `audio` to point to an existing audio file for being able to transcribe
-it and store it at the path indicated by `tlog`. Furthermore it requires `script` to
-point to an  existing script. It will write its alignment results to the path in `aligned`.
-
-The __export tool__ `bin/export.sh` requires `audio` and `aligned` to point to existing files.
-
-The __statistics tool__ `bin/statistics.sh` requires only `aligned` to point to existing files.
-
-Advantages of having a catalog file:
-
-- Simplified tool usage with only one parameter for defining all involved files (`--catalog`).
-- A directory with many files has to be scanned just one time at catalog generation.
-- Different file types can live at different and custom locations in the system.
-This is important in case of read-only access rights to the original data.
-It can also be used for avoiding to taint the original directory tree.
-- Accumulated statistics
-- Better progress indication (as the total number of files is available up front)
-- Reduced tool startup overhead
-- Allows for meta-data aware set-splitting on export - e.g. if some speakers are speaking
-in several files.
-
-So especially in case of many files to process it is highly recommended to __first create
-a catalog file__ with all paths present (even the ones not pointing to existing files yet).
-
-
-### Script files (.script|.txt)
-
-The alignment tool requires an original script or (human transcript) of the provided audio.
-These scripts can be represented in two basic forms:
-- plain text files (`.txt`) or
-- script files (`.script`)
-
-In case of plain text files the content is considered a continuous stream of text without
-any assigned meta data. The only exception is option `--text-meaningful-newlines` which
-tells the aligner to consider newlines as separators between utterances
-in conjunction with option `--align-phrase-snap-factor`.
-
-If the original data source features utterance meta data, one should consider converting it
-to the `.script` JSON file format which looks like this
-(except of `data/test2/transcript.script`): 
-
-```javascript
-[
-  // ...
-  {
-    "speaker": "Phebe",
-    "text": "Good shepherd, tell this youth what 'tis to love."
-  },
-  {
-    "speaker": "Silvius",
-    "text": "It is to be all made of sighs and tears; And so am I for Phebe."
-  },
-  // ...
-]
-```
-
-_This and the following sub-sections are all using the same real world examples and excerpts_
-
-It is basically again an array of hash-tables, where each hash-table represents an utterance with the
-only mandatory field `text` for its textual representation.
-
-All other fields are considered meta data 
-(with the key called "meta data type" and the value "meta data instance").
-
-### Transcription log files (.tlog)
-
-The alignment tool relies on timed STT transcripts of the provided audio.
-These transcripts are either provided by some external processing 
-(even using a different STT system than DeepSpeech) or will get generated
-as part of the alignment process.
-
-They are called transcription logs (`.tlog`) and are looking like this
-(except of `data/test2/joined.tlog`):
-
-```javascript
-[
-  // ...
-  {
-    "start": 7491960,
-    "end": 7493040,
-    "transcript": "good shepherd"
-  },
-  {
-    "start": 7493040,
-    "end": 7495110,
-    "transcript": "tell this youth what tis to love"
-  },
-  {
-    "start": 7495380,
-    "end": 7498020,
-    "transcript": "it is to be made of soles and tears"
-  },
-  {
-    "start": 7498470,
-    "end": 7500150,
-    "transcript": "and so a may for phoebe"
-  },
-  // ...
-]
-```
-
-The fields of each entry:
-- `start`: time offset of the audio fragment in milliseconds from the beginning of the
-aligned audio file (mandatory)
-- `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
-aligned audio file (mandatory) 
-- `transcript`: STT transcript of the utterance (mandatory)
-
-### Aligned files (.aligned)
-
-The result of aligning an audio file with an original transcript is written to an
-`.aligned` JSON file consisting of an array of hash-tables of the following form:
-
-```javascript
-[
-  // ...
-  {
-    "start": 7491960,
-    "end": 7493040,
-    "transcript": "good shepherd",
-    "text-start": 98302,
-    "text-end": 98316,
-    "meta": {
-      "speaker": [
-        "Phebe"
-      ]
-    },
-    "aligned-raw": "Good shepherd,",
-    "aligned": "good shepherd",
-    "wng": 99.99999999999997,
-    "jaro_winkler": 100.0,
-    "levenshtein": 100.0,
-    "mra": 100.0,
-    "cer": 0.0
-  },
-  {
-    "start": 7493040,
-    "end": 7495110,
-    "transcript": "tell this youth what tis to love",
-    "text-start": 98317,
-    "text-end": 98351,
-    "meta": {
-      "speaker": [
-        "Phebe"
-      ]
-    },
-    "aligned-raw": "tell this youth what 'tis to love.",
-    "aligned": "tell this youth what 'tis to love",
-    "wng": 92.71730687405957,
-    "jaro_winkler": 100.0,
-    "levenshtein": 96.96969696969697,
-    "mra": 100.0,
-    "cer": 3.0303030303030303
-  },
-  {
-    "start": 7495380,
-    "end": 7498020,
-    "transcript": "it is to be made of soles and tears",
-    "text-start": 98352,
-    "text-end": 98392,
-    "meta": {
-      "speaker": [
-        "Silvius"
-      ]
-    },
-    "aligned-raw": "It is to be all made of sighs and tears;",
-    "aligned": "it is to be all made of sighs and tears",
-    "wng": 77.93921929148159,
-    "jaro_winkler": 100.0,
-    "levenshtein": 82.05128205128204,
-    "mra": 100.0,
-    "cer": 17.94871794871795
-  },
-  {
-    "start": 7498470,
-    "end": 7500150,
-    "transcript": "and so a may for phoebe",
-    "text-start": 98393,
-    "text-end": 98415,
-    "meta": {
-      "speaker": [
-        "Silvius"
-      ]
-    },
-    "aligned-raw": "And so am I for Phebe.",
-    "aligned": "and so am i for phebe",
-    "wng": 66.82687893873339,
-    "jaro_winkler": 98.47964113181504,
-    "levenshtein": 82.6086956521739,
-    "mra": 100.0,
-    "cer": 19.047619047619047
-  },
-  // ...
-]
-```
-
-Each object array-entry represents an aligned audio fragment with the following attributes:
-- `start`: time offset of the audio fragment in milliseconds from the beginning of the
-aligned audio file
-- `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
-aligned audio file
-- `transcript`: STT transcript used for aligning
-- `text-start`: character offset of the fragment's associated original text within the
-aligned text document
-- `text-end`: character offset of the end of the fragment's associated original text within the
-aligned text document
-- `meta`: meta data hash-table with
-  - _key_: meta data type
-  - _value_: array of meta data instances coalesced from the `.script` entries that
-  this entry intersects with
-- `aligned-raw`: __raw__ original text fragment that got aligned with the audio fragment
-and its STT transcript
-- `aligned`: __clean__ original text fragment that got aligned with the audio fragment
-and its STT transcript
-- `<metric>` For each `--output-<metric>` parameter the alignment tool adds an entry with the
-computed value (in this case `wng`, `jaro_winkler`, `levenshtein`, `mra`, `cer`)
-
-## Text distance metrics
-
-This section lists all available text distance metrics along with their IDs for
-command-line use.
-
-### Weighted N-grams (wng)
-
-The weighted N-gram score is computed as the sum of the number of weighted shared N-grams
-between the two texts.
-It ensures that:
-- Shared N-gram instances near interval bounds (dependent on situation) get rated higher than
-the ones near the center or opposite end
-- Large shared N-gram instances are weighted higher than short ones
-
-`--align-min-ngram-size <SIZE>` sets the start (minimum) N-gram size
-
-`--align-max-ngram-size <SIZE>` sets the final (maximum) N-gram size
-
-`--align-ngram-size-factor <FACTOR>` sets a weight factor for the size preference
-
-`--align-ngram-position-factor <FACTOR>` sets a weight factor for the position preference
-
-### Jaro-Winkler (jaro_winkler)
-
-Jaro-Winkler is an edit distance metric described
-[here](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).
-
-### Editex (editex)
-
-Editex is a phonetic text distance algorithm described
-[here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138&rep=rep1&type=pdf).
-
-### Levenshtein (levenshtein)
-
-Levenshtein is an edit distance metric described
-[here](https://en.wikipedia.org/wiki/Levenshtein_distance).
-
-### MRA (mra)
-
-The "Match rating approach" is a phonetic text distance algorithm described
-[here](https://en.wikipedia.org/wiki/Match_rating_approach).
-
-### Hamming (hamming)
-
-The Hamming distance is an edit distance metric described
-[here](https://en.wikipedia.org/wiki/Hamming_distance).
-
-### Word error rate (wer)
-
-This is the same as Levenshtein - just on word level.
-
-Not available for gap alignment.
-
-### Character error rate (cer)
-
-This is the same as Levenshtein but using a different implementation.
-
-Not available for gap alignment.
-
-### Smith-Waterman score (sws)
-
-This is the final Smith-Waterman score coming from the rough alignment
-step (but before gap alignment!).
-It is described
-[here](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
-
-Not available for gap alignment.
-
-### Transcript length (tlen)
-
-The character length of the STT transcript.
-
-Not available for gap alignment.
-
-### Matched text length (mlen)
-
-The character length of the matched text of the original transcript (cleaned).
-
-Not available for gap alignment.
diff --git a/align/catalog_tool.py b/align/catalog_tool.py
index 9c7a466..acd5d28 100755
--- a/align/catalog_tool.py
+++ b/align/catalog_tool.py
@@ -20,9 +20,9 @@ def build_catalog():
     for source_glob in CLI_ARGS.sources:
         catalog_paths.extend(glob(source_glob))
     items = []
-    for catalog_path in catalog_paths:
-        catalog_path = Path(catalog_path).absolute()
-        print('Loading catalog "{}"'.format(str(catalog_path)))
+    for catalog_original_path in catalog_paths:
+        catalog_path = Path(catalog_original_path).absolute()
+        print('Loading catalog "{}"'.format(str(catalog_original_path)))
         if not catalog_path.is_file():
             fail('Unable to find catalog file "{}"'.format(str(catalog_path)))
         with open(catalog_path, 'r') as catalog_file:
@@ -30,13 +30,13 @@ def build_catalog():
         base_path = catalog_path.parent.absolute()
         for item in catalog_items:
             new_item = {}
-            for entry, entry_path in item.items():
-                entry_path = Path(entry_path)
-                entry_path = entry_path if entry_path.is_absolute() else (base_path / entry_path)
+            for entry, entry_original_path in item.items():
+                entry_path = Path(entry_original_path)
+                entry_path = entry_path if entry_path.is_absolute() else (base_path / entry_path).absolute()
                 if ((len(CLI_ARGS.check) == 1 and CLI_ARGS.check[0] == 'all')
                         or entry in CLI_ARGS.check) and not entry_path.is_file():
                     note = 'Catalog "{}" - Missing file for "{}" ("{}")'.format(
-                        str(catalog_path), entry, str(entry_path))
+                        str(catalog_original_path), entry, str(entry_original_path))
                     if CLI_ARGS.on_miss == 'fail':
                         fail(note + ' - aborting')
                     if CLI_ARGS.on_miss == 'ignore':
@@ -54,7 +54,7 @@ def build_catalog():
                 items.append(new_item)
     if CLI_ARGS.output is not None:
         catalog_path = Path(CLI_ARGS.output).absolute()
-        print('Writing catalog "{}"'.format(str(catalog_path)))
+        print('Writing catalog "{}"'.format(str(CLI_ARGS.output)))
         if CLI_ARGS.make_relative:
             base_path = catalog_path.parent
             for item in items:
@@ -63,7 +63,7 @@ def build_catalog():
         if CLI_ARGS.order_by is not None:
             items.sort(key=lambda i: i[CLI_ARGS.order_by] if CLI_ARGS.order_by in i else '')
         with open(catalog_path, 'w') as catalog_file:
-            json.dump(items, catalog_file)
+            json.dump(items, catalog_file, indent=2)
 
 
 def handle_args():
@@ -71,7 +71,7 @@ def handle_args():
                                                  'converting paths within catalog files')
     parser.add_argument('--output', help='Write collected catalog items to this new catalog file')
     parser.add_argument('--make-relative', action='store_true',
-                        help='Make all path entries of all items relative to target catalog file\'s parent directory')
+                        help='Make all path entries of all items relative to new catalog file\'s parent directory')
     parser.add_argument('--check',
                         help='Comma separated list of path entries to check for existence '
                              '("all" for checking every entry, default: no checks)')
diff --git a/align/export.py b/align/export.py
index 96f5392..9661245 100644
--- a/align/export.py
+++ b/align/export.py
@@ -338,7 +338,6 @@ def parse_args():
                         help='Take audio file as input (requires "--aligned <file>")')
     parser.add_argument('--aligned', type=str,
                         help='Take alignment file ("<...>.aligned") as input (requires "--audio <file>")')
-
     parser.add_argument('--catalog', type=str,
                         help='Take alignment and audio file references of provided catalog ("<...>.catalog") as input')
     parser.add_argument('--ignore-missing', action="store_true",
diff --git a/align/meta.py b/align/meta.py
index 778a604..0b65f41 100644
--- a/align/meta.py
+++ b/align/meta.py
@@ -8,14 +8,14 @@ forbidden_keys = ['start', 'end', 'text', 'transcript']
 def main(args):
     parser = argparse.ArgumentParser(description='Annotate .tlog or .script files by adding meta data')
     parser.add_argument('target', type=str, help='')
-    parser.add_argument('assignment', action='append', help='Meta data assignment of the form <key>=<value>')
+    parser.add_argument('assignments', nargs='+', help='Meta data assignments of the form <key>=<value>')
     args = parser.parse_args()
 
     with open(args.target, 'r') as json_file:
         entries = json.load(json_file)
 
-    for assign in args.assignment:
-        key, value = assign.split('=')
+    for assignment in args.assignments:
+        key, value = assignment.split('=')
         if key in forbidden_keys:
             print('Meta data key "{}" not allowed - forbidden: {}'.format(key, '|'.join(forbidden_keys)))
             sys.exit(1)
@@ -23,7 +23,7 @@ def main(args):
             entry[key] = value
 
     with open(args.target, 'w') as json_file:
-        json.dump(entries, json_file)
+        json.dump(entries, json_file, indent=2)
 
 
 if __name__ == '__main__':
diff --git a/align/stats.py b/align/stats.py
index f9a66b0..d5a6cce 100644
--- a/align/stats.py
+++ b/align/stats.py
@@ -126,6 +126,8 @@ def main(args):
                         help='Read alignment references of provided catalog ("<...>.catalog") as input')
     parser.add_argument('--no-progress', action='store_true',
                         help='Prevents showing progress bars')
+    parser.add_argument('--progress-interval', type=float, default=1.0,
+                        help='Progress indication interval in seconds')
 
     args = parser.parse_args()
 
diff --git a/doc/algo.md b/doc/algo.md
new file mode 100644
index 0000000..27ec307
--- /dev/null
+++ b/doc/algo.md
@@ -0,0 +1,197 @@
+## Alignment algorithm and its parameters
+
+### Step 1 - Splitting audio
+
+A voice activity detector (at the moment this is `webrtcvad`) is used
+to split the provided audio data into voice fragments.
+These fragments are essentially streams of continuous speech without any longer pauses 
+(e.g. sentences).
+
+`--audio-vad-aggressiveness <AGGRESSIVENESS>` can be used to influence the length of the
+resulting fragments.
+
+### Step 2 - Preparation of original text
+
+STT transcripts are typically provided in a normalized textual form with
+- no casing,
+- no punctuation and
+- normalized whitespace (single spaces only).
+
+So for being able to align STT transcripts with the original text it is necessary
+to internally convert the original text into the same form.
+
+This happens in two steps:
+1. Normalization of whitespace, lower-casing all text and 
+replacing some characters with spaces (e.g. dashes)
+2. Removal of all characters that are not in the languages's alphabet
+(see DeepSpeech model data)
+
+Be aware: *This conversion happens on text basis and will not remove unspoken content
+like markup/markdown tags or artifacts. This should be done beforehand.
+Reducing the difference of spoken and original text will improve alignment quality and speed.*
+
+In the very unlikely situation that you have to change the default behavior (of step 1),
+there are some switches:
+
+`--text-keep-dashes` will prevent substitution of dashes with spaces.
+
+`--text-keep-ws` will keep whitespace untouched.
+
+`--text-keep-casing` will keep character casing as provided.
+
+### Step 3 (optional) - Generating document specific language model
+
+If the [dependencies](lm.md) for 
+individual language model generation got installed, this document-individual model will
+now be generated by default.
+
+Assuming your text document is named `original.txt`, these files will be generated:
+- `original.txt.clean` - cleaned version of the original text
+- `original.txt.arpa` - text file with probabilities in ARPA format
+- `original.txt.lm` - binary representation of the former one
+- `original.txt.trie` - prefix-tree optimized for probability lookup
+
+`--stt-no-own-lm` deactivates creation of individual language models per document and
+uses the one from model dir instead.
+
+### Step 4 - Transcription of voice fragments through STT
+
+After VAD splitting the resulting fragments are transcribed into textual phrases.
+This transcription is done through [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT.
+
+As this can take a longer time, all resulting phrases are - together with their 
+timestamps - saved as JSON into a transcription log file 
+(the `audio` parameter path with suffix `.tlog` instead of `.wav`).
+Consecutive calls will look for that file and - if found - 
+load it and skip the transcription phase.
+
+`--stt-model-dir <DIR>` points DeepSpeech to the language specific model data directory.
+It defaults to `models/en`. Use `bin/getmodel.sh` for preparing it.  
+
+### Step 5 - Rough alignment
+
+The actual text alignment is based on a recursive divide and conquer approach:
+
+1. Construct an ordered list of of all phrases in the current interval
+(at the beginning this is the list of all phrases that are to be aligned),
+where long phrases close to the middle of the interval come first.
+2. Iterate through the list and compute the best Smith-Waterman alignment
+(see the following sub-sections) with the document's original text...
+3. ...till there is a phrase whose Smith-Waterman alignment score surpasses a (low) recursion-depth 
+dependent threshold (in most cases this should already be the first phrase).
+4. Recursively continue with step 1 for the sub-intervals and original text ranges
+to the left and right of the phrase and its aligned text range within the original text.
+5. Return all phrases in order of appearance (depth-first) that were aligned with the minimum 
+Smith-Waterman score on their recursion level.
+
+This approach assumes that all phrases were spoken in the same order as they appear in the
+original transcript. It has the following advantages compared to individual
+global phrase matching:
+
+- Long non-matching chunks of spoken text or the original transcript will automatically and 
+cleanly get ignored.
+- Short phrases (with the risk of matching more than one time per document) will automatically
+get aligned to their intended locations by longer ones who "squeeze" them in.
+- Smith-Waterman score thresholds can be kept lower 
+(and thus better match lower quality STT transcripts), as there is a lower chance for 
+  - long sequences to match at a wrong location and for 
+  - shorter sequences to match at a wrong location within their shortened intervals
+  (as they are getting matched later and deeper in the recursion tree).
+
+#### Smith-Waterman candidate selection
+
+Finding the best match of a given phrase within the original (potentially long) transcript
+using vanilla Smith-Waterman is not feasible.
+
+So this tool follows a two-phase approach where the first goal is to get a list of alignment 
+candidates. As the first step the original text is virtually partitioned into windows of the 
+same length as the search pattern. These windows are ordered descending by the number of 3-grams
+they share with the pattern.
+Best alignment candidates are now taken from the beginning of this ordered list.
+
+`--align-max-candidates <CANDIDATES>` sets the maximum number of candidate windows
+taken from the beginning of the list for further alignment.
+
+`--align-candidate-threshold <THRESHOLD>` multiplied with the number of 3-grams of the predecessor
+window it gives the minimum number of 3-grams the next candidate window has to have to also be
+considered a candidate.
+
+#### Smith-Waterman alignment
+
+For each candidate, the best possible alignment is computed using the 
+[Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) algorithm
+within an extended interval of one window-size around the candidate window.
+
+`--align-match-score <SCORE>` is the score per correctly matched character. Default: 100
+
+`--align-mismatch-score <SCORE>` is the score per non-matching (exchanged) character. Default: -100
+
+`--align-gap-score <SCORE>` is the score per character gap (removing 1 character from pattern or original). Default: -100
+
+The overall best score for the best match is normalized to a value of about 100 maximum by dividing
+it through the maximum character count of either the match or the pattern.
+
+### Step 6 - Gap alignment
+
+After recursive matching of fragments there are potential text leftovers between aligned original
+texts.
+
+Some examples:
+- Often: Missing (and therefore unaligned) STT transcripts of word-endings (e.g. English past tense endings _-d_ and _-ed_)
+on phrase endings to the left of the gap
+- Seldom: Phrase beginnings or endings that were wrongly matched on unspoken (but written) text whose actual
+alignments are now left unaligned in the gap
+- Big unmatched chunks of text, like
+  - Preface, text summaries or any other kind of meta information
+  - Copyright headers/footers
+  - Table of contents
+- Chapter headers (if not spoken as they appear)
+- Captions of figures
+- Contents of tables
+- Line-headers like character names in drama scripts
+- Dependent of the (pre-processing) quality: OCR leftovers like
+  - page headers
+  - page numbers
+  - reader's notes
+  
+The basic challenge here is to figure out, if all or some of the gap text should be used to extend 
+the phrase to the left and/or to the right of the gap.
+
+As Smith-Waterman alignment led to the current (potentially incomplete or even wrong) result,
+its score cannot be used for further fine-tuning. Therefore there is a collection of
+so called [test-distance metrics](metrics.md) to pick from using the `--align-similarity-algo <METRIC-ID>`
+parameter.
+
+Using the selected distance metric, the gap alignment is done by looking for the best scoring 
+extension of the left and right phrases up to their maximum extension.
+
+`--align-stretch-factor <FRACTION>` is the fraction of the text length that it could get
+stretched at max.  
+
+For many languages it is worth putting some emphasis on matching to words boundaries 
+(that is white-space separated sub-sequences).
+
+`--align-snap-factor <FACTOR>` allows to control the snappiness to word boundaries.
+
+If the best scoring extensions should overlap, the best scoring sum of non-overlapping
+(but touching) extensions will win.
+
+### Step 7 - Selection, filtering and output
+
+Finally the best alignment of all candidate windows is selected as the winner.
+It has to survive a series of filters for getting into the result file.
+
+For each text distance metric there are two filter parameters:
+
+`--output-min-<METRIC-ID> <VALUE>` only keeps utterances having the provided minimum value for the
+metric with id `METRIC-ID`
+                              
+`--output-max-<METRIC-ID> <VALUE>` only keeps utterances having the provided maximum value for the
+metric with id `METRIC-ID`
+
+For each text distance metric there's also the option to have it added to each utterance's entry:
+
+`--output-<METRIC-ID>` adds the computed value for `<METRIC-ID>` to the utterances array-entry
+
+Error rates and scores are provided as fractional values (typically between 0.0 = 0% and 1.0 = 100%
+where numbers >1.0 are theoretically possible).
diff --git a/doc/export.md b/doc/export.md
new file mode 100644
index 0000000..b1843a6
--- /dev/null
+++ b/doc/export.md
@@ -0,0 +1,129 @@
+## Export
+
+After files got successfully aligned, one would possibly want to export the aligned utterances
+as machine learning training samples.
+
+This is where the export tool `bin/export.sh` comes in.
+
+### Step 1 - Reading the input
+
+The exporter takes either a single audio file (`--audio <AUDIO>`) 
+plus a corresponding `.aligned` file (`--aligned <ALIGNED>`) or a series
+of such pairs from a `.catalog` file (`--catalog <CATALOG>`) as input.
+
+All of the following computations will be done on the joined list of all aligned
+utterances of all input pairs.
+
+Option `--ignore-missing` will not fail on missing file references in the catalog
+and instead just ignore the affected catalog entry.
+
+### Step 2 - (Pre-) Filtering
+
+The parameter `--filter <EXPR>` allows to specify a Python expression that has access
+to all data fields of an aligned utterance (as can be seen in `.aligned` file entries).
+
+This expression is now applied to each aligned utterance and in case it returns `True`,
+the utterance will get excluded from all the following steps. 
+This is useful for excluding utterances that would not work as input for the planned
+training or other kind of application.
+
+### Step 3 - Computing quality
+
+As with filtering, the parameter `--criteria <EXPR>` allows for specifying a Python 
+expression that has access to all data fields of an aligned utterance.
+
+The expression is applied to each aligned utterance and its numerical return 
+value is assigned to each utterance as `quality`.
+
+### Step 4 - De-biasing
+
+This step is to (optionally) exclude utterances that would otherwise bias the data
+(risk of overfitting).
+
+For each `--debias <META DATA TYPE>` parameter the following procedure is applied:
+1. Take the meta data type (e.g. "name") and read its instances (e.g. "Alice" or "Bob")
+from each utternace and group all utterances accordingly
+(e.g. a group with 2 utterances of "Alice" and a group with 15 utterances of "Bob"...)
+2. Compute the standard deviation (`sigma`) of the instance-counts of the groups
+3. For each group: If the instance-count exceeds `sigma` times `--debias-sigma-factor <FACTOR>`:
+    - Drop the number of exceeding utterances in order of their `quality` (lowest first)
+    
+### Step 5 - Partitioning
+
+Training sets are often partitioned into several quality levels.
+
+For each `--partition <QUALITY:PARTITION>` parameter (ordered descending by `QUALITY`):
+If the utterance's `quality` value is greater or equal `QUALITY`, assign it to `PARTITION`.
+
+Remaining utterances are assigned to partition `other`.
+
+### Step 6 - Splitting
+
+Training sets (actually their partitions) are typically split into sets `train`, `dev` 
+and `test` ([explanation](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)).
+
+This can get automated through parameter `--split` which will let the exporter split each
+partition (or the entire set) accordingly.
+
+Parameter `--split-field` allows for specifying a meta data type that should be considered 
+atomic (e.g. "speaker" would result in all utterances of a speaker 
+instance - like "Alice" - to end up in one sub-set only). This atomic behavior will also hold
+true across partitions.
+
+Option `--split-drop-multiple` allows for dropping all samples with multiple `--split-field` assignments - e.g. a 
+sample with more than one "speaker".
+
+In contrast option `--split-drop-unknown` allows for dropping all samples with no `--split-field assignment`.
+
+With option '--assign-{train|dev|test} <VALUES>' one can pre-assign values (of the comma-separated list)
+to the specified set.
+
+Option `--split-seed <SEED>` sets an integer random seed for the split operation.
+
+### Step 7 - Output
+
+For each partition/sub-set combination the following is done:
+ - Construction of a `name` (e.g. `good-dev` will represent the validation set of partition `good`).
+ - All samples are lazy-loaded and potentially re-sampled to match parameters: 
+   - `--channels <N>`: Number of audio channels - 1 for mono (default), 2 for stereo
+   - `--rate <RATE>`: Sample rate - default: 16000
+   - `--width <WIDTH>`: Sample width in bytes - default: 2 (16 bit)
+   
+   `--workers <WORKERS>` can be used to specify how many parallel processes should be used for loading and re-sampling.
+   
+   `--tmp-dir <DIR>` overrides system default temporary directory that is used for converting samples.
+   
+   `--skip-damaged` allows for just skipping export of samples that cannot be loaded.
+   
+ - If option `--target-dir <DIR>` is provided, all output will be written to the provided target directory.
+   This can be done in two different ways:
+   
+     1. With the additional option `--sdb` each set will be written to a so called Sample-DB
+        that can be used by DeepSpeech. It will be written as `<name>.sdb` into the target directory.
+        SDB export can be controlled with the following additional options:
+        - `--sdb-bucket-size <SIZE>`: SDB bucket size (using units like "1GB") for external sorting of the samples
+        - `--sdb-workers <WORKERS>`: Number of parallel workers for preparing and compressing SDB entries
+        - `--sdb-buffered-samples <SAMPLES>`: Number of samples per bucket buffer during last phase of external sorting
+        - `--sdb-audio-type <TYPE>`: Internal audio type for storing SDB samples - `wav` or `opus` (default)
+     2. Without option `--sdb` all samples are written as WAV-files into sub-directory `<name>`
+        of the target directory and a list of samples to a `<name>.csv` file next to it with columns 
+        `wav_filename`, `wav_filesize`, `transcript`.
+        
+   If not omitted through option `--no-meta`, a CSV file called `<name>.meta` is written to the target directory.
+   For each written sample it provides the following columns: 
+   `sample`, `split_entity`, `catalog_index`, `source_audio_file`, `aligned_file`, `alignment_index`.
+   
+   Throughout this process option `--force` allows to overwrite any existing files.
+ - If instead option `--target-tar <TAR-FILE>` is provided, the same file structure as with `--target-dir <DIR>`
+   is directly written to the specified tar-file.
+   This output variant does not support writing SDBs.
+ 
+### Additional functionality
+
+Option `--plan <PLAN>` can be used to cache all computational steps before actual output writing.
+Will be loaded if existing or generated otherwise.
+This allows for writing several output formats using the same sample set distribution and without having to load
+alignment files and re-calculate quality metrics, de-biasing, partitioning or splitting.
+
+Using `--dry-run` one can avoid any writing and get a preview on set-splits and so forth
+(`--dry-run-fast` won't even load any sample).
diff --git a/doc/files.md b/doc/files.md
new file mode 100644
index 0000000..956d0fc
--- /dev/null
+++ b/doc/files.md
@@ -0,0 +1,255 @@
+## File formats
+
+### Catalog files (.catalog)
+
+Catalog files (suffix `.catalog`) are used for organizing bigger data file collections and
+defining relations among them. It is basically a JSON array of hash-tables where each entry stands
+for a single audio file and its associated original transcript.
+
+So a typical catalog looks like this (`data/all.catalog` from this project):
+
+```javascript
+[
+  {
+    "audio": "test1/joined.mp3",
+    "tlog": "test1/joined.tlog",
+    "script": "test1/transcript.txt",
+    "aligned": "test1/joined.aligned"
+  },
+  {
+    "audio": "test2/joined.mp3",
+    "tlog": "test2/joined.tlog",
+    "script": "test2/transcript.script",
+    "aligned": "test2/joined.aligned"
+  }
+]
+```
+
+- `audio` is a path to an audio file (of a format that `pydub` supports)
+- `tlog` is the (supposed) path to the STT generated transcription log of the audio file
+- `script` is the path to the original transcript of the audio file
+(as `.txt` or `.script` file)
+- `aligned` is the (supposed) path to a `.aligned` file
+
+Be aware: __All relative file paths are treated as relative to the catalog file's directory__.
+
+The tools `bin/align.sh`, `bin/statistics.sh` and `bin/export.sh` all support parameter
+`--catalog`:
+
+The __alignment tool__ `bin/align.sh` requires either `tlog` to point to an existing
+file or (if not) `audio` to point to an existing audio file for being able to transcribe
+it and store it at the path indicated by `tlog`. Furthermore it requires `script` to
+point to an  existing script. It will write its alignment results to the path in `aligned`.
+
+The __export tool__ `bin/export.sh` requires `audio` and `aligned` to point to existing files.
+
+The __statistics tool__ `bin/statistics.sh` requires only `aligned` to point to existing files.
+
+Advantages of having a catalog file:
+
+- Simplified tool usage with only one parameter for defining all involved files (`--catalog`).
+- A directory with many files has to be scanned just one time at catalog generation.
+- Different file types can live at different and custom locations in the system.
+This is important in case of read-only access rights to the original data.
+It can also be used for avoiding to taint the original directory tree.
+- Accumulated statistics
+- Better progress indication (as the total number of files is available up front)
+- Reduced tool startup overhead
+- Allows for meta-data aware set-splitting on export - e.g. if some speakers are speaking
+in several files.
+
+So especially in case of many files to process it is highly recommended to __first create
+a catalog file__ with all paths present (even the ones not pointing to existing files yet).
+
+
+### Script files (.script|.txt)
+
+The alignment tool requires an original script or (human transcript) of the provided audio.
+These scripts can be represented in two basic forms:
+- plain text files (`.txt`) or
+- script files (`.script`)
+
+In case of plain text files the content is considered a continuous stream of text without
+any assigned meta data. The only exception is option `--text-meaningful-newlines` which
+tells the aligner to consider newlines as separators between utterances
+in conjunction with option `--align-phrase-snap-factor`.
+
+If the original data source features utterance meta data, one should consider converting it
+to the `.script` JSON file format which looks like this
+(except of `data/test2/transcript.script`): 
+
+```javascript
+[
+  // ...
+  {
+    "speaker": "Phebe",
+    "text": "Good shepherd, tell this youth what 'tis to love."
+  },
+  {
+    "speaker": "Silvius",
+    "text": "It is to be all made of sighs and tears; And so am I for Phebe."
+  },
+  // ...
+]
+```
+
+_This and the following sub-sections are all using the same real world examples and excerpts_
+
+It is basically again an array of hash-tables, where each hash-table represents an utterance with the
+only mandatory field `text` for its textual representation.
+
+All other fields are considered meta data 
+(with the key called "meta data type" and the value "meta data instance").
+
+### Transcription log files (.tlog)
+
+The alignment tool relies on timed STT transcripts of the provided audio.
+These transcripts are either provided by some external processing 
+(even using a different STT system than DeepSpeech) or will get generated
+as part of the alignment process.
+
+They are called transcription logs (`.tlog`) and are looking like this
+(except of `data/test2/joined.tlog`):
+
+```javascript
+[
+  // ...
+  {
+    "start": 7491960,
+    "end": 7493040,
+    "transcript": "good shepherd"
+  },
+  {
+    "start": 7493040,
+    "end": 7495110,
+    "transcript": "tell this youth what tis to love"
+  },
+  {
+    "start": 7495380,
+    "end": 7498020,
+    "transcript": "it is to be made of soles and tears"
+  },
+  {
+    "start": 7498470,
+    "end": 7500150,
+    "transcript": "and so a may for phoebe"
+  },
+  // ...
+]
+```
+
+The fields of each entry:
+- `start`: time offset of the audio fragment in milliseconds from the beginning of the
+aligned audio file (mandatory)
+- `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
+aligned audio file (mandatory) 
+- `transcript`: STT transcript of the utterance (mandatory)
+
+### Aligned files (.aligned)
+
+The result of aligning an audio file with an original transcript is written to an
+`.aligned` JSON file consisting of an array of hash-tables of the following form:
+
+```javascript
+[
+  // ...
+  {
+    "start": 7491960,
+    "end": 7493040,
+    "transcript": "good shepherd",
+    "text-start": 98302,
+    "text-end": 98316,
+    "meta": {
+      "speaker": [
+        "Phebe"
+      ]
+    },
+    "aligned-raw": "Good shepherd,",
+    "aligned": "good shepherd",
+    "wng": 99.99999999999997,
+    "jaro_winkler": 100.0,
+    "levenshtein": 100.0,
+    "mra": 100.0,
+    "cer": 0.0
+  },
+  {
+    "start": 7493040,
+    "end": 7495110,
+    "transcript": "tell this youth what tis to love",
+    "text-start": 98317,
+    "text-end": 98351,
+    "meta": {
+      "speaker": [
+        "Phebe"
+      ]
+    },
+    "aligned-raw": "tell this youth what 'tis to love.",
+    "aligned": "tell this youth what 'tis to love",
+    "wng": 92.71730687405957,
+    "jaro_winkler": 100.0,
+    "levenshtein": 96.96969696969697,
+    "mra": 100.0,
+    "cer": 3.0303030303030303
+  },
+  {
+    "start": 7495380,
+    "end": 7498020,
+    "transcript": "it is to be made of soles and tears",
+    "text-start": 98352,
+    "text-end": 98392,
+    "meta": {
+      "speaker": [
+        "Silvius"
+      ]
+    },
+    "aligned-raw": "It is to be all made of sighs and tears;",
+    "aligned": "it is to be all made of sighs and tears",
+    "wng": 77.93921929148159,
+    "jaro_winkler": 100.0,
+    "levenshtein": 82.05128205128204,
+    "mra": 100.0,
+    "cer": 17.94871794871795
+  },
+  {
+    "start": 7498470,
+    "end": 7500150,
+    "transcript": "and so a may for phoebe",
+    "text-start": 98393,
+    "text-end": 98415,
+    "meta": {
+      "speaker": [
+        "Silvius"
+      ]
+    },
+    "aligned-raw": "And so am I for Phebe.",
+    "aligned": "and so am i for phebe",
+    "wng": 66.82687893873339,
+    "jaro_winkler": 98.47964113181504,
+    "levenshtein": 82.6086956521739,
+    "mra": 100.0,
+    "cer": 19.047619047619047
+  },
+  // ...
+]
+```
+
+Each object array-entry represents an aligned audio fragment with the following attributes:
+- `start`: time offset of the audio fragment in milliseconds from the beginning of the
+aligned audio file
+- `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
+aligned audio file
+- `transcript`: STT transcript used for aligning
+- `text-start`: character offset of the fragment's associated original text within the
+aligned text document
+- `text-end`: character offset of the end of the fragment's associated original text within the
+aligned text document
+- `meta`: meta data hash-table with
+  - _key_: meta data type
+  - _value_: array of meta data instances coalesced from the `.script` entries that
+  this entry intersects with
+- `aligned-raw`: __raw__ original text fragment that got aligned with the audio fragment
+and its STT transcript
+- `aligned`: __clean__ original text fragment that got aligned with the audio fragment
+and its STT transcript
+- `<metric>` For each `--output-<metric>` parameter the alignment tool adds an entry with the
+computed value (in this case `wng`, `jaro_winkler`, `levenshtein`, `mra`, `cer`)
diff --git a/doc/lm.md b/doc/lm.md
new file mode 100644
index 0000000..e513e31
--- /dev/null
+++ b/doc/lm.md
@@ -0,0 +1,18 @@
+## Individual language models
+
+If you plan to let the tool generate individual language models per text,
+you have to get (essentially build) [KenLM](https://kheafield.com/code/kenlm/).
+Before doing this, you should install its [dependencies](https://kheafield.com/code/kenlm/dependencies/).
+For Debian based systems this can be done through:
+```bash
+$ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
+```
+
+With all requirements fulfilled, there is a script for building and installing KenLM
+and the required DeepSpeech tools in the right location:
+```bash
+$ bin/lm-dependencies.sh
+```
+
+If all went well, the alignment tool will find and use it to automatically create individual
+language models for each document.
\ No newline at end of file
diff --git a/doc/metrics.md b/doc/metrics.md
new file mode 100644
index 0000000..690c865
--- /dev/null
+++ b/doc/metrics.md
@@ -0,0 +1,79 @@
+## Text distance metrics
+
+This section lists all available text distance metrics along with their IDs for
+command-line use.
+
+### Weighted N-grams (wng)
+
+The weighted N-gram score is computed as the sum of the number of weighted shared N-grams
+between the two texts.
+It ensures that:
+- Shared N-gram instances near interval bounds (dependent on situation) get rated higher than
+the ones near the center or opposite end
+- Large shared N-gram instances are weighted higher than short ones
+
+`--align-min-ngram-size <SIZE>` sets the start (minimum) N-gram size
+
+`--align-max-ngram-size <SIZE>` sets the final (maximum) N-gram size
+
+`--align-ngram-size-factor <FACTOR>` sets a weight factor for the size preference
+
+`--align-ngram-position-factor <FACTOR>` sets a weight factor for the position preference
+
+### Jaro-Winkler (jaro_winkler)
+
+Jaro-Winkler is an edit distance metric described
+[here](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).
+
+### Editex (editex)
+
+Editex is a phonetic text distance algorithm described
+[here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138&rep=rep1&type=pdf).
+
+### Levenshtein (levenshtein)
+
+Levenshtein is an edit distance metric described
+[here](https://en.wikipedia.org/wiki/Levenshtein_distance).
+
+### MRA (mra)
+
+The "Match rating approach" is a phonetic text distance algorithm described
+[here](https://en.wikipedia.org/wiki/Match_rating_approach).
+
+### Hamming (hamming)
+
+The Hamming distance is an edit distance metric described
+[here](https://en.wikipedia.org/wiki/Hamming_distance).
+
+### Word error rate (wer)
+
+This is the same as Levenshtein - just on word level.
+
+Not available for gap alignment.
+
+### Character error rate (cer)
+
+This is the same as Levenshtein but using a different implementation.
+
+Not available for gap alignment.
+
+### Smith-Waterman score (sws)
+
+This is the final Smith-Waterman score coming from the rough alignment
+step (but before gap alignment!).
+It is described
+[here](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
+
+Not available for gap alignment.
+
+### Transcript length (tlen)
+
+The character length of the STT transcript.
+
+Not available for gap alignment.
+
+### Matched text length (mlen)
+
+The character length of the matched text of the original transcript (cleaned).
+
+Not available for gap alignment.
diff --git a/doc/tools.md b/doc/tools.md
new file mode 100644
index 0000000..953b4f7
--- /dev/null
+++ b/doc/tools.md
@@ -0,0 +1,147 @@
+## Tools
+
+### Statistics tool
+
+The statistics tool `bin/statistics.sh` can be used for displaying aggregated statistics of
+all passed alignment files. Alignment files can be specified directly through the 
+`--aligned <ALIGNED-FILE>` multi-option and indirectly through the `--catalog <CATALOG-FILE>` multi-option.
+
+Example call:
+
+```shell script
+DSAlign$ bin/statistics.sh --catalog data/all.catalog 
+Reading catalog
+ 2 of 2 : 100.00% (elapsed: 00:00:00, speed: 94.27 it/s, ETA: 00:00:00)
+Total number of files: 2
+
+Total number of utterances: 5,949
+
+Total aligned utterance character length: 202,191
+
+Total utterance duration: 3:53:28.410000 (3 hours)
+
+Overall number of instances of meta type "speaker": 27
+
+100 most frequent "speaker" instances:
+Rosalind                     678
+Touchstone                   401
+Orlando                      310
+Jaques                       303
+Celia                        281
+Oliver                       125
+Phebe                        108
+Duke Senior                   87
+Silvius                       86
+Adam                          81
+Corin                         68
+Duke Frederick                53
+Le Beau                       52
+First Lord                    49
+Charles                       33
+Amiens                        27
+Audrey                        27
+Second Page                   22
+Hymen                         19
+Jaques De Boys                16
+Second Lord                   12
+William                       12
+Forester                       8
+First Page                     7
+Sir Oliver Martext             4
+Dennis                         3
+A Lord                         1
+```
+
+### Catalog tool
+
+The catalog tool allows for maintenance of catalog files.
+It takes multiple catalog files (supporting wildcards) and allows for applying several checks and tweaks before
+potentially exporting them to a new combined catalog file.
+
+Options:
+
+ - `--output <CATALOG>`: Writes all items of all passed catalogs into to the specified new catalog.
+ - `--make-relative`: Makes all paths entries of all items relative to the parent directory of the 
+   new catalog (see `--output`).
+ - `--order-by <ENTRY>`: Entry that should be used for sorting items in new catalog (see `--output`).
+ - `--check <ENTRIES>`: Checks file existence of all passed (comma separated) entries of each catalog 
+   item (e.g. `--check aligned,audio` will check if `aligned` and `audio` file paths of each catalog item exist). 
+   `--check all` will check all entries of each item.
+ - `--on-miss fail|drop|remove|ignore`: What to do if a checked (`--check`) file is not existing. 
+   - `fail`: tool will exit with an error status (default)
+   - `drop`: the catalog item with all its entries will be removed (see `--output`)
+   - `remove`: the missing entry within the catalog item will be removed (see `--output`)
+   - `ignore`: just logs the missing entry
+   
+Example usage:
+```shell script
+$ cat a.catalog 
+[
+  {
+    "entry1": "is/not/existing/x",
+    "entry2": "is/existing/x"
+  }
+]
+$ cat b.catalog 
+[
+  {
+    "entry1": "is/not/existing/y",
+    "entry2": "is/existing/y"
+  }
+]
+$ bin/catalog_tool.sh --check all --on-miss remove --output c.catalog --make-relative a.catalog b.catalog 
+Loading catalog "a.catalog"
+Catalog "a.catalog" - Missing file for "entry1" ("is/not/existing/x") - removing entry from item
+Loading catalog "b.catalog"
+Catalog "b.catalog" - Missing file for "entry1" ("is/not/existing/y") - removing entry from item
+Writing catalog "c.catalog"
+$ cat c.catalog 
+[
+  {
+    "entry2": "is/existing/x"
+  },
+  {
+    "entry2": "is/existing/y"
+  }
+]
+```
+
+### Meta data annotation tool
+
+The meta data annotation tool allows for assigning meta data fields to all items of script files or transcription logs.
+It takes only two parameters: The file and a series of `<key>=<value>` assignments.
+
+Example usage:
+```shell script
+$ cat a.tlog 
+[
+  {
+    "start": 330.0,
+    "end": 2820.0,
+    "transcript": "some text without a meaning"
+  },
+  {
+    "start": 3456.0,
+    "end": 5123.0,
+    "transcript": "some other text without a meaning"
+  }
+]
+$ bin/meta.sh a.tlog speaker=alice year=2020
+$ cat a.tlog 
+[
+  {
+    "start": 330.0,
+    "end": 2820.0,
+    "transcript": "some text without a meaning",
+    "speaker": "alice",
+    "year": "2020"
+  },
+  {
+    "start": 3456.0,
+    "end": 5123.0,
+    "transcript": "some other text without a meaning",
+    "speaker": "alice",
+    "year": "2020"
+  }
+]
+```
\ No newline at end of file