Updated documentation and minor tool fixes

2020-07-01 17:58:13 +02:00 · 2020-07-01 17:58:13 +02:00 · 39a633a434
--- a/README.md
+++ b/README.md
@ -4,9 +4,10 @@ DeepSpeech based forced alignment tool
 ## Installation
 It is recommended to use this tool from within a virtual environment.
-There is a script for creating one with all requirements in the git-ignored dir `venv`:
+After cloning and changing to the root of the project,
 there is a script for creating one with all requirements in the git-ignored dir `venv`:
-```bash
+```shell script
 $ bin/createenv.sh
 $ ls venv
 bin  include  lib  lib64  pyvenv.cfg  share
@ -14,689 +15,52 @@ bin  include  lib  lib64  pyvenv.cfg  share
 `bin/align.sh` will automatically use it.
 ## Prerequisites
 ### Language specific data
 Internally DSAlign uses the [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT engine.
 For it to be able to function, it requires a couple of files that are specific to 
 the language of the speech data you want to align.
 If you want to align English, there is already a helper script that will download and prepare
 all required data:
-```bash
+```shell script
 $ bin/getmodel.sh 
 [...]
 $ ls models/en/
 alphabet.txt  lm.binary  output_graph.pb  output_graph.pbmm  output_graph.tflite  trie
 ```
-### Dependencies for generating individual language models
+## Overview and documentation
-If you plan to let the tool generate individual language models per text (you should!),
+A typical application of the aligner is done in three phases: 
 you have to get (essentially build) [KenLM](https://kheafield.com/code/kenlm/).
 Before doing this, you should install its [dependencies](https://kheafield.com/code/kenlm/dependencies/).
 For Debian based systems this can be done through:
 ```bash
 $ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
 ```
-With all requirements fulfilled, there is a script for building and installing KenLM
+ 1. __Preparing__ the data. Albeit most of this has to be done individually,
-and the required DeepSpeech tools in the right location:
+    there are some [tools for data preparation, statistics and maintenance](doc/tools.md).
-```bash
+    All involved file formats are described [here](doc/files.md).
-$ bin/lm-dependencies.sh
+ 2. __Aligning__ the data using [the alignment tool and it algorithm](doc/algo.md).
-```
+ 3. __Exporting__ aligned data using [the data-set exporter](doc/export.md).
-If all went well, the alignment tool will find and use it to automatically create individual
+## Quickstart example
 language models for each document.
 ### Example data
-There is also a script for downloading and preparing some public domain speech and transcript data.
+There is a script for downloading and preparing some public domain speech and transcript data.
 It requires `ffmpeg` for some sample conversion.
-```bash
+```shell script
 $ bin/gettestdata.sh
 $ ls data
 test1  test2
 ```
 ## Using the tool
 ```bash
 $ bin/align.sh --help
 [...]
 ```
 ### Alignment using example data
-```bash
+Now the aligner can be called either "manually" (specifying all involved files directly):
-$ bin/align.sh --output-max-cer 15 --loglevel 10 --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log
+
 ```shell script
 $ bin/align.sh --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log
 ```
-## The algorithm
+Or "automatically" by specifying a so-called catalog file that bundles all involved paths:
-### Step 1 - Splitting audio
+```shell script
-
+$ bin/align.sh --catalog data/test1.catalog
 A voice activity detector (at the moment this is `webrtcvad`) is used
 to split the provided audio data into voice fragments.
 These fragments are essentially streams of continuous speech without any longer pauses 
 (e.g. sentences).
 `--audio-vad-aggressiveness <AGGRESSIVENESS>` can be used to influence the length of the
 resulting fragments.
 ### Step 2 - Preparation of original text
 STT transcripts are typically provided in a normalized textual form with
 - no casing,
 - no punctuation and
 - normalized whitespace (single spaces only).
 So for being able to align STT transcripts with the original text it is necessary
 to internally convert the original text into the same form.
 This happens in two steps:
 1. Normalization of whitespace, lower-casing all text and 
 replacing some characters with spaces (e.g. dashes)
 2. Removal of all characters that are not in the languages's alphabet
 (see DeepSpeech model data)
 Be aware: *This conversion happens on text basis and will not remove unspoken content
 like markup/markdown tags or artifacts. This should be done beforehand.
 Reducing the difference of spoken and original text will improve alignment quality and speed.*
 In the very unlikely situation that you have to change the default behavior (of step 1),
 there are some switches:
 `--text-keep-dashes` will prevent substitution of dashes with spaces.
 `--text-keep-ws` will keep whitespace untouched.
 `--text-keep-casing` will keep character casing as provided.
 ### Step 4a (optional) - Generating document specific language model
 If the [dependencies][Dependencies for generating individual language models] for 
 individual language model generation got installed, this document-individual model will
 now be generated by default.
 Assuming your text document is named `original.txt`, these files will be generated:
 - `original.txt.clean` - cleaned version of the original text
 - `original.txt.arpa` - text file with probabilities in ARPA format
 - `original.txt.lm` - binary representation of the former one
 - `original.txt.trie` - prefix-tree optimized for probability lookup
 `--stt-no-own-lm` deactivates creation of individual language models per document and
 uses the one from model dir instead.
 ### Step 4b - Transcription of voice fragments through STT
 After VAD splitting the resulting fragments are transcribed into textual phrases.
 This transcription is done through [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT.
 As this can take a longer time, all resulting phrases are - together with their 
 timestamps - saved as JSON into a transcription log file 
 (the `audio` parameter path with suffix `.tlog` instead of `.wav`).
 Consecutive calls will look for that file and - if found - 
 load it and skip the transcription phase.
 `--stt-model-dir <DIR>` points DeepSpeech to the language specific model data directory.
 It defaults to `models/en`. Use `bin/getmodel.sh` for preparing it.  
 ### Step 5 - Rough alignment
 The actual text alignment is based on a recursive divide and conquer approach:
 1. Construct an ordered list of of all phrases in the current interval
 (at the beginning this is the list of all phrases that are to be aligned),
 where long phrases close to the middle of the interval come first.
 2. Iterate through the list and compute the best Smith-Waterman alignment
 (see the following sub-sections) with the document's original text...
 3. ...till there is a phrase whose Smith-Waterman alignment score surpasses a (low) recursion-depth 
 dependent threshold (in most cases this should already be the first phrase).
 4. Recursively continue with step 1 for the sub-intervals and original text ranges
 to the left and right of the phrase and its aligned text range within the original text.
 5. Return all phrases in order of appearance (depth-first) that were aligned with the minimum 
 Smith-Waterman score on their recursion level.
 This approach assumes that all phrases were spoken in the same order as they appear in the
 original transcript. It has the following advantages compared to individual
 global phrase matching:
 - Long non-matching chunks of spoken text or the original transcript will automatically and 
 cleanly get ignored.
 - Short phrases (with the risk of matching more than one time per document) will automatically
 get aligned to their intended locations by longer ones who "squeeze" them in.
 - Smith-Waterman score thresholds can be kept lower 
 (and thus better match lower quality STT transcripts), as there is a lower chance for 
  - long sequences to match at a wrong location and for 
  - shorter sequences to match at a wrong location within their shortened intervals
  (as they are getting matched later and deeper in the recursion tree).
 #### Smith-Waterman candidate selection
 Finding the best match of a given phrase within the original (potentially long) transcript
 using vanilla Smith-Waterman is not feasible.
 So this tool follows a two-phase approach where the first goal is to get a list of alignment 
 candidates. As the first step the original text is virtually partitioned into windows of the 
 same length as the search pattern. These windows are ordered descending by the number of 3-grams
 they share with the pattern.
 Best alignment candidates are now taken from the beginning of this ordered list.
 `--align-max-candidates <CANDIDATES>` sets the maximum number of candidate windows
 taken from the beginning of the list for further alignment.
 `--align-candidate-threshold <THRESHOLD>` multiplied with the number of 3-grams of the predecessor
 window it gives the minimum number of 3-grams the next candidate window has to have to also be
 considered a candidate.
 #### Smith-Waterman alignment
 For each candidate, the best possible alignment is computed using the 
 [Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) algorithm
 within an extended interval of one window-size around the candidate window.
 `--align-match-score <SCORE>` is the score per correctly matched character. Default: 100
 `--align-mismatch-score <SCORE>` is the score per non-matching (exchanged) character. Default: -100
 `--align-gap-score <SCORE>` is the score per character gap (removing 1 character from pattern or original). Default: -100
 The overall best score for the best match is normalized to a value of about 100 maximum by dividing
 it through the maximum character count of either the match or the pattern.
 ### Step 6 - Gap alignment
 After recursive matching of fragments there are potential text leftovers between aligned original
 texts.
 Some examples:
 - Often: Missing (and therefore unaligned) STT transcripts of word-endings (e.g. English past tense endings _-d_ and _-ed_)
 on phrase endings to the left of the gap
 - Seldom: Phrase beginnings or endings that were wrongly matched on unspoken (but written) text whose actual
 alignments are now left unaligned in the gap
 - Big unmatched chunks of text, like
  - Preface, text summaries or any other kind of meta information
  - Copyright headers/footers
  - Table of contents
 - Chapter headers (if not spoken as they appear)
 - Captions of figures
 - Contents of tables
 - Line-headers like character names in drama scripts
 - Dependent of the (pre-processing) quality: OCR leftovers like
  - page headers
  - page numbers
  - reader's notes
 The basic challenge here is to figure out, if all or some of the gap text should be used to extend 
 the phrase to the left and/or to the right of the gap.
 As Smith-Waterman alignment led to the current (potentially incomplete or even wrong) result,
 its score cannot be used for further fine-tuning. Therefore there is a collection of
 so called test-distance algorithms to pick from using the `--align-similarity-algo`
 parameter.
 Using the selected distance metric, the gap alignment is done by looking for the best scoring 
 extension of the left and right phrases up to their maximum extension.
 `--align-stretch-factor <FRACTION>` is the fraction of the text length that it could get
 stretched at max.  
 For many languages it is worth putting some emphasis on matching to words boundaries 
 (that is white-space separated sub-sequences).
 `--align-snap-factor <FACTOR>` allows to control the snappiness to word boundaries.
 If the best scoring extensions should overlap, the best scoring sum of non-overlapping
 (but touching) extensions will win.
 ### Step 7 - Selection, filtering and output
 Finally the best alignment of all candidate windows is selected as the winner.
 It has to survive a series of filters for getting into the result file.
 For each text distance metric there are two filter parameters:
 `--output-min-<METRIC-ID> <VALUE>` only keeps utterances having the provided minimum value for the
 metric with id `METRIC-ID`
 `--output-max-<METRIC-ID> <VALUE>` only keeps utterances having the provided maximum value for the
 metric with id `METRIC-ID`
 For each text distance metric there's also the option to have it added to each utterance's entry:
 `--output-<METRIC-ID>` adds the computed value for `<METRIC-ID>` to the utterances array-entry
 Error rates and scores are provided as fractional values (typically between 0.0 = 0% and 1.0 = 100%
 where numbers >1.0 are theoretically possible).
 ### General options
 `--play` will play each aligned sample using the `play` command of the SoX audio toolkit
 `--text-context <CONTEXT-SIZE>` will add additional `CONTEXT-SIZE` characters around original
 transcripts when logged
 ## Export
 After files got successfully aligned, one would possibly want to export the aligned utterances
 as machine learning training samples.
 This is where the export tool `bin/export.sh` comes in.
 ### Step 1 - Reading the input
 The exporter takes either a single audio file (`--audio`) 
 plus a corresponding `.aligned` file (`--aligned`) or a series
 of such pairs from a `.catalog` file (`--catalog`) as input.
 All of the following computations will be done on the joined list of all aligned
 utterances of all input pairs.
 ### Step 2 - (Pre-) Filtering
 The parameter `--filter <EXPR>` allows to specify a Python expression that has access
 to all data fields of an aligned utterance (as can be seen in `.aligned` file entries).
 This expression is now applied to each aligned utterance and in case it returns `True`,
 the utterance will get excluded from all the following steps. 
 This is useful for excluding utterances that would not work as input for the planned
 training or other kind of application.
 ### Step 3 - Computing quality
 As with filtering, the parameter `--criteria <EXPR>` allows for specifying a Python 
 expression that has access to all data fields of an aligned utterance.
 The expression is applied to each aligned utterance and its numerical return 
 value is assigned to each utterance as `quality`.
 ### Step 4 - De-biasing
 This step is to (optionally) exclude utterances that would otherwise bias the data
 (risk of overfitting).
 For each `--debias <META DATA TYPE>` parameter the following procedure is applied:
 1. Take the meta data type (e.g. "name") and read its instances (e.g. "Alice" or "Bob")
 from each utternace and group all utterances accordingly
 (e.g. a group with 2 utterances of "Alice" and a group with 15 utterances of "Bob"...)
 2. Compute the standard deviation (`sigma`) of the instance-counts of the groups
 3. For each group: If the instance-count exceeds `sigma` times `--debias-sigma-factor <FACTOR>`:
    - Drop the number of exceeding utterances in order of their `quality` (lowest first)
 ### Step 5 - Partitioning
 Training sets are often partitioned into several quality levels.
 For each `--partition <QUALITY:PARTITION>` parameter (ordered descending by `QUALITY`):
 If the utterance's `quality` value is greater or equal `QUALITY`, assign it to `PARTITION`.
 Remaining utterances are assigned to partition `other`.
 ### Step 6 - Splitting
 Training sets (actually their partitions) are typically split into sets `train`, `dev` 
 and `test` ([explanation](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)).
 This can get automated through parameter `--split` which will let the exporter split each
 partition (or the entire set) accordingly.
 Parameter `--split-field` allows for specifying a meta data type that should be considered 
 atomic (e.g. "speaker" would result in all utterances of a speaker 
 instance - like "Alice" - to end up in one sub-set only). This atomic behavior will also hold
 true across partitions.
 ### Step 7 - Output
 For each partition/sub-set combination the following is done:
 - Construction of a `name` (e.g. `good-dev` will represent the validation set of partition `good`).
 - Writing all utterance audio fragments (as `.wav` files) into a sub-directory of `--target-dir <DIR>`
 named `name` (using parameters `--channels <N>` and `--rate <RATE>`).
 - Writing an utterance list into `--target-dir <DIR>` named `name.(json|csv)` dependent on the
 output format specified through `--format <FORMAT>`
 ### Additional functionality
 Using `--dry-run` one can avoid any writing and get a preview on set-splits and so forth
 (`--dry-run-fast` won't even load any sample).
 `--force` will force overwriting of samples and list files.
 `--workers <N>` allows for specifying the number of parallel workers.
 ## File formats
 ### Catalog files (.catalog)
 Catalog files (suffix `.catalog`) are used for organizing bigger data file collections and
 defining relations among them. It is basically a JSON array of hash-tables where each entry stands
 for a single audio file and its associated original transcript.
 So a typical catalog looks like this (`data/all.catalog` from this project):
 ```javascript
 [
  {
    "audio": "test1/joined.mp3",
    "tlog": "test1/joined.tlog",
    "script": "test1/transcript.txt",
    "aligned": "test1/joined.aligned"
  },
  {
    "audio": "test2/joined.mp3",
    "tlog": "test2/joined.tlog",
    "script": "test2/transcript.script",
    "aligned": "test2/joined.aligned"
  }
 ]
 ```
 - `audio` is a path to an audio file (of a format that `pydub` supports)
 - `tlog` is the (supposed) path to the STT generated transcription log of the audio file
 - `script` is the path to the original transcript of the audio file
 (as `.txt` or `.script` file)
 - `aligned` is the (supposed) path to a `.aligned` file
 Be aware: __All relative file paths are treated as relative to the catalog file's directory__.
 The tools `bin/align.sh`, `bin/statistics.sh` and `bin/export.sh` all support parameter
 `--catalog`:
 The __alignment tool__ `bin/align.sh` requires either `tlog` to point to an existing
 file or (if not) `audio` to point to an existing audio file for being able to transcribe
 it and store it at the path indicated by `tlog`. Furthermore it requires `script` to
 point to an  existing script. It will write its alignment results to the path in `aligned`.
 The __export tool__ `bin/export.sh` requires `audio` and `aligned` to point to existing files.
 The __statistics tool__ `bin/statistics.sh` requires only `aligned` to point to existing files.
 Advantages of having a catalog file:
 - Simplified tool usage with only one parameter for defining all involved files (`--catalog`).
 - A directory with many files has to be scanned just one time at catalog generation.
 - Different file types can live at different and custom locations in the system.
 This is important in case of read-only access rights to the original data.
 It can also be used for avoiding to taint the original directory tree.
 - Accumulated statistics
 - Better progress indication (as the total number of files is available up front)
 - Reduced tool startup overhead
 - Allows for meta-data aware set-splitting on export - e.g. if some speakers are speaking
 in several files.
 So especially in case of many files to process it is highly recommended to __first create
 a catalog file__ with all paths present (even the ones not pointing to existing files yet).
 ### Script files (.script|.txt)
 The alignment tool requires an original script or (human transcript) of the provided audio.
 These scripts can be represented in two basic forms:
 - plain text files (`.txt`) or
 - script files (`.script`)
 In case of plain text files the content is considered a continuous stream of text without
 any assigned meta data. The only exception is option `--text-meaningful-newlines` which
 tells the aligner to consider newlines as separators between utterances
 in conjunction with option `--align-phrase-snap-factor`.
 If the original data source features utterance meta data, one should consider converting it
 to the `.script` JSON file format which looks like this
 (except of `data/test2/transcript.script`): 
 ```javascript
 [
  // ...
  {
    "speaker": "Phebe",
    "text": "Good shepherd, tell this youth what 'tis to love."
  },
  {
    "speaker": "Silvius",
    "text": "It is to be all made of sighs and tears; And so am I for Phebe."
  },
  // ...
 ]
 ```
 _This and the following sub-sections are all using the same real world examples and excerpts_
 It is basically again an array of hash-tables, where each hash-table represents an utterance with the
 only mandatory field `text` for its textual representation.
 All other fields are considered meta data 
 (with the key called "meta data type" and the value "meta data instance").
 ### Transcription log files (.tlog)
 The alignment tool relies on timed STT transcripts of the provided audio.
 These transcripts are either provided by some external processing 
 (even using a different STT system than DeepSpeech) or will get generated
 as part of the alignment process.
 They are called transcription logs (`.tlog`) and are looking like this
 (except of `data/test2/joined.tlog`):
 ```javascript
 [
  // ...
  {
    "start": 7491960,
    "end": 7493040,
    "transcript": "good shepherd"
  },
  {
    "start": 7493040,
    "end": 7495110,
    "transcript": "tell this youth what tis to love"
  },
  {
    "start": 7495380,
    "end": 7498020,
    "transcript": "it is to be made of soles and tears"
  },
  {
    "start": 7498470,
    "end": 7500150,
    "transcript": "and so a may for phoebe"
  },
  // ...
 ]
 ```
 The fields of each entry:
 - `start`: time offset of the audio fragment in milliseconds from the beginning of the
 aligned audio file (mandatory)
 - `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
 aligned audio file (mandatory) 
 - `transcript`: STT transcript of the utterance (mandatory)
 ### Aligned files (.aligned)
 The result of aligning an audio file with an original transcript is written to an
 `.aligned` JSON file consisting of an array of hash-tables of the following form:
 ```javascript
 [
  // ...
  {
    "start": 7491960,
    "end": 7493040,
    "transcript": "good shepherd",
    "text-start": 98302,
    "text-end": 98316,
    "meta": {
      "speaker": [
        "Phebe"
      ]
    },
    "aligned-raw": "Good shepherd,",
    "aligned": "good shepherd",
    "wng": 99.99999999999997,
    "jaro_winkler": 100.0,
    "levenshtein": 100.0,
    "mra": 100.0,
    "cer": 0.0
  },
  {
    "start": 7493040,
    "end": 7495110,
    "transcript": "tell this youth what tis to love",
    "text-start": 98317,
    "text-end": 98351,
    "meta": {
      "speaker": [
        "Phebe"
      ]
    },
    "aligned-raw": "tell this youth what 'tis to love.",
    "aligned": "tell this youth what 'tis to love",
    "wng": 92.71730687405957,
    "jaro_winkler": 100.0,
    "levenshtein": 96.96969696969697,
    "mra": 100.0,
    "cer": 3.0303030303030303
  },
  {
    "start": 7495380,
    "end": 7498020,
    "transcript": "it is to be made of soles and tears",
    "text-start": 98352,
    "text-end": 98392,
    "meta": {
      "speaker": [
        "Silvius"
      ]
    },
    "aligned-raw": "It is to be all made of sighs and tears;",
    "aligned": "it is to be all made of sighs and tears",
    "wng": 77.93921929148159,
    "jaro_winkler": 100.0,
    "levenshtein": 82.05128205128204,
    "mra": 100.0,
    "cer": 17.94871794871795
  },
  {
    "start": 7498470,
    "end": 7500150,
    "transcript": "and so a may for phoebe",
    "text-start": 98393,
    "text-end": 98415,
    "meta": {
      "speaker": [
        "Silvius"
      ]
    },
    "aligned-raw": "And so am I for Phebe.",
    "aligned": "and so am i for phebe",
    "wng": 66.82687893873339,
    "jaro_winkler": 98.47964113181504,
    "levenshtein": 82.6086956521739,
    "mra": 100.0,
    "cer": 19.047619047619047
  },
  // ...
 ]
 ```
 Each object array-entry represents an aligned audio fragment with the following attributes:
 - `start`: time offset of the audio fragment in milliseconds from the beginning of the
 aligned audio file
 - `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
 aligned audio file
 - `transcript`: STT transcript used for aligning
 - `text-start`: character offset of the fragment's associated original text within the
 aligned text document
 - `text-end`: character offset of the end of the fragment's associated original text within the
 aligned text document
 - `meta`: meta data hash-table with
  - _key_: meta data type
  - _value_: array of meta data instances coalesced from the `.script` entries that
  this entry intersects with
 - `aligned-raw`: __raw__ original text fragment that got aligned with the audio fragment
 and its STT transcript
 - `aligned`: __clean__ original text fragment that got aligned with the audio fragment
 and its STT transcript
 - `<metric>` For each `--output-<metric>` parameter the alignment tool adds an entry with the
 computed value (in this case `wng`, `jaro_winkler`, `levenshtein`, `mra`, `cer`)
 ## Text distance metrics
 This section lists all available text distance metrics along with their IDs for
 command-line use.
 ### Weighted N-grams (wng)
 The weighted N-gram score is computed as the sum of the number of weighted shared N-grams
 between the two texts.
 It ensures that:
 - Shared N-gram instances near interval bounds (dependent on situation) get rated higher than
 the ones near the center or opposite end
 - Large shared N-gram instances are weighted higher than short ones
 `--align-min-ngram-size <SIZE>` sets the start (minimum) N-gram size
 `--align-max-ngram-size <SIZE>` sets the final (maximum) N-gram size
 `--align-ngram-size-factor <FACTOR>` sets a weight factor for the size preference
 `--align-ngram-position-factor <FACTOR>` sets a weight factor for the position preference
 ### Jaro-Winkler (jaro_winkler)
 Jaro-Winkler is an edit distance metric described
 [here](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).
 ### Editex (editex)
 Editex is a phonetic text distance algorithm described
 [here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138&rep=rep1&type=pdf).
 ### Levenshtein (levenshtein)
 Levenshtein is an edit distance metric described
 [here](https://en.wikipedia.org/wiki/Levenshtein_distance).
 ### MRA (mra)
 The "Match rating approach" is a phonetic text distance algorithm described
 [here](https://en.wikipedia.org/wiki/Match_rating_approach).
 ### Hamming (hamming)
 The Hamming distance is an edit distance metric described
 [here](https://en.wikipedia.org/wiki/Hamming_distance).
 ### Word error rate (wer)
 This is the same as Levenshtein - just on word level.
 Not available for gap alignment.
 ### Character error rate (cer)
 This is the same as Levenshtein but using a different implementation.
 Not available for gap alignment.
 ### Smith-Waterman score (sws)
 This is the final Smith-Waterman score coming from the rough alignment
 step (but before gap alignment!).
 It is described
 [here](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
 Not available for gap alignment.
 ### Transcript length (tlen)
 The character length of the STT transcript.
 Not available for gap alignment.
 ### Matched text length (mlen)
 The character length of the matched text of the original transcript (cleaned).
 Not available for gap alignment.
--- a/align/catalog_tool.py
+++ b/align/catalog_tool.py
@ -20,9 +20,9 @@ def build_catalog():
    for source_glob in CLI_ARGS.sources:
        catalog_paths.extend(glob(source_glob))
    items = []
-    for catalog_path in catalog_paths:
+    for catalog_original_path in catalog_paths:
-        catalog_path = Path(catalog_path).absolute()
+        catalog_path = Path(catalog_original_path).absolute()
-        print('Loading catalog "{}"'.format(str(catalog_path)))
+        print('Loading catalog "{}"'.format(str(catalog_original_path)))
        if not catalog_path.is_file():
            fail('Unable to find catalog file "{}"'.format(str(catalog_path)))
        with open(catalog_path, 'r') as catalog_file:
@ -30,13 +30,13 @@ def build_catalog():
        base_path = catalog_path.parent.absolute()
        for item in catalog_items:
            new_item = {}
-            for entry, entry_path in item.items():
+            for entry, entry_original_path in item.items():
-                entry_path = Path(entry_path)
+                entry_path = Path(entry_original_path)
-                entry_path = entry_path if entry_path.is_absolute() else (base_path / entry_path)
+                entry_path = entry_path if entry_path.is_absolute() else (base_path / entry_path).absolute()
                if ((len(CLI_ARGS.check) == 1 and CLI_ARGS.check[0] == 'all')
                        or entry in CLI_ARGS.check) and not entry_path.is_file():
                    note = 'Catalog "{}" - Missing file for "{}" ("{}")'.format(
-                        str(catalog_path), entry, str(entry_path))
+                        str(catalog_original_path), entry, str(entry_original_path))
                    if CLI_ARGS.on_miss == 'fail':
                        fail(note + ' - aborting')
                    if CLI_ARGS.on_miss == 'ignore':
@ -54,7 +54,7 @@ def build_catalog():
                items.append(new_item)
    if CLI_ARGS.output is not None:
        catalog_path = Path(CLI_ARGS.output).absolute()
-        print('Writing catalog "{}"'.format(str(catalog_path)))
+        print('Writing catalog "{}"'.format(str(CLI_ARGS.output)))
        if CLI_ARGS.make_relative:
            base_path = catalog_path.parent
            for item in items:
@ -63,7 +63,7 @@ def build_catalog():
        if CLI_ARGS.order_by is not None:
            items.sort(key=lambda i: i[CLI_ARGS.order_by] if CLI_ARGS.order_by in i else '')
        with open(catalog_path, 'w') as catalog_file:
-            json.dump(items, catalog_file)
+            json.dump(items, catalog_file, indent=2)
 def handle_args():
@ -71,7 +71,7 @@ def handle_args():
                                                 'converting paths within catalog files')
    parser.add_argument('--output', help='Write collected catalog items to this new catalog file')
    parser.add_argument('--make-relative', action='store_true',
-                        help='Make all path entries of all items relative to target catalog file\'s parent directory')
+                        help='Make all path entries of all items relative to new catalog file\'s parent directory')
    parser.add_argument('--check',
                        help='Comma separated list of path entries to check for existence '
                             '("all" for checking every entry, default: no checks)')
--- a/align/export.py
+++ b/align/export.py
@ -338,7 +338,6 @@ def parse_args():
                        help='Take audio file as input (requires "--aligned <file>")')
    parser.add_argument('--aligned', type=str,
                        help='Take alignment file ("<...>.aligned") as input (requires "--audio <file>")')
    parser.add_argument('--catalog', type=str,
                        help='Take alignment and audio file references of provided catalog ("<...>.catalog") as input')
    parser.add_argument('--ignore-missing', action="store_true",
--- a/align/meta.py
+++ b/align/meta.py
@ -8,14 +8,14 @@ forbidden_keys = ['start', 'end', 'text', 'transcript']
 def main(args):
    parser = argparse.ArgumentParser(description='Annotate .tlog or .script files by adding meta data')
    parser.add_argument('target', type=str, help='')
-    parser.add_argument('assignment', action='append', help='Meta data assignment of the form <key>=<value>')
+    parser.add_argument('assignments', nargs='+', help='Meta data assignments of the form <key>=<value>')
    args = parser.parse_args()
    with open(args.target, 'r') as json_file:
        entries = json.load(json_file)
-    for assign in args.assignment:
+    for assignment in args.assignments:
-        key, value = assign.split('=')
+        key, value = assignment.split('=')
        if key in forbidden_keys:
            print('Meta data key "{}" not allowed - forbidden: {}'.format(key, '|'.join(forbidden_keys)))
            sys.exit(1)
@ -23,7 +23,7 @@ def main(args):
            entry[key] = value
    with open(args.target, 'w') as json_file:
-        json.dump(entries, json_file)
+        json.dump(entries, json_file, indent=2)
 if __name__ == '__main__':
--- a/align/stats.py
+++ b/align/stats.py
@ -126,6 +126,8 @@ def main(args):
                        help='Read alignment references of provided catalog ("<...>.catalog") as input')
    parser.add_argument('--no-progress', action='store_true',
                        help='Prevents showing progress bars')
    parser.add_argument('--progress-interval', type=float, default=1.0,
                        help='Progress indication interval in seconds')
    args = parser.parse_args()
--- a/doc/algo.md
+++ b/doc/algo.md
@ -0,0 +1,197 @@
 ## Alignment algorithm and its parameters
 ### Step 1 - Splitting audio
 A voice activity detector (at the moment this is `webrtcvad`) is used
 to split the provided audio data into voice fragments.
 These fragments are essentially streams of continuous speech without any longer pauses 
 (e.g. sentences).
 `--audio-vad-aggressiveness <AGGRESSIVENESS>` can be used to influence the length of the
 resulting fragments.
 ### Step 2 - Preparation of original text
 STT transcripts are typically provided in a normalized textual form with
 - no casing,
 - no punctuation and
 - normalized whitespace (single spaces only).
 So for being able to align STT transcripts with the original text it is necessary
 to internally convert the original text into the same form.
 This happens in two steps:
 1. Normalization of whitespace, lower-casing all text and 
 replacing some characters with spaces (e.g. dashes)
 2. Removal of all characters that are not in the languages's alphabet
 (see DeepSpeech model data)
 Be aware: *This conversion happens on text basis and will not remove unspoken content
 like markup/markdown tags or artifacts. This should be done beforehand.
 Reducing the difference of spoken and original text will improve alignment quality and speed.*
 In the very unlikely situation that you have to change the default behavior (of step 1),
 there are some switches:
 `--text-keep-dashes` will prevent substitution of dashes with spaces.
 `--text-keep-ws` will keep whitespace untouched.
 `--text-keep-casing` will keep character casing as provided.
 ### Step 3 (optional) - Generating document specific language model
 If the [dependencies](lm.md) for 
 individual language model generation got installed, this document-individual model will
 now be generated by default.
 Assuming your text document is named `original.txt`, these files will be generated:
 - `original.txt.clean` - cleaned version of the original text
 - `original.txt.arpa` - text file with probabilities in ARPA format
 - `original.txt.lm` - binary representation of the former one
 - `original.txt.trie` - prefix-tree optimized for probability lookup
 `--stt-no-own-lm` deactivates creation of individual language models per document and
 uses the one from model dir instead.
 ### Step 4 - Transcription of voice fragments through STT
 After VAD splitting the resulting fragments are transcribed into textual phrases.
 This transcription is done through [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT.
 As this can take a longer time, all resulting phrases are - together with their 
 timestamps - saved as JSON into a transcription log file 
 (the `audio` parameter path with suffix `.tlog` instead of `.wav`).
 Consecutive calls will look for that file and - if found - 
 load it and skip the transcription phase.
 `--stt-model-dir <DIR>` points DeepSpeech to the language specific model data directory.
 It defaults to `models/en`. Use `bin/getmodel.sh` for preparing it.  
 ### Step 5 - Rough alignment
 The actual text alignment is based on a recursive divide and conquer approach:
 1. Construct an ordered list of of all phrases in the current interval
 (at the beginning this is the list of all phrases that are to be aligned),
 where long phrases close to the middle of the interval come first.
 2. Iterate through the list and compute the best Smith-Waterman alignment
 (see the following sub-sections) with the document's original text...
 3. ...till there is a phrase whose Smith-Waterman alignment score surpasses a (low) recursion-depth 
 dependent threshold (in most cases this should already be the first phrase).
 4. Recursively continue with step 1 for the sub-intervals and original text ranges
 to the left and right of the phrase and its aligned text range within the original text.
 5. Return all phrases in order of appearance (depth-first) that were aligned with the minimum 
 Smith-Waterman score on their recursion level.
 This approach assumes that all phrases were spoken in the same order as they appear in the
 original transcript. It has the following advantages compared to individual
 global phrase matching:
 - Long non-matching chunks of spoken text or the original transcript will automatically and 
 cleanly get ignored.
 - Short phrases (with the risk of matching more than one time per document) will automatically
 get aligned to their intended locations by longer ones who "squeeze" them in.
 - Smith-Waterman score thresholds can be kept lower 
 (and thus better match lower quality STT transcripts), as there is a lower chance for 
  - long sequences to match at a wrong location and for 
  - shorter sequences to match at a wrong location within their shortened intervals
  (as they are getting matched later and deeper in the recursion tree).
 #### Smith-Waterman candidate selection
 Finding the best match of a given phrase within the original (potentially long) transcript
 using vanilla Smith-Waterman is not feasible.
 So this tool follows a two-phase approach where the first goal is to get a list of alignment 
 candidates. As the first step the original text is virtually partitioned into windows of the 
 same length as the search pattern. These windows are ordered descending by the number of 3-grams
 they share with the pattern.
 Best alignment candidates are now taken from the beginning of this ordered list.
 `--align-max-candidates <CANDIDATES>` sets the maximum number of candidate windows
 taken from the beginning of the list for further alignment.
 `--align-candidate-threshold <THRESHOLD>` multiplied with the number of 3-grams of the predecessor
 window it gives the minimum number of 3-grams the next candidate window has to have to also be
 considered a candidate.
 #### Smith-Waterman alignment
 For each candidate, the best possible alignment is computed using the 
 [Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) algorithm
 within an extended interval of one window-size around the candidate window.
 `--align-match-score <SCORE>` is the score per correctly matched character. Default: 100
 `--align-mismatch-score <SCORE>` is the score per non-matching (exchanged) character. Default: -100
 `--align-gap-score <SCORE>` is the score per character gap (removing 1 character from pattern or original). Default: -100
 The overall best score for the best match is normalized to a value of about 100 maximum by dividing
 it through the maximum character count of either the match or the pattern.
 ### Step 6 - Gap alignment
 After recursive matching of fragments there are potential text leftovers between aligned original
 texts.
 Some examples:
 - Often: Missing (and therefore unaligned) STT transcripts of word-endings (e.g. English past tense endings _-d_ and _-ed_)
 on phrase endings to the left of the gap
 - Seldom: Phrase beginnings or endings that were wrongly matched on unspoken (but written) text whose actual
 alignments are now left unaligned in the gap
 - Big unmatched chunks of text, like
  - Preface, text summaries or any other kind of meta information
  - Copyright headers/footers
  - Table of contents
 - Chapter headers (if not spoken as they appear)
 - Captions of figures
 - Contents of tables
 - Line-headers like character names in drama scripts
 - Dependent of the (pre-processing) quality: OCR leftovers like
  - page headers
  - page numbers
  - reader's notes
 The basic challenge here is to figure out, if all or some of the gap text should be used to extend 
 the phrase to the left and/or to the right of the gap.
 As Smith-Waterman alignment led to the current (potentially incomplete or even wrong) result,
 its score cannot be used for further fine-tuning. Therefore there is a collection of
 so called [test-distance metrics](metrics.md) to pick from using the `--align-similarity-algo <METRIC-ID>`
 parameter.
 Using the selected distance metric, the gap alignment is done by looking for the best scoring 
 extension of the left and right phrases up to their maximum extension.
 `--align-stretch-factor <FRACTION>` is the fraction of the text length that it could get
 stretched at max.  
 For many languages it is worth putting some emphasis on matching to words boundaries 
 (that is white-space separated sub-sequences).
 `--align-snap-factor <FACTOR>` allows to control the snappiness to word boundaries.
 If the best scoring extensions should overlap, the best scoring sum of non-overlapping
 (but touching) extensions will win.
 ### Step 7 - Selection, filtering and output
 Finally the best alignment of all candidate windows is selected as the winner.
 It has to survive a series of filters for getting into the result file.
 For each text distance metric there are two filter parameters:
 `--output-min-<METRIC-ID> <VALUE>` only keeps utterances having the provided minimum value for the
 metric with id `METRIC-ID`
 `--output-max-<METRIC-ID> <VALUE>` only keeps utterances having the provided maximum value for the
 metric with id `METRIC-ID`
 For each text distance metric there's also the option to have it added to each utterance's entry:
 `--output-<METRIC-ID>` adds the computed value for `<METRIC-ID>` to the utterances array-entry
 Error rates and scores are provided as fractional values (typically between 0.0 = 0% and 1.0 = 100%
 where numbers >1.0 are theoretically possible).
--- a/doc/export.md
+++ b/doc/export.md
@ -0,0 +1,129 @@
 ## Export
 After files got successfully aligned, one would possibly want to export the aligned utterances
 as machine learning training samples.
 This is where the export tool `bin/export.sh` comes in.
 ### Step 1 - Reading the input
 The exporter takes either a single audio file (`--audio <AUDIO>`) 
 plus a corresponding `.aligned` file (`--aligned <ALIGNED>`) or a series
 of such pairs from a `.catalog` file (`--catalog <CATALOG>`) as input.
 All of the following computations will be done on the joined list of all aligned
 utterances of all input pairs.
 Option `--ignore-missing` will not fail on missing file references in the catalog
 and instead just ignore the affected catalog entry.
 ### Step 2 - (Pre-) Filtering
 The parameter `--filter <EXPR>` allows to specify a Python expression that has access
 to all data fields of an aligned utterance (as can be seen in `.aligned` file entries).
 This expression is now applied to each aligned utterance and in case it returns `True`,
 the utterance will get excluded from all the following steps. 
 This is useful for excluding utterances that would not work as input for the planned
 training or other kind of application.
 ### Step 3 - Computing quality
 As with filtering, the parameter `--criteria <EXPR>` allows for specifying a Python 
 expression that has access to all data fields of an aligned utterance.
 The expression is applied to each aligned utterance and its numerical return 
 value is assigned to each utterance as `quality`.
 ### Step 4 - De-biasing
 This step is to (optionally) exclude utterances that would otherwise bias the data
 (risk of overfitting).
 For each `--debias <META DATA TYPE>` parameter the following procedure is applied:
 1. Take the meta data type (e.g. "name") and read its instances (e.g. "Alice" or "Bob")
 from each utternace and group all utterances accordingly
 (e.g. a group with 2 utterances of "Alice" and a group with 15 utterances of "Bob"...)
 2. Compute the standard deviation (`sigma`) of the instance-counts of the groups
 3. For each group: If the instance-count exceeds `sigma` times `--debias-sigma-factor <FACTOR>`:
    - Drop the number of exceeding utterances in order of their `quality` (lowest first)
 ### Step 5 - Partitioning
 Training sets are often partitioned into several quality levels.
 For each `--partition <QUALITY:PARTITION>` parameter (ordered descending by `QUALITY`):
 If the utterance's `quality` value is greater or equal `QUALITY`, assign it to `PARTITION`.
 Remaining utterances are assigned to partition `other`.
 ### Step 6 - Splitting
 Training sets (actually their partitions) are typically split into sets `train`, `dev` 
 and `test` ([explanation](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)).
 This can get automated through parameter `--split` which will let the exporter split each
 partition (or the entire set) accordingly.
 Parameter `--split-field` allows for specifying a meta data type that should be considered 
 atomic (e.g. "speaker" would result in all utterances of a speaker 
 instance - like "Alice" - to end up in one sub-set only). This atomic behavior will also hold
 true across partitions.
 Option `--split-drop-multiple` allows for dropping all samples with multiple `--split-field` assignments - e.g. a 
 sample with more than one "speaker".
 In contrast option `--split-drop-unknown` allows for dropping all samples with no `--split-field assignment`.
 With option '--assign-{train|dev|test} <VALUES>' one can pre-assign values (of the comma-separated list)
 to the specified set.
 Option `--split-seed <SEED>` sets an integer random seed for the split operation.
 ### Step 7 - Output
 For each partition/sub-set combination the following is done:
 - Construction of a `name` (e.g. `good-dev` will represent the validation set of partition `good`).
 - All samples are lazy-loaded and potentially re-sampled to match parameters: 
   - `--channels <N>`: Number of audio channels - 1 for mono (default), 2 for stereo
   - `--rate <RATE>`: Sample rate - default: 16000
   - `--width <WIDTH>`: Sample width in bytes - default: 2 (16 bit)
   `--workers <WORKERS>` can be used to specify how many parallel processes should be used for loading and re-sampling.
   `--tmp-dir <DIR>` overrides system default temporary directory that is used for converting samples.
   `--skip-damaged` allows for just skipping export of samples that cannot be loaded.
 - If option `--target-dir <DIR>` is provided, all output will be written to the provided target directory.
   This can be done in two different ways:
     1. With the additional option `--sdb` each set will be written to a so called Sample-DB
        that can be used by DeepSpeech. It will be written as `<name>.sdb` into the target directory.
        SDB export can be controlled with the following additional options:
        - `--sdb-bucket-size <SIZE>`: SDB bucket size (using units like "1GB") for external sorting of the samples
        - `--sdb-workers <WORKERS>`: Number of parallel workers for preparing and compressing SDB entries
        - `--sdb-buffered-samples <SAMPLES>`: Number of samples per bucket buffer during last phase of external sorting
        - `--sdb-audio-type <TYPE>`: Internal audio type for storing SDB samples - `wav` or `opus` (default)
     2. Without option `--sdb` all samples are written as WAV-files into sub-directory `<name>`
        of the target directory and a list of samples to a `<name>.csv` file next to it with columns 
        `wav_filename`, `wav_filesize`, `transcript`.
   If not omitted through option `--no-meta`, a CSV file called `<name>.meta` is written to the target directory.
   For each written sample it provides the following columns: 
   `sample`, `split_entity`, `catalog_index`, `source_audio_file`, `aligned_file`, `alignment_index`.
   Throughout this process option `--force` allows to overwrite any existing files.
 - If instead option `--target-tar <TAR-FILE>` is provided, the same file structure as with `--target-dir <DIR>`
   is directly written to the specified tar-file.
   This output variant does not support writing SDBs.
 ### Additional functionality
 Option `--plan <PLAN>` can be used to cache all computational steps before actual output writing.
 Will be loaded if existing or generated otherwise.
 This allows for writing several output formats using the same sample set distribution and without having to load
 alignment files and re-calculate quality metrics, de-biasing, partitioning or splitting.
 Using `--dry-run` one can avoid any writing and get a preview on set-splits and so forth
 (`--dry-run-fast` won't even load any sample).
--- a/doc/files.md
+++ b/doc/files.md
@ -0,0 +1,255 @@
 ## File formats
 ### Catalog files (.catalog)
 Catalog files (suffix `.catalog`) are used for organizing bigger data file collections and
 defining relations among them. It is basically a JSON array of hash-tables where each entry stands
 for a single audio file and its associated original transcript.
 So a typical catalog looks like this (`data/all.catalog` from this project):
 ```javascript
 [
  {
    "audio": "test1/joined.mp3",
    "tlog": "test1/joined.tlog",
    "script": "test1/transcript.txt",
    "aligned": "test1/joined.aligned"
  },
  {
    "audio": "test2/joined.mp3",
    "tlog": "test2/joined.tlog",
    "script": "test2/transcript.script",
    "aligned": "test2/joined.aligned"
  }
 ]
 ```
 - `audio` is a path to an audio file (of a format that `pydub` supports)
 - `tlog` is the (supposed) path to the STT generated transcription log of the audio file
 - `script` is the path to the original transcript of the audio file
 (as `.txt` or `.script` file)
 - `aligned` is the (supposed) path to a `.aligned` file
 Be aware: __All relative file paths are treated as relative to the catalog file's directory__.
 The tools `bin/align.sh`, `bin/statistics.sh` and `bin/export.sh` all support parameter
 `--catalog`:
 The __alignment tool__ `bin/align.sh` requires either `tlog` to point to an existing
 file or (if not) `audio` to point to an existing audio file for being able to transcribe
 it and store it at the path indicated by `tlog`. Furthermore it requires `script` to
 point to an  existing script. It will write its alignment results to the path in `aligned`.
 The __export tool__ `bin/export.sh` requires `audio` and `aligned` to point to existing files.
 The __statistics tool__ `bin/statistics.sh` requires only `aligned` to point to existing files.
 Advantages of having a catalog file:
 - Simplified tool usage with only one parameter for defining all involved files (`--catalog`).
 - A directory with many files has to be scanned just one time at catalog generation.
 - Different file types can live at different and custom locations in the system.
 This is important in case of read-only access rights to the original data.
 It can also be used for avoiding to taint the original directory tree.
 - Accumulated statistics
 - Better progress indication (as the total number of files is available up front)
 - Reduced tool startup overhead
 - Allows for meta-data aware set-splitting on export - e.g. if some speakers are speaking
 in several files.
 So especially in case of many files to process it is highly recommended to __first create
 a catalog file__ with all paths present (even the ones not pointing to existing files yet).
 ### Script files (.script|.txt)
 The alignment tool requires an original script or (human transcript) of the provided audio.
 These scripts can be represented in two basic forms:
 - plain text files (`.txt`) or
 - script files (`.script`)
 In case of plain text files the content is considered a continuous stream of text without
 any assigned meta data. The only exception is option `--text-meaningful-newlines` which
 tells the aligner to consider newlines as separators between utterances
 in conjunction with option `--align-phrase-snap-factor`.
 If the original data source features utterance meta data, one should consider converting it
 to the `.script` JSON file format which looks like this
 (except of `data/test2/transcript.script`): 
 ```javascript
 [
  // ...
  {
    "speaker": "Phebe",
    "text": "Good shepherd, tell this youth what 'tis to love."
  },
  {
    "speaker": "Silvius",
    "text": "It is to be all made of sighs and tears; And so am I for Phebe."
  },
  // ...
 ]
 ```
 _This and the following sub-sections are all using the same real world examples and excerpts_
 It is basically again an array of hash-tables, where each hash-table represents an utterance with the
 only mandatory field `text` for its textual representation.
 All other fields are considered meta data 
 (with the key called "meta data type" and the value "meta data instance").
 ### Transcription log files (.tlog)
 The alignment tool relies on timed STT transcripts of the provided audio.
 These transcripts are either provided by some external processing 
 (even using a different STT system than DeepSpeech) or will get generated
 as part of the alignment process.
 They are called transcription logs (`.tlog`) and are looking like this
 (except of `data/test2/joined.tlog`):
 ```javascript
 [
  // ...
  {
    "start": 7491960,
    "end": 7493040,
    "transcript": "good shepherd"
  },
  {
    "start": 7493040,
    "end": 7495110,
    "transcript": "tell this youth what tis to love"
  },
  {
    "start": 7495380,
    "end": 7498020,
    "transcript": "it is to be made of soles and tears"
  },
  {
    "start": 7498470,
    "end": 7500150,
    "transcript": "and so a may for phoebe"
  },
  // ...
 ]
 ```
 The fields of each entry:
 - `start`: time offset of the audio fragment in milliseconds from the beginning of the
 aligned audio file (mandatory)
 - `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
 aligned audio file (mandatory) 
 - `transcript`: STT transcript of the utterance (mandatory)
 ### Aligned files (.aligned)
 The result of aligning an audio file with an original transcript is written to an
 `.aligned` JSON file consisting of an array of hash-tables of the following form:
 ```javascript
 [
  // ...
  {
    "start": 7491960,
    "end": 7493040,
    "transcript": "good shepherd",
    "text-start": 98302,
    "text-end": 98316,
    "meta": {
      "speaker": [
        "Phebe"
      ]
    },
    "aligned-raw": "Good shepherd,",
    "aligned": "good shepherd",
    "wng": 99.99999999999997,
    "jaro_winkler": 100.0,
    "levenshtein": 100.0,
    "mra": 100.0,
    "cer": 0.0
  },
  {
    "start": 7493040,
    "end": 7495110,
    "transcript": "tell this youth what tis to love",
    "text-start": 98317,
    "text-end": 98351,
    "meta": {
      "speaker": [
        "Phebe"
      ]
    },
    "aligned-raw": "tell this youth what 'tis to love.",
    "aligned": "tell this youth what 'tis to love",
    "wng": 92.71730687405957,
    "jaro_winkler": 100.0,
    "levenshtein": 96.96969696969697,
    "mra": 100.0,
    "cer": 3.0303030303030303
  },
  {
    "start": 7495380,
    "end": 7498020,
    "transcript": "it is to be made of soles and tears",
    "text-start": 98352,
    "text-end": 98392,
    "meta": {
      "speaker": [
        "Silvius"
      ]
    },
    "aligned-raw": "It is to be all made of sighs and tears;",
    "aligned": "it is to be all made of sighs and tears",
    "wng": 77.93921929148159,
    "jaro_winkler": 100.0,
    "levenshtein": 82.05128205128204,
    "mra": 100.0,
    "cer": 17.94871794871795
  },
  {
    "start": 7498470,
    "end": 7500150,
    "transcript": "and so a may for phoebe",
    "text-start": 98393,
    "text-end": 98415,
    "meta": {
      "speaker": [
        "Silvius"
      ]
    },
    "aligned-raw": "And so am I for Phebe.",
    "aligned": "and so am i for phebe",
    "wng": 66.82687893873339,
    "jaro_winkler": 98.47964113181504,
    "levenshtein": 82.6086956521739,
    "mra": 100.0,
    "cer": 19.047619047619047
  },
  // ...
 ]
 ```
 Each object array-entry represents an aligned audio fragment with the following attributes:
 - `start`: time offset of the audio fragment in milliseconds from the beginning of the
 aligned audio file
 - `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
 aligned audio file
 - `transcript`: STT transcript used for aligning
 - `text-start`: character offset of the fragment's associated original text within the
 aligned text document
 - `text-end`: character offset of the end of the fragment's associated original text within the
 aligned text document
 - `meta`: meta data hash-table with
  - _key_: meta data type
  - _value_: array of meta data instances coalesced from the `.script` entries that
  this entry intersects with
 - `aligned-raw`: __raw__ original text fragment that got aligned with the audio fragment
 and its STT transcript
 - `aligned`: __clean__ original text fragment that got aligned with the audio fragment
 and its STT transcript
 - `<metric>` For each `--output-<metric>` parameter the alignment tool adds an entry with the
 computed value (in this case `wng`, `jaro_winkler`, `levenshtein`, `mra`, `cer`)
--- a/doc/lm.md
+++ b/doc/lm.md
@ -0,0 +1,18 @@
 ## Individual language models
 If you plan to let the tool generate individual language models per text,
 you have to get (essentially build) [KenLM](https://kheafield.com/code/kenlm/).
 Before doing this, you should install its [dependencies](https://kheafield.com/code/kenlm/dependencies/).
 For Debian based systems this can be done through:
 ```bash
 $ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
 ```
 With all requirements fulfilled, there is a script for building and installing KenLM
 and the required DeepSpeech tools in the right location:
 ```bash
 $ bin/lm-dependencies.sh
 ```
 If all went well, the alignment tool will find and use it to automatically create individual
 language models for each document.
--- a/doc/metrics.md
+++ b/doc/metrics.md
@ -0,0 +1,79 @@
 ## Text distance metrics
 This section lists all available text distance metrics along with their IDs for
 command-line use.
 ### Weighted N-grams (wng)
 The weighted N-gram score is computed as the sum of the number of weighted shared N-grams
 between the two texts.
 It ensures that:
 - Shared N-gram instances near interval bounds (dependent on situation) get rated higher than
 the ones near the center or opposite end
 - Large shared N-gram instances are weighted higher than short ones
 `--align-min-ngram-size <SIZE>` sets the start (minimum) N-gram size
 `--align-max-ngram-size <SIZE>` sets the final (maximum) N-gram size
 `--align-ngram-size-factor <FACTOR>` sets a weight factor for the size preference
 `--align-ngram-position-factor <FACTOR>` sets a weight factor for the position preference
 ### Jaro-Winkler (jaro_winkler)
 Jaro-Winkler is an edit distance metric described
 [here](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).
 ### Editex (editex)
 Editex is a phonetic text distance algorithm described
 [here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138&rep=rep1&type=pdf).
 ### Levenshtein (levenshtein)
 Levenshtein is an edit distance metric described
 [here](https://en.wikipedia.org/wiki/Levenshtein_distance).
 ### MRA (mra)
 The "Match rating approach" is a phonetic text distance algorithm described
 [here](https://en.wikipedia.org/wiki/Match_rating_approach).
 ### Hamming (hamming)
 The Hamming distance is an edit distance metric described
 [here](https://en.wikipedia.org/wiki/Hamming_distance).
 ### Word error rate (wer)
 This is the same as Levenshtein - just on word level.
 Not available for gap alignment.
 ### Character error rate (cer)
 This is the same as Levenshtein but using a different implementation.
 Not available for gap alignment.
 ### Smith-Waterman score (sws)
 This is the final Smith-Waterman score coming from the rough alignment
 step (but before gap alignment!).
 It is described
 [here](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
 Not available for gap alignment.
 ### Transcript length (tlen)
 The character length of the STT transcript.
 Not available for gap alignment.
 ### Matched text length (mlen)
 The character length of the matched text of the original transcript (cleaned).
 Not available for gap alignment.
--- a/doc/tools.md
+++ b/doc/tools.md
@ -0,0 +1,147 @@
 ## Tools
 ### Statistics tool
 The statistics tool `bin/statistics.sh` can be used for displaying aggregated statistics of
 all passed alignment files. Alignment files can be specified directly through the 
 `--aligned <ALIGNED-FILE>` multi-option and indirectly through the `--catalog <CATALOG-FILE>` multi-option.
 Example call:
 ```shell script
 DSAlign$ bin/statistics.sh --catalog data/all.catalog 
 Reading catalog
 2 of 2 : 100.00% (elapsed: 00:00:00, speed: 94.27 it/s, ETA: 00:00:00)
 Total number of files: 2
 Total number of utterances: 5,949
 Total aligned utterance character length: 202,191
 Total utterance duration: 3:53:28.410000 (3 hours)
 Overall number of instances of meta type "speaker": 27
 100 most frequent "speaker" instances:
 Rosalind                     678
 Touchstone                   401
 Orlando                      310
 Jaques                       303
 Celia                        281
 Oliver                       125
 Phebe                        108
 Duke Senior                   87
 Silvius                       86
 Adam                          81
 Corin                         68
 Duke Frederick                53
 Le Beau                       52
 First Lord                    49
 Charles                       33
 Amiens                        27
 Audrey                        27
 Second Page                   22
 Hymen                         19
 Jaques De Boys                16
 Second Lord                   12
 William                       12
 Forester                       8
 First Page                     7
 Sir Oliver Martext             4
 Dennis                         3
 A Lord                         1
 ```
 ### Catalog tool
 The catalog tool allows for maintenance of catalog files.
 It takes multiple catalog files (supporting wildcards) and allows for applying several checks and tweaks before
 potentially exporting them to a new combined catalog file.
 Options:
 - `--output <CATALOG>`: Writes all items of all passed catalogs into to the specified new catalog.
 - `--make-relative`: Makes all paths entries of all items relative to the parent directory of the 
   new catalog (see `--output`).
 - `--order-by <ENTRY>`: Entry that should be used for sorting items in new catalog (see `--output`).
 - `--check <ENTRIES>`: Checks file existence of all passed (comma separated) entries of each catalog 
   item (e.g. `--check aligned,audio` will check if `aligned` and `audio` file paths of each catalog item exist). 
   `--check all` will check all entries of each item.
 - `--on-miss fail|drop|remove|ignore`: What to do if a checked (`--check`) file is not existing. 
   - `fail`: tool will exit with an error status (default)
   - `drop`: the catalog item with all its entries will be removed (see `--output`)
   - `remove`: the missing entry within the catalog item will be removed (see `--output`)
   - `ignore`: just logs the missing entry
 Example usage:
 ```shell script
 $ cat a.catalog 
 [
  {
    "entry1": "is/not/existing/x",
    "entry2": "is/existing/x"
  }
 ]
 $ cat b.catalog 
 [
  {
    "entry1": "is/not/existing/y",
    "entry2": "is/existing/y"
  }
 ]
 $ bin/catalog_tool.sh --check all --on-miss remove --output c.catalog --make-relative a.catalog b.catalog 
 Loading catalog "a.catalog"
 Catalog "a.catalog" - Missing file for "entry1" ("is/not/existing/x") - removing entry from item
 Loading catalog "b.catalog"
 Catalog "b.catalog" - Missing file for "entry1" ("is/not/existing/y") - removing entry from item
 Writing catalog "c.catalog"
 $ cat c.catalog 
 [
  {
    "entry2": "is/existing/x"
  },
  {
    "entry2": "is/existing/y"
  }
 ]
 ```
 ### Meta data annotation tool
 The meta data annotation tool allows for assigning meta data fields to all items of script files or transcription logs.
 It takes only two parameters: The file and a series of `<key>=<value>` assignments.
 Example usage:
 ```shell script
 $ cat a.tlog 
 [
  {
    "start": 330.0,
    "end": 2820.0,
    "transcript": "some text without a meaning"
  },
  {
    "start": 3456.0,
    "end": 5123.0,
    "transcript": "some other text without a meaning"
  }
 ]
 $ bin/meta.sh a.tlog speaker=alice year=2020
 $ cat a.tlog 
 [
  {
    "start": 330.0,
    "end": 2820.0,
    "transcript": "some text without a meaning",
    "speaker": "alice",
    "year": "2020"
  },
  {
    "start": 3456.0,
    "end": 5123.0,
    "transcript": "some other text without a meaning",
    "speaker": "alice",
    "year": "2020"
  }
 ]
 ```