зеркало из https://github.com/mozilla/DSAlign.git
Updated documentation and minor tool fixes
This commit is contained in:
Родитель
f6a16d92a0
Коммит
39a633a434
680
README.md
680
README.md
|
@ -4,9 +4,10 @@ DeepSpeech based forced alignment tool
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
It is recommended to use this tool from within a virtual environment.
|
It is recommended to use this tool from within a virtual environment.
|
||||||
There is a script for creating one with all requirements in the git-ignored dir `venv`:
|
After cloning and changing to the root of the project,
|
||||||
|
there is a script for creating one with all requirements in the git-ignored dir `venv`:
|
||||||
|
|
||||||
```bash
|
```shell script
|
||||||
$ bin/createenv.sh
|
$ bin/createenv.sh
|
||||||
$ ls venv
|
$ ls venv
|
||||||
bin include lib lib64 pyvenv.cfg share
|
bin include lib lib64 pyvenv.cfg share
|
||||||
|
@ -14,689 +15,52 @@ bin include lib lib64 pyvenv.cfg share
|
||||||
|
|
||||||
`bin/align.sh` will automatically use it.
|
`bin/align.sh` will automatically use it.
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
### Language specific data
|
|
||||||
|
|
||||||
Internally DSAlign uses the [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT engine.
|
Internally DSAlign uses the [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT engine.
|
||||||
For it to be able to function, it requires a couple of files that are specific to
|
For it to be able to function, it requires a couple of files that are specific to
|
||||||
the language of the speech data you want to align.
|
the language of the speech data you want to align.
|
||||||
If you want to align English, there is already a helper script that will download and prepare
|
If you want to align English, there is already a helper script that will download and prepare
|
||||||
all required data:
|
all required data:
|
||||||
|
|
||||||
```bash
|
```shell script
|
||||||
$ bin/getmodel.sh
|
$ bin/getmodel.sh
|
||||||
[...]
|
[...]
|
||||||
$ ls models/en/
|
$ ls models/en/
|
||||||
alphabet.txt lm.binary output_graph.pb output_graph.pbmm output_graph.tflite trie
|
alphabet.txt lm.binary output_graph.pb output_graph.pbmm output_graph.tflite trie
|
||||||
```
|
```
|
||||||
|
|
||||||
### Dependencies for generating individual language models
|
## Overview and documentation
|
||||||
|
|
||||||
If you plan to let the tool generate individual language models per text (you should!),
|
A typical application of the aligner is done in three phases:
|
||||||
you have to get (essentially build) [KenLM](https://kheafield.com/code/kenlm/).
|
|
||||||
Before doing this, you should install its [dependencies](https://kheafield.com/code/kenlm/dependencies/).
|
|
||||||
For Debian based systems this can be done through:
|
|
||||||
```bash
|
|
||||||
$ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
|
|
||||||
```
|
|
||||||
|
|
||||||
With all requirements fulfilled, there is a script for building and installing KenLM
|
1. __Preparing__ the data. Albeit most of this has to be done individually,
|
||||||
and the required DeepSpeech tools in the right location:
|
there are some [tools for data preparation, statistics and maintenance](doc/tools.md).
|
||||||
```bash
|
All involved file formats are described [here](doc/files.md).
|
||||||
$ bin/lm-dependencies.sh
|
2. __Aligning__ the data using [the alignment tool and it algorithm](doc/algo.md).
|
||||||
```
|
3. __Exporting__ aligned data using [the data-set exporter](doc/export.md).
|
||||||
|
|
||||||
If all went well, the alignment tool will find and use it to automatically create individual
|
## Quickstart example
|
||||||
language models for each document.
|
|
||||||
|
|
||||||
### Example data
|
### Example data
|
||||||
|
|
||||||
There is also a script for downloading and preparing some public domain speech and transcript data.
|
There is a script for downloading and preparing some public domain speech and transcript data.
|
||||||
|
It requires `ffmpeg` for some sample conversion.
|
||||||
|
|
||||||
```bash
|
```shell script
|
||||||
$ bin/gettestdata.sh
|
$ bin/gettestdata.sh
|
||||||
$ ls data
|
$ ls data
|
||||||
test1 test2
|
test1 test2
|
||||||
```
|
```
|
||||||
|
|
||||||
## Using the tool
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ bin/align.sh --help
|
|
||||||
[...]
|
|
||||||
```
|
|
||||||
|
|
||||||
### Alignment using example data
|
### Alignment using example data
|
||||||
|
|
||||||
```bash
|
Now the aligner can be called either "manually" (specifying all involved files directly):
|
||||||
$ bin/align.sh --output-max-cer 15 --loglevel 10 --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log
|
|
||||||
|
```shell script
|
||||||
|
$ bin/align.sh --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log
|
||||||
```
|
```
|
||||||
|
|
||||||
## The algorithm
|
Or "automatically" by specifying a so-called catalog file that bundles all involved paths:
|
||||||
|
|
||||||
### Step 1 - Splitting audio
|
```shell script
|
||||||
|
$ bin/align.sh --catalog data/test1.catalog
|
||||||
A voice activity detector (at the moment this is `webrtcvad`) is used
|
|
||||||
to split the provided audio data into voice fragments.
|
|
||||||
These fragments are essentially streams of continuous speech without any longer pauses
|
|
||||||
(e.g. sentences).
|
|
||||||
|
|
||||||
`--audio-vad-aggressiveness <AGGRESSIVENESS>` can be used to influence the length of the
|
|
||||||
resulting fragments.
|
|
||||||
|
|
||||||
### Step 2 - Preparation of original text
|
|
||||||
|
|
||||||
STT transcripts are typically provided in a normalized textual form with
|
|
||||||
- no casing,
|
|
||||||
- no punctuation and
|
|
||||||
- normalized whitespace (single spaces only).
|
|
||||||
|
|
||||||
So for being able to align STT transcripts with the original text it is necessary
|
|
||||||
to internally convert the original text into the same form.
|
|
||||||
|
|
||||||
This happens in two steps:
|
|
||||||
1. Normalization of whitespace, lower-casing all text and
|
|
||||||
replacing some characters with spaces (e.g. dashes)
|
|
||||||
2. Removal of all characters that are not in the languages's alphabet
|
|
||||||
(see DeepSpeech model data)
|
|
||||||
|
|
||||||
Be aware: *This conversion happens on text basis and will not remove unspoken content
|
|
||||||
like markup/markdown tags or artifacts. This should be done beforehand.
|
|
||||||
Reducing the difference of spoken and original text will improve alignment quality and speed.*
|
|
||||||
|
|
||||||
In the very unlikely situation that you have to change the default behavior (of step 1),
|
|
||||||
there are some switches:
|
|
||||||
|
|
||||||
`--text-keep-dashes` will prevent substitution of dashes with spaces.
|
|
||||||
|
|
||||||
`--text-keep-ws` will keep whitespace untouched.
|
|
||||||
|
|
||||||
`--text-keep-casing` will keep character casing as provided.
|
|
||||||
|
|
||||||
### Step 4a (optional) - Generating document specific language model
|
|
||||||
|
|
||||||
If the [dependencies][Dependencies for generating individual language models] for
|
|
||||||
individual language model generation got installed, this document-individual model will
|
|
||||||
now be generated by default.
|
|
||||||
|
|
||||||
Assuming your text document is named `original.txt`, these files will be generated:
|
|
||||||
- `original.txt.clean` - cleaned version of the original text
|
|
||||||
- `original.txt.arpa` - text file with probabilities in ARPA format
|
|
||||||
- `original.txt.lm` - binary representation of the former one
|
|
||||||
- `original.txt.trie` - prefix-tree optimized for probability lookup
|
|
||||||
|
|
||||||
`--stt-no-own-lm` deactivates creation of individual language models per document and
|
|
||||||
uses the one from model dir instead.
|
|
||||||
|
|
||||||
### Step 4b - Transcription of voice fragments through STT
|
|
||||||
|
|
||||||
After VAD splitting the resulting fragments are transcribed into textual phrases.
|
|
||||||
This transcription is done through [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT.
|
|
||||||
|
|
||||||
As this can take a longer time, all resulting phrases are - together with their
|
|
||||||
timestamps - saved as JSON into a transcription log file
|
|
||||||
(the `audio` parameter path with suffix `.tlog` instead of `.wav`).
|
|
||||||
Consecutive calls will look for that file and - if found -
|
|
||||||
load it and skip the transcription phase.
|
|
||||||
|
|
||||||
`--stt-model-dir <DIR>` points DeepSpeech to the language specific model data directory.
|
|
||||||
It defaults to `models/en`. Use `bin/getmodel.sh` for preparing it.
|
|
||||||
|
|
||||||
### Step 5 - Rough alignment
|
|
||||||
|
|
||||||
The actual text alignment is based on a recursive divide and conquer approach:
|
|
||||||
|
|
||||||
1. Construct an ordered list of of all phrases in the current interval
|
|
||||||
(at the beginning this is the list of all phrases that are to be aligned),
|
|
||||||
where long phrases close to the middle of the interval come first.
|
|
||||||
2. Iterate through the list and compute the best Smith-Waterman alignment
|
|
||||||
(see the following sub-sections) with the document's original text...
|
|
||||||
3. ...till there is a phrase whose Smith-Waterman alignment score surpasses a (low) recursion-depth
|
|
||||||
dependent threshold (in most cases this should already be the first phrase).
|
|
||||||
4. Recursively continue with step 1 for the sub-intervals and original text ranges
|
|
||||||
to the left and right of the phrase and its aligned text range within the original text.
|
|
||||||
5. Return all phrases in order of appearance (depth-first) that were aligned with the minimum
|
|
||||||
Smith-Waterman score on their recursion level.
|
|
||||||
|
|
||||||
This approach assumes that all phrases were spoken in the same order as they appear in the
|
|
||||||
original transcript. It has the following advantages compared to individual
|
|
||||||
global phrase matching:
|
|
||||||
|
|
||||||
- Long non-matching chunks of spoken text or the original transcript will automatically and
|
|
||||||
cleanly get ignored.
|
|
||||||
- Short phrases (with the risk of matching more than one time per document) will automatically
|
|
||||||
get aligned to their intended locations by longer ones who "squeeze" them in.
|
|
||||||
- Smith-Waterman score thresholds can be kept lower
|
|
||||||
(and thus better match lower quality STT transcripts), as there is a lower chance for
|
|
||||||
- long sequences to match at a wrong location and for
|
|
||||||
- shorter sequences to match at a wrong location within their shortened intervals
|
|
||||||
(as they are getting matched later and deeper in the recursion tree).
|
|
||||||
|
|
||||||
#### Smith-Waterman candidate selection
|
|
||||||
|
|
||||||
Finding the best match of a given phrase within the original (potentially long) transcript
|
|
||||||
using vanilla Smith-Waterman is not feasible.
|
|
||||||
|
|
||||||
So this tool follows a two-phase approach where the first goal is to get a list of alignment
|
|
||||||
candidates. As the first step the original text is virtually partitioned into windows of the
|
|
||||||
same length as the search pattern. These windows are ordered descending by the number of 3-grams
|
|
||||||
they share with the pattern.
|
|
||||||
Best alignment candidates are now taken from the beginning of this ordered list.
|
|
||||||
|
|
||||||
`--align-max-candidates <CANDIDATES>` sets the maximum number of candidate windows
|
|
||||||
taken from the beginning of the list for further alignment.
|
|
||||||
|
|
||||||
`--align-candidate-threshold <THRESHOLD>` multiplied with the number of 3-grams of the predecessor
|
|
||||||
window it gives the minimum number of 3-grams the next candidate window has to have to also be
|
|
||||||
considered a candidate.
|
|
||||||
|
|
||||||
#### Smith-Waterman alignment
|
|
||||||
|
|
||||||
For each candidate, the best possible alignment is computed using the
|
|
||||||
[Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) algorithm
|
|
||||||
within an extended interval of one window-size around the candidate window.
|
|
||||||
|
|
||||||
`--align-match-score <SCORE>` is the score per correctly matched character. Default: 100
|
|
||||||
|
|
||||||
`--align-mismatch-score <SCORE>` is the score per non-matching (exchanged) character. Default: -100
|
|
||||||
|
|
||||||
`--align-gap-score <SCORE>` is the score per character gap (removing 1 character from pattern or original). Default: -100
|
|
||||||
|
|
||||||
The overall best score for the best match is normalized to a value of about 100 maximum by dividing
|
|
||||||
it through the maximum character count of either the match or the pattern.
|
|
||||||
|
|
||||||
### Step 6 - Gap alignment
|
|
||||||
|
|
||||||
After recursive matching of fragments there are potential text leftovers between aligned original
|
|
||||||
texts.
|
|
||||||
|
|
||||||
Some examples:
|
|
||||||
- Often: Missing (and therefore unaligned) STT transcripts of word-endings (e.g. English past tense endings _-d_ and _-ed_)
|
|
||||||
on phrase endings to the left of the gap
|
|
||||||
- Seldom: Phrase beginnings or endings that were wrongly matched on unspoken (but written) text whose actual
|
|
||||||
alignments are now left unaligned in the gap
|
|
||||||
- Big unmatched chunks of text, like
|
|
||||||
- Preface, text summaries or any other kind of meta information
|
|
||||||
- Copyright headers/footers
|
|
||||||
- Table of contents
|
|
||||||
- Chapter headers (if not spoken as they appear)
|
|
||||||
- Captions of figures
|
|
||||||
- Contents of tables
|
|
||||||
- Line-headers like character names in drama scripts
|
|
||||||
- Dependent of the (pre-processing) quality: OCR leftovers like
|
|
||||||
- page headers
|
|
||||||
- page numbers
|
|
||||||
- reader's notes
|
|
||||||
|
|
||||||
The basic challenge here is to figure out, if all or some of the gap text should be used to extend
|
|
||||||
the phrase to the left and/or to the right of the gap.
|
|
||||||
|
|
||||||
As Smith-Waterman alignment led to the current (potentially incomplete or even wrong) result,
|
|
||||||
its score cannot be used for further fine-tuning. Therefore there is a collection of
|
|
||||||
so called test-distance algorithms to pick from using the `--align-similarity-algo`
|
|
||||||
parameter.
|
|
||||||
|
|
||||||
Using the selected distance metric, the gap alignment is done by looking for the best scoring
|
|
||||||
extension of the left and right phrases up to their maximum extension.
|
|
||||||
|
|
||||||
`--align-stretch-factor <FRACTION>` is the fraction of the text length that it could get
|
|
||||||
stretched at max.
|
|
||||||
|
|
||||||
For many languages it is worth putting some emphasis on matching to words boundaries
|
|
||||||
(that is white-space separated sub-sequences).
|
|
||||||
|
|
||||||
`--align-snap-factor <FACTOR>` allows to control the snappiness to word boundaries.
|
|
||||||
|
|
||||||
If the best scoring extensions should overlap, the best scoring sum of non-overlapping
|
|
||||||
(but touching) extensions will win.
|
|
||||||
|
|
||||||
### Step 7 - Selection, filtering and output
|
|
||||||
|
|
||||||
Finally the best alignment of all candidate windows is selected as the winner.
|
|
||||||
It has to survive a series of filters for getting into the result file.
|
|
||||||
|
|
||||||
For each text distance metric there are two filter parameters:
|
|
||||||
|
|
||||||
`--output-min-<METRIC-ID> <VALUE>` only keeps utterances having the provided minimum value for the
|
|
||||||
metric with id `METRIC-ID`
|
|
||||||
|
|
||||||
`--output-max-<METRIC-ID> <VALUE>` only keeps utterances having the provided maximum value for the
|
|
||||||
metric with id `METRIC-ID`
|
|
||||||
|
|
||||||
For each text distance metric there's also the option to have it added to each utterance's entry:
|
|
||||||
|
|
||||||
`--output-<METRIC-ID>` adds the computed value for `<METRIC-ID>` to the utterances array-entry
|
|
||||||
|
|
||||||
Error rates and scores are provided as fractional values (typically between 0.0 = 0% and 1.0 = 100%
|
|
||||||
where numbers >1.0 are theoretically possible).
|
|
||||||
|
|
||||||
### General options
|
|
||||||
|
|
||||||
`--play` will play each aligned sample using the `play` command of the SoX audio toolkit
|
|
||||||
|
|
||||||
`--text-context <CONTEXT-SIZE>` will add additional `CONTEXT-SIZE` characters around original
|
|
||||||
transcripts when logged
|
|
||||||
|
|
||||||
## Export
|
|
||||||
|
|
||||||
After files got successfully aligned, one would possibly want to export the aligned utterances
|
|
||||||
as machine learning training samples.
|
|
||||||
|
|
||||||
This is where the export tool `bin/export.sh` comes in.
|
|
||||||
|
|
||||||
### Step 1 - Reading the input
|
|
||||||
|
|
||||||
The exporter takes either a single audio file (`--audio`)
|
|
||||||
plus a corresponding `.aligned` file (`--aligned`) or a series
|
|
||||||
of such pairs from a `.catalog` file (`--catalog`) as input.
|
|
||||||
|
|
||||||
All of the following computations will be done on the joined list of all aligned
|
|
||||||
utterances of all input pairs.
|
|
||||||
|
|
||||||
### Step 2 - (Pre-) Filtering
|
|
||||||
|
|
||||||
The parameter `--filter <EXPR>` allows to specify a Python expression that has access
|
|
||||||
to all data fields of an aligned utterance (as can be seen in `.aligned` file entries).
|
|
||||||
|
|
||||||
This expression is now applied to each aligned utterance and in case it returns `True`,
|
|
||||||
the utterance will get excluded from all the following steps.
|
|
||||||
This is useful for excluding utterances that would not work as input for the planned
|
|
||||||
training or other kind of application.
|
|
||||||
|
|
||||||
### Step 3 - Computing quality
|
|
||||||
|
|
||||||
As with filtering, the parameter `--criteria <EXPR>` allows for specifying a Python
|
|
||||||
expression that has access to all data fields of an aligned utterance.
|
|
||||||
|
|
||||||
The expression is applied to each aligned utterance and its numerical return
|
|
||||||
value is assigned to each utterance as `quality`.
|
|
||||||
|
|
||||||
### Step 4 - De-biasing
|
|
||||||
|
|
||||||
This step is to (optionally) exclude utterances that would otherwise bias the data
|
|
||||||
(risk of overfitting).
|
|
||||||
|
|
||||||
For each `--debias <META DATA TYPE>` parameter the following procedure is applied:
|
|
||||||
1. Take the meta data type (e.g. "name") and read its instances (e.g. "Alice" or "Bob")
|
|
||||||
from each utternace and group all utterances accordingly
|
|
||||||
(e.g. a group with 2 utterances of "Alice" and a group with 15 utterances of "Bob"...)
|
|
||||||
2. Compute the standard deviation (`sigma`) of the instance-counts of the groups
|
|
||||||
3. For each group: If the instance-count exceeds `sigma` times `--debias-sigma-factor <FACTOR>`:
|
|
||||||
- Drop the number of exceeding utterances in order of their `quality` (lowest first)
|
|
||||||
|
|
||||||
### Step 5 - Partitioning
|
|
||||||
|
|
||||||
Training sets are often partitioned into several quality levels.
|
|
||||||
|
|
||||||
For each `--partition <QUALITY:PARTITION>` parameter (ordered descending by `QUALITY`):
|
|
||||||
If the utterance's `quality` value is greater or equal `QUALITY`, assign it to `PARTITION`.
|
|
||||||
|
|
||||||
Remaining utterances are assigned to partition `other`.
|
|
||||||
|
|
||||||
### Step 6 - Splitting
|
|
||||||
|
|
||||||
Training sets (actually their partitions) are typically split into sets `train`, `dev`
|
|
||||||
and `test` ([explanation](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)).
|
|
||||||
|
|
||||||
This can get automated through parameter `--split` which will let the exporter split each
|
|
||||||
partition (or the entire set) accordingly.
|
|
||||||
|
|
||||||
Parameter `--split-field` allows for specifying a meta data type that should be considered
|
|
||||||
atomic (e.g. "speaker" would result in all utterances of a speaker
|
|
||||||
instance - like "Alice" - to end up in one sub-set only). This atomic behavior will also hold
|
|
||||||
true across partitions.
|
|
||||||
|
|
||||||
### Step 7 - Output
|
|
||||||
|
|
||||||
For each partition/sub-set combination the following is done:
|
|
||||||
- Construction of a `name` (e.g. `good-dev` will represent the validation set of partition `good`).
|
|
||||||
- Writing all utterance audio fragments (as `.wav` files) into a sub-directory of `--target-dir <DIR>`
|
|
||||||
named `name` (using parameters `--channels <N>` and `--rate <RATE>`).
|
|
||||||
- Writing an utterance list into `--target-dir <DIR>` named `name.(json|csv)` dependent on the
|
|
||||||
output format specified through `--format <FORMAT>`
|
|
||||||
|
|
||||||
### Additional functionality
|
|
||||||
|
|
||||||
Using `--dry-run` one can avoid any writing and get a preview on set-splits and so forth
|
|
||||||
(`--dry-run-fast` won't even load any sample).
|
|
||||||
|
|
||||||
`--force` will force overwriting of samples and list files.
|
|
||||||
|
|
||||||
`--workers <N>` allows for specifying the number of parallel workers.
|
|
||||||
|
|
||||||
## File formats
|
|
||||||
|
|
||||||
### Catalog files (.catalog)
|
|
||||||
|
|
||||||
Catalog files (suffix `.catalog`) are used for organizing bigger data file collections and
|
|
||||||
defining relations among them. It is basically a JSON array of hash-tables where each entry stands
|
|
||||||
for a single audio file and its associated original transcript.
|
|
||||||
|
|
||||||
So a typical catalog looks like this (`data/all.catalog` from this project):
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
[
|
|
||||||
{
|
|
||||||
"audio": "test1/joined.mp3",
|
|
||||||
"tlog": "test1/joined.tlog",
|
|
||||||
"script": "test1/transcript.txt",
|
|
||||||
"aligned": "test1/joined.aligned"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"audio": "test2/joined.mp3",
|
|
||||||
"tlog": "test2/joined.tlog",
|
|
||||||
"script": "test2/transcript.script",
|
|
||||||
"aligned": "test2/joined.aligned"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
```
|
```
|
||||||
|
|
||||||
- `audio` is a path to an audio file (of a format that `pydub` supports)
|
|
||||||
- `tlog` is the (supposed) path to the STT generated transcription log of the audio file
|
|
||||||
- `script` is the path to the original transcript of the audio file
|
|
||||||
(as `.txt` or `.script` file)
|
|
||||||
- `aligned` is the (supposed) path to a `.aligned` file
|
|
||||||
|
|
||||||
Be aware: __All relative file paths are treated as relative to the catalog file's directory__.
|
|
||||||
|
|
||||||
The tools `bin/align.sh`, `bin/statistics.sh` and `bin/export.sh` all support parameter
|
|
||||||
`--catalog`:
|
|
||||||
|
|
||||||
The __alignment tool__ `bin/align.sh` requires either `tlog` to point to an existing
|
|
||||||
file or (if not) `audio` to point to an existing audio file for being able to transcribe
|
|
||||||
it and store it at the path indicated by `tlog`. Furthermore it requires `script` to
|
|
||||||
point to an existing script. It will write its alignment results to the path in `aligned`.
|
|
||||||
|
|
||||||
The __export tool__ `bin/export.sh` requires `audio` and `aligned` to point to existing files.
|
|
||||||
|
|
||||||
The __statistics tool__ `bin/statistics.sh` requires only `aligned` to point to existing files.
|
|
||||||
|
|
||||||
Advantages of having a catalog file:
|
|
||||||
|
|
||||||
- Simplified tool usage with only one parameter for defining all involved files (`--catalog`).
|
|
||||||
- A directory with many files has to be scanned just one time at catalog generation.
|
|
||||||
- Different file types can live at different and custom locations in the system.
|
|
||||||
This is important in case of read-only access rights to the original data.
|
|
||||||
It can also be used for avoiding to taint the original directory tree.
|
|
||||||
- Accumulated statistics
|
|
||||||
- Better progress indication (as the total number of files is available up front)
|
|
||||||
- Reduced tool startup overhead
|
|
||||||
- Allows for meta-data aware set-splitting on export - e.g. if some speakers are speaking
|
|
||||||
in several files.
|
|
||||||
|
|
||||||
So especially in case of many files to process it is highly recommended to __first create
|
|
||||||
a catalog file__ with all paths present (even the ones not pointing to existing files yet).
|
|
||||||
|
|
||||||
|
|
||||||
### Script files (.script|.txt)
|
|
||||||
|
|
||||||
The alignment tool requires an original script or (human transcript) of the provided audio.
|
|
||||||
These scripts can be represented in two basic forms:
|
|
||||||
- plain text files (`.txt`) or
|
|
||||||
- script files (`.script`)
|
|
||||||
|
|
||||||
In case of plain text files the content is considered a continuous stream of text without
|
|
||||||
any assigned meta data. The only exception is option `--text-meaningful-newlines` which
|
|
||||||
tells the aligner to consider newlines as separators between utterances
|
|
||||||
in conjunction with option `--align-phrase-snap-factor`.
|
|
||||||
|
|
||||||
If the original data source features utterance meta data, one should consider converting it
|
|
||||||
to the `.script` JSON file format which looks like this
|
|
||||||
(except of `data/test2/transcript.script`):
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
[
|
|
||||||
// ...
|
|
||||||
{
|
|
||||||
"speaker": "Phebe",
|
|
||||||
"text": "Good shepherd, tell this youth what 'tis to love."
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"speaker": "Silvius",
|
|
||||||
"text": "It is to be all made of sighs and tears; And so am I for Phebe."
|
|
||||||
},
|
|
||||||
// ...
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
_This and the following sub-sections are all using the same real world examples and excerpts_
|
|
||||||
|
|
||||||
It is basically again an array of hash-tables, where each hash-table represents an utterance with the
|
|
||||||
only mandatory field `text` for its textual representation.
|
|
||||||
|
|
||||||
All other fields are considered meta data
|
|
||||||
(with the key called "meta data type" and the value "meta data instance").
|
|
||||||
|
|
||||||
### Transcription log files (.tlog)
|
|
||||||
|
|
||||||
The alignment tool relies on timed STT transcripts of the provided audio.
|
|
||||||
These transcripts are either provided by some external processing
|
|
||||||
(even using a different STT system than DeepSpeech) or will get generated
|
|
||||||
as part of the alignment process.
|
|
||||||
|
|
||||||
They are called transcription logs (`.tlog`) and are looking like this
|
|
||||||
(except of `data/test2/joined.tlog`):
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
[
|
|
||||||
// ...
|
|
||||||
{
|
|
||||||
"start": 7491960,
|
|
||||||
"end": 7493040,
|
|
||||||
"transcript": "good shepherd"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"start": 7493040,
|
|
||||||
"end": 7495110,
|
|
||||||
"transcript": "tell this youth what tis to love"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"start": 7495380,
|
|
||||||
"end": 7498020,
|
|
||||||
"transcript": "it is to be made of soles and tears"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"start": 7498470,
|
|
||||||
"end": 7500150,
|
|
||||||
"transcript": "and so a may for phoebe"
|
|
||||||
},
|
|
||||||
// ...
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
The fields of each entry:
|
|
||||||
- `start`: time offset of the audio fragment in milliseconds from the beginning of the
|
|
||||||
aligned audio file (mandatory)
|
|
||||||
- `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
|
|
||||||
aligned audio file (mandatory)
|
|
||||||
- `transcript`: STT transcript of the utterance (mandatory)
|
|
||||||
|
|
||||||
### Aligned files (.aligned)
|
|
||||||
|
|
||||||
The result of aligning an audio file with an original transcript is written to an
|
|
||||||
`.aligned` JSON file consisting of an array of hash-tables of the following form:
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
[
|
|
||||||
// ...
|
|
||||||
{
|
|
||||||
"start": 7491960,
|
|
||||||
"end": 7493040,
|
|
||||||
"transcript": "good shepherd",
|
|
||||||
"text-start": 98302,
|
|
||||||
"text-end": 98316,
|
|
||||||
"meta": {
|
|
||||||
"speaker": [
|
|
||||||
"Phebe"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"aligned-raw": "Good shepherd,",
|
|
||||||
"aligned": "good shepherd",
|
|
||||||
"wng": 99.99999999999997,
|
|
||||||
"jaro_winkler": 100.0,
|
|
||||||
"levenshtein": 100.0,
|
|
||||||
"mra": 100.0,
|
|
||||||
"cer": 0.0
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"start": 7493040,
|
|
||||||
"end": 7495110,
|
|
||||||
"transcript": "tell this youth what tis to love",
|
|
||||||
"text-start": 98317,
|
|
||||||
"text-end": 98351,
|
|
||||||
"meta": {
|
|
||||||
"speaker": [
|
|
||||||
"Phebe"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"aligned-raw": "tell this youth what 'tis to love.",
|
|
||||||
"aligned": "tell this youth what 'tis to love",
|
|
||||||
"wng": 92.71730687405957,
|
|
||||||
"jaro_winkler": 100.0,
|
|
||||||
"levenshtein": 96.96969696969697,
|
|
||||||
"mra": 100.0,
|
|
||||||
"cer": 3.0303030303030303
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"start": 7495380,
|
|
||||||
"end": 7498020,
|
|
||||||
"transcript": "it is to be made of soles and tears",
|
|
||||||
"text-start": 98352,
|
|
||||||
"text-end": 98392,
|
|
||||||
"meta": {
|
|
||||||
"speaker": [
|
|
||||||
"Silvius"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"aligned-raw": "It is to be all made of sighs and tears;",
|
|
||||||
"aligned": "it is to be all made of sighs and tears",
|
|
||||||
"wng": 77.93921929148159,
|
|
||||||
"jaro_winkler": 100.0,
|
|
||||||
"levenshtein": 82.05128205128204,
|
|
||||||
"mra": 100.0,
|
|
||||||
"cer": 17.94871794871795
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"start": 7498470,
|
|
||||||
"end": 7500150,
|
|
||||||
"transcript": "and so a may for phoebe",
|
|
||||||
"text-start": 98393,
|
|
||||||
"text-end": 98415,
|
|
||||||
"meta": {
|
|
||||||
"speaker": [
|
|
||||||
"Silvius"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"aligned-raw": "And so am I for Phebe.",
|
|
||||||
"aligned": "and so am i for phebe",
|
|
||||||
"wng": 66.82687893873339,
|
|
||||||
"jaro_winkler": 98.47964113181504,
|
|
||||||
"levenshtein": 82.6086956521739,
|
|
||||||
"mra": 100.0,
|
|
||||||
"cer": 19.047619047619047
|
|
||||||
},
|
|
||||||
// ...
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
Each object array-entry represents an aligned audio fragment with the following attributes:
|
|
||||||
- `start`: time offset of the audio fragment in milliseconds from the beginning of the
|
|
||||||
aligned audio file
|
|
||||||
- `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
|
|
||||||
aligned audio file
|
|
||||||
- `transcript`: STT transcript used for aligning
|
|
||||||
- `text-start`: character offset of the fragment's associated original text within the
|
|
||||||
aligned text document
|
|
||||||
- `text-end`: character offset of the end of the fragment's associated original text within the
|
|
||||||
aligned text document
|
|
||||||
- `meta`: meta data hash-table with
|
|
||||||
- _key_: meta data type
|
|
||||||
- _value_: array of meta data instances coalesced from the `.script` entries that
|
|
||||||
this entry intersects with
|
|
||||||
- `aligned-raw`: __raw__ original text fragment that got aligned with the audio fragment
|
|
||||||
and its STT transcript
|
|
||||||
- `aligned`: __clean__ original text fragment that got aligned with the audio fragment
|
|
||||||
and its STT transcript
|
|
||||||
- `<metric>` For each `--output-<metric>` parameter the alignment tool adds an entry with the
|
|
||||||
computed value (in this case `wng`, `jaro_winkler`, `levenshtein`, `mra`, `cer`)
|
|
||||||
|
|
||||||
## Text distance metrics
|
|
||||||
|
|
||||||
This section lists all available text distance metrics along with their IDs for
|
|
||||||
command-line use.
|
|
||||||
|
|
||||||
### Weighted N-grams (wng)
|
|
||||||
|
|
||||||
The weighted N-gram score is computed as the sum of the number of weighted shared N-grams
|
|
||||||
between the two texts.
|
|
||||||
It ensures that:
|
|
||||||
- Shared N-gram instances near interval bounds (dependent on situation) get rated higher than
|
|
||||||
the ones near the center or opposite end
|
|
||||||
- Large shared N-gram instances are weighted higher than short ones
|
|
||||||
|
|
||||||
`--align-min-ngram-size <SIZE>` sets the start (minimum) N-gram size
|
|
||||||
|
|
||||||
`--align-max-ngram-size <SIZE>` sets the final (maximum) N-gram size
|
|
||||||
|
|
||||||
`--align-ngram-size-factor <FACTOR>` sets a weight factor for the size preference
|
|
||||||
|
|
||||||
`--align-ngram-position-factor <FACTOR>` sets a weight factor for the position preference
|
|
||||||
|
|
||||||
### Jaro-Winkler (jaro_winkler)
|
|
||||||
|
|
||||||
Jaro-Winkler is an edit distance metric described
|
|
||||||
[here](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).
|
|
||||||
|
|
||||||
### Editex (editex)
|
|
||||||
|
|
||||||
Editex is a phonetic text distance algorithm described
|
|
||||||
[here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138&rep=rep1&type=pdf).
|
|
||||||
|
|
||||||
### Levenshtein (levenshtein)
|
|
||||||
|
|
||||||
Levenshtein is an edit distance metric described
|
|
||||||
[here](https://en.wikipedia.org/wiki/Levenshtein_distance).
|
|
||||||
|
|
||||||
### MRA (mra)
|
|
||||||
|
|
||||||
The "Match rating approach" is a phonetic text distance algorithm described
|
|
||||||
[here](https://en.wikipedia.org/wiki/Match_rating_approach).
|
|
||||||
|
|
||||||
### Hamming (hamming)
|
|
||||||
|
|
||||||
The Hamming distance is an edit distance metric described
|
|
||||||
[here](https://en.wikipedia.org/wiki/Hamming_distance).
|
|
||||||
|
|
||||||
### Word error rate (wer)
|
|
||||||
|
|
||||||
This is the same as Levenshtein - just on word level.
|
|
||||||
|
|
||||||
Not available for gap alignment.
|
|
||||||
|
|
||||||
### Character error rate (cer)
|
|
||||||
|
|
||||||
This is the same as Levenshtein but using a different implementation.
|
|
||||||
|
|
||||||
Not available for gap alignment.
|
|
||||||
|
|
||||||
### Smith-Waterman score (sws)
|
|
||||||
|
|
||||||
This is the final Smith-Waterman score coming from the rough alignment
|
|
||||||
step (but before gap alignment!).
|
|
||||||
It is described
|
|
||||||
[here](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
|
|
||||||
|
|
||||||
Not available for gap alignment.
|
|
||||||
|
|
||||||
### Transcript length (tlen)
|
|
||||||
|
|
||||||
The character length of the STT transcript.
|
|
||||||
|
|
||||||
Not available for gap alignment.
|
|
||||||
|
|
||||||
### Matched text length (mlen)
|
|
||||||
|
|
||||||
The character length of the matched text of the original transcript (cleaned).
|
|
||||||
|
|
||||||
Not available for gap alignment.
|
|
||||||
|
|
|
@ -20,9 +20,9 @@ def build_catalog():
|
||||||
for source_glob in CLI_ARGS.sources:
|
for source_glob in CLI_ARGS.sources:
|
||||||
catalog_paths.extend(glob(source_glob))
|
catalog_paths.extend(glob(source_glob))
|
||||||
items = []
|
items = []
|
||||||
for catalog_path in catalog_paths:
|
for catalog_original_path in catalog_paths:
|
||||||
catalog_path = Path(catalog_path).absolute()
|
catalog_path = Path(catalog_original_path).absolute()
|
||||||
print('Loading catalog "{}"'.format(str(catalog_path)))
|
print('Loading catalog "{}"'.format(str(catalog_original_path)))
|
||||||
if not catalog_path.is_file():
|
if not catalog_path.is_file():
|
||||||
fail('Unable to find catalog file "{}"'.format(str(catalog_path)))
|
fail('Unable to find catalog file "{}"'.format(str(catalog_path)))
|
||||||
with open(catalog_path, 'r') as catalog_file:
|
with open(catalog_path, 'r') as catalog_file:
|
||||||
|
@ -30,13 +30,13 @@ def build_catalog():
|
||||||
base_path = catalog_path.parent.absolute()
|
base_path = catalog_path.parent.absolute()
|
||||||
for item in catalog_items:
|
for item in catalog_items:
|
||||||
new_item = {}
|
new_item = {}
|
||||||
for entry, entry_path in item.items():
|
for entry, entry_original_path in item.items():
|
||||||
entry_path = Path(entry_path)
|
entry_path = Path(entry_original_path)
|
||||||
entry_path = entry_path if entry_path.is_absolute() else (base_path / entry_path)
|
entry_path = entry_path if entry_path.is_absolute() else (base_path / entry_path).absolute()
|
||||||
if ((len(CLI_ARGS.check) == 1 and CLI_ARGS.check[0] == 'all')
|
if ((len(CLI_ARGS.check) == 1 and CLI_ARGS.check[0] == 'all')
|
||||||
or entry in CLI_ARGS.check) and not entry_path.is_file():
|
or entry in CLI_ARGS.check) and not entry_path.is_file():
|
||||||
note = 'Catalog "{}" - Missing file for "{}" ("{}")'.format(
|
note = 'Catalog "{}" - Missing file for "{}" ("{}")'.format(
|
||||||
str(catalog_path), entry, str(entry_path))
|
str(catalog_original_path), entry, str(entry_original_path))
|
||||||
if CLI_ARGS.on_miss == 'fail':
|
if CLI_ARGS.on_miss == 'fail':
|
||||||
fail(note + ' - aborting')
|
fail(note + ' - aborting')
|
||||||
if CLI_ARGS.on_miss == 'ignore':
|
if CLI_ARGS.on_miss == 'ignore':
|
||||||
|
@ -54,7 +54,7 @@ def build_catalog():
|
||||||
items.append(new_item)
|
items.append(new_item)
|
||||||
if CLI_ARGS.output is not None:
|
if CLI_ARGS.output is not None:
|
||||||
catalog_path = Path(CLI_ARGS.output).absolute()
|
catalog_path = Path(CLI_ARGS.output).absolute()
|
||||||
print('Writing catalog "{}"'.format(str(catalog_path)))
|
print('Writing catalog "{}"'.format(str(CLI_ARGS.output)))
|
||||||
if CLI_ARGS.make_relative:
|
if CLI_ARGS.make_relative:
|
||||||
base_path = catalog_path.parent
|
base_path = catalog_path.parent
|
||||||
for item in items:
|
for item in items:
|
||||||
|
@ -63,7 +63,7 @@ def build_catalog():
|
||||||
if CLI_ARGS.order_by is not None:
|
if CLI_ARGS.order_by is not None:
|
||||||
items.sort(key=lambda i: i[CLI_ARGS.order_by] if CLI_ARGS.order_by in i else '')
|
items.sort(key=lambda i: i[CLI_ARGS.order_by] if CLI_ARGS.order_by in i else '')
|
||||||
with open(catalog_path, 'w') as catalog_file:
|
with open(catalog_path, 'w') as catalog_file:
|
||||||
json.dump(items, catalog_file)
|
json.dump(items, catalog_file, indent=2)
|
||||||
|
|
||||||
|
|
||||||
def handle_args():
|
def handle_args():
|
||||||
|
@ -71,7 +71,7 @@ def handle_args():
|
||||||
'converting paths within catalog files')
|
'converting paths within catalog files')
|
||||||
parser.add_argument('--output', help='Write collected catalog items to this new catalog file')
|
parser.add_argument('--output', help='Write collected catalog items to this new catalog file')
|
||||||
parser.add_argument('--make-relative', action='store_true',
|
parser.add_argument('--make-relative', action='store_true',
|
||||||
help='Make all path entries of all items relative to target catalog file\'s parent directory')
|
help='Make all path entries of all items relative to new catalog file\'s parent directory')
|
||||||
parser.add_argument('--check',
|
parser.add_argument('--check',
|
||||||
help='Comma separated list of path entries to check for existence '
|
help='Comma separated list of path entries to check for existence '
|
||||||
'("all" for checking every entry, default: no checks)')
|
'("all" for checking every entry, default: no checks)')
|
||||||
|
|
|
@ -338,7 +338,6 @@ def parse_args():
|
||||||
help='Take audio file as input (requires "--aligned <file>")')
|
help='Take audio file as input (requires "--aligned <file>")')
|
||||||
parser.add_argument('--aligned', type=str,
|
parser.add_argument('--aligned', type=str,
|
||||||
help='Take alignment file ("<...>.aligned") as input (requires "--audio <file>")')
|
help='Take alignment file ("<...>.aligned") as input (requires "--audio <file>")')
|
||||||
|
|
||||||
parser.add_argument('--catalog', type=str,
|
parser.add_argument('--catalog', type=str,
|
||||||
help='Take alignment and audio file references of provided catalog ("<...>.catalog") as input')
|
help='Take alignment and audio file references of provided catalog ("<...>.catalog") as input')
|
||||||
parser.add_argument('--ignore-missing', action="store_true",
|
parser.add_argument('--ignore-missing', action="store_true",
|
||||||
|
|
|
@ -8,14 +8,14 @@ forbidden_keys = ['start', 'end', 'text', 'transcript']
|
||||||
def main(args):
|
def main(args):
|
||||||
parser = argparse.ArgumentParser(description='Annotate .tlog or .script files by adding meta data')
|
parser = argparse.ArgumentParser(description='Annotate .tlog or .script files by adding meta data')
|
||||||
parser.add_argument('target', type=str, help='')
|
parser.add_argument('target', type=str, help='')
|
||||||
parser.add_argument('assignment', action='append', help='Meta data assignment of the form <key>=<value>')
|
parser.add_argument('assignments', nargs='+', help='Meta data assignments of the form <key>=<value>')
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
with open(args.target, 'r') as json_file:
|
with open(args.target, 'r') as json_file:
|
||||||
entries = json.load(json_file)
|
entries = json.load(json_file)
|
||||||
|
|
||||||
for assign in args.assignment:
|
for assignment in args.assignments:
|
||||||
key, value = assign.split('=')
|
key, value = assignment.split('=')
|
||||||
if key in forbidden_keys:
|
if key in forbidden_keys:
|
||||||
print('Meta data key "{}" not allowed - forbidden: {}'.format(key, '|'.join(forbidden_keys)))
|
print('Meta data key "{}" not allowed - forbidden: {}'.format(key, '|'.join(forbidden_keys)))
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
@ -23,7 +23,7 @@ def main(args):
|
||||||
entry[key] = value
|
entry[key] = value
|
||||||
|
|
||||||
with open(args.target, 'w') as json_file:
|
with open(args.target, 'w') as json_file:
|
||||||
json.dump(entries, json_file)
|
json.dump(entries, json_file, indent=2)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
|
|
|
@ -126,6 +126,8 @@ def main(args):
|
||||||
help='Read alignment references of provided catalog ("<...>.catalog") as input')
|
help='Read alignment references of provided catalog ("<...>.catalog") as input')
|
||||||
parser.add_argument('--no-progress', action='store_true',
|
parser.add_argument('--no-progress', action='store_true',
|
||||||
help='Prevents showing progress bars')
|
help='Prevents showing progress bars')
|
||||||
|
parser.add_argument('--progress-interval', type=float, default=1.0,
|
||||||
|
help='Progress indication interval in seconds')
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,197 @@
|
||||||
|
## Alignment algorithm and its parameters
|
||||||
|
|
||||||
|
### Step 1 - Splitting audio
|
||||||
|
|
||||||
|
A voice activity detector (at the moment this is `webrtcvad`) is used
|
||||||
|
to split the provided audio data into voice fragments.
|
||||||
|
These fragments are essentially streams of continuous speech without any longer pauses
|
||||||
|
(e.g. sentences).
|
||||||
|
|
||||||
|
`--audio-vad-aggressiveness <AGGRESSIVENESS>` can be used to influence the length of the
|
||||||
|
resulting fragments.
|
||||||
|
|
||||||
|
### Step 2 - Preparation of original text
|
||||||
|
|
||||||
|
STT transcripts are typically provided in a normalized textual form with
|
||||||
|
- no casing,
|
||||||
|
- no punctuation and
|
||||||
|
- normalized whitespace (single spaces only).
|
||||||
|
|
||||||
|
So for being able to align STT transcripts with the original text it is necessary
|
||||||
|
to internally convert the original text into the same form.
|
||||||
|
|
||||||
|
This happens in two steps:
|
||||||
|
1. Normalization of whitespace, lower-casing all text and
|
||||||
|
replacing some characters with spaces (e.g. dashes)
|
||||||
|
2. Removal of all characters that are not in the languages's alphabet
|
||||||
|
(see DeepSpeech model data)
|
||||||
|
|
||||||
|
Be aware: *This conversion happens on text basis and will not remove unspoken content
|
||||||
|
like markup/markdown tags or artifacts. This should be done beforehand.
|
||||||
|
Reducing the difference of spoken and original text will improve alignment quality and speed.*
|
||||||
|
|
||||||
|
In the very unlikely situation that you have to change the default behavior (of step 1),
|
||||||
|
there are some switches:
|
||||||
|
|
||||||
|
`--text-keep-dashes` will prevent substitution of dashes with spaces.
|
||||||
|
|
||||||
|
`--text-keep-ws` will keep whitespace untouched.
|
||||||
|
|
||||||
|
`--text-keep-casing` will keep character casing as provided.
|
||||||
|
|
||||||
|
### Step 3 (optional) - Generating document specific language model
|
||||||
|
|
||||||
|
If the [dependencies](lm.md) for
|
||||||
|
individual language model generation got installed, this document-individual model will
|
||||||
|
now be generated by default.
|
||||||
|
|
||||||
|
Assuming your text document is named `original.txt`, these files will be generated:
|
||||||
|
- `original.txt.clean` - cleaned version of the original text
|
||||||
|
- `original.txt.arpa` - text file with probabilities in ARPA format
|
||||||
|
- `original.txt.lm` - binary representation of the former one
|
||||||
|
- `original.txt.trie` - prefix-tree optimized for probability lookup
|
||||||
|
|
||||||
|
`--stt-no-own-lm` deactivates creation of individual language models per document and
|
||||||
|
uses the one from model dir instead.
|
||||||
|
|
||||||
|
### Step 4 - Transcription of voice fragments through STT
|
||||||
|
|
||||||
|
After VAD splitting the resulting fragments are transcribed into textual phrases.
|
||||||
|
This transcription is done through [DeepSpeech](https://github.com/mozilla/DeepSpeech/) STT.
|
||||||
|
|
||||||
|
As this can take a longer time, all resulting phrases are - together with their
|
||||||
|
timestamps - saved as JSON into a transcription log file
|
||||||
|
(the `audio` parameter path with suffix `.tlog` instead of `.wav`).
|
||||||
|
Consecutive calls will look for that file and - if found -
|
||||||
|
load it and skip the transcription phase.
|
||||||
|
|
||||||
|
`--stt-model-dir <DIR>` points DeepSpeech to the language specific model data directory.
|
||||||
|
It defaults to `models/en`. Use `bin/getmodel.sh` for preparing it.
|
||||||
|
|
||||||
|
### Step 5 - Rough alignment
|
||||||
|
|
||||||
|
The actual text alignment is based on a recursive divide and conquer approach:
|
||||||
|
|
||||||
|
1. Construct an ordered list of of all phrases in the current interval
|
||||||
|
(at the beginning this is the list of all phrases that are to be aligned),
|
||||||
|
where long phrases close to the middle of the interval come first.
|
||||||
|
2. Iterate through the list and compute the best Smith-Waterman alignment
|
||||||
|
(see the following sub-sections) with the document's original text...
|
||||||
|
3. ...till there is a phrase whose Smith-Waterman alignment score surpasses a (low) recursion-depth
|
||||||
|
dependent threshold (in most cases this should already be the first phrase).
|
||||||
|
4. Recursively continue with step 1 for the sub-intervals and original text ranges
|
||||||
|
to the left and right of the phrase and its aligned text range within the original text.
|
||||||
|
5. Return all phrases in order of appearance (depth-first) that were aligned with the minimum
|
||||||
|
Smith-Waterman score on their recursion level.
|
||||||
|
|
||||||
|
This approach assumes that all phrases were spoken in the same order as they appear in the
|
||||||
|
original transcript. It has the following advantages compared to individual
|
||||||
|
global phrase matching:
|
||||||
|
|
||||||
|
- Long non-matching chunks of spoken text or the original transcript will automatically and
|
||||||
|
cleanly get ignored.
|
||||||
|
- Short phrases (with the risk of matching more than one time per document) will automatically
|
||||||
|
get aligned to their intended locations by longer ones who "squeeze" them in.
|
||||||
|
- Smith-Waterman score thresholds can be kept lower
|
||||||
|
(and thus better match lower quality STT transcripts), as there is a lower chance for
|
||||||
|
- long sequences to match at a wrong location and for
|
||||||
|
- shorter sequences to match at a wrong location within their shortened intervals
|
||||||
|
(as they are getting matched later and deeper in the recursion tree).
|
||||||
|
|
||||||
|
#### Smith-Waterman candidate selection
|
||||||
|
|
||||||
|
Finding the best match of a given phrase within the original (potentially long) transcript
|
||||||
|
using vanilla Smith-Waterman is not feasible.
|
||||||
|
|
||||||
|
So this tool follows a two-phase approach where the first goal is to get a list of alignment
|
||||||
|
candidates. As the first step the original text is virtually partitioned into windows of the
|
||||||
|
same length as the search pattern. These windows are ordered descending by the number of 3-grams
|
||||||
|
they share with the pattern.
|
||||||
|
Best alignment candidates are now taken from the beginning of this ordered list.
|
||||||
|
|
||||||
|
`--align-max-candidates <CANDIDATES>` sets the maximum number of candidate windows
|
||||||
|
taken from the beginning of the list for further alignment.
|
||||||
|
|
||||||
|
`--align-candidate-threshold <THRESHOLD>` multiplied with the number of 3-grams of the predecessor
|
||||||
|
window it gives the minimum number of 3-grams the next candidate window has to have to also be
|
||||||
|
considered a candidate.
|
||||||
|
|
||||||
|
#### Smith-Waterman alignment
|
||||||
|
|
||||||
|
For each candidate, the best possible alignment is computed using the
|
||||||
|
[Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) algorithm
|
||||||
|
within an extended interval of one window-size around the candidate window.
|
||||||
|
|
||||||
|
`--align-match-score <SCORE>` is the score per correctly matched character. Default: 100
|
||||||
|
|
||||||
|
`--align-mismatch-score <SCORE>` is the score per non-matching (exchanged) character. Default: -100
|
||||||
|
|
||||||
|
`--align-gap-score <SCORE>` is the score per character gap (removing 1 character from pattern or original). Default: -100
|
||||||
|
|
||||||
|
The overall best score for the best match is normalized to a value of about 100 maximum by dividing
|
||||||
|
it through the maximum character count of either the match or the pattern.
|
||||||
|
|
||||||
|
### Step 6 - Gap alignment
|
||||||
|
|
||||||
|
After recursive matching of fragments there are potential text leftovers between aligned original
|
||||||
|
texts.
|
||||||
|
|
||||||
|
Some examples:
|
||||||
|
- Often: Missing (and therefore unaligned) STT transcripts of word-endings (e.g. English past tense endings _-d_ and _-ed_)
|
||||||
|
on phrase endings to the left of the gap
|
||||||
|
- Seldom: Phrase beginnings or endings that were wrongly matched on unspoken (but written) text whose actual
|
||||||
|
alignments are now left unaligned in the gap
|
||||||
|
- Big unmatched chunks of text, like
|
||||||
|
- Preface, text summaries or any other kind of meta information
|
||||||
|
- Copyright headers/footers
|
||||||
|
- Table of contents
|
||||||
|
- Chapter headers (if not spoken as they appear)
|
||||||
|
- Captions of figures
|
||||||
|
- Contents of tables
|
||||||
|
- Line-headers like character names in drama scripts
|
||||||
|
- Dependent of the (pre-processing) quality: OCR leftovers like
|
||||||
|
- page headers
|
||||||
|
- page numbers
|
||||||
|
- reader's notes
|
||||||
|
|
||||||
|
The basic challenge here is to figure out, if all or some of the gap text should be used to extend
|
||||||
|
the phrase to the left and/or to the right of the gap.
|
||||||
|
|
||||||
|
As Smith-Waterman alignment led to the current (potentially incomplete or even wrong) result,
|
||||||
|
its score cannot be used for further fine-tuning. Therefore there is a collection of
|
||||||
|
so called [test-distance metrics](metrics.md) to pick from using the `--align-similarity-algo <METRIC-ID>`
|
||||||
|
parameter.
|
||||||
|
|
||||||
|
Using the selected distance metric, the gap alignment is done by looking for the best scoring
|
||||||
|
extension of the left and right phrases up to their maximum extension.
|
||||||
|
|
||||||
|
`--align-stretch-factor <FRACTION>` is the fraction of the text length that it could get
|
||||||
|
stretched at max.
|
||||||
|
|
||||||
|
For many languages it is worth putting some emphasis on matching to words boundaries
|
||||||
|
(that is white-space separated sub-sequences).
|
||||||
|
|
||||||
|
`--align-snap-factor <FACTOR>` allows to control the snappiness to word boundaries.
|
||||||
|
|
||||||
|
If the best scoring extensions should overlap, the best scoring sum of non-overlapping
|
||||||
|
(but touching) extensions will win.
|
||||||
|
|
||||||
|
### Step 7 - Selection, filtering and output
|
||||||
|
|
||||||
|
Finally the best alignment of all candidate windows is selected as the winner.
|
||||||
|
It has to survive a series of filters for getting into the result file.
|
||||||
|
|
||||||
|
For each text distance metric there are two filter parameters:
|
||||||
|
|
||||||
|
`--output-min-<METRIC-ID> <VALUE>` only keeps utterances having the provided minimum value for the
|
||||||
|
metric with id `METRIC-ID`
|
||||||
|
|
||||||
|
`--output-max-<METRIC-ID> <VALUE>` only keeps utterances having the provided maximum value for the
|
||||||
|
metric with id `METRIC-ID`
|
||||||
|
|
||||||
|
For each text distance metric there's also the option to have it added to each utterance's entry:
|
||||||
|
|
||||||
|
`--output-<METRIC-ID>` adds the computed value for `<METRIC-ID>` to the utterances array-entry
|
||||||
|
|
||||||
|
Error rates and scores are provided as fractional values (typically between 0.0 = 0% and 1.0 = 100%
|
||||||
|
where numbers >1.0 are theoretically possible).
|
|
@ -0,0 +1,129 @@
|
||||||
|
## Export
|
||||||
|
|
||||||
|
After files got successfully aligned, one would possibly want to export the aligned utterances
|
||||||
|
as machine learning training samples.
|
||||||
|
|
||||||
|
This is where the export tool `bin/export.sh` comes in.
|
||||||
|
|
||||||
|
### Step 1 - Reading the input
|
||||||
|
|
||||||
|
The exporter takes either a single audio file (`--audio <AUDIO>`)
|
||||||
|
plus a corresponding `.aligned` file (`--aligned <ALIGNED>`) or a series
|
||||||
|
of such pairs from a `.catalog` file (`--catalog <CATALOG>`) as input.
|
||||||
|
|
||||||
|
All of the following computations will be done on the joined list of all aligned
|
||||||
|
utterances of all input pairs.
|
||||||
|
|
||||||
|
Option `--ignore-missing` will not fail on missing file references in the catalog
|
||||||
|
and instead just ignore the affected catalog entry.
|
||||||
|
|
||||||
|
### Step 2 - (Pre-) Filtering
|
||||||
|
|
||||||
|
The parameter `--filter <EXPR>` allows to specify a Python expression that has access
|
||||||
|
to all data fields of an aligned utterance (as can be seen in `.aligned` file entries).
|
||||||
|
|
||||||
|
This expression is now applied to each aligned utterance and in case it returns `True`,
|
||||||
|
the utterance will get excluded from all the following steps.
|
||||||
|
This is useful for excluding utterances that would not work as input for the planned
|
||||||
|
training or other kind of application.
|
||||||
|
|
||||||
|
### Step 3 - Computing quality
|
||||||
|
|
||||||
|
As with filtering, the parameter `--criteria <EXPR>` allows for specifying a Python
|
||||||
|
expression that has access to all data fields of an aligned utterance.
|
||||||
|
|
||||||
|
The expression is applied to each aligned utterance and its numerical return
|
||||||
|
value is assigned to each utterance as `quality`.
|
||||||
|
|
||||||
|
### Step 4 - De-biasing
|
||||||
|
|
||||||
|
This step is to (optionally) exclude utterances that would otherwise bias the data
|
||||||
|
(risk of overfitting).
|
||||||
|
|
||||||
|
For each `--debias <META DATA TYPE>` parameter the following procedure is applied:
|
||||||
|
1. Take the meta data type (e.g. "name") and read its instances (e.g. "Alice" or "Bob")
|
||||||
|
from each utternace and group all utterances accordingly
|
||||||
|
(e.g. a group with 2 utterances of "Alice" and a group with 15 utterances of "Bob"...)
|
||||||
|
2. Compute the standard deviation (`sigma`) of the instance-counts of the groups
|
||||||
|
3. For each group: If the instance-count exceeds `sigma` times `--debias-sigma-factor <FACTOR>`:
|
||||||
|
- Drop the number of exceeding utterances in order of their `quality` (lowest first)
|
||||||
|
|
||||||
|
### Step 5 - Partitioning
|
||||||
|
|
||||||
|
Training sets are often partitioned into several quality levels.
|
||||||
|
|
||||||
|
For each `--partition <QUALITY:PARTITION>` parameter (ordered descending by `QUALITY`):
|
||||||
|
If the utterance's `quality` value is greater or equal `QUALITY`, assign it to `PARTITION`.
|
||||||
|
|
||||||
|
Remaining utterances are assigned to partition `other`.
|
||||||
|
|
||||||
|
### Step 6 - Splitting
|
||||||
|
|
||||||
|
Training sets (actually their partitions) are typically split into sets `train`, `dev`
|
||||||
|
and `test` ([explanation](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)).
|
||||||
|
|
||||||
|
This can get automated through parameter `--split` which will let the exporter split each
|
||||||
|
partition (or the entire set) accordingly.
|
||||||
|
|
||||||
|
Parameter `--split-field` allows for specifying a meta data type that should be considered
|
||||||
|
atomic (e.g. "speaker" would result in all utterances of a speaker
|
||||||
|
instance - like "Alice" - to end up in one sub-set only). This atomic behavior will also hold
|
||||||
|
true across partitions.
|
||||||
|
|
||||||
|
Option `--split-drop-multiple` allows for dropping all samples with multiple `--split-field` assignments - e.g. a
|
||||||
|
sample with more than one "speaker".
|
||||||
|
|
||||||
|
In contrast option `--split-drop-unknown` allows for dropping all samples with no `--split-field assignment`.
|
||||||
|
|
||||||
|
With option '--assign-{train|dev|test} <VALUES>' one can pre-assign values (of the comma-separated list)
|
||||||
|
to the specified set.
|
||||||
|
|
||||||
|
Option `--split-seed <SEED>` sets an integer random seed for the split operation.
|
||||||
|
|
||||||
|
### Step 7 - Output
|
||||||
|
|
||||||
|
For each partition/sub-set combination the following is done:
|
||||||
|
- Construction of a `name` (e.g. `good-dev` will represent the validation set of partition `good`).
|
||||||
|
- All samples are lazy-loaded and potentially re-sampled to match parameters:
|
||||||
|
- `--channels <N>`: Number of audio channels - 1 for mono (default), 2 for stereo
|
||||||
|
- `--rate <RATE>`: Sample rate - default: 16000
|
||||||
|
- `--width <WIDTH>`: Sample width in bytes - default: 2 (16 bit)
|
||||||
|
|
||||||
|
`--workers <WORKERS>` can be used to specify how many parallel processes should be used for loading and re-sampling.
|
||||||
|
|
||||||
|
`--tmp-dir <DIR>` overrides system default temporary directory that is used for converting samples.
|
||||||
|
|
||||||
|
`--skip-damaged` allows for just skipping export of samples that cannot be loaded.
|
||||||
|
|
||||||
|
- If option `--target-dir <DIR>` is provided, all output will be written to the provided target directory.
|
||||||
|
This can be done in two different ways:
|
||||||
|
|
||||||
|
1. With the additional option `--sdb` each set will be written to a so called Sample-DB
|
||||||
|
that can be used by DeepSpeech. It will be written as `<name>.sdb` into the target directory.
|
||||||
|
SDB export can be controlled with the following additional options:
|
||||||
|
- `--sdb-bucket-size <SIZE>`: SDB bucket size (using units like "1GB") for external sorting of the samples
|
||||||
|
- `--sdb-workers <WORKERS>`: Number of parallel workers for preparing and compressing SDB entries
|
||||||
|
- `--sdb-buffered-samples <SAMPLES>`: Number of samples per bucket buffer during last phase of external sorting
|
||||||
|
- `--sdb-audio-type <TYPE>`: Internal audio type for storing SDB samples - `wav` or `opus` (default)
|
||||||
|
2. Without option `--sdb` all samples are written as WAV-files into sub-directory `<name>`
|
||||||
|
of the target directory and a list of samples to a `<name>.csv` file next to it with columns
|
||||||
|
`wav_filename`, `wav_filesize`, `transcript`.
|
||||||
|
|
||||||
|
If not omitted through option `--no-meta`, a CSV file called `<name>.meta` is written to the target directory.
|
||||||
|
For each written sample it provides the following columns:
|
||||||
|
`sample`, `split_entity`, `catalog_index`, `source_audio_file`, `aligned_file`, `alignment_index`.
|
||||||
|
|
||||||
|
Throughout this process option `--force` allows to overwrite any existing files.
|
||||||
|
- If instead option `--target-tar <TAR-FILE>` is provided, the same file structure as with `--target-dir <DIR>`
|
||||||
|
is directly written to the specified tar-file.
|
||||||
|
This output variant does not support writing SDBs.
|
||||||
|
|
||||||
|
### Additional functionality
|
||||||
|
|
||||||
|
Option `--plan <PLAN>` can be used to cache all computational steps before actual output writing.
|
||||||
|
Will be loaded if existing or generated otherwise.
|
||||||
|
This allows for writing several output formats using the same sample set distribution and without having to load
|
||||||
|
alignment files and re-calculate quality metrics, de-biasing, partitioning or splitting.
|
||||||
|
|
||||||
|
Using `--dry-run` one can avoid any writing and get a preview on set-splits and so forth
|
||||||
|
(`--dry-run-fast` won't even load any sample).
|
|
@ -0,0 +1,255 @@
|
||||||
|
## File formats
|
||||||
|
|
||||||
|
### Catalog files (.catalog)
|
||||||
|
|
||||||
|
Catalog files (suffix `.catalog`) are used for organizing bigger data file collections and
|
||||||
|
defining relations among them. It is basically a JSON array of hash-tables where each entry stands
|
||||||
|
for a single audio file and its associated original transcript.
|
||||||
|
|
||||||
|
So a typical catalog looks like this (`data/all.catalog` from this project):
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"audio": "test1/joined.mp3",
|
||||||
|
"tlog": "test1/joined.tlog",
|
||||||
|
"script": "test1/transcript.txt",
|
||||||
|
"aligned": "test1/joined.aligned"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"audio": "test2/joined.mp3",
|
||||||
|
"tlog": "test2/joined.tlog",
|
||||||
|
"script": "test2/transcript.script",
|
||||||
|
"aligned": "test2/joined.aligned"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
- `audio` is a path to an audio file (of a format that `pydub` supports)
|
||||||
|
- `tlog` is the (supposed) path to the STT generated transcription log of the audio file
|
||||||
|
- `script` is the path to the original transcript of the audio file
|
||||||
|
(as `.txt` or `.script` file)
|
||||||
|
- `aligned` is the (supposed) path to a `.aligned` file
|
||||||
|
|
||||||
|
Be aware: __All relative file paths are treated as relative to the catalog file's directory__.
|
||||||
|
|
||||||
|
The tools `bin/align.sh`, `bin/statistics.sh` and `bin/export.sh` all support parameter
|
||||||
|
`--catalog`:
|
||||||
|
|
||||||
|
The __alignment tool__ `bin/align.sh` requires either `tlog` to point to an existing
|
||||||
|
file or (if not) `audio` to point to an existing audio file for being able to transcribe
|
||||||
|
it and store it at the path indicated by `tlog`. Furthermore it requires `script` to
|
||||||
|
point to an existing script. It will write its alignment results to the path in `aligned`.
|
||||||
|
|
||||||
|
The __export tool__ `bin/export.sh` requires `audio` and `aligned` to point to existing files.
|
||||||
|
|
||||||
|
The __statistics tool__ `bin/statistics.sh` requires only `aligned` to point to existing files.
|
||||||
|
|
||||||
|
Advantages of having a catalog file:
|
||||||
|
|
||||||
|
- Simplified tool usage with only one parameter for defining all involved files (`--catalog`).
|
||||||
|
- A directory with many files has to be scanned just one time at catalog generation.
|
||||||
|
- Different file types can live at different and custom locations in the system.
|
||||||
|
This is important in case of read-only access rights to the original data.
|
||||||
|
It can also be used for avoiding to taint the original directory tree.
|
||||||
|
- Accumulated statistics
|
||||||
|
- Better progress indication (as the total number of files is available up front)
|
||||||
|
- Reduced tool startup overhead
|
||||||
|
- Allows for meta-data aware set-splitting on export - e.g. if some speakers are speaking
|
||||||
|
in several files.
|
||||||
|
|
||||||
|
So especially in case of many files to process it is highly recommended to __first create
|
||||||
|
a catalog file__ with all paths present (even the ones not pointing to existing files yet).
|
||||||
|
|
||||||
|
|
||||||
|
### Script files (.script|.txt)
|
||||||
|
|
||||||
|
The alignment tool requires an original script or (human transcript) of the provided audio.
|
||||||
|
These scripts can be represented in two basic forms:
|
||||||
|
- plain text files (`.txt`) or
|
||||||
|
- script files (`.script`)
|
||||||
|
|
||||||
|
In case of plain text files the content is considered a continuous stream of text without
|
||||||
|
any assigned meta data. The only exception is option `--text-meaningful-newlines` which
|
||||||
|
tells the aligner to consider newlines as separators between utterances
|
||||||
|
in conjunction with option `--align-phrase-snap-factor`.
|
||||||
|
|
||||||
|
If the original data source features utterance meta data, one should consider converting it
|
||||||
|
to the `.script` JSON file format which looks like this
|
||||||
|
(except of `data/test2/transcript.script`):
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
[
|
||||||
|
// ...
|
||||||
|
{
|
||||||
|
"speaker": "Phebe",
|
||||||
|
"text": "Good shepherd, tell this youth what 'tis to love."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"speaker": "Silvius",
|
||||||
|
"text": "It is to be all made of sighs and tears; And so am I for Phebe."
|
||||||
|
},
|
||||||
|
// ...
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
_This and the following sub-sections are all using the same real world examples and excerpts_
|
||||||
|
|
||||||
|
It is basically again an array of hash-tables, where each hash-table represents an utterance with the
|
||||||
|
only mandatory field `text` for its textual representation.
|
||||||
|
|
||||||
|
All other fields are considered meta data
|
||||||
|
(with the key called "meta data type" and the value "meta data instance").
|
||||||
|
|
||||||
|
### Transcription log files (.tlog)
|
||||||
|
|
||||||
|
The alignment tool relies on timed STT transcripts of the provided audio.
|
||||||
|
These transcripts are either provided by some external processing
|
||||||
|
(even using a different STT system than DeepSpeech) or will get generated
|
||||||
|
as part of the alignment process.
|
||||||
|
|
||||||
|
They are called transcription logs (`.tlog`) and are looking like this
|
||||||
|
(except of `data/test2/joined.tlog`):
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
[
|
||||||
|
// ...
|
||||||
|
{
|
||||||
|
"start": 7491960,
|
||||||
|
"end": 7493040,
|
||||||
|
"transcript": "good shepherd"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"start": 7493040,
|
||||||
|
"end": 7495110,
|
||||||
|
"transcript": "tell this youth what tis to love"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"start": 7495380,
|
||||||
|
"end": 7498020,
|
||||||
|
"transcript": "it is to be made of soles and tears"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"start": 7498470,
|
||||||
|
"end": 7500150,
|
||||||
|
"transcript": "and so a may for phoebe"
|
||||||
|
},
|
||||||
|
// ...
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
The fields of each entry:
|
||||||
|
- `start`: time offset of the audio fragment in milliseconds from the beginning of the
|
||||||
|
aligned audio file (mandatory)
|
||||||
|
- `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
|
||||||
|
aligned audio file (mandatory)
|
||||||
|
- `transcript`: STT transcript of the utterance (mandatory)
|
||||||
|
|
||||||
|
### Aligned files (.aligned)
|
||||||
|
|
||||||
|
The result of aligning an audio file with an original transcript is written to an
|
||||||
|
`.aligned` JSON file consisting of an array of hash-tables of the following form:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
[
|
||||||
|
// ...
|
||||||
|
{
|
||||||
|
"start": 7491960,
|
||||||
|
"end": 7493040,
|
||||||
|
"transcript": "good shepherd",
|
||||||
|
"text-start": 98302,
|
||||||
|
"text-end": 98316,
|
||||||
|
"meta": {
|
||||||
|
"speaker": [
|
||||||
|
"Phebe"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"aligned-raw": "Good shepherd,",
|
||||||
|
"aligned": "good shepherd",
|
||||||
|
"wng": 99.99999999999997,
|
||||||
|
"jaro_winkler": 100.0,
|
||||||
|
"levenshtein": 100.0,
|
||||||
|
"mra": 100.0,
|
||||||
|
"cer": 0.0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"start": 7493040,
|
||||||
|
"end": 7495110,
|
||||||
|
"transcript": "tell this youth what tis to love",
|
||||||
|
"text-start": 98317,
|
||||||
|
"text-end": 98351,
|
||||||
|
"meta": {
|
||||||
|
"speaker": [
|
||||||
|
"Phebe"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"aligned-raw": "tell this youth what 'tis to love.",
|
||||||
|
"aligned": "tell this youth what 'tis to love",
|
||||||
|
"wng": 92.71730687405957,
|
||||||
|
"jaro_winkler": 100.0,
|
||||||
|
"levenshtein": 96.96969696969697,
|
||||||
|
"mra": 100.0,
|
||||||
|
"cer": 3.0303030303030303
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"start": 7495380,
|
||||||
|
"end": 7498020,
|
||||||
|
"transcript": "it is to be made of soles and tears",
|
||||||
|
"text-start": 98352,
|
||||||
|
"text-end": 98392,
|
||||||
|
"meta": {
|
||||||
|
"speaker": [
|
||||||
|
"Silvius"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"aligned-raw": "It is to be all made of sighs and tears;",
|
||||||
|
"aligned": "it is to be all made of sighs and tears",
|
||||||
|
"wng": 77.93921929148159,
|
||||||
|
"jaro_winkler": 100.0,
|
||||||
|
"levenshtein": 82.05128205128204,
|
||||||
|
"mra": 100.0,
|
||||||
|
"cer": 17.94871794871795
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"start": 7498470,
|
||||||
|
"end": 7500150,
|
||||||
|
"transcript": "and so a may for phoebe",
|
||||||
|
"text-start": 98393,
|
||||||
|
"text-end": 98415,
|
||||||
|
"meta": {
|
||||||
|
"speaker": [
|
||||||
|
"Silvius"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"aligned-raw": "And so am I for Phebe.",
|
||||||
|
"aligned": "and so am i for phebe",
|
||||||
|
"wng": 66.82687893873339,
|
||||||
|
"jaro_winkler": 98.47964113181504,
|
||||||
|
"levenshtein": 82.6086956521739,
|
||||||
|
"mra": 100.0,
|
||||||
|
"cer": 19.047619047619047
|
||||||
|
},
|
||||||
|
// ...
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Each object array-entry represents an aligned audio fragment with the following attributes:
|
||||||
|
- `start`: time offset of the audio fragment in milliseconds from the beginning of the
|
||||||
|
aligned audio file
|
||||||
|
- `end`: time offset of the audio fragment's end in milliseconds from the beginning of the
|
||||||
|
aligned audio file
|
||||||
|
- `transcript`: STT transcript used for aligning
|
||||||
|
- `text-start`: character offset of the fragment's associated original text within the
|
||||||
|
aligned text document
|
||||||
|
- `text-end`: character offset of the end of the fragment's associated original text within the
|
||||||
|
aligned text document
|
||||||
|
- `meta`: meta data hash-table with
|
||||||
|
- _key_: meta data type
|
||||||
|
- _value_: array of meta data instances coalesced from the `.script` entries that
|
||||||
|
this entry intersects with
|
||||||
|
- `aligned-raw`: __raw__ original text fragment that got aligned with the audio fragment
|
||||||
|
and its STT transcript
|
||||||
|
- `aligned`: __clean__ original text fragment that got aligned with the audio fragment
|
||||||
|
and its STT transcript
|
||||||
|
- `<metric>` For each `--output-<metric>` parameter the alignment tool adds an entry with the
|
||||||
|
computed value (in this case `wng`, `jaro_winkler`, `levenshtein`, `mra`, `cer`)
|
|
@ -0,0 +1,18 @@
|
||||||
|
## Individual language models
|
||||||
|
|
||||||
|
If you plan to let the tool generate individual language models per text,
|
||||||
|
you have to get (essentially build) [KenLM](https://kheafield.com/code/kenlm/).
|
||||||
|
Before doing this, you should install its [dependencies](https://kheafield.com/code/kenlm/dependencies/).
|
||||||
|
For Debian based systems this can be done through:
|
||||||
|
```bash
|
||||||
|
$ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
|
||||||
|
```
|
||||||
|
|
||||||
|
With all requirements fulfilled, there is a script for building and installing KenLM
|
||||||
|
and the required DeepSpeech tools in the right location:
|
||||||
|
```bash
|
||||||
|
$ bin/lm-dependencies.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
If all went well, the alignment tool will find and use it to automatically create individual
|
||||||
|
language models for each document.
|
|
@ -0,0 +1,79 @@
|
||||||
|
## Text distance metrics
|
||||||
|
|
||||||
|
This section lists all available text distance metrics along with their IDs for
|
||||||
|
command-line use.
|
||||||
|
|
||||||
|
### Weighted N-grams (wng)
|
||||||
|
|
||||||
|
The weighted N-gram score is computed as the sum of the number of weighted shared N-grams
|
||||||
|
between the two texts.
|
||||||
|
It ensures that:
|
||||||
|
- Shared N-gram instances near interval bounds (dependent on situation) get rated higher than
|
||||||
|
the ones near the center or opposite end
|
||||||
|
- Large shared N-gram instances are weighted higher than short ones
|
||||||
|
|
||||||
|
`--align-min-ngram-size <SIZE>` sets the start (minimum) N-gram size
|
||||||
|
|
||||||
|
`--align-max-ngram-size <SIZE>` sets the final (maximum) N-gram size
|
||||||
|
|
||||||
|
`--align-ngram-size-factor <FACTOR>` sets a weight factor for the size preference
|
||||||
|
|
||||||
|
`--align-ngram-position-factor <FACTOR>` sets a weight factor for the position preference
|
||||||
|
|
||||||
|
### Jaro-Winkler (jaro_winkler)
|
||||||
|
|
||||||
|
Jaro-Winkler is an edit distance metric described
|
||||||
|
[here](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).
|
||||||
|
|
||||||
|
### Editex (editex)
|
||||||
|
|
||||||
|
Editex is a phonetic text distance algorithm described
|
||||||
|
[here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.2138&rep=rep1&type=pdf).
|
||||||
|
|
||||||
|
### Levenshtein (levenshtein)
|
||||||
|
|
||||||
|
Levenshtein is an edit distance metric described
|
||||||
|
[here](https://en.wikipedia.org/wiki/Levenshtein_distance).
|
||||||
|
|
||||||
|
### MRA (mra)
|
||||||
|
|
||||||
|
The "Match rating approach" is a phonetic text distance algorithm described
|
||||||
|
[here](https://en.wikipedia.org/wiki/Match_rating_approach).
|
||||||
|
|
||||||
|
### Hamming (hamming)
|
||||||
|
|
||||||
|
The Hamming distance is an edit distance metric described
|
||||||
|
[here](https://en.wikipedia.org/wiki/Hamming_distance).
|
||||||
|
|
||||||
|
### Word error rate (wer)
|
||||||
|
|
||||||
|
This is the same as Levenshtein - just on word level.
|
||||||
|
|
||||||
|
Not available for gap alignment.
|
||||||
|
|
||||||
|
### Character error rate (cer)
|
||||||
|
|
||||||
|
This is the same as Levenshtein but using a different implementation.
|
||||||
|
|
||||||
|
Not available for gap alignment.
|
||||||
|
|
||||||
|
### Smith-Waterman score (sws)
|
||||||
|
|
||||||
|
This is the final Smith-Waterman score coming from the rough alignment
|
||||||
|
step (but before gap alignment!).
|
||||||
|
It is described
|
||||||
|
[here](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
|
||||||
|
|
||||||
|
Not available for gap alignment.
|
||||||
|
|
||||||
|
### Transcript length (tlen)
|
||||||
|
|
||||||
|
The character length of the STT transcript.
|
||||||
|
|
||||||
|
Not available for gap alignment.
|
||||||
|
|
||||||
|
### Matched text length (mlen)
|
||||||
|
|
||||||
|
The character length of the matched text of the original transcript (cleaned).
|
||||||
|
|
||||||
|
Not available for gap alignment.
|
|
@ -0,0 +1,147 @@
|
||||||
|
## Tools
|
||||||
|
|
||||||
|
### Statistics tool
|
||||||
|
|
||||||
|
The statistics tool `bin/statistics.sh` can be used for displaying aggregated statistics of
|
||||||
|
all passed alignment files. Alignment files can be specified directly through the
|
||||||
|
`--aligned <ALIGNED-FILE>` multi-option and indirectly through the `--catalog <CATALOG-FILE>` multi-option.
|
||||||
|
|
||||||
|
Example call:
|
||||||
|
|
||||||
|
```shell script
|
||||||
|
DSAlign$ bin/statistics.sh --catalog data/all.catalog
|
||||||
|
Reading catalog
|
||||||
|
2 of 2 : 100.00% (elapsed: 00:00:00, speed: 94.27 it/s, ETA: 00:00:00)
|
||||||
|
Total number of files: 2
|
||||||
|
|
||||||
|
Total number of utterances: 5,949
|
||||||
|
|
||||||
|
Total aligned utterance character length: 202,191
|
||||||
|
|
||||||
|
Total utterance duration: 3:53:28.410000 (3 hours)
|
||||||
|
|
||||||
|
Overall number of instances of meta type "speaker": 27
|
||||||
|
|
||||||
|
100 most frequent "speaker" instances:
|
||||||
|
Rosalind 678
|
||||||
|
Touchstone 401
|
||||||
|
Orlando 310
|
||||||
|
Jaques 303
|
||||||
|
Celia 281
|
||||||
|
Oliver 125
|
||||||
|
Phebe 108
|
||||||
|
Duke Senior 87
|
||||||
|
Silvius 86
|
||||||
|
Adam 81
|
||||||
|
Corin 68
|
||||||
|
Duke Frederick 53
|
||||||
|
Le Beau 52
|
||||||
|
First Lord 49
|
||||||
|
Charles 33
|
||||||
|
Amiens 27
|
||||||
|
Audrey 27
|
||||||
|
Second Page 22
|
||||||
|
Hymen 19
|
||||||
|
Jaques De Boys 16
|
||||||
|
Second Lord 12
|
||||||
|
William 12
|
||||||
|
Forester 8
|
||||||
|
First Page 7
|
||||||
|
Sir Oliver Martext 4
|
||||||
|
Dennis 3
|
||||||
|
A Lord 1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Catalog tool
|
||||||
|
|
||||||
|
The catalog tool allows for maintenance of catalog files.
|
||||||
|
It takes multiple catalog files (supporting wildcards) and allows for applying several checks and tweaks before
|
||||||
|
potentially exporting them to a new combined catalog file.
|
||||||
|
|
||||||
|
Options:
|
||||||
|
|
||||||
|
- `--output <CATALOG>`: Writes all items of all passed catalogs into to the specified new catalog.
|
||||||
|
- `--make-relative`: Makes all paths entries of all items relative to the parent directory of the
|
||||||
|
new catalog (see `--output`).
|
||||||
|
- `--order-by <ENTRY>`: Entry that should be used for sorting items in new catalog (see `--output`).
|
||||||
|
- `--check <ENTRIES>`: Checks file existence of all passed (comma separated) entries of each catalog
|
||||||
|
item (e.g. `--check aligned,audio` will check if `aligned` and `audio` file paths of each catalog item exist).
|
||||||
|
`--check all` will check all entries of each item.
|
||||||
|
- `--on-miss fail|drop|remove|ignore`: What to do if a checked (`--check`) file is not existing.
|
||||||
|
- `fail`: tool will exit with an error status (default)
|
||||||
|
- `drop`: the catalog item with all its entries will be removed (see `--output`)
|
||||||
|
- `remove`: the missing entry within the catalog item will be removed (see `--output`)
|
||||||
|
- `ignore`: just logs the missing entry
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
```shell script
|
||||||
|
$ cat a.catalog
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"entry1": "is/not/existing/x",
|
||||||
|
"entry2": "is/existing/x"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
$ cat b.catalog
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"entry1": "is/not/existing/y",
|
||||||
|
"entry2": "is/existing/y"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
$ bin/catalog_tool.sh --check all --on-miss remove --output c.catalog --make-relative a.catalog b.catalog
|
||||||
|
Loading catalog "a.catalog"
|
||||||
|
Catalog "a.catalog" - Missing file for "entry1" ("is/not/existing/x") - removing entry from item
|
||||||
|
Loading catalog "b.catalog"
|
||||||
|
Catalog "b.catalog" - Missing file for "entry1" ("is/not/existing/y") - removing entry from item
|
||||||
|
Writing catalog "c.catalog"
|
||||||
|
$ cat c.catalog
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"entry2": "is/existing/x"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"entry2": "is/existing/y"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Meta data annotation tool
|
||||||
|
|
||||||
|
The meta data annotation tool allows for assigning meta data fields to all items of script files or transcription logs.
|
||||||
|
It takes only two parameters: The file and a series of `<key>=<value>` assignments.
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
```shell script
|
||||||
|
$ cat a.tlog
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"start": 330.0,
|
||||||
|
"end": 2820.0,
|
||||||
|
"transcript": "some text without a meaning"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"start": 3456.0,
|
||||||
|
"end": 5123.0,
|
||||||
|
"transcript": "some other text without a meaning"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
$ bin/meta.sh a.tlog speaker=alice year=2020
|
||||||
|
$ cat a.tlog
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"start": 330.0,
|
||||||
|
"end": 2820.0,
|
||||||
|
"transcript": "some text without a meaning",
|
||||||
|
"speaker": "alice",
|
||||||
|
"year": "2020"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"start": 3456.0,
|
||||||
|
"end": 5123.0,
|
||||||
|
"transcript": "some other text without a meaning",
|
||||||
|
"speaker": "alice",
|
||||||
|
"year": "2020"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
Загрузка…
Ссылка в новой задаче