Merge pull request #21 from JRMeyer/feature/add-scorer

Added information around scorer / language model. Resolves #6
2021-02-08 23:00:53 +11:00 · 2021-02-08 23:00:53 +11:00 · f1de135b60
--- a/ALPHABET.md
+++ b/ALPHABET.md
@ -1,4 +1,4 @@
-[Home](README.md) | [Next - Formatting your training data](DATA_FORMATTING.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)
+[Home](README.md) | [Previous - Scorer - language model for determining which words occur together ](SCORER.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)

 # The alphabet.txt file

@ -39,4 +39,4 @@ ValueError: Alphabet cannot encode transcript "panggil ambulan！" while process

 ---

-[Home](README.md) | [Next - Formatting your training data](DATA_FORMATTING.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)
+[Home](README.md) | [Previous - Scorer - language model for determining which words occur together ](SCORER.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)
--- a/AM_vs_LM.md
+++ b/AM_vs_LM.md
@ -1,4 +1,4 @@
-[Home](README.md) | [Previous - The alphabet.txt file](ALPHABET.md) | [Next - Setting up your DeepSpeech training environment](ENVIRONMENT.md)
+[Home](README.md) | [Previous - Scorer - language model for determining which words occur together](SCORER.md) | [Next - Setting up your DeepSpeech training environment](ENVIRONMENT.md)

 # Acoustic model vs. Language model

@ -18,4 +18,4 @@ The language model is a n-gram model trained with kenlm, and the training data i

 ---

-[Home](README.md) | [Previous - The alphabet.txt file](ALPHABET.md) | [Next - Setting up your DeepSpeech training environment](ENVIRONMENT.md)
+[Home](README.md) | [Previous - Scorer - language model for determining which words occur together](SCORER.md) | [Next - Setting up your DeepSpeech training environment](ENVIRONMENT.md)
--- a/README.md
+++ b/README.md
@ -22,10 +22,14 @@ Once you know what you can achieve with the DeepSpeech Playbook, this section pr

 Before you can train a model, you will need to collect and format your _corpus_ of data. This section provides an overview of the data format required for DeepSpeech, and walks through an example in prepping a dataset from Common Voice.

-## [The alphabet.txt file](ALPHABET.txt)
+## [The alphabet.txt file](ALPHABET.md)

 If you are training a model that uses a different alphabet to English, for example a language with diacritical marks, then you will need to modify the `alphabet.txt` file.

+## [Building your own scorer](SCORER.md)
+
+Learn what the scorer does, and how you can go about building your own. 
+
 ## [Acoustic model and language model](AM_vs_LM.md)

 Learn about the differences between DeepSpeech's _acoustic_ model and _language_ model and how they combine to provide end to end speech recognition.
--- a/SCORER.md
+++ b/SCORER.md
@ -0,0 +1,32 @@
+[Home](README.md) | [Previous - The alphabet.txt file](ALPHABET.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)
+
+# Scorer - language model for determining which words occur together
+
+## Contents
+
+- [Scorer - language model for determining which words occur together](#scorer---language-model-for-determining-which-words-occur-together)
+  * [Contents](#contents)
+    + [What is a scorer?](#what-is-a-scorer-)
+    + [Building your own scorer](#building-your-own-scorer)
+
+### What is a scorer?
+
+A scorer is a _language model_ and it is used by DeepSpeech to improve the accuracy of transcription. A _language model_ predicts which words are more likely to follow each other. For example, the word `chicken` might be frequently followed by the words `nuggets`, `soup` or `rissoles`, but is unlikely to be followed by the word `purple`. The scorer identifies probabilities of words occurring together.
+
+The default scorer used by DeepSpeech is trained on the LibriSpeech dataset. The LibriSpeech dataset is based on [LibriVox](https://librivox.org/) - an open collection of out-of-copyright and public domain works.
+
+You may need to build your own scorer - your own _language model_ if:
+
+* You are training DeepSpeech in another language
+* You are training a speech recognition model for a particular domain - such as technical words, medical transcription, agricultural terms and so on
+* If you want to improve the accuracy of transcription
+
+DeepSpeech supports the _optional_ use of an external scorer - if you're not sure if you need to build your own scorer, stick with the built-in one to begin with.
+
+### Building your own scorer
+
+Building your own scorer is beyond the scope of the DeepSpeech Playbook, but the [DeepSpeech documentation covers how to do this](https://deepspeech.readthedocs.io/en/latest/Scorer.html). There are built-in scripts with DeepSpeech which make building an external scorer - a _language model_ - easier.
+
+---
+
+[Home](README.md) | [Previous - The alphabet.txt file](ALPHABET.md) | [Next - Acoustic Model and Language Model](AM_vs_LM.md)