Updated README, some code beautification

2019-04-02 19:41:33 +02:00 · 2019-04-02 19:41:33 +02:00 · 94c088be87
--- a/README.md
+++ b/README.md
@ -251,46 +251,35 @@ Please ensure you have the required [CUDA dependency](#cuda-dependency).
 ### Common Voice training data

 The Common Voice corpus consists of voice samples that were donated through Mozilla's [Common Voice](https://voice.mozilla.org/) Initiative.
+You can download individual CommonVoice v2.0 language packs from [here](https://voice.mozilla.org/data).
+After extraction of such a pack, you'll find the following contents:
+ - the `*.tsv` files output by CorporaCreator for the downloaded language
+ - the mp3 audio files they reference in a `clips` sub-directory.

-We provide an importer (`bin/import_cv2.py`) which automates preparation of the Common Voice (v2.0) corpus as such:
+For bringing this data into a form that DeepSpeech understands, you have to run the CommonVoice v2.0 importer (`bin/import_cv2.py`):

 ```bash
-bin/import_cv2.py /path/to/audio/data_dir /path/to/tsv_dir
+bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/language/archive
 ```

-You should have already downloaded the Common Voice v2.0 data from [here](https://voice.mozilla.org/data). The `import_cv2.py` script assumes as input (1) the audio downloaded from Common Voice v2.0 for a certain language, in addition to (2) the `*.tsv` files output by CorporaCreator (included in Common Voice v2.0 download). As output, the script returns the data and transcripts in a state usable by `DeepSpeech.py` (i.e. `*.csv` and `.WAV` data).
+Providing a filter alphabet is optional. It will exclude all samples whose transcripts contain characters not in the specified alphabet. 
+Running the importer with `-h` will show you some additional options.

-Please be aware that training with the Common Voice corpus archive requires a lot of free disk space and quite some time to conclude. As this process creates a huge number of small files, using an SSD drive is highly recommended. If the import script gets interrupted, it will try to continue from where it stopped the next time you run it. Unfortunately, there are some cases where it will need to start over. Once the import is done, the directory will contain a bunch of CSV files.
+Once the import is done, the `clips` sub-directory will contain for each required `.mp3` an additional `.wav` file.
+It will also add the following `.csv` files:

-The following files are official user-validated sets for training, validating and testing:
+- `clips/train.csv`
+- `clips/dev.csv`
+- `clips/test.csv`

- `cv-valid-train.csv`
- `cv-valid-dev.csv`
- `cv-valid-test.csv`
-
-The following files are the non-validated unofficial sets for training, validating and testing:
-
- `cv-other-train.csv`
- `cv-other-dev.csv`
- `cv-other-test.csv`
-
-`cv-invalid.csv` contains all samples that users flagged as invalid.
-
-A sub-directory called `cv_corpus_{version}` contains the mp3 and wav files that were extracted from an archive named `cv_corpus_{version}.tar.gz`.
-All entries in the CSV files refer to their samples by absolute paths. So moving this sub-directory would require another import or tweaking the CSV files accordingly.
+All entries in these CSV files refer to their samples by absolute paths. So moving this sub-directory would require another import or tweaking the CSV files accordingly.

 To use Common Voice data during training, validation and testing, you pass (comma separated combinations of) their filenames into `--train_files`, `--dev_files`, `--test_files` parameters of `DeepSpeech.py`.

-If, for example, Common Voice was imported into `../data/CV`, `DeepSpeech.py` could be called like this:
+If, for example, Common Voice language `en` was extracted to `../data/CV/en/`, `DeepSpeech.py` could be called like this:

 ```bash
-./DeepSpeech.py --train_files ../data/CV/cv-valid-train.csv --dev_files ../data/CV/cv-valid-dev.csv --test_files ../data/CV/cv-valid-test.csv
-```
-
-If you are brave enough, you can also include the `other` dataset, which contains not-yet-validated content:
-
-```bash
-./DeepSpeech.py --train_files ../data/CV/cv-valid-train.csv,../data/CV/cv-other-train.csv --dev_files ../data/CV/cv-valid-dev.csv --test_files ../data/CV/cv-valid-test.csv
+./DeepSpeech.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv
 ```

 ### Training a model
--- a/bin/import_cv2.py
+++ b/bin/import_cv2.py
@ -45,7 +45,7 @@ def _preprocess_data(tsv_dir, audio_dir, label_filter):


 def _maybe_convert_set(input_tsv, audio_dir, label_filter):
-    output_csv =  path.join(audio_dir,os.path.split(input_tsv)[-1].replace('tsv', 'csv'))
+    output_csv = path.join(audio_dir, os.path.split(input_tsv)[-1].replace('tsv', 'csv'))
    print("Saving new DeepSpeech-formatted CSV file to: ", output_csv)

    # Get audiofile path and transcript for each sentence in tsv
@ -56,7 +56,7 @@ def _maybe_convert_set(input_tsv, audio_dir, label_filter):
            samples.append((row['path'], row['sentence']))

    # Keep track of how many samples are good vs. problematic
-    counter = { 'all': 0, 'failed': 0, 'invalid_label': 0, 'too_short': 0, 'too_long': 0 }
+    counter = {'all': 0, 'failed': 0, 'invalid_label': 0, 'too_short': 0, 'too_long': 0}
    lock = RLock()
    num_samples = len(samples)
    rows = []