CNTK/Tutorials/CNTK_208_Speech_Connectioni...

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CNTK 208: Training Acoustic Model with Connectionist Temporal Classification (CTC) Criteria\n",
    "This tutorial assumes familiarity with 10\\* CNTK tutorials and basic knowledge of data representation in acoustic modelling tasks. It introduces some CNTK building blocks that can be used in training deep networks for speech recognition on the example of CTC training criteria.\n",
    "\n",
    "## Introduction\n",
    "CNTK implementation of CTC is based on the paper by A. Graves et al. *\"Connectionist temporal classification: labeling unsegmented sequence data with recurrent neural networks\"*. CTC is a popular training criteria for sequence learning tasks, such as speech or handwriting. It doesn't require segmentation of training data nor post-processing of network outpus to convert them to labels. Thereby, it significantly simplifies training and decoding processes while achieving state of the art accuracy.\n",
    "\n",
    "CTC training runs on several sequences in parallel either on GPU or CPU, achieving maximal utilization of the hardware.  \n",
    "![](http://cntk.ai/jup/cntk208_speech_image.png \"Segment of 'Hey Cortana' speech\")\n",
    "\n",
    "\n",
    "First let us import some of the necessary libraries including CNTK and setup the testing environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Current directory D:\\users\\vadimma\\cntk_tut\\CNTK\\Tutorials\n",
      "Changed to data directory ..\\Tests\\EndToEndTests\\Speech\\Data\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import cntk as C\n",
    "import numpy as np\n",
    "\n",
    "# Select the right target device\n",
    "import cntk.tests.test_utils\n",
    "cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)\n",
    "\n",
    "data_dir = os.path.join(\"..\", \"Tests\", \"EndToEndTests\", \"Speech\", \"Data\")\n",
    "print(\"Current directory {0}\".format(os.getcwd()))\n",
    "\n",
    "if os.path.exists(data_dir):\n",
    "    if os.path.realpath(data_dir) != os.path.realpath(os.getcwd()):\n",
    "        os.chdir(data_dir)\n",
    "        print(\"Changed to data directory {0}\".format(data_dir))\n",
    "else:\n",
    "    print(\"Data directory not available locally. Downloading data.\")\n",
    "    try:\n",
    "        from urllib.request import urlretrieve\n",
    "    except ImportError:\n",
    "        from urllib import urlretrieve\n",
    "    for dir in ['GlobalStats', 'Features']:\n",
    "        if not os.path.exists(dir):\n",
    "            os.mkdir(dir)\n",
    "    for file in ['glob_0000.scp', 'glob_0000.write.scp', 'glob_0000.mlf', 'state_ctc.list', 'GlobalStats/mean.363', 'GlobalStats/var.363', 'Features/000000000.chunk']:\n",
    "        if os.path.exists(file):\n",
    "            print('Already downloaded %s' % file)\n",
    "        else:\n",
    "            print('Downloading %s' % file)\n",
    "            urlretrieve('https://github.com/Microsoft/CNTK/raw/release/2.6/Tests/EndToEndTests/Speech/Data/%s' % file, file)   \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read data\n",
    "\n",
    "CNTK consumes Acoustic Model (AM) training data in HTK/MLF format and typically expects 3 input files\n",
    "* [SCP file with features](https://github.com/Microsoft/CNTK/blob/master/Tests/EndToEndTests/Speech/Data/glob_0000.scp). SCP file contains mapping of utterance ids to corresponding feature files.\n",
    "* [MLF file with labels](https://github.com/Microsoft/CNTK/blob/master/Tests/EndToEndTests/Speech/Data/glob_0000.mlf). MLF (master label file) is a traditional format for representing transcription alignment to features. Even though the referenced MLF file contains label boundaries, they are not needed during CTC training and ignored. For more details on feature/label formats, refer to a copy of HTK book, e.g. [here](http://www1.icsi.berkeley.edu/Speech/docs/HTKBook3.2/)\n",
    "* [States list file](https://github.com/Microsoft/CNTK/blob/master/Tests/EndToEndTests/Speech/Data/state_ctc.list). This file contains the list of all labels (states) in the training set. The blank label, required by CTC, is located in the end of the file at index (line) 132, assuming 0-based indexing.\n",
    "\n",
    "CNTK provides flexible and efficient readers `HTKFeatureDeserializer`/`HTKMLFDeserializer` for acoustic features and labels. These readers follow [convention over configuration principle](https://en.wikipedia.org/wiki/Convention_over_configuration) and greatly simply training procedure. At the same time, they take care of various optimizations of reading from disk/network, CPU and GPU asynchronous prefetching which resuls in significant speed up of model training. \n",
    "\n",
    "**Note**: Currently, CTC training expects label and feature inputs of **the same dimension**, yet the labels don't have to be aligned. An easy way to generate the label file is to have uniform (equal) distribution of the labels across the feature frames. Obviously, some labels will be mis-aligned with this setup, but CTC criteria will take care of it during training, see the original publication for reference. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Type of features/labels and dimensions are application specific\n",
    "# Here we use rather small dimensional feature and the label set for the sake of keeping the train set compact.\n",
    "feature_dimension = 33\n",
    "feature = C.sequence.input((feature_dimension))\n",
    "\n",
    "label_dimension = 133\n",
    "label = C.sequence.input((label_dimension))\n",
    "\n",
    "train_feature_filepath = \"glob_0000.scp\"\n",
    "train_label_filepath = \"glob_0000.mlf\"\n",
    "mapping_filepath = \"state_ctc.list\"\n",
    "try:\n",
    "    train_feature_stream = C.io.HTKFeatureDeserializer(\n",
    "    C.io.StreamDefs(speech_feature = C.io.StreamDef(shape = feature_dimension, scp = train_feature_filepath)))\n",
    "    train_label_stream = C.io.HTKMLFDeserializer(\n",
    "    mapping_filepath, C.io.StreamDefs(speech_label = C.io.StreamDef(shape = label_dimension, mlf = train_label_filepath)), True)\n",
    "    train_data_reader = C.io.MinibatchSource([train_feature_stream, train_label_stream], frame_mode = False)\n",
    "    train_input_map = {feature: train_data_reader.streams.speech_feature, label: train_data_reader.streams.speech_label}\n",
    "except RuntimeError:\n",
    "    print (\"ERROR: not able to read features or labels\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model creation\n",
    "\n",
    "In this block we first normalize the features and define a model with LSTM Layers. We normalize the input features to zero mean and unit variance by subtracting the mean vector and multiplying by [inverse](https://en.wikipedia.org/wiki/Multiplicative_inverse) standard deviation, which are stored in separate files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "feature_mean = np.fromfile(os.path.join(\"GlobalStats\", \"mean.363\"), dtype=float, count=feature_dimension)\n",
    "feature_inverse_stddev = np.fromfile(os.path.join(\"GlobalStats\", \"var.363\"), dtype=float, count=feature_dimension)\n",
    "\n",
    "feature_normalized = (feature - feature_mean) * feature_inverse_stddev\n",
    "\n",
    "with C.default_options(activation=C.sigmoid):\n",
    "    z = C.layers.Sequential([\n",
    "        C.layers.For(range(3), lambda: C.layers.Recurrence(C.layers.LSTM(1024))),\n",
    "        C.layers.Dense(label_dimension)\n",
    "    ])(feature_normalized)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Define training hyperparameters\n",
    "\n",
    "CTC criteria (loss) function is implemented by combination of the `labels_to_graph` and `forward_backward` functions. These functions are designed to generalize forward-backward viterbi-like functions which are very common in sequential modelling problems, e.g. speech or handwriting. `labels_to_graph` is designed to convert the input label sequence into graph representation suitable for particular forward-backward procedure, and `forward_backward` function performs the procedure itself. Currently, these functions only support CTC, and it's their default configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mbsize = 1024\n",
    "mbs_per_epoch = 10\n",
    "max_epochs = 5\n",
    "\n",
    "criteria = C.forward_backward(C.labels_to_graph(label), z, blankTokenId=132, delayConstraint=3)\n",
    "err = C.edit_distance_error(z, label, squashInputs=True, tokensToIgnore=[132])\n",
    "# Learning rate parameter schedule per sample: \n",
    "# Use 0.01 for the first 3 epochs, followed by 0.001 for the remaining\n",
    "lr = C.learning_parameter_schedule_per_sample([(3, .01), (1,.001)])\n",
    "mm = C.momentum_schedule([(1000, 0.9), (0, 0.99)], mbsize)\n",
    "learner = C.momentum_sgd(z.parameters, lr, mm)\n",
    "trainer = C.Trainer(z, (criteria, err), learner)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training 21255301 parameters in 11 parameter tensors.\n",
      "Trained on a total of 8428 frames\n",
      "Finished Epoch[1 of 5]: [Training] loss = 3.720116 * 8428, metric = 100.00% * 8428 25.106s (335.7 samples/s);\n",
      "Trained on a total of 17094 frames\n",
      "Finished Epoch[2 of 5]: [Training] loss = 3.513460 * 8666, metric = 98.07% * 8666 21.176s (409.2 samples/s);\n",
      "Trained on a total of 25662 frames\n",
      "Finished Epoch[3 of 5]: [Training] loss = 3.498874 * 8568, metric = 98.23% * 8568 21.978s (389.8 samples/s);\n",
      "Trained on a total of 35282 frames\n",
      "Finished Epoch[4 of 5]: [Training] loss = 3.512962 * 9620, metric = 98.23% * 9620 22.159s (434.1 samples/s);\n",
      "Trained on a total of 43890 frames\n",
      "Finished Epoch[5 of 5]: [Training] loss = 3.508142 * 8608, metric = 98.12% * 8608 19.864s (433.3 samples/s);\n"
     ]
    }
   ],
   "source": [
    "C.logging.log_number_of_parameters(z)\n",
    "progress_printer = C.logging.progress_print.ProgressPrinter(tag='Training', num_epochs = max_epochs)\n",
    "\n",
    "for epoch in range(max_epochs):\n",
    "    for mb in range(mbs_per_epoch):\n",
    "        minibatch = train_data_reader.next_minibatch(mbsize, input_map = train_input_map)\n",
    "        trainer.train_minibatch(minibatch)\n",
    "        progress_printer.update_with_trainer(trainer, with_metric = True)\n",
    "\n",
    "    print('Trained on a total of ' + str(trainer.total_number_of_samples_seen) + ' frames')\n",
    "    progress_printer.epoch_summary(with_metric = True)\n",
    "\n",
    "# Uncomment to save the model\n",
    "# z.save('CTC_' + str(max_epochs) + 'epochs_' + str(mbsize) + 'mbsize_' + str(mbs_per_epoch) + 'mbs.model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Evaluate "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_feature_filepath = \"glob_0000.write.scp\"\n",
    "test_feature_stream = C.io.HTKFeatureDeserializer(\n",
    "    C.io.StreamDefs(speech_feature = C.io.StreamDef(shape = feature_dimension, scp = test_feature_filepath)))\n",
    "test_data_reader = C.io.MinibatchSource([test_feature_stream, train_label_stream], frame_mode = False)\n",
    "test_input_map = {feature: test_data_reader.streams.speech_feature, label: test_data_reader.streams.speech_label}\n",
    "\n",
    "num_test_minibatches = 2\n",
    "test_result = 0.0\n",
    "for i in range(num_test_minibatches):\n",
    "    test_minibatch = test_data_reader.next_minibatch(mbsize, input_map = test_input_map)\n",
    "    eval_error = trainer.test_minibatch(test_minibatch)\n",
    "    test_result = test_result + eval_error\n",
    "\n",
    "# Average of evaluation errors of all test minibatches\n",
    "round(test_result / num_test_minibatches,2)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}