CNTK/Tutorials/CNTK_201A_CIFAR-10_DataLoad...

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CNTK 201A Part A: CIFAR-10 Data Loader\n",
    "\n",
    "This tutorial will show how to prepare image data sets for use with deep learning algorithms in CNTK. The CIFAR-10 dataset (http://www.cs.toronto.edu/~kriz/cifar.html) is a popular dataset for image classification, collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It is a labeled subset of the [80 million tiny images](http://people.csail.mit.edu/torralba/tinyimages/) dataset.\n",
    "\n",
    "The CIFAR-10 dataset is not included in the CNTK distribution but can be easily downloaded and converted to CNTK-supported format   \n",
    "\n",
    "CNTK 201A tutorial is divided into two parts:\n",
    "- Part A: Familiarizes you with the CIFAR-10 data and converts them into CNTK supported format. This data will be used later in the tutorial for image classification tasks.\n",
    "- Part B: We will introduce image understanding tutorials.\n",
    "\n",
    "If you are curious about how well computers can perform on CIFAR-10 today, Rodrigo Benenson maintains a [blog](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html#43494641522d3130) on the state-of-the-art performance of various algorithms.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "\n",
    "from PIL import Image\n",
    "import getopt\n",
    "import numpy as np\n",
    "import pickle as cp\n",
    "import os\n",
    "import shutil\n",
    "import struct\n",
    "import sys\n",
    "import tarfile\n",
    "import xml.etree.cElementTree as et\n",
    "import xml.dom.minidom\n",
    "\n",
    "try: \n",
    "    from urllib.request import urlretrieve \n",
    "except ImportError: \n",
    "    from urllib import urlretrieve\n",
    "\n",
    "# Config matplotlib for inline plotting\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data download\n",
    "\n",
    "The CIFAR-10 dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. \n",
    "There are 50,000 training images and 10,000 test images. The 10 classes are: airplane, automobile, bird, \n",
    "cat, deer, dog, frog, horse, ship, and truck."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# CIFAR Image data\n",
    "imgSize = 32\n",
    "numFeature = imgSize * imgSize * 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first setup a few helper functions to download the CIFAR data. The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python \"pickled\" object produced with cPickle. To prepare the input data for use in CNTK we use three oprations:\n",
    "> `readBatch`: Unpack the pickle files\n",
    "\n",
    "> `loadData`: Compose the data into single train and test objects\n",
    "\n",
    "> `saveTxt`: As the name suggests, saves the label and the features into text files for both training and testing. \n",
    "  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def readBatch(src):\n",
    "    with open(src, 'rb') as f:\n",
    "        if sys.version_info[0] < 3: \n",
    "            d = cp.load(f) \n",
    "        else:\n",
    "            d = cp.load(f, encoding='latin1')\n",
    "        data = d['data']\n",
    "        feat = data\n",
    "    res = np.hstack((feat, np.reshape(d['labels'], (len(d['labels']), 1))))\n",
    "    return res.astype(np.int)\n",
    "\n",
    "def loadData(src):\n",
    "    print ('Downloading ' + src)\n",
    "    fname, h = urlretrieve(src, './delete.me')\n",
    "    print ('Done.')\n",
    "    try:\n",
    "        print ('Extracting files...')\n",
    "        with tarfile.open(fname) as tar:\n",
    "            tar.extractall()\n",
    "        print ('Done.')\n",
    "        print ('Preparing train set...')\n",
    "        trn = np.empty((0, numFeature + 1), dtype=np.int)\n",
    "        for i in range(5):\n",
    "            batchName = './cifar-10-batches-py/data_batch_{0}'.format(i + 1)\n",
    "            trn = np.vstack((trn, readBatch(batchName)))\n",
    "        print ('Done.')\n",
    "        print ('Preparing test set...')\n",
    "        tst = readBatch('./cifar-10-batches-py/test_batch')\n",
    "        print ('Done.')\n",
    "    finally:\n",
    "        os.remove(fname)\n",
    "    return (trn, tst)\n",
    "\n",
    "def saveTxt(filename, ndarray):\n",
    "    with open(filename, 'w') as f:\n",
    "        labels = list(map(' '.join, np.eye(10, dtype=np.uint).astype(str)))\n",
    "        for row in ndarray:\n",
    "            row_str = row.astype(str)\n",
    "            label_str = labels[row[-1]]\n",
    "            feature_str = ' '.join(row_str[:-1])\n",
    "            f.write('|labels {} |features {}\\n'.format(label_str, feature_str))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to saving the images in the text format, we would save the images in PNG format. In addition we also compute the mean of the image. `saveImage` and `saveMean` are two functions used for this purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def saveImage(fname, data, label, mapFile, regrFile, pad, **key_parms):\n",
    "    # data in CIFAR-10 dataset is in CHW format.\n",
    "    pixData = data.reshape((3, imgSize, imgSize))\n",
    "    if ('mean' in key_parms):\n",
    "        key_parms['mean'] += pixData\n",
    "\n",
    "    if pad > 0:\n",
    "        pixData = np.pad(pixData, ((0, 0), (pad, pad), (pad, pad)), mode='constant', constant_values=128) \n",
    "\n",
    "    img = Image.new('RGB', (imgSize + 2 * pad, imgSize + 2 * pad))\n",
    "    pixels = img.load()\n",
    "    for x in range(img.size[0]):\n",
    "        for y in range(img.size[1]):\n",
    "            pixels[x, y] = (pixData[0][y][x], pixData[1][y][x], pixData[2][y][x])\n",
    "    img.save(fname)\n",
    "    mapFile.write(\"%s\\t%d\\n\" % (fname, label))\n",
    "    \n",
    "    # compute per channel mean and store for regression example\n",
    "    channelMean = np.mean(pixData, axis=(1,2))\n",
    "    regrFile.write(\"|regrLabels\\t%f\\t%f\\t%f\\n\" % (channelMean[0]/255.0, channelMean[1]/255.0, channelMean[2]/255.0))\n",
    "    \n",
    "def saveMean(fname, data):\n",
    "    root = et.Element('opencv_storage')\n",
    "    et.SubElement(root, 'Channel').text = '3'\n",
    "    et.SubElement(root, 'Row').text = str(imgSize)\n",
    "    et.SubElement(root, 'Col').text = str(imgSize)\n",
    "    meanImg = et.SubElement(root, 'MeanImg', type_id='opencv-matrix')\n",
    "    et.SubElement(meanImg, 'rows').text = '1'\n",
    "    et.SubElement(meanImg, 'cols').text = str(imgSize * imgSize * 3)\n",
    "    et.SubElement(meanImg, 'dt').text = 'f'\n",
    "    et.SubElement(meanImg, 'data').text = ' '.join(['%e' % n for n in np.reshape(data, (imgSize * imgSize * 3))])\n",
    "\n",
    "    tree = et.ElementTree(root)\n",
    "    tree.write(fname)\n",
    "    x = xml.dom.minidom.parse(fname)\n",
    "    with open(fname, 'w') as f:\n",
    "        f.write(x.toprettyxml(indent = '  '))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`saveTrainImages` and `saveTestImages` are simple wrapper functions to iterate through the data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def saveTrainImages(filename, foldername):\n",
    "    if not os.path.exists(foldername):\n",
    "        os.makedirs(foldername)\n",
    "    data = {}\n",
    "    dataMean = np.zeros((3, imgSize, imgSize)) # mean is in CHW format.\n",
    "    with open('train_map.txt', 'w') as mapFile:\n",
    "        with open('train_regrLabels.txt', 'w') as regrFile:\n",
    "            for ifile in range(1, 6):\n",
    "                with open(os.path.join('./cifar-10-batches-py', 'data_batch_' + str(ifile)), 'rb') as f:\n",
    "                    if sys.version_info[0] < 3: \n",
    "                        data = cp.load(f)\n",
    "                    else: \n",
    "                        data = cp.load(f, encoding='latin1')\n",
    "                    for i in range(10000):\n",
    "                        fname = os.path.join(os.path.abspath(foldername), ('%05d.png' % (i + (ifile - 1) * 10000)))\n",
    "                        saveImage(fname, data['data'][i, :], data['labels'][i], mapFile, regrFile, 4, mean=dataMean)\n",
    "    dataMean = dataMean / (50 * 1000)\n",
    "    saveMean('CIFAR-10_mean.xml', dataMean)\n",
    "\n",
    "def saveTestImages(filename, foldername):\n",
    "    if not os.path.exists(foldername):\n",
    "      os.makedirs(foldername)\n",
    "    with open('test_map.txt', 'w') as mapFile:\n",
    "        with open('test_regrLabels.txt', 'w') as regrFile:\n",
    "            with open(os.path.join('./cifar-10-batches-py', 'test_batch'), 'rb') as f:\n",
    "                if sys.version_info[0] < 3: \n",
    "                    data = cp.load(f)\n",
    "                else: \n",
    "                    data = cp.load(f, encoding='latin1')\n",
    "                for i in range(10000):\n",
    "                    fname = os.path.join(os.path.abspath(foldername), ('%05d.png' % i))\n",
    "                    saveImage(fname, data['data'][i, :], data['labels'][i], mapFile, regrFile, 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# URLs for the train image and labels data\n",
    "url_cifar_data = 'http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'\n",
    "\n",
    "# Paths for saving the text files\n",
    "data_dir = './data/CIFAR-10/'\n",
    "train_filename = data_dir + '/Train_cntk_text.txt'\n",
    "test_filename = data_dir + '/Test_cntk_text.txt'\n",
    "\n",
    "train_img_directory = data_dir + '/Train'\n",
    "test_img_directory = data_dir + '/Test'\n",
    "\n",
    "root_dir = os.getcwd()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\n",
      "Done.\n",
      "Extracting files...\n",
      "Done.\n",
      "Preparing train set...\n",
      "Done.\n",
      "Preparing test set...\n",
      "Done.\n",
      "Writing train text file...\n",
      "Done.\n",
      "Writing test text file...\n",
      "Done.\n",
      "Converting train data to png images...\n",
      "Done.\n",
      "Converting test data to png images...\n",
      "Done.\n"
     ]
    }
   ],
   "source": [
    "if not os.path.exists(data_dir):\n",
    "    os.makedirs(data_dir)\n",
    "\n",
    "try:\n",
    "    os.chdir(data_dir)   \n",
    "    trn, tst= loadData('http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz')\n",
    "    print ('Writing train text file...')\n",
    "    saveTxt(r'./Train_cntk_text.txt', trn)\n",
    "    print ('Done.')\n",
    "    print ('Writing test text file...')\n",
    "    saveTxt(r'./Test_cntk_text.txt', tst)\n",
    "    print ('Done.')\n",
    "    print ('Converting train data to png images...')\n",
    "    saveTrainImages(r'./Train_cntk_text.txt', 'train')\n",
    "    print ('Done.')\n",
    "    print ('Converting test data to png images...')\n",
    "    saveTestImages(r'./Test_cntk_text.txt', 'test')\n",
    "    print ('Done.')\n",
    "finally:\n",
    "    os.chdir(\"../..\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}