Converting Middlesex example to Markdown

2017-03-07 09:51:10 -05:00 · 2017-03-07 09:51:10 -05:00 · 1d58842700
--- a/README.md
+++ b/README.md
@ -11,7 +11,7 @@ Deep neural networks (DNNs) are extraordinarily versatile artificial intelligenc

 Unfortunately, DNNs are also among the most time- and resource-intensive machine learning models. Whereas linear regression model results can be computed in negligible time, applying a DNN to a single file of interest may take tens or hundreds of milliseconds -- a processing rate on the order of 1,000 files per minute. Many business use cases require faster throughput. Fortunately, DNNs can be applied in parallel and scalable fashion when evaluation is performed on Spark clusters.

-This guide repository demonstrates how trained DNNs produced with two common deep learning frameworks, Microsoft's Cognitive Toolkit and Google's Tensorflow, can be operationalized on Spark to score a large image set. Files stored on Azure Data Lake Store, Microsoft's HDFS-based cloud storage resource, are processed in parallel by workers on the Spark cluster. The guide follows a single use case, described below.
+This guide repository demonstrates how trained DNNs produced with two common deep learning frameworks, Microsoft's Cognitive Toolkit and Google's Tensorflow, can be operationalized on Spark to score a large image set. Files stored on Azure Data Lake Store, Microsoft's HDFS-based cloud storage resource, are processed in parallel by workers on the Spark cluster. The guide follows a specific example use case, described below.

 ## Land use classification from aerial imagery

@ -24,6 +24,15 @@ In this guide, we develop a classifier that can predict how a parcel of land has

 Sample images and ground-truth labels are fortunately available in abundance for this use case. We use aerial imagery provided by the U.S. [National Agriculture Imagery Program](https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/), and land use labels from the [National Land Cover Database](https://www.mrlc.gov/). NLCD labels are published roughly every five years, while NAIP data are collected more frequently: a trained land use classifier can be used to infer land use at all aerial imaging timepoints.

+## Model training and validation
+
+We applied transfer learning to retrain the final layers of existing Tensorflow (ResNet) and CNTK (AlexNet) models for classification of 1-meter resolution NAIP aerial images of 224 meter x 224 meter regions selected from across the United States. It is possible for regions of this size to have multiple land uses (NLCD labels are provided at 30-meter resolution): to avoid confusion, we used only regions with relatively homogeneous land use for model training and validation. We created a balanced training and validation sets containing aerial images in six major land use categories (Developed, Cultivated, Forest, Shrub, Barren, and Herbaceous) from non-neighboring counties and collection years. Our retrained models achieved a classification accuracy of ~80% on these six categories, with the majority of errors occurring between different types of undeveloped land. By further grouping our images into "Developed", "Cultivated," and "Undeveloped" classes, we were able to further improve the classification accuracy to roughly 95% in our validation set.
+
+## Model application: predicting land use in 2016
+
+The trained models were applied to aerial images tiling Middlesex County, MA (home of Microsoft's Boston-area office).
+
+
 ## Contributing and Adapting

 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
--- a/image_set_preparation.ipynb
+++ b/image_set_preparation.ipynb
@ -621,8 +621,8 @@
    "- **Developed**: 21, 22, 23, 24\n",
    "- **Forest**: 41, 42, 43, 90\n",
    "- **Barren**: 31\n",
-    "- **Shrubland**: 51, 52 (note: our training dataset did not include any examples of 51)\n",
-    "- **Grassland/Wetlands**: 71, 72, 73, 74, 95 (note: our training dataset did not include any examples of 72, 73, 74)\n",
+    "- **Shrubland**: 51, 52\n",
+    "- **Herbaceous**: 71, 72, 73, 74, 95\n",
    "- **Cultivated**: 81, 82"
   ]
  },
--- a/img/data_overview/mediumnaip.png
+++ b/img/data_overview/mediumnaip.png
--- a/img/middlesex/20655.png
+++ b/img/middlesex/20655.png
--- a/img/middlesex/33308.png
+++ b/img/middlesex/33308.png
--- a/img/middlesex/36083.png
+++ b/img/middlesex/36083.png
--- a/img/middlesex/37002.png
+++ b/img/middlesex/37002.png
--- a/img/middlesex/47331.png
+++ b/img/middlesex/47331.png
--- a/img/middlesex/true_and_predicted_labels.png
+++ b/img/middlesex/true_and_predicted_labels.png
--- a/img/middlesex/true_and_predicted_labels_smoothened.png
+++ b/img/middlesex/true_and_predicted_labels_smoothened.png
--- a/img/scoring/scaling.png
+++ b/img/scoring/scaling.png
--- a/land_use_prediction.md
+++ b/land_use_prediction.md
@ -0,0 +1,52 @@
+# Middlesex County Land Use Prediction
+
+This notebook illustrates how trained Cognitive Toolkit (CNTK) and TensorFlow models can be applied to predict current land usage from recent aerial imagery. For more detail on image set creation, model training, and Spark cluster deployment, please see the rest of the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository.
+
+<img src="./img/data_overview/middlesex_ma.png" />
+
+## Image preparation and labeling
+
+We have used National Land Cover Database (NLCD) data for our ground truth labels during model training and evaluation. The most recent NLCD dataset was published in 2011, but aerial images from the National Agriculture Imagery Program (NAIP) are available for 2016. Our trained models therefore allow us to bridge a five-year data gap by predicting land use in 2016.
+
+To demonstrate this approach, we extracted a set of 65,563 images tiling Middlesex County, MA (home to Microsoft's New England Research and Development Center) at one-meter resolution from 2010 and 2016 NAIP data as [described previously](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/image_set_preparation.ipynb). Note that unlike the image set used in training and evaluation, some of these images have ambiguous land use types: for example, they may depict the boundary between a forest and developed land. These images were then scored with [trained CNTK and Tensorflow land use classification models](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/model_training.ipynb) applied in [parallel fashion using Spark](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/scoring_on_spark.ipynb). Both models performed similarly; results for the CNTK model are shown.
+
+For those unfamiliar with the region, we include below an aerial view of a 80 km x 70km region covering the county. The Greater Boston Area is centered along the ESE border of the county and extends through all but the northernmost regions.
+<img src="./img/data_overview/mediumnaip_white.png" width="500px" />
+
+## Visualizing land use
+
+To visualize the results, we represent the labels of each 224 m x 224 m tile with a single color-coded pixel:
+- Red represents developed regions (NLCD codes 21-24; see [legend](https://www.mrlc.gov/nlcd11_leg.php))
+- White represents cultivated regions (NLCD codes 81-82)
+- Green represents undeveloped and uncultivated regions (all other NLCD codes)
+
+Below left, the plurality NLCD 2011 label is shown for each tile. (NLCD data is provided at 30-meter resolution, so any tile may contain multiple land use labels.) The predicted labels for each tile in 2010 (most directly comparable to the NLCD labels) and 2016 (most recent available) are shown at center and right, respectively.
+
+<img src="./img/middlesex/true_and_predicted_labels.png"/>
+
+We found a striking correspondence between true and predicted labels at both timepoints. An uptick in the fraction of developed land was observed between 2010 and 2016 (see table below), but we believe this change is attributable in large part to the impact of image coloration and vegetation differences (e.g. browning in drought conditions) on labeling. Some systematic errors are noticable in the predictions, including the apparent mislabeling of some highways as cultivated land (white lines in 2016 image).
+
+|   	|No. developed tiles   	|No. cultivated tiles (%)   	|No. undeveloped tiles  	|
+|---	|---	|---	|---	|
+|NLCD 2011 labels   	|28,537 (43.5%)   	|2,337 (3.6%)   	|34,689 (52.9%)   	|
+|2010 predicted labels   	|27,584 (42.1%)   	|941 (1.4%)   	|37,038 (56.4%)   	|
+|2016 predicted labels   	|28,911 (44.1%)   	|4,011 (6.1%)   	|32,641 (49.8%)   	|
+
+For the purposes of mapping and quantifying land use, it may be preferable to discount isolated patches of differing land use. For example, an urban park may not be considered undeveloped land for the purposes of habit conservation, and construction of a rural homestead may not indicate substantial development in an otherwise cultivated region. We note that isolated tiles of land use can be removed by applying a 3x3 plurality-voting filter (with added weight for the center tile's own predicted label) to the raw predictions:
+
+<img src="./img/middlesex/true_and_predicted_labels_smoothened.png"/>
+
+## Identifying newly developed regions
+
+The ability to identify programmatically identify new development and cultivation in remote areas may be useful to government agencies that regulate housing and commerce, e.g. to identify tax evasion. By comparing our 2016 predicted labels to the 2011 NLCD tile labels, we were able to identify ~400 tiles putatively undergoing new development in the last five years. A few examples (including bordering tiles for context) are shown below:
+
+<img src="./img/middlesex/33308.png"/>
+<img src="./img/middlesex/36083.png"/>
+<img src="./img/middlesex/47331.png"/>
+
+In some cases, our land use classifier was sensitive enough to identify the development of single properties within a tile:
+
+<img src="./img/middlesex/20655.png"/>
+<img src="./img/middlesex/37002.png"/>
+
+A visual comparison of the ~400 candidate tiles in 2010 vs. 2016 NAIP images reveals that roughly one-third have truly been developed; the false positives may reflect differences in lighting and drought conditions between the 2016 images and the training data.
--- a/middlesex_county_land_use_prediction.ipynb
+++ b/middlesex_county_land_use_prediction.ipynb
@ -0,0 +1,109 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Middlesex County Land Use Prediction\n",
+    "\n",
+    "This notebook illustrates how trained Cognitive Toolkit (CNTK) and TensorFlow models can be applied to predict current land usage from recent aerial imagery. For more detail on image set creation, model training, and Spark cluster deployment, please see the rest of the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository.\n",
+    "\n",
+    "<img src=\"./img/data_overview/middlesex_ma.png\" />\n",
+    "\n",
+    "## Image preparation and labeling\n",
+    "\n",
+    "We have used National Land Cover Database (NLCD) data for our ground truth labels during model training and evaluation. The most recent NLCD dataset was published in 2011, but aerial images from the National Agriculture Imagery Program (NAIP) are available for 2016. Our trained models therefore allow us to bridge a five-year data gap by predicting land use in 2016.\n",
+    "\n",
+    "To demonstrate this approach, we extracted a set of 65,563 images tiling Middlesex County, MA (home to Microsoft's New England Research and Development Center) at one-meter resolution from 2010 and 2016 NAIP data as [described previously](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/image_set_preparation.ipynb). Note that unlike the image set used in training and evaluation, some of these images have ambiguous land use types: for example, they may depict the boundary between a forest and developed land. These images were then scored with [trained CNTK and Tensorflow land use classification models](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/model_training.ipynb) applied in [parallel fashion using Spark](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/scoring_on_spark.ipynb). Both models performed similarly; results for the CNTK model are shown.\n",
+    "\n",
+    "For those unfamiliar with the region, we include below an aerial view of a 80 km x 70km region covering the county. The Greater Boston Area is centered along the ESE border of the county and extends through all but the northernmost regions.\n",
+    "<img src=\"./img/data_overview/mediumnaip_white.png\" width=\"500px\" />"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Visualizing land use\n",
+    "\n",
+    "To visualize the results, we represent the labels of each 224 m x 224 m tile with a single color-coded pixel:\n",
+    "- Red represents developed regions (NLCD codes 21-24; see [legend](https://www.mrlc.gov/nlcd11_leg.php))\n",
+    "- White represents cultivated regions (NLCD codes 81-82)\n",
+    "- Green represents undeveloped and uncultivated regions (all other NLCD codes)\n",
+    "\n",
+    "At left, the plurality NLCD 2011 label is shown for each tile. (NLCD data is provided at 30-meter resolution, so any tile may contain multiple land use labels.) The predicted labels for each tile in 2010 (most directly comparable to the NLCD labels) and 2016 (most recent available) are shown at center and right, respectively."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src=\"./img/middlesex/true_and_predicted_labels.png\"/>\n",
+    "\n",
+    "We found a striking correspondence between true and predicted labels at both timepoints. An uptick in the fraction of developed land was observed between 2010 and 2016 (see table below), but we believe this change is attributable in large part to the impact of image coloration and vegetation differences (e.g. browning in drought conditions) on labeling. Some systematic errors are noticable in the predictions, including the apparent mislabeling of some highways as cultivated land (white lines in 2016 image).\n",
+    "\n",
+    "|   \t|No. developed tiles   \t|No. cultivated tiles (%)   \t|No. undeveloped tiles  \t|\n",
+    "|---\t|---\t|---\t|---\t|\n",
+    "|NLCD 2011 labels   \t|28,537 (43.5%)   \t|2,337 (3.6%)   \t|34,689 (52.9%)   \t|\n",
+    "|2010 predicted labels   \t|27,584 (42.1%)   \t|941 (1.4%)   \t|37,038 (56.4%)   \t|\n",
+    "|2016 predicted labels   \t|28,911 (44.1%)   \t|4,011 (6.1%)   \t|32,641 (49.8%)   \t|\n",
+    "\n",
+    "For the purposes of mapping and quantifying land use, it may be preferable to discount isolated patches of differing land use. For example, an urban park may not be considered undeveloped land for the purposes of habit conservation, and construction of a rural homestead may not indicate substantial development in an otherwise cultivated region. We note that isolated tiles of land use can be removed by applying a 3x3 plurality-voting filter (with added weight for the center tile's own predicted label) to the raw predictions:\n",
+    "\n",
+    "<img src=\"./img/middlesex/true_and_predicted_labels_smoothened.png\"/>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Identifying newly developed regions\n",
+    "\n",
+    "The ability to identify programmatically identify new development and cultivation in remote areas may be useful to government agencies that regulate housing and commerce, e.g. to identify tax evasion. By comparing our 2016 predicted labels to the 2011 NLCD tile labels, we were able to identify ~400 tiles putatively undergoing new development in the last five years. A few examples (including bordering tiles for context) are shown below:\n",
+    "\n",
+    "<img src=\"./img/middlesex/33308.png\"/>\n",
+    "<img src=\"./img/middlesex/36083.png\"/>\n",
+    "<img src=\"./img/middlesex/47331.png\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In some cases, our land use classifier was sensitive enough to identify the development of single properties within a tile:\n",
+    "\n",
+    "<img src=\"./img/middlesex/20655.png\"/>\n",
+    "<img src=\"./img/middlesex/37002.png\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A visual comparison of the ~400 candidate tiles in 2010 vs. 2016 NAIP images reveals that roughly one-third have truly been developed; the false positives may reflect differences in lighting and drought conditions between the 2016 images and the training data."
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python [conda env:python35]",
+   "language": "python",
+   "name": "conda-env-python35-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
--- a/scoring/script_action.sh
+++ b/scoring/script_action.sh
@ -5,19 +5,19 @@

 cntk_home="/usr/hdp/current"
 cd $cntk_home
-curl "https://cntk.ai/BinaryDrop/CNTK-2-0-beta10-0-Linux-64bit-CPU-Only.tar.gz" | tar xzf -
+curl "https://cntk.ai/BinaryDrop/CNTK-2-0-beta12-0-Linux-64bit-CPU-Only.tar.gz" | tar xzf -
 cd ./cntk/Scripts/install/linux 
 sed -i "s#"ANACONDA_PREFIX=\"\$HOME/anaconda3\""#"ANACONDA_PREFIX=\"\/usr/bin/anaconda\""#g" install-cntk.sh
 sed -i "s#"\$HOME/anaconda3"#"\$ANACONDA_PREFIX"#g" install-cntk.sh
 ./install-cntk.sh --py-version 35

 sudo /usr/bin/anaconda/envs/cntk-py35/bin/pip install pillow
-sudo /usr/bin/anaconda/envs/cntk-py35/bin/pip install tensorflow
+sudo /usr/bin/anaconda/envs/cntk-py35/bin/pip install tensorflow==0.12.1

-#sudo rm -rf /tmp/resnet
-sudo mkdir /tmp/resnet
-cd /tmp/resnet
-wget https://mawahstorage.blob.core.windows.net/models/tf_improvedpreprocessing.zip -P /tmp/resnet
-unzip /tmp/resnet/tf_improvedpreprocessing.zip
-wget https://mawahstorage.blob.core.windows.net/models/resnet20_237_improvedpreprocessing.dnn -P /tmp/resnet
-sudo chmod -R 777 /tmp/resnet
+#sudo rm -rf /tmp/models
+sudo mkdir /tmp/models
+cd /tmp/models
+wget https://mawahstorage.blob.core.windows.net/models/tf.zip -P /tmp/models
+unzip /tmp/models/tf.zip
+wget https://mawahstorage.blob.core.windows.net/models/withcrops_50.dnn -P /tmp/models
+sudo chmod -R 777 /tmp/models
--- a/scoring_on_spark.ipynb
+++ b/scoring_on_spark.ipynb
@ -15,7 +15,8 @@
    "   - [Installing Cognitive Toolkit and Tensorflow](#install)\n",
    "- [Image scoring with PySpark](#pyspark)\n",
    "   - [Cognitive Toolkit](#cntk)\n",
-    "   - [TensorFlow](#tf)"
+    "   - [TensorFlow](#tf)\n",
+    "- [Improving runtime by scaling cluster size](#scaling)"
   ]
  },
  {
@ -86,7 +87,7 @@
    "               1. When the run completes, click \"Done\".\n",
    "       1. Click the \"Next\" button at the bottom of the pane.\n",
    "1. In the \"Summary\" section of the \"New HDInsight cluster\" pane:\n",
-    "   1. If desired, you can edit the cluster size settings to choose node counts/sizes based on your budget and time constraints. This tutorial can be completed using a cluster with **4** worker nodes and a node size of **D12 v2** (for both worker and head nodes). For more information, please see the [cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters) and [VM](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-sizes#dv2-series) size guides.\n",
+    "   1. If desired, you can edit the cluster size settings to choose node counts/sizes based on your budget and time constraints. This tutorial can be completed using a cluster with **10** worker nodes and a node size of **D12 v2** (for both worker and head nodes). For more information, please see the [cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters) and [VM](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-sizes#dv2-series) size guides.\n",
    "1. Click the \"Create\" button at the bottom of the pane.\n",
    "\n",
    "#### Checking cluster deployment status\n",
@ -108,12 +109,8 @@
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": false
-   },
-   "outputs": [],
+   "cell_type": "raw",
+   "metadata": {},
   "source": [
    "local_image_dir = 'E:\\\\combined\\\\test'\n",
    "blob_account_name = ''\n",
@ -154,21 +151,21 @@
    "\n",
    "cntk_home=\"/usr/hdp/current\"\n",
    "cd $cntk_home\n",
-    "curl \"BinaryDrop/CNTK-2-0-beta10-0-Linux-64bit-CPU-Only.tar.gz\" | tar xzf -\n",
+    "curl \"https://cntk.ai/BinaryDrop/CNTK-2-0-beta12-0-Linux-64bit-CPU-Only.tar.gz\" | tar xzf -\n",
    "cd ./cntk/Scripts/install/linux \n",
    "sed -i \"s#\"ANACONDA_PREFIX=\\\"\\$HOME/anaconda3\\\"\"#\"ANACONDA_PREFIX=\\\"\\/usr/bin/anaconda\\\"\"#g\" install-cntk.sh\n",
    "sed -i \"s#\"\\$HOME/anaconda3\"#\"\\$ANACONDA_PREFIX\"#g\" install-cntk.sh\n",
    "./install-cntk.sh --py-version 35\n",
    "\n",
    "sudo /usr/bin/anaconda/envs/cntk-py35/bin/pip install pillow\n",
-    "sudo /usr/bin/anaconda/envs/cntk-py35/bin/pip install tensorflow\n",
+    "sudo /usr/bin/anaconda/envs/cntk-py35/bin/pip install tensorflow==0.12.1\n",
    "\n",
-    "sudo mkdir /tmp/resnet\n",
-    "cd /tmp/resnet\n",
-    "wget https://mawahstorage.blob.core.windows.net/models/tf.zip -P /tmp/resnet\n",
-    "unzip /tmp/resnet/tf.zip\n",
-    "wget https://mawahstorage.blob.core.windows.net/models/resnet20_237_improvedpreprocessing.dnn -P /tmp/resnet\n",
-    "sudo chmod -R 777 /tmp/resnet"
+    "sudo mkdir /tmp/models\n",
+    "cd /tmp/models\n",
+    "wget https://mawahstorage.blob.core.windows.net/models/tf.zip -P /tmp/models\n",
+    "unzip /tmp/models/tf.zip\n",
+    "wget https://mawahstorage.blob.core.windows.net/models/withcrops_50.dnn -P /tmp/models\n",
+    "sudo chmod -R 777 /tmp/models"
   ]
  },
  {
@ -244,7 +241,7 @@
     "data": {
      "text/html": [
       "<table>\n",
-       "<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>5</td><td>application_1486565576928_0061</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"http://hn1-mawaht.h2celvvrkkmuregxlpjd0qjune.cx.internal.cloudapp.net:8088/proxy/application_1486565576928_0061/\">Link</a></td><td><a target=\"_blank\" href=\"http://10.0.0.16:30060/node/containerlogs/container_1486565576928_0061_01_000001/livy\">Link</a></td><td>✔</td></tr></table>"
+       "<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>0</td><td>application_1488464300441_0016</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"http://hn0-mawaht.anfilysuiuiuxdfcwwmiqq1sme.cx.internal.cloudapp.net:8088/proxy/application_1488464300441_0016/\">Link</a></td><td><a target=\"_blank\" href=\"http://10.0.0.12:30060/node/containerlogs/container_1488464300441_0016_01_000001/livy\">Link</a></td><td>✔</td></tr></table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
@ -275,44 +272,12 @@
    "    return(int(os.path.basename(folder)))\n",
    "\n",
    "adls_name = 'mawahtensorflow'\n",
-    "adls_folder = 'test'\n",
-    "n_workers = 4\n",
-    "local_tmp_dir = '/tmp/resnet'\n",
+    "adls_folder = 'balancedtest'\n",
+    "n_workers = 10\n",
+    "local_tmp_dir = '/tmp/models'\n",
    "\n",
    "dataset_dir = 'adl://{}.azuredatalakestore.net/{}'.format(adls_name, adls_folder)\n",
-    "image_rdd = sc.binaryFiles('{}/*/*.png'.format(dataset_dir), minPartitions=n_workers).coalesce(n_workers)\n",
-    "\n",
-    "# Define correspondence of NLCD ids to labels of the trained model\n",
-    "nlcd_id_to_group = {21: 'Developed',\n",
-    "                    22: 'Developed',\n",
-    "                    23: 'Developed',\n",
-    "                    24: 'Developed',\n",
-    "                    11: 'Water/Wetlands',\n",
-    "                    12: 'Water/Wetlands',\n",
-    "                    95: 'Water/Wetlands',\n",
-    "                    41: 'Forest',\n",
-    "                    42: 'Forest',\n",
-    "                    43: 'Forest',\n",
-    "                    90: 'Forest',\n",
-    "                    31: 'Barren',\n",
-    "                    51: 'Shrubland',\n",
-    "                    52: 'Shrubland',\n",
-    "                    71: 'Grassland',\n",
-    "                    72: 'Grassland',\n",
-    "                    73: 'Grassland',\n",
-    "                    74: 'Grassland',\n",
-    "                    81: 'Cultivated',\n",
-    "                    82: 'Cultivated'}\n",
-    "group_to_label = {'Shrubland': 0,\n",
-    "                  'Forest': 1,\n",
-    "                  'Cultivated': 2,\n",
-    "                  'Barren': 3,\n",
-    "                  'Water/Wetlands': 4,\n",
-    "                  'Grassland': 5,\n",
-    "                  'Developed': 6}\n",
-    "\n",
-    "nlcd_id_to_label = {key: group_to_label[nlcd_id_to_group[key]] for key in nlcd_id_to_group.keys()}\n",
-    "nlcd_id_to_label_bc = sc.broadcast(nlcd_id_to_label)"
+    "image_rdd = sc.binaryFiles('{}/*/*.png'.format(dataset_dir), minPartitions=n_workers).coalesce(n_workers)"
   ]
  },
  {
@ -327,7 +292,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
@ -335,7 +300,7 @@
   "source": [
    "from cntk import load_model\n",
    "\n",
-    "cntk_model_filepath = '{}/resnet20_237_improvedpreprocessing.dnn'.format(local_tmp_dir)\n",
+    "cntk_model_filepath = '{}/withcrops_50.dnn'.format(local_tmp_dir)\n",
    "cntk_model_filepath_bc = sc.broadcast(cntk_model_filepath)\n",
    "sc.addFile(cntk_model_filepath)"
   ]
@ -349,36 +314,30 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
-    "def cntk_get_preprocessed_image(my_file):\n",
-    "    ''' Perform reshaping '''\n",
-    "    image_data = np.array(Image.open(my_file), dtype=np.float32)\n",
-    "    image_data = np.ascontiguousarray(np.transpose(image_data, (2, 0, 1)))\n",
+    "def cntk_get_preprocessed_image(filename):\n",
+    "    ''' Perform transposition and RGB -> BGR permutation '''\n",
+    "    image_data = np.array(Image.open(filename), dtype=np.float32)\n",
+    "    bgr_image = image_data[:, :, ::-1]\n",
+    "    image_data = np.ascontiguousarray(np.transpose(bgr_image, (2,0,1)))\n",
    "    return(image_data)\n",
    "\n",
-    "def argsoftmax(x):\n",
-    "    ''' Apply softmax, then return the best label '''\n",
-    "    exponentiated = np.exp(x)\n",
-    "    softmax = exponentiated / exponentiated.sum(axis=0)\n",
-    "    return(np.argmax(softmax))\n",
-    "\n",
    "def cntk_run_worker(files):\n",
    "    ''' Scoring script run by each worker '''\n",
    "    cntk_model_filepath = cntk_model_filepath_bc.value\n",
    "    loaded_model = load_model(SparkFiles.get(cntk_model_filepath))\n",
-    "    nlcd_id_to_label = nlcd_id_to_label_bc.value\n",
    "    \n",
    "    # Iterate through the files. The first value in each tuple is the file name; the second is the image data\n",
    "    for file in files:\n",
    "        preprocessed_image = cntk_get_preprocessed_image(BytesIO(file[1]))\n",
    "        dnn_output = loaded_model.eval({loaded_model.arguments[0]: [preprocessed_image]})\n",
-    "        true_label = nlcd_id_to_label[get_nlcd_id(file[0])]\n",
-    "        yield (file[0], true_label, argsoftmax(np.squeeze(dnn_output)))"
+    "        true_label = get_nlcd_id(file[0])\n",
+    "        yield (file[0], true_label, np.argmax(np.squeeze(dnn_output)))"
   ]
  },
  {
@ -390,17 +349,26 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Scored 11760 images\n",
+      "0:04:13.002895"
+     ]
+    }
+   ],
   "source": [
    "labeled_images = image_rdd.mapPartitions(cntk_run_worker)\n",
    "\n",
    "start = pd.datetime.now()\n",
    "cntk_results = labeled_images.collect()\n",
-    "print('Scored {} images'.format(len(results)))\n",
+    "print('Scored {} images'.format(len(cntk_results)))\n",
    "stop = pd.datetime.now()\n",
    "print(stop - start)"
   ]
@ -409,21 +377,47 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "#### Evaluate the model's performance"
+    "#### Evaluate the model's performance\n",
+    "\n",
+    "We first report the model's raw overall accuracy. We then calculate the overall accuracy when all undeveloped land types are grouped under the same label. (We will use the latter grouping in a subsequent notebook to simplify result interpretation.)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "When using all six categories, correctly predicted 9077 of 11760 images (77.19%)\n",
+      "After regrouping land use categories, correctly predicted 10816 of 11760 images (91.97%)"
+     ]
+    }
+   ],
   "source": [
+    "def group_undeveloped_land_types(original_label):\n",
+    "    if original_label in [3, 5]:  # developed and cultivated land types\n",
+    "        return(original_label)\n",
+    "    else:\n",
+    "        return(6)  # new grouped label for all undeveloped land types\n",
+    "\n",
    "cntk_df = pd.DataFrame(cntk_results, columns=['filename', 'true_label', 'predicted_label'])\n",
    "num_correct = sum(cntk_df['true_label'] == cntk_df['predicted_label'])\n",
    "num_total = len(cntk_results)\n",
-    "print('Correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct, num_total, 100 * num_correct / num_total))"
+    "print('When using all six categories, correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct,\n",
+    "                                                                                             num_total,\n",
+    "                                                                                             100 * num_correct / num_total))\n",
+    "\n",
+    "cntk_df['true_label_regrouped'] = cntk_df['true_label'].apply(group_undeveloped_land_types)\n",
+    "cntk_df['predicted_label_regrouped'] = cntk_df['predicted_label'].apply(group_undeveloped_land_types)\n",
+    "num_correct = sum(cntk_df['true_label_regrouped'] == cntk_df['predicted_label_regrouped'])\n",
+    "print('After regrouping land use categories, correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct,\n",
+    "                                                                                                    num_total,\n",
+    "                                                                                                    100 * num_correct / num_total))\n"
   ]
  },
  {
@ -445,7 +439,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
@ -470,7 +464,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
@ -505,13 +499,11 @@
    "    return(preprocessing_fn)\n",
    "\n",
    "def tf_run_worker(files):\n",
-    "    nlcd_id_to_label = nlcd_id_to_label_bc.value\n",
    "    model_dir = model_dir_bc.value\n",
-    "    class_count = 7\n",
    "    results = []\n",
    "    \n",
    "    with tf.Graph().as_default():\n",
-    "        network_fn = get_network_fn(num_classes=class_count, is_training=False)\n",
+    "        network_fn = get_network_fn(num_classes=6, is_training=False)\n",
    "        image_preprocessing_fn = get_preprocessing()\n",
    "        \n",
    "        current_image = tf.placeholder(tf.uint8, shape=(224, 224, 3))\n",
@ -530,7 +522,7 @@
    "                for file in files:\n",
    "                    imported_image_np = np.asarray(Image.open(BytesIO(file[1])), dtype=np.uint8)\n",
    "                    result = sess.run(predictions, feed_dict={current_image: imported_image_np})\n",
-    "                    true_label = nlcd_id_to_label[get_nlcd_id(file[0])]\n",
+    "                    true_label = get_nlcd_id(file[0])\n",
    "                    results.append([file[0], true_label, result[0]])\n",
    "            finally:\n",
    "                coord.request_stop()\n",
@ -547,11 +539,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Scored 11760 images\n",
+      "0:04:58.742548"
+     ]
+    }
+   ],
   "source": [
    "labeled_images_tf = image_rdd.mapPartitions(tf_run_worker)\n",
    "\n",
@ -566,49 +567,79 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "#### Evaluate the model's performance"
+    "#### Evaluate the model's performance\n",
+    "\n",
+    "We first report the model's raw overall accuracy. We also report the overall accuracy when all undeveloped land types are grouped under the same label. (We will use the latter grouping in a subsequent notebook to simplify result interpretation.)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "When using all six categories, correctly predicted 9611 of 11760 images (81.73%)\n",
+      "After regrouping land use categories, correctly predicted 10788 of 11760 images (91.73%)"
+     ]
+    }
+   ],
   "source": [
+    "def group_undeveloped_land_types(original_label):\n",
+    "    if original_label in [3, 5]:  # developed and cultivated land types\n",
+    "        return(original_label)\n",
+    "    else:\n",
+    "        return(6)\n",
+    "\n",
    "tf_df = pd.DataFrame(results_tf, columns=['filename', 'true_label', 'predicted_label'])\n",
    "num_correct = sum(tf_df['true_label'] == tf_df['predicted_label'])\n",
    "num_total = len(results_tf)\n",
-    "print('Correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct, num_total, 100 * num_correct / num_total))"
+    "print('When using all six categories, correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct,\n",
+    "                                                                                             num_total,\n",
+    "                                                                                             100 * num_correct / num_total))\n",
+    "\n",
+    "tf_df['true_label_regrouped'] = tf_df['true_label'].apply(group_undeveloped_land_types)\n",
+    "tf_df['predicted_label_regrouped'] = tf_df['predicted_label'].apply(group_undeveloped_land_types)\n",
+    "num_correct = sum(tf_df['true_label_regrouped'] == tf_df['predicted_label_regrouped'])\n",
+    "print('After regrouping land use categories, correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct,\n",
+    "                                                                                                    num_total,\n",
+    "                                                                                                    100 * num_correct / num_total))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We found that our trained Tensorflow model correctly predicted 102,971 of 131,999 images (78.01%) in our class-imbalanced test set. Scoring took 4h55 on a cluster with 4 D21v2 worker nodes. Data transfer accounts for the majority of this processing time."
+    "<a name=\"scaling\"></a>\n",
+    "## Improving runtime by scaling cluster size\n",
+    "\n",
+    "To demonstrate the scalability achievable through scoring on Spark, we performed the image labeling task illustrated above on Spark clusters with two D13 V2 head nodes and a varying number of D4 V2 worker nodes. Runtimes for the scoring task were recorded with the timing functionality included above. We predicted that mean runtimes would be inversely proportional to the number of worker nodes available for scoring. (The number of images assigned to each worker is inversely proportional to the number of workers.) Mean runtimes for both CNTK and Tensorflow are reported below:\n",
+    "\n",
+    "<img src=\"./img/scoring/scaling.png\" width=\"600 px\"/>\n",
+    "\n",
+    "The best-fit power law relation for our CNTK results has an exponent of -0.97 consistent with the expected inverse proportional relationship between runtime and worker node count. For our Tensorflow model, we observed a less dramatic decrease in runtime as the number of worker nodes increased (best-fit power law exponent: -0.74) and patterned residuals suggesting that fixed time contributions (e.g. time required for model loading) dominate runtime when the number of worker nodes is large. (Recall that the transfer learning models that we created in CNTK and Tensorflow are not directly comparable -- the former is an AlexNet model, the latter a 50-layer ResNet -- so these results should not be interpreted to suggest a difference in efficiency between Tensorflow and CNTK.)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
-   "display_name": "Python [conda env:python35]",
-   "language": "python",
-   "name": "conda-env-python35-py"
+   "display_name": "PySpark3",
+   "language": "",
+   "name": "pyspark3kernel"
  },
  "language_info": {
   "codemirror_mode": {
-    "name": "ipython",
+    "name": "python",
    "version": 3
   },
-   "file_extension": ".py",
   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.5.2"
+   "name": "pyspark3",
+   "pygments_lexer": "python3"
  }
 },
 "nbformat": 4,