Docs for data exploration notebook, handle numpy array labels for t-sne

2019-05-28 07:19:39 -07:00 · 2019-05-28 07:19:39 -07:00 · 4c90354a5e
--- a/notebooks/data_exploration.ipynb
+++ b/notebooks/data_exploration.ipynb
@ -1,5 +1,28 @@
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Using TensorWatch for Data Exploration\n",
+    "\n",
+    "In this tutorial, we will show how to quickly use TensorWatch for exploring data using dimention reduction technique called [t-sne](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). We plan to implement many other techniques in future (please feel to [contribute](https://github.com/microsoft/tensorwatch/blob/master/CONTRIBUTING.md)!)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Installing regim\n",
+    "This tutorial will use small Python package called `regim`. It has few utility classes to quickly work with PyTorch datasets. To install regim, simply run:\n",
+    "\n",
+    "```\n",
+    "pip install regim\n",
+    "```\n",
+    "\n",
+    "Now we are done with that, let's do our imports:"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 1,
@ -8,42 +31,80 @@
   },
   "outputs": [],
   "source": [
-    "import tensorwatch as tw"
+    "import tensorwatch as tw\n",
+    "from regim import DataUtils"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First we will get MNIST dataset. The `regim` package has DataUtils class that allows to get entire MNIST dataset without train/test split and reshaping each image as vector of 784 integers instead of 28x28 matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
   "outputs": [],
   "source": [
-    "from regim import *\n",
-    "ds = DataUtils.mnist_datasets(linearize=True, train_test=False)\n",
-    "ds = DataUtils.sample_by_class(ds, k=500, shuffle=True, as_np=True, no_test=True)"
+    "ds = DataUtils.mnist_datasets(linearize=True, train_test=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "MNIST dataset is big enough for t-sne to take long time on slow computers. So we will apply t-sne on sample of this dataset. The `regim` package has utility method that allows us to take `k` random samples for each class. We also set `as_np=True` to convert images to numpy array from PyTorch tensor. The `no_test=True` parameter instructs that we don't want to split our data as train and test. The return value is a tuple of two numpy arrays, one containing input images and other labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
   "outputs": [],
   "source": [
-    "components = tw.get_tsne_components(ds)"
+    "inputs, labels = DataUtils.sample_by_class(ds, k=50, shuffle=True, as_np=True, no_test=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We are now ready to supply this dataset to TensorWatch and in just one line we can get lower dimensional components. The `get_tsne_components` method takes a tuple of input and labels. The optional parameters `features_col=0` and `labels_col=1` tells which member of tuple is input features and truth labels. Another optional parameter `n_components=3` says that we should generate 3 components for each data point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
-    "scrolled": false
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "components = tw.get_tsne_components((inputs, labels))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have 3D component for each data point in our dataset, let's plot it! For this purpose, we use `ArrayStream` class from TensorWatch that allows you to convert any iterables in to TensorWatch stream. This stream then we supply to Visualizer class asking it to use `tsne` visualization type which is just fency 3D scatter plot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "0a2f434c08cd4f3483a3d818713a7ad6",
+       "model_id": "f86416164bb04e729ea16cb71a666a70",
       "version_major": 2,
       "version_minor": 0
      },
@ -60,9 +121,25 @@
   "source": [
    "comp_stream = tw.ArrayStream(components)\n",
    "vis = tw.Visualizer(comp_stream, vis_type='tsne', \n",
-    "                    hover_images=ds[0], hover_image_reshape=(28,28))\n",
+    "                    hover_images=inputs, hover_image_reshape=(28,28))\n",
    "vis.show()"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notice that as you hover over mouse on each point on graph, you will see MNIST image associated with that point. How does that work? We are supplying input images to `hover_image` parameter. As our images are 1 dimentional array of 784, we also set `hover_image_reshape` parameter to reshape it to 28x28. You can customize hover behavior by attaching new fuction to vis.[hover_fn](https://github.com/microsoft/tensorwatch/blob/master/tensorwatch/plotly/embeddings_plot.py#L28) member."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Questions?\n",
+    "\n",
+    "File a [Github issue](https://github.com/microsoft/tensorwatch/issues/new) and let us know if we can improve this tutorial."
+   ]
  }
 ],
 "metadata": {
--- a/tensorwatch/embeddings/tsne_utils.py
+++ b/tensorwatch/embeddings/tsne_utils.py
@ -3,6 +3,7 @@

 from sklearn.manifold import TSNE
 from sklearn.preprocessing import StandardScaler
+import numpy as np

 def _standardize_data(data, col, whitten, flatten):
    if col is not None:
@ -29,7 +30,10 @@ def get_tsne_components(data, features_col=0, labels_col=1, whitten=True, n_comp
        labels = data[labels_col]
        for i, item in enumerate(comps):
            # low, high, annotation, text, color
-            item.extend((None, None, None, str(labels[i]), labels[i]))
+            label = labels[i]
+            if isinstance(labels, np.ndarray):
+                label = label.item()
+            item.extend((None, None, None, str(int(label)), label))
        return comps
    return tsne_results