NimbusML-Samples/samples/2.2 [Text] Sentiment Analys...

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentiment Analysis 2 - Data Streaming with FileDataStream\n",
    "\n",
    "In this example, we develop a similar model as in the tutorial for [Twitter Data 1](https://docs.microsoft.com/en-us/nimbusml/tutorials/b_a-sentiment-analysis-1-data-loading-with-pandas). Instead of loading data in pandas, we load the data with nimbusml and the model can be simply trained using the input file name. Instead of saving the whole dataset in memory, nimbusml processes the data by passing a DataFileStream in the training/testing process to achieve exponentially speed up."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NimbusML Schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, instead of using pandas.load_csv to load the input files, we use DataFileStream to train the model. A FileDataStream can be generated from a csv file directly. The file schema indicates the format of the input file, i.e. with/without header, sep = ',' or '\\t', the data type for each columns, and all the related information. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train-twitter.gen-sample.tsv\n",
      "test-twitter.gen-sample.tsv\n"
     ]
    }
   ],
   "source": [
    "# Getting input file path from package\n",
    "import os\n",
    "from nimbusml.datasets import get_dataset\n",
    "from nimbusml import FileDataStream \n",
    "\n",
    "train_file = get_dataset('gen_twittertrain').as_filepath()\n",
    "test_file = get_dataset('gen_twittertest').as_filepath()\n",
    "\n",
    "print(os.path.basename(train_file))\n",
    "print(os.path.basename(test_file))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DataSchema([DataColumn(name='Sentiment', type='TX', pos=0),\n",
       "    DataColumn(name='Text', type='TX', pos=1),\n",
       "    DataColumn(name='Label', type='I8', pos=2)], header=True,\n",
       "    sep='\\t')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generating file schema\n",
    "data_stream_train = FileDataStream.read_csv(train_file, sep='\\t')\n",
    "data_stream_test = FileDataStream.read_csv(test_file, sep='\\t')\n",
    "data_stream_train.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The interpretation is:\n",
    "\n",
    "    1. There are three columns in the dataset\n",
    "    2. \"name='Sentiment', type='TX', pos=0\" indicates that the the new column \"Sentiment\" is of type \"TX\" (text) and it is from the column 0 in the origin dataset.\n",
    "    3. \"name='Text', type='TX', pos=1\" indicates that the new column \"Text\" is of type \"TX\" (text) and it is from the column 1 in the origin dataset.\n",
    "    4. \"name='Label', type='I8', pos=2\" indicates that the new column \"Label\" is of type \"I8\" (integer), and it is from the column 2 in the origin dataset.\n",
    "    5. \"header=True\" indicates that the data has a header\n",
    "    6. \"sep = \\t\" means that the dataset is seperated by \"\\t\".\n",
    "\n",
    "The pipeline is created in the same way as in the previous tutorial:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Training Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pipeline can be developed as in the previous tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from IPython.display import display\n",
    "from nimbusml.feature_extraction.text import NGramFeaturizer\n",
    "from nimbusml.feature_extraction.text.extractor import Ngram\n",
    "from nimbusml.linear_model import AveragedPerceptronBinaryClassifier\n",
    "from nimbusml.decomposition import PcaTransformer\n",
    "from nimbusml import Pipeline, FileDataStream"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In NimbusML, for transforms, the user can specify the input column names for each operator to be executed on. If not, all the columns from the previous operator or the origin dataset will be used. In Tutorial 2.2, the column syntax of nimbusml will be discussed in more details. \n",
    "\n",
    "                                               columns = {\"features\":\"Text\"]}\n",
    "indicates that the operator will use as input columns [\"Text\"] and the output will be saved to column \"features\". \n",
    "\n",
    "                                               columns = [\"features\"]\n",
    "indicates that the operator will use as input columns [\"features\"] and the output will be saved to the same column (overwrite the origin columns).\n",
    "\n",
    "For learners, the user can specify the \"roles\" for different columns, such as feature, weight, label, group_id, etc.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "t0 = time.time()\n",
    "ppl = Pipeline([\n",
    "                NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'TfIdf',\n",
    "                                                             ngram_length=2),\n",
    "                                char_feature_extractor=Ngram(weighting = 'Tf',\n",
    "                                                             ngram_length=3),\n",
    "                                columns = {\"features\": \"Text\"}), \n",
    "                PcaTransformer(rank = 80, columns = \"features\"),\n",
    "                AveragedPerceptronBinaryClassifier(l2_regularization=0.3,\n",
    "                                                   number_of_iterations=3,\n",
    "                                                   feature = [\"features\"],\n",
    "                                                   label = \"Label\"),\n",
    "               ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of fitting the model using a pandas dataframe, we use the FileDataStream created by the file name, such as 'input_data/train-twitter.sample.tsv' as a pointer to the input data source. \n",
    "\n",
    "Then the model can be fitted based on the FileDataStream. The data file will be loaded using NimbusML data loader in a streamline, and the FileDataStream is passed in the pipeline. In general, this process is much faster than loading the whole dataset into memory beforehand, such as using pandas, since all the processes have been optimized to boost the performance.\n",
    "\n",
    "The pipeline can be trained as following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.\n",
      "Training calibrator.\n",
      "Elapsed time: 00:00:01.4548474\n",
      "Training time: 5.37\n"
     ]
    }
   ],
   "source": [
    "ppl.fit(data_stream_train)\n",
    "\n",
    "print(\"Training time: \"  + str(round(time.time() - t0, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Testing Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, the pipeline can be trained using .test() function using the FileDataStream based on the test_file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Performance metrics: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AUC</th>\n",
       "      <th>Accuracy</th>\n",
       "      <th>Positive precision</th>\n",
       "      <th>Positive recall</th>\n",
       "      <th>Negative precision</th>\n",
       "      <th>Negative recall</th>\n",
       "      <th>Log-loss</th>\n",
       "      <th>Log-loss reduction</th>\n",
       "      <th>Test-set entropy (prior Log-Loss/instance)</th>\n",
       "      <th>F1 Score</th>\n",
       "      <th>AUPRC</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.573503</td>\n",
       "      <td>0.662791</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.662791</td>\n",
       "      <td>1</td>\n",
       "      <td>1.079578</td>\n",
       "      <td>-0.170753</td>\n",
       "      <td>0.922123</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.361565</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        AUC  Accuracy  Positive precision  Positive recall  \\\n",
       "0  0.573503  0.662791                   0                0   \n",
       "\n",
       "   Negative precision  Negative recall  Log-loss  Log-loss reduction  \\\n",
       "0            0.662791                1  1.079578           -0.170753   \n",
       "\n",
       "   Test-set entropy (prior Log-Loss/instance)  F1 Score     AUPRC  \n",
       "0                                    0.922123       NaN  0.361565  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Individual scores: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PredictedLabel</th>\n",
       "      <th>Score</th>\n",
       "      <th>Probability</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>-0.236972</td>\n",
       "      <td>0.071740</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>-0.172740</td>\n",
       "      <td>0.356047</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>-0.218403</td>\n",
       "      <td>0.120105</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>-0.219571</td>\n",
       "      <td>0.116377</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>-0.181157</td>\n",
       "      <td>0.299347</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PredictedLabel     Score  Probability\n",
       "0               0 -0.236972     0.071740\n",
       "1               0 -0.172740     0.356047\n",
       "2               0 -0.218403     0.120105\n",
       "3               0 -0.219571     0.116377\n",
       "4               0 -0.181157     0.299347"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total runtime: 9\n"
     ]
    }
   ],
   "source": [
    "metrics, scores = ppl.test(data_stream_test, output_scores = True)\n",
    "print(\"Performance metrics: \")\n",
    "display(metrics)\n",
    "print(\"Individual scores: \")\n",
    "\n",
    "display(scores[0:5]) #using the file stream, hard to visualize intermediate output and origin dataset\n",
    "print(\"Total runtime: \" + str(round(time.time() - t0)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}