NimbusML-Samples/samples/2.1 [Text] Sentiment Analys...

706 строки
24 KiB
Plaintext
Исходник Постоянная ссылка Ответственный История

Этот файл содержит неоднозначные символы Юникода!

Этот файл содержит неоднозначные символы Юникода, которые могут быть перепутаны с другими в текущей локали. Если это намеренно, можете спокойно проигнорировать это предупреждение. Используйте кнопку Экранировать, чтобы подсветить эти символы.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sentiment Analysis 1 - Data Loading with Pandas\n",
"\n",
"In this example, we develop a binary classifier using the manually generated Twitter data to detect the sentiment of each tweet. For example, \"This is awesome!\" will be a positive one and \"I am sad\" will be negative. The input data is the text and we use NimbusML NGramFeaturizer to extract numeric features and input them to a AveragedPerceptron classifier."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading Data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"import pandas as pd\n",
"import numpy as np\n",
"import os\n",
"from IPython.display import display\n",
"from nimbusml.datasets import get_dataset\n",
"from nimbusml.feature_extraction.text import NGramFeaturizer\n",
"from nimbusml.feature_extraction.text.extractor import Ngram\n",
"from nimbusml.linear_model import AveragedPerceptronBinaryClassifier\n",
"from nimbusml.decomposition import PcaTransformer\n",
"from nimbusml import Pipeline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train data file path: train-twitter.gen-sample.tsv\n",
"Test data file path: test-twitter.gen-sample.tsv\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sentiment</th>\n",
" <th>Text</th>\n",
" <th>Label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Negative</td>\n",
" <td>Oh you are hurting me</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Positive</td>\n",
" <td>So long</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Positive</td>\n",
" <td>Ths sofa is comfortable</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Negative</td>\n",
" <td>The place suck. No?</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Positive</td>\n",
" <td>@fakeid &amp;quot;Chillin&amp;quot; I love it!!</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Sentiment Text Label\n",
"0 Negative Oh you are hurting me 0\n",
"1 Positive So long 1\n",
"2 Positive Ths sofa is comfortable 1\n",
"3 Negative The place suck. No? 0\n",
"4 Positive @fakeid &quot;Chillin&quot; I love it!! 1"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load data from package\n",
"trainDataFile = get_dataset('gen_twittertrain').as_filepath()\n",
"testDataFile = get_dataset('gen_twittertest').as_filepath()\n",
"print(\"Train data file path: \" + str(os.path.basename(trainDataFile)))\n",
"print(\"Test data file path: \" + str(os.path.basename(testDataFile)))\n",
"\n",
"trainData = pd.read_csv(trainDataFile, sep = \"\\t\")\n",
"testData = pd.read_csv(testDataFile, sep = \"\\t\")\n",
"\n",
"trainData.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the \"Text\" column as the input feature and the \"Sentiment\" column as the label column (after converting to numeric)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The NGramFeaturizer transform produces a bag of counts of sequences of consecutive words, called n-grams, from a given corpus of text. The word counts are then normalized using term frequency-inverse document frequency (TF-IDF) method. \n",
"\n",
"In NimbusML, the user can specify the input column names for each operator to be executed on. If not, all the columns from the previous operator or the origin dataset will be used. In Tutorial 2.2, the column syntax of nimbusml will be discussed in more details.\n",
"\n",
"For text featurizer, since the output has multiple columns, for visualization, the names for those will become \"output_col_name.[word sequence] \" to represent the count for word sequence [word sequence] after normalization. In this example, we train the model with only one column, column \"Text\"."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"featurizer = NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'TfIdf'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can call .fit_transform() to train the featurizer."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(71, 1007)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Text.Char.&lt;␂&gt;|o|h</th>\n",
" <th>Text.Char.o|h|&lt;␠&gt;</th>\n",
" <th>Text.Char.h|&lt;␠&gt;|y</th>\n",
" <th>Text.Char.&lt;␠&gt;|y|o</th>\n",
" <th>Text.Char.y|o|u</th>\n",
" <th>Text.Char.o|u|&lt;␠&gt;</th>\n",
" <th>Text.Char.u|&lt;␠&gt;|a</th>\n",
" <th>Text.Char.&lt;␠&gt;|a|r</th>\n",
" <th>Text.Char.a|r|e</th>\n",
" <th>Text.Char.r|e|&lt;␠&gt;</th>\n",
" <th>...</th>\n",
" <th>Text.Word.upset</th>\n",
" <th>Text.Word.vocation</th>\n",
" <th>Text.Word.flight</th>\n",
" <th>Text.Word.late</th>\n",
" <th>Text.Word.again</th>\n",
" <th>Text.Word.commute</th>\n",
" <th>Text.Word.died</th>\n",
" <th>Text.Word.cancer</th>\n",
" <th>Text.Word.oh,</th>\n",
" <th>Text.Word.finally</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>0.218218</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 1007 columns</p>\n",
"</div>"
],
"text/plain": [
" Text.Char.<␂>|o|h Text.Char.o|h|<␠> Text.Char.h|<␠>|y Text.Char.<␠>|y|o \\\n",
"0 0.218218 0.218218 0.218218 0.218218 \n",
"1 0.000000 0.000000 0.000000 0.000000 \n",
"2 0.000000 0.000000 0.000000 0.000000 \n",
"3 0.000000 0.000000 0.000000 0.000000 \n",
"4 0.000000 0.000000 0.000000 0.000000 \n",
"\n",
" Text.Char.y|o|u Text.Char.o|u|<␠> Text.Char.u|<␠>|a Text.Char.<␠>|a|r \\\n",
"0 0.218218 0.218218 0.218218 0.218218 \n",
"1 0.000000 0.000000 0.000000 0.000000 \n",
"2 0.000000 0.000000 0.000000 0.000000 \n",
"3 0.000000 0.000000 0.000000 0.000000 \n",
"4 0.000000 0.000000 0.000000 0.000000 \n",
"\n",
" Text.Char.a|r|e Text.Char.r|e|<␠> ... Text.Word.upset \\\n",
"0 0.218218 0.218218 ... 0.0 \n",
"1 0.000000 0.000000 ... 0.0 \n",
"2 0.000000 0.000000 ... 0.0 \n",
"3 0.000000 0.000000 ... 0.0 \n",
"4 0.000000 0.000000 ... 0.0 \n",
"\n",
" Text.Word.vocation Text.Word.flight Text.Word.late Text.Word.again \\\n",
"0 0.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 0.0 \n",
"3 0.0 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 0.0 \n",
"\n",
" Text.Word.commute Text.Word.died Text.Word.cancer Text.Word.oh, \\\n",
"0 0.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 0.0 \n",
"3 0.0 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 0.0 \n",
"\n",
" Text.Word.finally \n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
"\n",
"[5 rows x 1007 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_transformed = featurizer.fit_transform(trainData[\"Text\"].to_frame()) # Using one column as input\n",
"print(text_transformed.shape)\n",
"text_transformed.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that, all the columns are the generated features from the original \"Text\" column. Based on those features, we can train a binary classifier."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Binary Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The user can use the transformed data as the input to the binary classifier using .fit(X,Y). \n",
"\n",
" ag = AveragedPerceptronBinaryClassifier()\n",
" ag.fit(text_transformed, 1 * (trainData[\"Sentiment\"] == \"Positive\"))\n",
" \n",
"The user can also use NimbusML pipeline to train the featurizer and the learner together."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.\n",
"Training calibrator.\n",
"Elapsed time: 00:00:00.7693293\n",
"Training time: 1.0\n"
]
}
],
"source": [
"t0 = time.time()\n",
"\n",
"ppl = Pipeline([\n",
" NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'Tf')), \n",
" PcaTransformer(rank = 100),\n",
" AveragedPerceptronBinaryClassifier(l2_regularization=0.4,\n",
" number_of_iterations=5),\n",
" ])\n",
"\n",
"ppl.fit(trainData[\"Text\"], trainData[\"Label\"]) #will replace with series if supported\n",
"\n",
"print(\"Training time: \" + str(round(time.time() - t0, 2)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the NimbusML pipeline, we can call ppl.test(test_X,test_Y)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Performance metrics: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>AUC</th>\n",
" <th>Accuracy</th>\n",
" <th>Positive precision</th>\n",
" <th>Positive recall</th>\n",
" <th>Negative precision</th>\n",
" <th>Negative recall</th>\n",
" <th>Log-loss</th>\n",
" <th>Log-loss reduction</th>\n",
" <th>Test-set entropy (prior Log-Loss/instance)</th>\n",
" <th>F1 Score</th>\n",
" <th>AUPRC</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.621295</td>\n",
" <td>0.662791</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.662791</td>\n",
" <td>1</td>\n",
" <td>1.006758</td>\n",
" <td>-0.091782</td>\n",
" <td>0.922123</td>\n",
" <td>NaN</td>\n",
" <td>0.529486</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" AUC Accuracy Positive precision Positive recall \\\n",
"0 0.621295 0.662791 0 0 \n",
"\n",
" Negative precision Negative recall Log-loss Log-loss reduction \\\n",
"0 0.662791 1 1.006758 -0.091782 \n",
"\n",
" Test-set entropy (prior Log-Loss/instance) F1 Score AUPRC \n",
"0 0.922123 NaN 0.529486 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Individual scores: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PredictedLabel</th>\n",
" <th>Score</th>\n",
" <th>Probability</th>\n",
" <th>OriginText</th>\n",
" <th>Sentiment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>-0.128473</td>\n",
" <td>0.135174</td>\n",
" <td>@faketwitterid I am sad</td>\n",
" <td>Negative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>-0.125328</td>\n",
" <td>0.149867</td>\n",
" <td>@wakeup_you It is a very simple twit I created</td>\n",
" <td>Negative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>-0.114177</td>\n",
" <td>0.212648</td>\n",
" <td>@anotherfakeid I would love to see the latest ...</td>\n",
" <td>Positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>-0.164016</td>\n",
" <td>0.038579</td>\n",
" <td>Oh my ladygaga! I haven't played tennis for 2 ...</td>\n",
" <td>Negative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>-0.091164</td>\n",
" <td>0.394444</td>\n",
" <td>I am heading on a road trip and taking a few d...</td>\n",
" <td>Positive</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PredictedLabel Score Probability \\\n",
"0 0 -0.128473 0.135174 \n",
"1 0 -0.125328 0.149867 \n",
"2 0 -0.114177 0.212648 \n",
"3 0 -0.164016 0.038579 \n",
"4 0 -0.091164 0.394444 \n",
"\n",
" OriginText Sentiment \n",
"0 @faketwitterid I am sad Negative \n",
"1 @wakeup_you It is a very simple twit I created Negative \n",
"2 @anotherfakeid I would love to see the latest ... Positive \n",
"3 Oh my ladygaga! I haven't played tennis for 2 ... Negative \n",
"4 I am heading on a road trip and taking a few d... Positive "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total runtime: 4.03\n"
]
}
],
"source": [
"metrics, scores = ppl.test(testData[\"Text\"], testData[\"Label\"], output_scores = True) #replace with series \n",
"print(\"Performance metrics: \")\n",
"display(metrics)\n",
"print(\"Individual scores: \")\n",
"\n",
"# Append origin text to the score\n",
"scores[\"OriginText\"] = testData[\"Text\"]\n",
"scores[\"Sentiment\"] = testData[\"Sentiment\"]\n",
"\n",
"display(scores[0:5])\n",
"print(\"Total runtime: \" + str(round(time.time() - t0, 2)))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}