QnA_Matching/Part_3_TFIDF_Cosine_Similar...

504 строки
16 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Document (QnA) Matching Data Science Process\n",
"\n",
"## Part 3: TF-IDF and Cosine Similarity - Match Q to Q, then Link to A\n",
"\n",
"### Overview\n",
"\n",
"__Part 3__ of the series shows the process of matching Questions to previously seen Questions, which link to their corresponding Answers, based on the _Cosine Similarity_ of the Questions' _Term Frequency-Inverse Document Frequency (TF-IDF)_ matrix. We believe the answer that resolved the previously seen question could also resolve this new question.\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/experiment2_overview.PNG?token=APoO9vukygoptytwwbh67JZ1YSFKG7COks5Ynhv0wA%3D%3D\">\n",
"\n",
"Note: This notebook series are built under Python 3.5 and NLTK 3.2.2."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import required Python modules"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import math\n",
"import numpy as np\n",
"from numpy import linalg as LA\n",
"from azure.storage import CloudStorageAccount\n",
"from IPython.display import display\n",
"\n",
"# suppress all warnings\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read trainQ and testQ into DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"trainQ_url = 'https://mezsa.blob.core.windows.net/stackoverflow/trainQwithTokens.tsv'\n",
"testQ_url = 'https://mezsa.blob.core.windows.net/stackoverflow/testQwithTokens.tsv'\n",
"\n",
"trainQ = pd.read_csv(trainQ_url, sep='\\t', index_col='Id', encoding='latin1')\n",
"testQ = pd.read_csv(testQ_url, sep='\\t', index_col='Id', encoding='latin1')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Tokens to IDs Hash\n",
"\n",
"For each token in the entire vocabulary, we assign it an unique ID."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# get Token to ID mapping: {Token: tokenId}\n",
"def tokens_to_ids(tokens, featureHash):\n",
" token2IdHash = {}\n",
" for i in range(len(tokens)):\n",
" tokenList = tokens.iloc[i].split(',')\n",
" if featureHash is None:\n",
" for t in tokenList:\n",
" if t not in token2IdHash.keys():\n",
" token2IdHash[t] = len(token2IdHash)\n",
" else:\n",
" for t in tokenList:\n",
" if t not in token2IdHash.keys() and t in list(featureHash.keys()):\n",
" token2IdHash[t] = len(token2IdHash)\n",
" \n",
" return token2IdHash"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"token2IdHashInit = tokens_to_ids(trainQ['Tokens'], None)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of unique tokens in the TrainQ: 4977\n"
]
}
],
"source": [
"print(\"Total number of unique tokens in the TrainQ: \" + str(len(token2IdHashInit)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Count Matrix for Each Token in Each Question"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def count_matrix(frame, token2IdHash, uniqueAnswerId):\n",
" # create am empty matrix with the shape of:\n",
" # num_row = num of unique tokens\n",
" # num_column = num of unique answerIds (N_wA) or num of questions\n",
" # rowIdx = token2IdHash.values()\n",
" # colIdx = index of uniqueAnswerId (N_wA) or index of questions\n",
" num_row = len(token2IdHash)\n",
" if uniqueAnswerId is not None: # get N_wA\n",
" num_column = len(uniqueAnswerId)\n",
" else: \n",
" num_column = len(frame)\n",
" countMatrix = np.empty(shape=(num_row, num_column))\n",
"\n",
" # loop through each question in the frame to fill in the countMatrix with corresponding counts\n",
" for i in range(len(frame)):\n",
" tokens = frame['Tokens'].iloc[i].split(',')\n",
" if uniqueAnswerId is not None: # get N_wA\n",
" answerId = frame['AnswerId'].iloc[i]\n",
" colIdx = uniqueAnswerId.index(answerId)\n",
" else:\n",
" colIdx = i\n",
" \n",
" for t in tokens:\n",
" if t in token2IdHash.keys():\n",
" rowIdx = token2IdHash[t]\n",
" countMatrix[rowIdx, colIdx] += 1\n",
"\n",
" return countMatrix"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# calculate the count matrix of all training questions.\n",
"N_wQ = count_matrix(trainQ, token2IdHashInit, None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compute IDF Vector\n",
"\n",
"Considering all tokens observed in the training questions and answers, we compute their Inverse Document Frequency based on the below formula.\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/idf.PNG?token=APoO9qf3WptQgUPVRQJuOt4cobf56-Y3ks5YnhLSwA%3D%3D\">"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_idf(N_wQ):\n",
" # N is total number of documents in the corpus\n",
" # N_V is the number of tokens in the vocabulary\n",
" N_V, N = N_wQ.shape\n",
" # D is the number of documents where the token w appears\n",
" D = np.empty(shape=(0, N_V))\n",
" for i in range(N_V):\n",
" D = np.append(D, len(np.nonzero(N_wQ[i, ])[0]))\n",
" return np.log(N/D)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"idf = get_idf(N_wQ)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculate Normalized TF of Each Word w in Training and Test Sets\n",
"\n",
"Each document d is typically represented by a feature vector x that represents the contents of d. Because different documents can have different lengths, it can be useful to apply L1 normalmalized feature vector x. Therefore, a normalized Term Frequency matrix can be obtained based on the below formula.\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tf.PNG?token=APoO9tMyEVzqoUJYT9ALcdF3_BryHHEVks5YnIQywA%3D%3D\">"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def normalize_tf(frame, token2IdHash):\n",
" N_w = count_matrix(frame, token2IdHash, uniqueAnswerId=None)\n",
" # get the column sum of the count matrix\n",
" N_W = np.sum(N_w, axis=0)\n",
" \n",
" # find the index where N_WQ is zero\n",
" zeroIdx = np.where(N_W == 0)[0]\n",
" \n",
" # if N_W is zero, then the x_w for that particular question/answer would be zero.\n",
" # for a simple calculation, we convert the N_WQ to 1 in those cases so the demoninator is not zero. \n",
" if len(zeroIdx) > 0:\n",
" N_W[zeroIdx] = 1\n",
" \n",
" # x_w = P_wd = count(w)/sum(count(i in V))\n",
" x_w = N_w / N_W\n",
" \n",
" return x_w"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# calculate tf matrix on trainQ and testQ independently.\n",
"x_wTest = normalize_tf(testQ, token2IdHashInit)\n",
"x_wTrain = normalize_tf(trainQ, token2IdHashInit)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compute TF-IDF Matrix\n",
"\n",
"By knowing the Term Frequency (TF) matrix and Inverse Document Frequency (IDF) vector, we can simply compute TF-IDF matrix by multiplying them together.\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tfidf.PNG?token=APoO9gw3rPhLusbG3if65TuVZNAnyqTCks5YnhWPwA%3D%3D\">"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tfidfTest = (x_wTest.T * idf).T\n",
"tfidfTrain = (x_wTrain.T * idf).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculate Cosine Similarity between Training and Test Sets\n",
"\n",
"For each question in the Test set, we compute its Consine Similarity against all questions in the Training set. This similarity is a score that we use to consider how similar a question and an answer is. \n",
"\n",
"<img src=\"https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/cosine.PNG?token=APoO9lOcKkP_7tFDh7p6KZXXVmokwLbGks5YnhY-wA%3D%3D\">"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def consine_similarity(tfidfL, tfidfR):\n",
" # calculate the dot product of two tfidf arrays\n",
" N = np.dot(tfidfL.T, tfidfR)\n",
" # calculate the norm of each tfidf array\n",
" normL = LA.norm(tfidfL, axis = 0)\n",
" normR = LA.norm(tfidfR, axis = 0)\n",
" similarity = (N.T/normL).T/normR\n",
" \n",
" return similarity"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# calculate similarity scores of each question in Test set against all training questions. \n",
"simScores = consine_similarity(tfidfTest, tfidfTrain)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Rank the Cosine Similarity and Calcualte Average Rank \n",
"\n",
"We use two evaluation matrices to test our model performance. For each question in the test set, we calculate a Cosine Similarity score against each question in the training set. For that test question, we then calculate the average Cosine Similarities per AnswerId. Based on the average similarities, we rank the answers to calculate Average Rank and Top 10 Percentage in the Test set using the below formula:\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/evaluation.PNG?token=APoO9hyYDFxGc9FRbmIXU3VGv0wdeCaPks5YnIVtwA%3D%3D\">"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# for each question in test set, calculate the average cosine similarities of training questions per answer group\n",
"def average_similarity(frame, simScores, uniqueAnswerId):\n",
" # create an empty matrix to store the average similarities per answer group\n",
" # shape = (num_testQ, num_uniqueAnswerId)\n",
" avgSimScores = np.empty(shape = (simScores.shape[0], len(uniqueAnswerId)))\n",
"\n",
" # get indices of all answerIds\n",
" answerIdxs = []\n",
"\n",
" for i in np.array(frame['AnswerId']):\n",
" if i not in uniqueAnswerId:\n",
" continue\n",
" else:\n",
" answerIdxs.append(uniqueAnswerId.index(i))\n",
" \n",
" # for each question in test set, calculate the average similarity per answer group\n",
" for j in range(len(avgSimScores)):\n",
" avgSimScores[j, ] = np.bincount(answerIdxs, weights = simScores[j, ]) / np.bincount(answerIdxs)\n",
" \n",
" return avgSimScores"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# sort the similarity scores in descending order and map them to the corresponding AnswerId in Answer set\n",
"def rank(frame, scores, uniqueAnswerId):\n",
" frame['SortedAnswers'] = list(np.array(uniqueAnswerId)[np.argsort(-scores, axis=1)])\n",
" \n",
" rankList = []\n",
" for i in range(len(frame)):\n",
" rankList.append(np.where(frame['SortedAnswers'].iloc[i] == frame['AnswerId'].iloc[i])[0][0] + 1)\n",
" frame['Rank'] = rankList\n",
" \n",
" return frame"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# get unique answerId in ascending order.\n",
"uniqueAnswerId = list(np.unique(trainQ['AnswerId']))\n",
"# get average cosine similarity.\n",
"avgSimScores = average_similarity(trainQ, simScores, uniqueAnswerId)\n",
"# calculate the rank of each question in Test set.\n",
"testQ = rank(testQ, avgSimScores, uniqueAnswerId)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average of rank: 70.0\n",
"Total number of questions in test set: 3671\n",
"Total number of answers: 1275\n",
"Total number of unique features: 4977\n",
"Percentage of questions find answers in top 10: 0.385\n"
]
}
],
"source": [
"# average of rank\n",
"print('Average of rank: ' + str(np.floor(testQ['Rank'].mean())))\n",
"print('Total number of questions in test set: ' + str(len(testQ)))\n",
"print('Total number of answers: ' + str(len(uniqueAnswerId)))\n",
"print('Total number of unique features: ' + str(len(token2IdHashInit)))\n",
"print('Percentage of questions find answers in top 10: ' + str(round(len(testQ.query('Rank <= 10'))/len(testQ), 3)))"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# check some results.\n",
"# uncomment it to execute.\n",
"# testQ.query('Rank <= 3')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Results and Conclusion of Part 2 and 3\n",
"\n",
"As we have experimented two approaches in Part 2 and 3, we notice that using the answer of a similar question to resolve a new question is a better approach (Part 3). There are substantially more words in common among questions than that between questions and answers. \n",
"\n",
"We also experiment a different phrases learning apporach by leveraging the [phrase detection functions in Gensim package](https://radimrehurek.com/gensim/models/phrases.html). The results are less favorable than our custom approach described in Part 1.\n",
"\n",
"Here is a comparison matrix of two phrases learning approaches and two experiments (part 2 and part 3).\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tfidf_results.PNG?token=APoO9phH-e-G9nS_23ej8qjyXP1V6cAdks5Yni46wA%3D%3D\">"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}