diff --git a/sentence_similarity/notebooks/02-model/baseline_deep_dive.ipynb b/sentence_similarity/notebooks/02-model/baseline_deep_dive.ipynb index 83daeb6..d258082 100644 --- a/sentence_similarity/notebooks/02-model/baseline_deep_dive.ipynb +++ b/sentence_similarity/notebooks/02-model/baseline_deep_dive.ipynb @@ -15,7 +15,7 @@ "\n", "Producing a baseline model is crucial for evaluating your model's performance on any machine learning problem. A baseline model is a basic solution to a machine learning problem that serves as a point of reference for comparing other models to. The baseline model's performance gives us an indication of how much better our models can perform relative to a naive approach. \n", "\n", - "Let's say we are building a sentence similarity model where our training set contains pairs of sentences and we want to predict how similiar these sentences are on a scale from 1-5. We could spend months producing a complex machine learning solution to this problem and ultimately get a mean squared error (MSE) of 0.4. But is this result good or bad? There is no way of knowing without comparing it with some baseline performance. For our baseline model, we could predict the mean sentence similarity of sentence pairs in our training set (called the _zero rule_) and get a MSE of 0.2. So our model beats the baseline! If our model's performance was worse than the baseline, this might give us pause to consider using different features, models, evaluation metrics etc. It is crucial that the choice of baseline model be tailored to a data science problem based on buisness goals and the specific modeling task." + "Let's say we are building a sentence similarity model where our training set contains pairs of sentences and we want to predict how similiar these sentences are on a scale from 1-5. We could spend months producing a complex machine learning solution to this problem and ultimately get a mean squared error (MSE) of 0.3. But is this result good or bad? There is no way of knowing without comparing it with some baseline performance. For our baseline model, we could predict the mean sentence similarity of sentence pairs in our training set (called the _zero rule_) and get a MSE of 0.35. So our model is worse than the baseline which indicates that we may want to consider using different features, models, evaluation metrics, etc. It is crucial that the choice of baseline model be tailored to a data science problem based on buisness goals and the specific modeling task." ] }, { @@ -24,7 +24,7 @@ "source": [ "### What are good baselines for sentence similarity?\n", "\n", - "For sentence similarity problems, we have two sub-tasks: 1) First, we need to produce a representation of each sentence in the sentence pair. This representation is called an **embedding**. This embedding allows us to represent a sentence with numbers versus words. Specifically, we're learning an n-dimensional vector of numbers that can represent the given sentence. 2) Second, we need to compute the similarity between these two sentence embeddings.\n", + "For sentence similarity problems, we have two sub-tasks: 1) First, we need to produce a vector representation of each sentence in the sentence pair, known as an **embedding**. 2) Second, we need to compute the similarity between these two sentence embeddings.\n", "\n", "For producing representations of sentences, there are some common baseline approaches: \n", "1. Create word embeddings for each word in a sentence\n", @@ -72,7 +72,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 44, "metadata": {}, "outputs": [], "source": [ @@ -114,32 +114,33 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "sys.path.append(\"../../../\") ## set the environment path\n", "BASE_DATA_PATH = \"../../../data\"\n", "\n", - "from utils_nlp.dataset.stsbenchmark import STSBenchmark" + "from utils_nlp.dataset.preprocess import to_lowercase, to_spacy_tokens\n", + "from utils_nlp.dataset import stsbenchmark" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 46, "metadata": { "scrolled": true }, "outputs": [], "source": [ - "# Initializing this instance runs the downloader and extractor behind the scenes, then convert to dataframe\n", - "stsTrain = STSBenchmark(\"train\", base_data_path=BASE_DATA_PATH).as_dataframe()\n", - "stsTest = STSBenchmark(\"test\", base_data_path=BASE_DATA_PATH).as_dataframe()" + "# Produce a pandas dataframe for the training and test sets\n", + "stsTrain = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split=\"train\")\n", + "stsTest = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split=\"test\")" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 47, "metadata": {}, "outputs": [ { @@ -158,7 +159,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 48, "metadata": {}, "outputs": [ { @@ -278,7 +279,7 @@ "9 A man is playing a trumpet. " ] }, - "execution_count": 5, + "execution_count": 48, "metadata": {}, "output_type": "execute_result" } @@ -298,7 +299,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Our baseline models will expect that each sentence is represented by a list of **tokens**. Tokens are linguistic units like words, punctuation marks, numbers, etc. We'll use the nltk package which is popular for performing tokenization." + "Our baseline models will expect that each sentence is represented by a list of **tokens**. Tokens are linguistic units like words, punctuation marks, numbers, etc. We'll use our util functions which utilize the spaCy package, a popular package for performing tokenization." ] }, { @@ -310,7 +311,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 49, "metadata": {}, "outputs": [], "source": [ @@ -323,26 +324,30 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "# train preprocessing\n", "df_low = to_lowercase(stsTrain) # covert all text to lowercase\n", "sts_tokenize = to_spacy_tokens(df_low) # tokenize normally\n", - "sts_train = rm_spacy_stopwords(sts_tokenize) # tokenize with removal of stopwords" + "sts_train_stop = rm_spacy_stopwords(sts_tokenize) # tokenize with removal of stopwords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now each row in our dataframe contains the 2 original sentences as well as a column for each sentence's tokenization with stop words and a column for each sentence's tokenization without stop words." + "Now each row in our dataframe contains: \n", + "- The similarity score of the sentence pair\n", + "- The 2 original sentences from our datasets \n", + "- A column for each sentence's tokenization with stop words \n", + "- A column for each sentence's tokenization without stop words" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 51, "metadata": {}, "outputs": [ { @@ -474,168 +479,25 @@ "4 [man, seated, playing, cello, .] " ] }, - "execution_count": 8, + "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sts_train.head(5)" + "sts_train_stop.head(5)" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 52, "metadata": {}, "outputs": [], "source": [ - "# Repeat process to perform preprocessing for test test\n", + "# Repeat process to perform preprocessing for test set\n", "df_low = to_lowercase(stsTest)\n", "sts_tokenize = to_spacy_tokens(df_low)\n", - "sts_test = rm_spacy_stopwords(sts_tokenize)" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
scoresentence1sentence2sentence1_tokenssentence2_tokenssentence1_tokens_stopsentence2_tokens_stop
02.5a girl is styling her hair.a girl is brushing her hair.[a, girl, is, styling, her, hair, .][a, girl, is, brushing, her, hair, .][girl, styling, hair, .][girl, brushing, hair, .]
13.6a group of men play soccer on the beach.a group of boys are playing soccer on the beach.[a, group, of, men, play, soccer, on, the, bea...[a, group, of, boys, are, playing, soccer, on,...[group, men, play, soccer, beach, .][group, boys, playing, soccer, beach, .]
25.0one woman is measuring another woman's ankle.a woman measures another woman's ankle.[one, woman, is, measuring, another, woman, 's...[a, woman, measures, another, woman, 's, ankle...[woman, measuring, woman, ankle, .][woman, measures, woman, ankle, .]
34.2a man is cutting up a cucumber.a man is slicing a cucumber.[a, man, is, cutting, up, a, cucumber, .][a, man, is, slicing, a, cucumber, .][man, cutting, cucumber, .][man, slicing, cucumber, .]
41.5a man is playing a harp.a man is playing a keyboard.[a, man, is, playing, a, harp, .][a, man, is, playing, a, keyboard, .][man, playing, harp, .][man, playing, keyboard, .]
\n", - "
" - ], - "text/plain": [ - " score sentence1 \\\n", - "0 2.5 a girl is styling her hair. \n", - "1 3.6 a group of men play soccer on the beach. \n", - "2 5.0 one woman is measuring another woman's ankle. \n", - "3 4.2 a man is cutting up a cucumber. \n", - "4 1.5 a man is playing a harp. \n", - "\n", - " sentence2 \\\n", - "0 a girl is brushing her hair. \n", - "1 a group of boys are playing soccer on the beach. \n", - "2 a woman measures another woman's ankle. \n", - "3 a man is slicing a cucumber. \n", - "4 a man is playing a keyboard. \n", - "\n", - " sentence1_tokens \\\n", - "0 [a, girl, is, styling, her, hair, .] \n", - "1 [a, group, of, men, play, soccer, on, the, bea... \n", - "2 [one, woman, is, measuring, another, woman, 's... \n", - "3 [a, man, is, cutting, up, a, cucumber, .] \n", - "4 [a, man, is, playing, a, harp, .] \n", - "\n", - " sentence2_tokens \\\n", - "0 [a, girl, is, brushing, her, hair, .] \n", - "1 [a, group, of, boys, are, playing, soccer, on,... \n", - "2 [a, woman, measures, another, woman, 's, ankle... \n", - "3 [a, man, is, slicing, a, cucumber, .] \n", - "4 [a, man, is, playing, a, keyboard, .] \n", - "\n", - " sentence1_tokens_stop \\\n", - "0 [girl, styling, hair, .] \n", - "1 [group, men, play, soccer, beach, .] \n", - "2 [woman, measuring, woman, ankle, .] \n", - "3 [man, cutting, cucumber, .] \n", - "4 [man, playing, harp, .] \n", - "\n", - " sentence2_tokens_stop \n", - "0 [girl, brushing, hair, .] \n", - "1 [group, boys, playing, soccer, beach, .] \n", - "2 [woman, measures, woman, ankle, .] \n", - "3 [man, slicing, cucumber, .] \n", - "4 [man, playing, keyboard, .] " - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sts_test.head(5)" + "sts_test_stop = rm_spacy_stopwords(sts_tokenize)" ] }, { @@ -654,7 +516,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 53, "metadata": {}, "outputs": [], "source": [ @@ -684,17 +546,17 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "# note that we need to calculate these values on our training set so that we don't \"peek at\" our test set until test time\n", - "document_frequencies, num_documents = get_document_frequency(sts_train)" + "document_frequencies, num_documents = get_document_frequency(sts_train_stop)" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 55, "metadata": {}, "outputs": [ { @@ -703,7 +565,7 @@ "11498" ] }, - "execution_count": 13, + "execution_count": 55, "metadata": {}, "output_type": "execute_result" } @@ -728,7 +590,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 56, "metadata": {}, "outputs": [], "source": [ @@ -756,19 +618,33 @@ "### What is Word2Vec?\n", "Word2vec is a predictive model for learning word embeddings from text. Word embeddings are learned such that words that share common contexts in the corpus will be close together in the vector space. There are two different model architectures that can be used to produce word2vec embeddings: continuous bag-of-words (CBOW) or continuous skip-gram. The former uses a window of surrounding words (the \"context\") to predict the current word and the latter uses the current word to predict the surrounding context words. See this [tutorial](https://www.guru99.com/word-embedding-word2vec.html#3) on word2vec for more detailed background on the model.\n", "\n", - "For our purposes, we use pretrained word2vec word embeddings. These embeddings were trained on a Google News corpus and provide 300-dimensional embeddings (vectors) for 3 million English words. See this link for the original source (https://code.google.com/archive/p/word2vec/) and see the code below to load in these word embeddings." + "For our purposes, we use pretrained word2vec word embeddings. These embeddings were trained on a Google News corpus and provide 300-dimensional embeddings (vectors) for 3 million English words. See this link for the original source (https://code.google.com/archive/p/word2vec/) and see the code below to load these word embeddings." ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "1.65GB [01:00, 27.3MB/s] \n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], "source": [ - "PATH_TO_WORD2VEC = os.path.expanduser(\"GoogleNews-vectors-negative300.bin\")\n", - "word2vec = gensim.models.KeyedVectors.load_word2vec_format(\n", - " PATH_TO_WORD2VEC, binary=True\n", - ")" + "from utils_nlp.pretrained_embeddings import word2vec\n", + "\n", + "word2vec_model = word2vec.load_pretrained_vectors(dir_path=BASE_DATA_PATH)" ] }, { @@ -777,7 +653,7 @@ "source": [ "### What is TF-IDF?\n", "\n", - "TF-IDF or term frequency-inverse document frequency is a weighting scheme intended to measure how important a word is to the document (or sentence in our case) within the broader corpus (our dataset). The weight \" increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus\" ([tutorial link](http://www.tfidf.com/)). When we're averaging together many different word vectors to get a sentence embedding, it makes sense to give stronger weight to words that are more distinct relative to the corpus and that have a high frequency in the sentence. The TF-IDF weights capture this intution, with the weight increasing as term frequency increases and/or as the inverse document frequency increases.\n", + "TF-IDF or term frequency-inverse document frequency is a weighting scheme intended to measure how important a word is to the document (or sentence in our case) within the broader corpus (our dataset). The weight \"increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus\" ([tutorial link](http://www.tfidf.com/)). When we're averaging together many different word vectors to get a sentence embedding, it makes sense to give stronger weight to words that are more distinct relative to the corpus and that have a high frequency in the sentence. The TF-IDF weights capture this intution, with the weight increasing as term frequency increases and/or as the inverse document frequency increases.\n", "\n", "For a term $t$ in sentence $s$ in corpus $c$, then the TF-IDF weight is \n", "$$w_{t,s} = TF_{t,s} * \\log{\\frac{N}{df_t}}$$\n", @@ -791,7 +667,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ @@ -839,7 +715,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ @@ -874,28 +750,29 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ - "def average_word_embedding_cosine_similarity(df, embedding_model, stop_words=True):\n", + "def average_word_embedding_cosine_similarity(df, embedding_model, rm_stopwords=False):\n", " \"\"\"Calculate the cosine similarity between TF-IDF weighted averaged embeddings\n", " \n", " Args:\n", " df (pandas dataframe): dataframe as provided by the nlp_utils\n", " embedding_model (gensim model): word embedding model\n", - " stop_words (bool): whether to use stop words (True) or remove them (False)\n", + " rm_stopwords (bool): whether to use stop words (True) or remove them (False)\n", " \n", " Returns:\n", " list: predicted values for sentence similarity of test set examples\n", " \"\"\"\n", " predictions = []\n", - " if stop_words:\n", - " tokenized_sentences = zip(df[\"sentence1_tokens\"], df[\"sentence2_tokens\"])\n", - " else:\n", + " if rm_stopwords:\n", " tokenized_sentences = zip(\n", " df[\"sentence1_tokens_stop\"], df[\"sentence2_tokens_stop\"]\n", " )\n", + " else:\n", + " tokenized_sentences = zip(df[\"sentence1_tokens\"], df[\"sentence2_tokens\"])\n", + "\n", "\n", " for (sentence1, sentence2) in tokenized_sentences:\n", " embedding1 = average_sentence_embedding(sentence1, embedding_model)\n", @@ -909,16 +786,16 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# get predictions using average word2vec embeddings both with and without stop words\n", "baselines[\"Word2vec Cosine\"] = average_word_embedding_cosine_similarity(\n", - " sts_test, word2vec, stop_words=False\n", + " sts_test_stop, word2vec_model, rm_stopwords=True\n", ")\n", "baselines[\"Word2vec Cosine with Stop Words\"] = average_word_embedding_cosine_similarity(\n", - " sts_test, word2vec, stop_words=True\n", + " sts_test_stop, word2vec_model, rm_stopwords=False\n", ")" ] }, @@ -960,28 +837,28 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ - "def word_embedding_WMD(df, embedding_model, stop_words=True):\n", + "def word_embedding_WMD(df, embedding_model, rm_stopwords=False):\n", " \"\"\"Calculate Word Mover's Distance between two sentences using embeddings\n", " \n", " Args:\n", " df (pandas dataframe): dataframe as provided by the nlp_utils\n", " embedding_model (gensim model): word embedding model\n", - " stop_words (bool): whether to use stop words (True) or remove them (False)\n", + " rm_stopwords (bool): whether to use stop words (True) or remove them (False)\n", " \n", " Returns:\n", " list: predicted values for sentence similarity of test set examples\n", " \"\"\"\n", " predictions = []\n", - " if stop_words:\n", - " tokenized_sentences = zip(df[\"sentence1_tokens\"], df[\"sentence2_tokens\"])\n", - " else:\n", + " if rm_stopwords:\n", " tokenized_sentences = zip(\n", " df[\"sentence1_tokens_stop\"], df[\"sentence2_tokens_stop\"]\n", " )\n", + " else:\n", + " tokenized_sentences = zip(df[\"sentence1_tokens\"], df[\"sentence2_tokens\"])\n", "\n", " for (sentence1, sentence2) in tokenized_sentences:\n", " # throw away tokens that are not in embeddings model\n", @@ -997,14 +874,14 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# get predictions using word2vec embeddings and WMD both with and without stop words\n", - "baselines[\"Word2vec WMD\"] = word_embedding_WMD(sts_test, word2vec, stop_words=False)\n", + "baselines[\"Word2vec WMD\"] = word_embedding_WMD(sts_test_stop, word2vec_model, rm_stopwords=True)\n", "baselines[\"Word2vec WMD with Stop Words\"] = word_embedding_WMD(\n", - " sts_test, word2vec, stop_words=True\n", + " sts_test_stop, word2vec_model, rm_stopwords=False\n", ")" ] }, @@ -1032,7 +909,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ @@ -1043,18 +920,24 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 27, "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "(2196017, 300)" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: 'glove.840B.300d.txt'", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# we need to download the GLoVe file and convert it to word2vec format, this takes a bit of time\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[0mPATH_TO_GLOVE\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mos\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexpanduser\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"glove.840B.300d.txt\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mglove2word2vec\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mPATH_TO_GLOVE\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtmp_file\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\envs\\azureml\\lib\\site-packages\\gensim\\scripts\\glove2word2vec.py\u001b[0m in \u001b[0;36mglove2word2vec\u001b[1;34m(glove_input_file, word2vec_output_file)\u001b[0m\n\u001b[0;32m 102\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 103\u001b[0m \"\"\"\n\u001b[1;32m--> 104\u001b[1;33m \u001b[0mnum_lines\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_dims\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mget_glove_info\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mglove_input_file\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 105\u001b[0m \u001b[0mlogger\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minfo\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"converting %i vectors from %s to %s\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_lines\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mglove_input_file\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mword2vec_output_file\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 106\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0msmart_open\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mword2vec_output_file\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'wb'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mfout\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", + "\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\envs\\azureml\\lib\\site-packages\\gensim\\scripts\\glove2word2vec.py\u001b[0m in \u001b[0;36mget_glove_info\u001b[1;34m(glove_file_name)\u001b[0m\n\u001b[0;32m 79\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 80\u001b[0m \"\"\"\n\u001b[1;32m---> 81\u001b[1;33m \u001b[1;32mwith\u001b[0m \u001b[0msmart_open\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mglove_file_name\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 82\u001b[0m \u001b[0mnum_lines\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msum\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 83\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0msmart_open\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mglove_file_name\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", + "\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\envs\\azureml\\lib\\site-packages\\smart_open\\smart_open_lib.py\u001b[0m in \u001b[0;36msmart_open\u001b[1;34m(uri, mode, **kw)\u001b[0m\n\u001b[0;32m 398\u001b[0m \u001b[0mtransport_params\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 399\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 400\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0muri\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mignore_ext\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mignore_extension\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtransport_params\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mtransport_params\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mscrubbed_kwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 401\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 402\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", + "\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\envs\\azureml\\lib\\site-packages\\smart_open\\smart_open_lib.py\u001b[0m in \u001b[0;36mopen\u001b[1;34m(uri, mode, buffering, encoding, errors, newline, closefd, opener, ignore_ext, transport_params)\u001b[0m\n\u001b[0;32m 298\u001b[0m \u001b[0mbuffering\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mbuffering\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 299\u001b[0m \u001b[0mencoding\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mencoding\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 300\u001b[1;33m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 301\u001b[0m )\n\u001b[0;32m 302\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mfobj\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", + "\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\envs\\azureml\\lib\\site-packages\\smart_open\\smart_open_lib.py\u001b[0m in \u001b[0;36m_shortcut_open\u001b[1;34m(uri, mode, ignore_ext, buffering, encoding, errors)\u001b[0m\n\u001b[0;32m 457\u001b[0m \u001b[1;31m#\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 458\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0msix\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mPY3\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 459\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0m_builtin_open\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mparsed_uri\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0muri_path\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbuffering\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mbuffering\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mopen_kwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 460\u001b[0m \u001b[1;32melif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mopen_kwargs\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 461\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0m_builtin_open\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mparsed_uri\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0muri_path\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmode\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbuffering\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mbuffering\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", + "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'glove.840B.300d.txt'" + ] } ], "source": [ @@ -1081,10 +964,10 @@ "source": [ "# get predictions using glove embeddings and cosine similarity both with and without stop words\n", "baselines[\"GLoVe Cosine\"] = average_word_embedding_cosine_similarity(\n", - " sts_test, glove, stop_words=False\n", + " sts_test_stop, glove, rm_stopwords=True\n", ")\n", "baselines[\"GLoVe Cosine with Stop Words\"] = average_word_embedding_cosine_similarity(\n", - " sts_test, glove, stop_words=True\n", + " sts_test_stop, glove, rm_stopwords=False\n", ")" ] }, @@ -1109,9 +992,9 @@ "outputs": [], "source": [ "# get predictions using glove embeddings and WMD both with and without stop words\n", - "baselines[\"GLoVe WMD\"] = word_embedding_WMD(sts_test, glove, stop_words=False)\n", + "baselines[\"GLoVe WMD\"] = word_embedding_WMD(sts_test_stop, glove, rm_stopwords=True)\n", "baselines[\"GLoVe WMD with Stop Words\"] = word_embedding_WMD(\n", - " sts_test, glove, stop_words=True\n", + " sts_test_stop, glove, rm_stopwords=False\n", ")" ] }, @@ -1161,10 +1044,10 @@ "source": [ "# get predictions using fastText embeddings and cosine similarity both with and without stop words\n", "baselines[\"fastText Cosine\"] = average_word_embedding_cosine_similarity(\n", - " sts_test, fastText, stop_words=False\n", + " sts_test_stop, fastText, rm_stopwords=True\n", ")\n", "baselines[\"fastText Cosine with Stop Words\"] = average_word_embedding_cosine_similarity(\n", - " sts_test, fastText, stop_words=True\n", + " sts_test_stop, fastText, rm_stopwords=False\n", ")" ] }, @@ -1182,9 +1065,9 @@ "outputs": [], "source": [ "# get predictions using fastText embeddings and WMD both with and without stop words\n", - "baselines[\"fastText WMD\"] = word_embedding_WMD(sts_test, fastText, stop_words=False)\n", + "baselines[\"fastText WMD\"] = word_embedding_WMD(sts_test_stop, fastText, rm_stopwords=True)\n", "baselines[\"fastText WMD with Stop Words\"] = word_embedding_WMD(\n", - " sts_test, fastText, stop_words=True\n", + " sts_test_stop, fastText, rm_stopwords=False\n", ")" ] }, @@ -1208,74 +1091,1244 @@ "source": [ "### Bag of Words\n", "\n", - "The most basic approach for document embeddings is called Bag-of-Words. This method requires first determines the vocabulary across the entire corpus and then, for each document, creatwa a vector containing the number of times each vocabulary word appeared in the given document. These vectors are obviously very sparse and typical bag of words implementations ignore terms whose document frequency is less than some threshold in order to reduce sparsity. We also often ignore stop words as they add little semantic information. " + "The most basic approach for document embeddings is called Bag-of-Words. This method first determines the vocabulary across the entire corpus and then, for each document, creates a vector containing the number of times each vocabulary word appeared in the given document. These vectors are obviously very sparse and typical bag of words implementations ignore terms whose document frequency is less than some threshold in order to reduce sparsity. We also often ignore stop words as they add little semantic information. " ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 106, "metadata": {}, "outputs": [], "source": [ - "def tfidf_cosine_similarity(df, stop_words=True):\n", + "def tfidf_cosine_similarity(df, rm_stopwords=False):\n", " \"\"\"Calculate cosine similarity between TF-IDF document embeddings\n", " \n", " Args:\n", " df (pandas dataframe): dataframe as provided by the nlp_utils\n", - " stop_words (bool): whether to use stop words (True) or remove them (False)\n", + " rm_stopwords (bool): whether to remove stop words or not\n", " \n", " Returns:\n", " list: predicted values for sentence similarity of test set examples\n", " \"\"\"\n", - " if stop_words:\n", - " tf = TfidfVectorizer(\n", - " input=\"content\",\n", - " analyzer=\"word\",\n", - " min_df=0,\n", - " stop_words=\"english\",\n", - " sublinear_tf=True,\n", - " )\n", - " else:\n", - " tf = TfidfVectorizer(\n", - " input=\"content\",\n", - " analyzer=\"word\",\n", - " min_df=0,\n", - " stop_words=None,\n", - " sublinear_tf=True,\n", - " )\n", + " stop_word_param = 'english' if rm_stopwords else None\n", + " \n", + " tf = TfidfVectorizer(\n", + " input=\"content\",\n", + " analyzer=\"word\",\n", + " min_df=0,\n", + " stop_words=stop_word_param,\n", + " sublinear_tf=True,\n", + " )\n", "\n", - " all_sentences = list(df[\"sentence1\"].append(df[\"sentence2\"]))\n", - " tfidf_matrix = tf.fit_transform(all_sentences)\n", - "\n", - " n = 0\n", - " predictions = []\n", - " for (sentence1, sentence2) in zip(df[\"sentence1\"], df[\"sentence2\"]):\n", - " sentence1_idx = n\n", - " sentence2_idx = len(sts_test[\"sentence1\"]) + n\n", - " sentence1_tfidf = list(tfidf_matrix.getrow(sentence1_idx).toarray()[0])\n", - " sentence2_tfidf = list(tfidf_matrix.getrow(sentence2_idx).toarray()[0])\n", - " if sum(sentence1_tfidf) == 0 or sum(sentence2_tfidf) == 0:\n", - " predictions.append(0)\n", - " else:\n", - " predictions.append(\n", - " calculate_cosine_similarity(sentence1_tfidf, sentence2_tfidf)\n", - " )\n", - " n += 1\n", - " return predictions" + " all_sentences = df[[\"sentence1\", \"sentence2\"]]\n", + " corpus = all_sentences.values.flatten().tolist()\n", + " tfidf_matrix = np.array(tf.fit_transform(corpus).todense())\n", + " \n", + " df['sentence1_tfidf'] = df.apply(lambda x: tfidf_matrix[2*x.name,:], axis=1)\n", + " df['sentence2_tfidf'] = df.apply(lambda x: tfidf_matrix[2*x.name+1,:], axis=1)\n", + " df['predictions'] = df.apply(lambda x: calculate_cosine_similarity(x.sentence1_tfidf, x.sentence2_tfidf) if \n", + " (sum(x.sentence1_tfidf) != 0 and sum(x.sentence2_tfidf) != 0) else 0,axis=1)\n", + " return df['predictions'].tolist()" ] }, { "cell_type": "code", - "execution_count": 31, + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 76, "metadata": {}, "outputs": [], "source": [ - "baselines[\"TF-IDF Cosine\"] = tfidf_cosine_similarity(sts_test, stop_words=False)\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
014
125
236
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 4\n", + "1 2 5\n", + "2 3 6" + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ex = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})\n", + "ex" + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abindexsent1sent2
014001
125123
236245
\n", + "
" + ], + "text/plain": [ + " a b index sent1 sent2\n", + "0 1 4 0 0 1\n", + "1 2 5 1 2 3\n", + "2 3 6 2 4 5" + ] + }, + "execution_count": 100, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ex['sent1'] = ex.apply(lambda x: 2*x.name,axis=1)\n", + "ex['sent2'] = ex.apply(lambda x: 2*x.name+1,axis=1)\n", + "\n", + "ex" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "baselines[\"TF-IDF Cosine\"] = tfidf_cosine_similarity(sts_test_stop, rm_stopwords=True)\n", "baselines[\"TF-IDF Cosine with Stop Words\"] = tfidf_cosine_similarity(\n", - " sts_test, stop_words=True\n", + " sts_test_stop, rm_stopwords=False\n", ")" ] }, + { + "cell_type": "code", + "execution_count": 108, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.5154669367957403,\n", + " 0.6238376709093325,\n", + " 0.5983062335986293,\n", + " 0.7102615298532333,\n", + " 0.3351548713228747,\n", + " 0.4586765374374362,\n", + " 0.7453648441043782,\n", + " 0.4332143047353778,\n", + " 0.6274081602186135,\n", + " 0.4102968379263213,\n", + " 0.4102968379263213,\n", + " 0.5253589511249759,\n", + " 0.17651528643337078,\n", + " 0.5499522389851709,\n", + " 0.3787193514643298,\n", + " 0.4332143047353778,\n", + " 0.5423224039397149,\n", + " 0.7519050657939645,\n", + " 0.30617844695350493,\n", + " 0.12920385409392754,\n", + " 0.5115388235096275,\n", + " 0.12571551261385527,\n", + " 0.671208418075113,\n", + " 0.6147994882540094,\n", + " 0.3085863437907099,\n", + " 0.8665043113349088,\n", + " 0.3499638723495143,\n", + " 0.4485303934478104,\n", + " 0.14412070093774276,\n", + " 0.7223992863241857,\n", + " 0.42977927537233007,\n", + " 0.6687393429045153,\n", + " 0.09883942492880626,\n", + " 0.8006601194016243,\n", + " 0.3915728034097792,\n", + " 0.12959796544910263,\n", + " 0.07074855491798504,\n", + " 0.24392238095706376,\n", + " 0.5797213695434924,\n", + " 0.1455830629990571,\n", + " 0.5774415169951077,\n", + " 0.7562038442807758,\n", + " 0.7519050657939645,\n", + " 0.14623990642389495,\n", + " 0.271411211816421,\n", + " 0.10679246383512553,\n", + " 0.12436018172562091,\n", + " 0.24282578702512114,\n", + " 0.649298946993787,\n", + " 0.48951256029075396,\n", + " 0.22269264312418413,\n", + " 0.13418067046494309,\n", + " 0.1475892907985462,\n", + " 0.11919014738377365,\n", + " 0.11758015372101038,\n", + " 0.0,\n", + " 0.5851543274311135,\n", + " 0.1308592565135699,\n", + " 0.7699878532034286,\n", + " 0.3204871058453277,\n", + " 1.0,\n", + " 0.0,\n", + " 0.5113876993851364,\n", + " 0.11301699334082715,\n", + " 0.4844516290250571,\n", + " 0.08881150230557444,\n", + " 0.0,\n", + " 0.5991676539776296,\n", + " 0.0,\n", + " 0.42156705698583274,\n", + " 0.5269248364657431,\n", + " 0.3655237377185544,\n", + " 0.0,\n", + " 0.6344119096586092,\n", + " 0.4153702218957458,\n", + " 0.0,\n", + " 0.17659756230874368,\n", + " 0.0,\n", + " 0.5773711547137133,\n", + " 0.3400513696315266,\n", + " 0.0,\n", + " 0.1831492472287829,\n", + " 0.23854126483382898,\n", + " 0.1511207255960434,\n", + " 0.14677017202961828,\n", + " 0.10406317742927851,\n", + " 0.6201409333213322,\n", + " 0.34867041329567816,\n", + " 0.5558631416993667,\n", + " 0.23888620151049622,\n", + " 0.0,\n", + " 0.8132717568057986,\n", + " 0.0,\n", + " 0.5131764335105033,\n", + " 0.08112645666290907,\n", + " 0.0,\n", + " 0.40984286310643414,\n", + " 0.0,\n", + " 0.15163144931355121,\n", + " 0.11562148003468309,\n", + " 0.0,\n", + " 0.05819882110600594,\n", + " 0.36864716569966793,\n", + " 0.5332166574305218,\n", + " 0.361604740013435,\n", + " 0.37159316532516296,\n", + " 0.0,\n", + " 0.08889801633124483,\n", + " 0.06783882062422586,\n", + " 0.39587926385535976,\n", + " 0.21373328317022044,\n", + " 0.3210726805003803,\n", + " 0.27988207621810623,\n", + " 0.6760684435794222,\n", + " 0.0,\n", + " 0.0,\n", + " 0.38097658851652527,\n", + " 0.0,\n", + " 0.0,\n", + " 0.06861096018409485,\n", + " 0.04908910946944067,\n", + " 0.0708011632071649,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 1.0,\n", + " 0.0,\n", + " 0.7837591866549787,\n", + " 0.5377343851873254,\n", + " 0.4889182045146514,\n", + " 0.6703932250207224,\n", + " 0.7241898046881592,\n", + " 0.5771623250458925,\n", + " 0.6216402101600226,\n", + " 0.6823235058855647,\n", + " 0.3280161588066738,\n", + " 0.6918009620390254,\n", + " 0.35865740721179473,\n", + " 0.4172115816761691,\n", + " 0.921908870714373,\n", + " 0.5586504267825702,\n", + " 0.7531542457363941,\n", + " 0.4663577643586986,\n", + " 0.8244339823877357,\n", + " 0.47839637812100944,\n", + " 0.3208083132446702,\n", + " 0.4869969531085441,\n", + " 0.3226097457268292,\n", + " 0.5286609852613486,\n", + " 0.17559289282597357,\n", + " 0.34711853027470263,\n", + " 0.4485303934478104,\n", + " 0.70614165690267,\n", + " 0.7519050657939645,\n", + " 0.45922622756177234,\n", + " 0.8126869583086342,\n", + " 0.2905028770893301,\n", + " 0.16895547254574517,\n", + " 0.5717826716630433,\n", + " 0.18344969420537105,\n", + " 0.6881903625249085,\n", + " 0.3744635997611292,\n", + " 0.4038766428917129,\n", + " 0.7629887778411776,\n", + " 0.7075593623565637,\n", + " 0.7211044894623567,\n", + " 0.37993813876579297,\n", + " 0.9578069784773438,\n", + " 0.32895005819109835,\n", + " 0.5233733505459711,\n", + " 0.4332143047353778,\n", + " 0.0,\n", + " 0.27231427111936635,\n", + " 0.5751553426708501,\n", + " 0.299049421036619,\n", + " 0.5089477348014251,\n", + " 0.5777296110323978,\n", + " 0.39476799484467184,\n", + " 0.14704330739625027,\n", + " 0.0,\n", + " 0.0,\n", + " 0.3070877471932829,\n", + " 0.17052490412044174,\n", + " 0.0783003764285174,\n", + " 0.17186157327119655,\n", + " 0.0,\n", + " 0.0,\n", + " 0.14661451413971538,\n", + " 0.08753407803623192,\n", + " 0.22360812652997764,\n", + " 0.5845971152319437,\n", + " 0.4573533893034656,\n", + " 0.1403723024049901,\n", + " 0.36638933237713944,\n", + " 0.6169097809414048,\n", + " 0.47834650935596745,\n", + " 0.5284223205808826,\n", + " 0.07995225509440962,\n", + " 0.0,\n", + " 0.3071117184213411,\n", + " 0.1326671474247091,\n", + " 0.0906988696914991,\n", + " 0.0,\n", + " 0.06982225239298134,\n", + " 0.06786341202563717,\n", + " 0.1505053792432255,\n", + " 0.5635780904316643,\n", + " 0.0,\n", + " 0.2621572974398185,\n", + " 0.25308403507612054,\n", + " 0.3221668821455286,\n", + " 0.0629899640099777,\n", + " 0.07072985167756685,\n", + " 0.09158692018067738,\n", + " 0.0,\n", + " 0.06961382750938006,\n", + " 0.0,\n", + " 0.3991826870741608,\n", + " 0.6258373248691018,\n", + " 0.2024992428862169,\n", + " 0.0934248056671858,\n", + " 0.0,\n", + " 0.0,\n", + " 0.3383416797152323,\n", + " 0.0,\n", + " 0.0,\n", + " 0.25632764621403736,\n", + " 0.26348364914109546,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.4379521167311877,\n", + " 0.41132949820204534,\n", + " 0.11847442885460868,\n", + " 0.2971498956613732,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.09102468047929912,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.37263554283623235,\n", + " 0.04783693739223094,\n", + " 0.5553870306731815,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.40532393907395625,\n", + " 0.8206164606646708,\n", + " 0.25666914949469033,\n", + " 0.6033023126321279,\n", + " 0.08887873998082596,\n", + " 0.5282344063364891,\n", + " 0.1614889734989262,\n", + " 0.6110703218466237,\n", + " 0.41727537138938575,\n", + " 0.0,\n", + " 0.7005010433277096,\n", + " 0.5243114597876252,\n", + " 0.45631060038613414,\n", + " 0.9294161965740012,\n", + " 0.48528251524517985,\n", + " 0.24378563314988155,\n", + " 0.05594338497520712,\n", + " 0.4421746149008551,\n", + " 0.2377048214538765,\n", + " 0.6813063418947445,\n", + " 0.5481628563278482,\n", + " 0.07514754084864528,\n", + " 0.15056863440934065,\n", + " 0.682620961448915,\n", + " 0.8187241290910362,\n", + " 0.7149131968798541,\n", + " 0.13706932564725627,\n", + " 0.0,\n", + " 0.6869397791046526,\n", + " 0.4611872605191232,\n", + " 0.0,\n", + " 0.15202275949382327,\n", + " 0.7829108581186621,\n", + " 0.5305878283081257,\n", + " 0.6603118873459244,\n", + " 0.28904172336610134,\n", + " 0.42732471901100355,\n", + " 0.6043752491632897,\n", + " 0.7109249444765121,\n", + " 0.30480391163182174,\n", + " 0.5698722318653697,\n", + " 0.20549585132781023,\n", + " 0.3177788370599429,\n", + " 0.8614230258391126,\n", + " 0.0,\n", + " 0.17961306027344115,\n", + " 0.2986599562526925,\n", + " 0.540420082234669,\n", + " 0.35607441781909666,\n", + " 0.8002769658830887,\n", + " 1.0,\n", + " 0.5010692009498263,\n", + " 0.6337345012267683,\n", + " 0.8856793779921786,\n", + " 0.11445707552032425,\n", + " 0.4750840293001414,\n", + " 0.7621116166681253,\n", + " 0.45004691560501453,\n", + " 0.8009532349029482,\n", + " 0.4924676163676591,\n", + " 0.4510285571092685,\n", + " 0.14869900345027842,\n", + " 0.5217159751008826,\n", + " 0.6185803009187976,\n", + " 0.4444468804695457,\n", + " 0.6156453648502771,\n", + " 0.14690996626271513,\n", + " 0.660545645117679,\n", + " 0.0,\n", + " 0.0,\n", + " 0.7008561243606823,\n", + " 0.33706833399580693,\n", + " 0.7055523290156626,\n", + " 0.0,\n", + " 0.7634159743209235,\n", + " 0.15529904417747142,\n", + " 0.5300893560037124,\n", + " 0.5351889030477158,\n", + " 0.4575928100864117,\n", + " 0.4846660450167073,\n", + " 1.0,\n", + " 0.6152809371757402,\n", + " 0.8513420582525032,\n", + " 0.7020435273072652,\n", + " 0.5127017435109952,\n", + " 0.12858246330763734,\n", + " 0.8038360742821876,\n", + " 0.5089461705324838,\n", + " 0.0,\n", + " 0.0,\n", + " 0.7891418878450985,\n", + " 0.9128208385604433,\n", + " 0.8636447810250142,\n", + " 0.21607109097737087,\n", + " 0.23764191211845587,\n", + " 0.2575300659556512,\n", + " 0.34468315200508814,\n", + " 0.19525740855245732,\n", + " 0.268269966041514,\n", + " 0.13351770660464013,\n", + " 0.35352177736162316,\n", + " 0.8210427655645578,\n", + " 0.5139485600912518,\n", + " 0.7688518290170615,\n", + " 0.38934200424583576,\n", + " 0.17665958376757385,\n", + " 0.11138334466238553,\n", + " 0.559142631384032,\n", + " 0.0,\n", + " 0.3343619630920124,\n", + " 0.8811380090163619,\n", + " 0.5593623007859126,\n", + " 0.5021808132755385,\n", + " 0.6184246806279081,\n", + " 0.7558520943647828,\n", + " 0.0,\n", + " 0.35749646392536905,\n", + " 0.13513854690141414,\n", + " 0.6356500175614648,\n", + " 0.873180138156219,\n", + " 0.6723659094433972,\n", + " 0.7644141138879765,\n", + " 0.16406654190325953,\n", + " 0.49578015664063524,\n", + " 0.24346092633110072,\n", + " 0.6468713109969254,\n", + " 0.44396916121555274,\n", + " 0.12171382929777275,\n", + " 0.3859715362687679,\n", + " 0.4219185382179377,\n", + " 0.8514539757495124,\n", + " 0.6614020530955229,\n", + " 0.3542799278463653,\n", + " 0.1289559363516588,\n", + " 0.44934071430844447,\n", + " 0.3343502863931016,\n", + " 0.8046166300068341,\n", + " 0.1239826378737986,\n", + " 0.45275869289234527,\n", + " 0.1477172135589866,\n", + " 1.0,\n", + " 0.31793243661545034,\n", + " 0.05158902041464353,\n", + " 0.23509354099316182,\n", + " 0.7497415311343633,\n", + " 0.0,\n", + " 0.09677819311728497,\n", + " 0.6057599535781372,\n", + " 0.5759253376853078,\n", + " 0.4301930609628042,\n", + " 0.19721669304753386,\n", + " 0.20938378558781823,\n", + " 0.4089957516485536,\n", + " 0.6108464286754679,\n", + " 0.4894615579955016,\n", + " 0.5808175769123721,\n", + " 0.0,\n", + " 0.3126676960261622,\n", + " 0.5481434369484262,\n", + " 0.3524874860063153,\n", + " 0.5360475210310603,\n", + " 0.8788490095821638,\n", + " 0.3442435750182118,\n", + " 0.4503955705954288,\n", + " 0.09029751535863328,\n", + " 0.2012419993294795,\n", + " 0.0,\n", + " 0.9274114315997014,\n", + " 0.5434147120672723,\n", + " 0.8306753486272993,\n", + " 0.4683515725501377,\n", + " 0.18678545201034713,\n", + " 1.0,\n", + " 0.13425155055493154,\n", + " 0.5880420237090649,\n", + " 0.24168330407679195,\n", + " 0.0,\n", + " 0.522583588243413,\n", + " 0.37341894102558615,\n", + " 0.194127399548935,\n", + " 0.2150669685194998,\n", + " 1.0,\n", + " 0.0,\n", + " 0.49356125237048354,\n", + " 0.1272784783854516,\n", + " 0.45966421954215453,\n", + " 0.26882244630641616,\n", + " 0.7202111764851974,\n", + " 0.5806015635505841,\n", + " 0.703370361888481,\n", + " 0.42427983852282103,\n", + " 0.3831859865292562,\n", + " 0.0,\n", + " 0.23560315672204657,\n", + " 0.3172206457791551,\n", + " 0.0,\n", + " 0.219068883168148,\n", + " 0.5102766121620987,\n", + " 0.03038602830506265,\n", + " 0.1430457736854147,\n", + " 0.5502426372632354,\n", + " 0.7084414111410967,\n", + " 0.31631873777664365,\n", + " 0.0,\n", + " 1.0,\n", + " 0.37520019211582634,\n", + " 0.18596151945689499,\n", + " 0.6143897906634846,\n", + " 0.11150528737291354,\n", + " 0.4572159574549788,\n", + " 0.05084109583994967,\n", + " 0.0,\n", + " 0.0,\n", + " 0.549302606412826,\n", + " 0.4101521595181268,\n", + " 0.47363544630879706,\n", + " 0.37750434720141457,\n", + " 0.599680729002257,\n", + " 0.8484819322509444,\n", + " 0.12219773475203377,\n", + " 0.2502143229111057,\n", + " 0.3089211179360021,\n", + " 0.6066426362810541,\n", + " 0.6744876985413576,\n", + " 0.21440010969171996,\n", + " 0.48194274236523627,\n", + " 0.15520407102846667,\n", + " 0.06874123653938524,\n", + " 0.620786031355869,\n", + " 0.1721030489923958,\n", + " 0.5152269919664448,\n", + " 0.12112787189985341,\n", + " 0.3789096225584536,\n", + " 0.0,\n", + " 0.7866277343051556,\n", + " 0.26783155476750453,\n", + " 0.47330632367516756,\n", + " 0.645621890323781,\n", + " 0.34594800364765343,\n", + " 1.0,\n", + " 0.6492450786127035,\n", + " 0.5235653744905541,\n", + " 0.6061208909415156,\n", + " 0.6573579523402031,\n", + " 0.0,\n", + " 0.5970584296887183,\n", + " 0.13906849950465394,\n", + " 0.4047053462934861,\n", + " 0.12129364249563068,\n", + " 0.7293432961325722,\n", + " 0.5401439734854109,\n", + " 0.4814045333954379,\n", + " 0.12806683569310184,\n", + " 0.8698657686207867,\n", + " 0.3437417759174832,\n", + " 0.8873617972961861,\n", + " 0.5410742441943139,\n", + " 0.2108311618785843,\n", + " 0.5183677288116197,\n", + " 0.0,\n", + " 0.0916888376354833,\n", + " 0.40785977246340366,\n", + " 0.33966626473841144,\n", + " 0.0,\n", + " 0.5993027509253149,\n", + " 0.0,\n", + " 0.37819651917068864,\n", + " 0.4156002552682925,\n", + " 0.4641392660373229,\n", + " 0.6519808199860605,\n", + " 0.04494474848168728,\n", + " 0.4647217406966456,\n", + " 0.48784437303212824,\n", + " 0.2726005238628906,\n", + " 0.39480703612005263,\n", + " 0.28644304493341854,\n", + " 0.0,\n", + " 0.07939562336249639,\n", + " 1.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0765777283572675,\n", + " 0.3424986946520312,\n", + " 0.2829883523026371,\n", + " 0.0,\n", + " 0.46390449626485497,\n", + " 0.32273066113866045,\n", + " 0.14148923053497942,\n", + " 0.46312472417298534,\n", + " 1.0,\n", + " 0.699877476328809,\n", + " 0.5508282738763569,\n", + " 0.7472229403719705,\n", + " 0.8106306659967933,\n", + " 0.7436366752463702,\n", + " 0.0,\n", + " 0.5698741776860069,\n", + " 0.40320562962610174,\n", + " 0.6368366902435519,\n", + " 0.6877295667132002,\n", + " 0.17037115910344358,\n", + " 0.6486701023855274,\n", + " 0.8633709821844233,\n", + " 0.16022601054849728,\n", + " 0.640083045917081,\n", + " 0.6697438573282328,\n", + " 0.4487485770265547,\n", + " 0.6719100338348823,\n", + " 0.2685141361789576,\n", + " 0.9028521136397718,\n", + " 0.1622506352731724,\n", + " 0.4539814651972416,\n", + " 0.27654737254151907,\n", + " 0.3844628500466558,\n", + " 0.3599192969869678,\n", + " 1.0,\n", + " 0.31262747295222737,\n", + " 0.0998819622034911,\n", + " 0.5500318840350857,\n", + " 0.31149975393979346,\n", + " 0.4321177576181423,\n", + " 0.38946637701000786,\n", + " 0.5403237881465437,\n", + " 0.6858924875847925,\n", + " 0.8198141851102765,\n", + " 0.8564742491936111,\n", + " 0.5196833302356285,\n", + " 0.35792820898885536,\n", + " 0.39824779587910064,\n", + " 0.2972380958131975,\n", + " 0.719401296625734,\n", + " 0.4803227150707702,\n", + " 0.5125995715602804,\n", + " 0.22681123638902323,\n", + " 0.41301661519157196,\n", + " 0.23254970822207288,\n", + " 1.0,\n", + " 0.3495836625459079,\n", + " 0.3984112218773903,\n", + " 0.2806722974408733,\n", + " 0.866781590051423,\n", + " 0.7681494330073158,\n", + " 0.36325229737259024,\n", + " 0.0,\n", + " 0.7167472425025875,\n", + " 0.5899089939551846,\n", + " 0.0,\n", + " 0.49115673778576774,\n", + " 0.88362940787316,\n", + " 0.6044395564617944,\n", + " 0.0,\n", + " 0.5085299951560898,\n", + " 0.6756504169738741,\n", + " 0.2582428942918765,\n", + " 0.37900938595589095,\n", + " 0.3909774315296721,\n", + " 0.24954918104723167,\n", + " 0.7102838327248886,\n", + " 0.8691974050976067,\n", + " 1.0,\n", + " 0.6005255962140266,\n", + " 0.5762709803364882,\n", + " 0.25458143588052695,\n", + " 0.42667737171879827,\n", + " 0.4843225342213465,\n", + " 0.4932812689599628,\n", + " 0.2800079504661759,\n", + " 0.4490563346904908,\n", + " 0.6181593556723631,\n", + " 0.03646100687435949,\n", + " 0.7309912268417118,\n", + " 0.0,\n", + " 1.0,\n", + " 0.4035475345793893,\n", + " 0.7840126745472233,\n", + " 0.3954803616482563,\n", + " 0.0,\n", + " 0.0,\n", + " 0.0,\n", + " 0.1591537194913175,\n", + " 0.6387516302676169,\n", + " 0.0,\n", + " 0.38828176138779136,\n", + " 0.0,\n", + " 0.5397777149131365,\n", + " 0.0,\n", + " 0.0,\n", + " 0.3968237818834943,\n", + " 1.0,\n", + " 0.2463884367894592,\n", + " 0.37488257251367485,\n", + " 0.40014008678755286,\n", + " 0.3795872170838577,\n", + " 0.3481696625036117,\n", + " 1.0,\n", + " 0.4532486284107131,\n", + " 0.45124885781957236,\n", + " 1.0,\n", + " 0.28499747178485757,\n", + " 0.47582331298981617,\n", + " 0.8569892438797804,\n", + " 0.5099943018262215,\n", + " 0.0,\n", + " 0.28016580350786,\n", + " 0.8206164606646708,\n", + " 0.676439262841274,\n", + " 0.0,\n", + " 0.0,\n", + " 0.6284854481311963,\n", + " 0.3783914617277526,\n", + " 0.47754345753482585,\n", + " 0.2243300379059756,\n", + " 0.42310885240354024,\n", + " 1.0,\n", + " 0.0,\n", + " 0.7984425111203017,\n", + " 0.0,\n", + " 1.0,\n", + " 0.33411352598381483,\n", + " 0.0,\n", + " 1.0,\n", + " 0.4376974425852824,\n", + " 1.0,\n", + " 0.6641864829585565,\n", + " 0.0,\n", + " 0.4114352038515582,\n", + " 0.2667636290931522,\n", + " 0.0,\n", + " 0.0,\n", + " 0.22915201570646015,\n", + " 0.0,\n", + " 0.6479035777820578,\n", + " 0.3239238626021006,\n", + " 0.39480292860784894,\n", + " 0.2573377944004096,\n", + " 0.0,\n", + " 0.0,\n", + " 0.477762489986595,\n", + " 0.8279624193673831,\n", + " 0.5200494878273851,\n", + " 0.0,\n", + " 1.0,\n", + " 0.23367235288040167,\n", + " 0.5970461949090197,\n", + " 0.27374491608748885,\n", + " 0.46303978762016773,\n", + " 0.39773060094296664,\n", + " 0.0,\n", + " 0.8319693059732305,\n", + " 0.41174823182562426,\n", + " 0.709896977074388,\n", + " 0.0,\n", + " 0.5247823897386177,\n", + " 0.6642672000619121,\n", + " 0.18685860042264402,\n", + " 0.6103548453839198,\n", + " 0.0,\n", + " 0.6296793435449906,\n", + " 0.6666617252063537,\n", + " 0.6850351236482123,\n", + " 0.28502376330020907,\n", + " 0.4351498686684909,\n", + " 0.11218670184576496,\n", + " 0.0,\n", + " 0.38641719864521673,\n", + " 0.7487690643478222,\n", + " 0.7970850962959118,\n", + " 0.37856165181789503,\n", + " 0.7679402131473076,\n", + " 0.6070492486988476,\n", + " 0.35798642841884964,\n", + " 0.19679168934593716,\n", + " 0.0,\n", + " 0.5332145072432587,\n", + " 0.45611959388394785,\n", + " 0.38524882472700495,\n", + " 0.8047081251748675,\n", + " 0.5492381930223644,\n", + " 0.3567740839298559,\n", + " 0.522620986362805,\n", + " 0.59522523840903,\n", + " 0.22484082300717556,\n", + " 0.0,\n", + " 0.19851216657351878,\n", + " 0.26684870419901496,\n", + " 0.2052299684689749,\n", + " 0.0,\n", + " 0.25063177573756357,\n", + " 0.0,\n", + " 0.0,\n", + " 0.16238590270650166,\n", + " 0.21804980109424987,\n", + " 0.40014008678755286,\n", + " 0.3614000377087643,\n", + " 0.3052284240319709,\n", + " 1.0,\n", + " 0.3809185295288766,\n", + " 0.3713548812569696,\n", + " 0.18707569297276838,\n", + " 0.4477502307780068,\n", + " 0.0,\n", + " 0.7140966717635232,\n", + " 0.4780702764219191,\n", + " 0.38229031950949444,\n", + " 0.6850351236482123,\n", + " 0.0,\n", + " 0.5947797094626923,\n", + " 0.14401740271647645,\n", + " 0.0,\n", + " 1.0,\n", + " 0.0,\n", + " 0.5144459609080433,\n", + " 0.611567382161153,\n", + " 0.8584136093353725,\n", + " 0.798892408043969,\n", + " 0.6144257642143102,\n", + " 0.4330346917739156,\n", + " 0.3867014756245677,\n", + " 0.0,\n", + " 0.4976791325378406,\n", + " 0.2258005366789897,\n", + " 0.43585115599310564,\n", + " 0.0,\n", + " 0.4848911314386981,\n", + " 0.5126680056301495,\n", + " 0.0,\n", + " 0.2373504569329732,\n", + " 0.5424156578876764,\n", + " 0.390342271686988,\n", + " 0.5166691886105966,\n", + " 0.5608506275623061,\n", + " 0.8244404315451992,\n", + " 0.7632478718966247,\n", + " 0.0,\n", + " 0.29137460642559,\n", + " 0.29257426918802265,\n", + " 0.7019612334107046,\n", + " 1.0,\n", + " 0.0,\n", + " 0.275953331199847,\n", + " 0.6852570986732491,\n", + " 0.4520288550570899,\n", + " 0.5608904134601668,\n", + " 0.6074888258058404,\n", + " 0.48358504032722727,\n", + " 0.0,\n", + " 1.0,\n", + " 0.2914801861522027,\n", + " 0.28615316182957407,\n", + " 0.0,\n", + " 0.21766798294887157,\n", + " 0.4827992523157545,\n", + " 0.30024352421921163,\n", + " 0.8061803245055488,\n", + " 1.0,\n", + " 0.3954922932910889,\n", + " 0.5692392016414198,\n", + " 0.38689983012194595,\n", + " 0.30660939542312104,\n", + " 0.4527044889852443,\n", + " 0.0,\n", + " 0.5320675800634052,\n", + " 0.0,\n", + " 0.0,\n", + " 0.3219745573000369,\n", + " 0.2849288443631641,\n", + " 0.38489605564307827,\n", + " 1.0,\n", + " 0.48490605510532103,\n", + " 0.0,\n", + " 0.0,\n", + " 0.6090674518895959,\n", + " 0.6846473747010637,\n", + " 0.20262811451638396,\n", + " 0.5433997707943095,\n", + " 0.27156819301279,\n", + " 0.23211288530112462,\n", + " 0.47053095022813185,\n", + " 0.6546030491697581,\n", + " 0.27835762804992337,\n", + " 0.6345329421893272,\n", + " 0.43446725981940015,\n", + " 0.3343377583279381,\n", + " 0.5282110223368404,\n", + " 0.17000383561814136,\n", + " 1.0,\n", + " 0.28174091507232246,\n", + " 0.15327597803967907,\n", + " 0.612225687497798,\n", + " 0.561032830377921,\n", + " 0.0,\n", + " 0.0,\n", + " 0.281210538298831,\n", + " 0.6955833134064412,\n", + " 0.0,\n", + " 0.5578319367749439,\n", + " 0.5126115623998788,\n", + " 0.4847554558516347,\n", + " 0.7414782298151074,\n", + " 0.3198095010126204,\n", + " 0.41758281096448013,\n", + " 1.0,\n", + " 1.0,\n", + " 1.0,\n", + " 0.23888680026525788,\n", + " 0.6491986356465447,\n", + " 0.507456205101669,\n", + " 0.706252646381165,\n", + " 0.26454021619621604,\n", + " 0.0,\n", + " 1.0,\n", + " 0.3116094017366853,\n", + " 0.21778028257350546,\n", + " 0.49842181522114093,\n", + " 0.339343491867661,\n", + " 0.0,\n", + " 0.21417246666536183,\n", + " 0.0,\n", + " 0.6641864829585565,\n", + " 0.0,\n", + " 0.7490212482819971,\n", + " 0.17789685420652068,\n", + " 0.28047590710058656,\n", + " 0.0,\n", + " 0.7871044396387373,\n", + " 0.6438473988211039,\n", + " 1.0,\n", + " 0.33843562296602225,\n", + " 0.3533833230267531,\n", + " 0.807032610563967,\n", + " 0.5588422784285316,\n", + " 0.4969164689014187,\n", + " 0.7833695245316242,\n", + " 0.3854231641887049,\n", + " 0.7559309434504692,\n", + " 0.3710636127102681,\n", + " 0.7789302182346349,\n", + " 0.40679735416043006,\n", + " 0.5338574494469093,\n", + " 0.7969778416007888,\n", + " 0.7132516223755987,\n", + " 0.28597803442794545,\n", + " 0.8060913960633393,\n", + " 0.7056023153131915,\n", + " 0.8762331108625677,\n", + " 0.49361310623234134,\n", + " 0.37526988595055755,\n", + " 0.5653672107971113,\n", + " 0.31317502919419726,\n", + " 0.7797241173852115,\n", + " 0.728383104923581,\n", + " 0.537655059552203,\n", + " 0.7280614187301717,\n", + " 0.7800506476123333,\n", + " 0.6550293512154154,\n", + " 0.5378856441332204,\n", + " 0.7890555032959564,\n", + " 0.7192559890928187,\n", + " 0.5235364401514935,\n", + " 0.5966791698745693,\n", + " 0.48467775705538085,\n", + " 0.4774653565240159,\n", + " 0.5393352459653991,\n", + " 0.5753294888719198,\n", + " 0.8886253945320826,\n", + " 0.45368663808037857,\n", + " 0.5617383733323401,\n", + " 0.4395704148583828,\n", + " 0.568250422160132,\n", + " 0.6147688194590971,\n", + " 0.6482393773758324,\n", + " 0.8303782399950751,\n", + " 0.510087140758001,\n", + " 0.6575630243888629,\n", + " 0.5614198519827348,\n", + " 0.7388704172821169,\n", + " 0.4510901973346588,\n", + " 0.6172748185872936,\n", + " 0.6384728664008663,\n", + " 0.7536961542330962,\n", + " 0.7694748525311874,\n", + " 0.6425034134735664,\n", + " 0.43943333248812877,\n", + " 0.45206930484700014,\n", + " 0.6036274547435792,\n", + " 0.5800379708919817,\n", + " 0.6075474889378645,\n", + " 0.47293191398133294,\n", + " 0.6033352414489536,\n", + " 0.8710641475457929,\n", + " 0.4710984118834234,\n", + " 0.5542331231354969,\n", + " 0.7191415608695886,\n", + " 0.3410871827572157,\n", + " 0.7386519015711646,\n", + " 0.6307458151677343,\n", + " 0.4723052628959741,\n", + " 0.5677336572409835,\n", + " 0.5363224977280127,\n", + " 0.6316583298793843,\n", + " 0.5962024113137894,\n", + " 0.6792949324840446,\n", + " 0.5007135556817937,\n", + " 0.6480747898496585,\n", + " 0.2673073284349996,\n", + " 0.6032449647223925,\n", + " 0.18796309414600543,\n", + " 0.7821756390339755,\n", + " 0.5526488375164476,\n", + " 0.565435157154873,\n", + " 0.5323569163048228,\n", + " 0.6165617133797364,\n", + " 0.5288543425424144,\n", + " 0.612552821716356,\n", + " 0.40981021123878547,\n", + " 0.8152230679662728,\n", + " 0.5349892172155912,\n", + " 0.7227741709423376,\n", + " 0.44535307964767035,\n", + " 0.7335916608467525,\n", + " 0.5219136535289064,\n", + " 0.5586310799918579,\n", + " 0.5985145021081398,\n", + " 0.5382221617352677,\n", + " 0.5880451250267861,\n", + " 0.34701572024027594,\n", + " 0.7615972746341745,\n", + " 0.6217087120235364,\n", + " 0.6562729790958813,\n", + " 0.564568429578883,\n", + " 0.7467645246032611,\n", + " 0.5088259536847448,\n", + " 0.38737964351750953,\n", + " 0.5061355183858008,\n", + " 0.5462005583632086,\n", + " 0.6059840012048179,\n", + " 0.641149740248289,\n", + " 0.2592717713379644,\n", + " 0.7107322732626838,\n", + " 0.8516988474931363,\n", + " 0.44198994732909314,\n", + " 0.5458936951931888,\n", + " 0.5033306169957211,\n", + " 0.2515293168803814,\n", + " 0.5947836969882396,\n", + " 0.5748299095373731,\n", + " 0.6082228431227119,\n", + " 0.44248204691962334,\n", + " 0.525599280438097,\n", + " ...]" + ] + }, + "execution_count": 108, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "baselines[\"TF-IDF Cosine with Stop Words\"]" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1320,8 +2373,8 @@ " return id_dict[row]\n", "\n", "\n", - "sts_test[\"qid1\"] = sts_test[\"sentence1\"].apply(assign_id)\n", - "sts_test[\"qid2\"] = sts_test[\"sentence2\"].apply(assign_id)" + "sts_test_stop[\"qid1\"] = sts_test_stop[\"sentence1\"].apply(assign_id)\n", + "sts_test_stop[\"qid2\"] = sts_test_stop[\"sentence2\"].apply(assign_id)" ] }, { @@ -1330,7 +2383,7 @@ "metadata": {}, "outputs": [], "source": [ - "def doc2vec_cosine(df, stop_words=True):\n", + "def doc2vec_cosine(df, rm_stopwords=False):\n", " \"\"\"Calculate cosine similarity between each sentence pair using Doc2Vec embeddings\n", " \n", " Args:\n", @@ -1341,12 +2394,12 @@ " list: predicted values for sentence similarity of test set examples\n", " \"\"\"\n", " predictions = []\n", - " if stop_words:\n", - " tokenized_sentences = zip(df[\"sentence1_tokens\"], df[\"sentence2_tokens\"])\n", - " else:\n", + " if rm_stopwords:\n", " tokenized_sentences = zip(\n", " df[\"sentence1_tokens_stop\"], df[\"sentence2_tokens_stop\"]\n", " )\n", + " else:\n", + " tokenized_sentences = zip(df[\"sentence1_tokens\"], df[\"sentence2_tokens\"])\n", "\n", " labeled_questions = []\n", " sentences = list(tokenized_sentences)\n", @@ -1386,8 +2439,8 @@ "metadata": {}, "outputs": [], "source": [ - "baselines[\"Doc2vec Cosine\"] = doc2vec_cosine(sts_test, stop_words=False)\n", - "baselines[\"Doc2vec Cosine with Stop Words\"] = doc2vec_cosine(sts_test, stop_words=True)" + "baselines[\"Doc2vec Cosine\"] = doc2vec_cosine(sts_test_stop, rm_stopwords=True)\n", + "baselines[\"Doc2vec Cosine with Stop Words\"] = doc2vec_cosine(sts_test_stop, rm_stopwords=False)" ] }, { @@ -1410,7 +2463,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 28, "metadata": {}, "outputs": [], "source": [ @@ -1430,36 +2483,22 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 109, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Word2vec Cosine 0.6337760059182685\n", - "Word2vec Cosine with Stop Words 0.647674307797345\n", - "Word2vec WMD 0.6578256301323717\n", - "Word2vec WMD with Stop Words 0.5697910628727219\n", - "GLoVe Cosine 0.642064729899729\n", - "GLoVe Cosine with Stop Words 0.5639670964748242\n", - "GLoVe WMD 0.6272339050920003\n", - "GLoVe WMD with Stop Words 0.48560149551724\n", - "fastText Cosine 0.6288780924569854\n", - "fastText Cosine with Stop Words 0.5958470751204787\n", - "fastText WMD 0.5275208457920849\n", - "fastText WMD with Stop Words 0.44198752510004097\n", "TF-IDF Cosine 0.7034695168223283\n", - "TF-IDF Cosine with Stop Words 0.6683811410442564\n", - "Doc2vec Cosine 0.5246890826752258\n", - "Doc2vec Cosine with Stop Words 0.36591781773894305\n" + "TF-IDF Cosine with Stop Words 0.6683811410442564\n" ] } ], "source": [ "# Get metrics on predictions from all models\n", "for model in baselines:\n", - " print(model, pearson_correlation(sts_test, baselines[model]))" + " print(model, pearson_correlation(sts_test_stop, baselines[model]))" ] }, { diff --git a/sentence_similarity/notebooks/02-model/embedding_trainer.ipynb b/sentence_similarity/notebooks/02-model/embedding_trainer.ipynb index 2c695ae..fcb233e 100644 --- a/sentence_similarity/notebooks/02-model/embedding_trainer.ipynb +++ b/sentence_similarity/notebooks/02-model/embedding_trainer.ipynb @@ -33,7 +33,8 @@ "outputs": [], "source": [ "import gensim\n", - "import sys" + "import sys\n", + "import os" ] }, { @@ -50,15 +51,18 @@ "outputs": [], "source": [ "sys.path.append(\"../../../\") ## set the environment path\n", - "\n", "BASE_DATA_PATH = \"../../../data\"\n", + "SAVE_FILES_PATH = BASE_DATA_PATH + \"/trained_word_embeddings/\"\n", "\n", - "from utils_nlp.dataset.stsbenchmark import STSBenchmark\n", + "if not os.path.exists(SAVE_FILES_PATH):\n", + " os.makedirs(SAVE_FILES_PATH)\n", + " \n", "from utils_nlp.dataset.preprocess import (\n", " to_lowercase,\n", " to_spacy_tokens,\n", " rm_spacy_stopwords,\n", - ")" + ")\n", + "from utils_nlp.dataset import stsbenchmark" ] }, { @@ -67,8 +71,8 @@ "metadata": {}, "outputs": [], "source": [ - "# Initializing this instance runs the downloader and extractor behind the scenes, then convert to dataframe\n", - "stsTrain = STSBenchmark(\"train\", base_data_path=BASE_DATA_PATH).as_dataframe()" + "# Produce a pandas dataframe for the training set\n", + "stsTrain = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split=\"train\")" ] }, { @@ -180,31 +184,23 @@ "name": "stdout", "output_type": "stream", "text": [ - "Embedding for apple: [-1.30064473e-01 1.84295833e-01 -1.53965428e-01 -9.69498605e-02\n", - " 4.99420874e-02 -1.23197936e-01 7.28140250e-02 -4.12699208e-02\n", - " 2.47626036e-01 -2.69805547e-04 -7.65557750e-04 2.08947986e-01\n", - " 7.81186996e-03 5.42742060e-03 5.25087006e-02 2.47807354e-01\n", - " -2.48165410e-02 9.91394650e-03 3.54040265e-02 -2.14830145e-01\n", - " 2.24868301e-02 1.52286962e-01 1.85761824e-01 2.33249858e-01\n", - " -1.46878466e-01 -7.60829672e-02 4.50950442e-03 1.15145534e-01\n", - " -9.11297649e-02 6.20169528e-02 -5.24968617e-02 -8.68254527e-02\n", - " -1.77496113e-04 8.58828798e-02 1.19839951e-01 2.51445977e-04\n", - " -3.06774918e-02 2.70280894e-03 -9.14655998e-02 5.54770082e-02\n", - " 6.70319721e-02 -1.10063367e-01 -9.94274616e-02 -1.62537303e-02\n", - " 1.07709818e-01 -1.17890313e-01 -1.68436840e-02 2.67276943e-01\n", - " 1.66485235e-02 -1.05556019e-01 8.72049183e-02 -2.79379219e-01\n", - " -7.61673898e-02 -1.26047105e-01 -2.10570037e-01 1.06335968e-01\n", - " -1.13933079e-01 8.91806409e-02 2.40348503e-02 1.27991261e-02\n", - " -9.80987865e-03 -8.29416886e-02 -1.05351470e-01 9.63128060e-02\n", - " -1.32907405e-01 -5.90794981e-02 -1.05936542e-01 5.24872467e-02\n", - " -1.62810262e-04 1.90204114e-03 -1.07438803e-01 -1.86693370e-02\n", - " -1.74428806e-01 -2.69948710e-02 -4.38663997e-02 -4.28975448e-02\n", - " 9.05705541e-02 -2.10348725e-01 -1.16732195e-01 3.60293575e-02\n", - " -2.08853818e-02 2.63118356e-01 1.76015347e-01 1.23300500e-01\n", - " -3.50267850e-02 -4.52703685e-02 -1.70624122e-01 -3.28516886e-02\n", - " 5.28835841e-02 8.53991881e-02 -8.47622007e-02 2.25594401e-01\n", - " -1.77075803e-01 -5.37518365e-03 9.42931976e-03 1.78159177e-02\n", - " -7.26433992e-02 -3.52309011e-02 -1.68363556e-01 2.79879309e-02]\n", + "Embedding for apple: [-0.09213913 -0.02462959 -0.11255068 0.11652157 -0.18142793 -0.17555593\n", + " 0.07121698 0.086779 -0.03097944 -0.01890221 -0.04537104 -0.10696206\n", + " 0.02276987 0.08645772 0.09701958 -0.22489007 0.03993007 -0.0748188\n", + " 0.0185363 -0.257262 0.06551826 0.01579769 -0.18179104 -0.22390445\n", + " -0.06907904 -0.08859113 0.00603421 -0.01953833 -0.0306666 -0.20717207\n", + " -0.07466035 -0.10690664 -0.06131361 -0.0747569 -0.03541371 -0.02307771\n", + " -0.04890924 0.09401437 0.14955166 0.03299814 -0.20348735 0.1091179\n", + " -0.05915498 0.07897269 -0.0392515 -0.1337506 0.16920352 0.00084969\n", + " 0.09151786 -0.07067705 -0.00130636 -0.00040609 -0.09070218 -0.05848758\n", + " 0.01417456 0.12759478 0.06773403 -0.03618362 0.05180905 -0.03987553\n", + " 0.15119544 0.1374909 -0.2100861 -0.12180148 -0.01784294 0.09922534\n", + " -0.01852375 0.2757332 -0.07551172 0.06188574 -0.0189024 0.08390908\n", + " 0.06324708 -0.02126443 0.07884526 -0.06014811 -0.1291807 0.03968196\n", + " -0.00395843 -0.05398612 0.25687164 0.06331551 -0.07450255 -0.12246329\n", + " -0.1481028 0.11168568 -0.24994832 -0.05962377 0.04101507 0.06981998\n", + " 0.02528387 0.1725297 0.10974599 0.12216322 -0.16961183 0.0819602\n", + " 0.15518941 0.12973912 0.09754901 -0.0033999 ]\n", "\n", "First 30 vocabulary words: ['a', 'plane', 'is', 'taking', 'off', '.', 'man', 'playing', 'large', 'flute', 'spreading', 'cheese', 'on', 'pizza', 'three', 'men', 'are', 'the', 'some', 'fighting']\n" ] @@ -218,8 +214,8 @@ "print(\"\\nFirst 30 vocabulary words:\", list(word2vec_model.wv.vocab)[:20])\n", "\n", "# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format\n", - "word2vec_model.wv.save_word2vec_format(\"word2vec_model\", binary=True) # binary format\n", - "word2vec_model.wv.save_word2vec_format(\"word2vec_model\", binary=False) # ASCII format" + "word2vec_model.wv.save_word2vec_format(SAVE_FILES_PATH+\"word2vec_model\", binary=True) # binary format\n", + "word2vec_model.wv.save_word2vec_format(SAVE_FILES_PATH+\"word2vec_model\", binary=False) # ASCII format" ] }, { @@ -276,23 +272,31 @@ "name": "stdout", "output_type": "stream", "text": [ - "Embedding for apple: [-0.19466035 0.02329457 0.11905755 0.43202105 0.29234868 -0.4173747\n", - " -0.42871934 -0.587514 -0.24620762 -0.30886024 -0.04068367 0.20132142\n", - " -0.1593995 -0.34693947 -0.05454068 0.21118519 0.20061074 0.33920124\n", - " 0.13465068 -0.16492505 -0.01792471 0.3517471 -0.42507643 -0.14185262\n", - " 0.6766511 -0.35682997 0.38852996 0.08338872 -0.16927068 0.00101932\n", - " 0.01033709 -0.00513317 -0.15251048 -0.07668231 0.02508747 -0.16725563\n", - " 0.13578647 0.5188022 0.4219404 -0.29186445 -0.35036987 0.04769979\n", - " -0.23967543 -0.03550959 -0.4072291 0.4920213 0.30146047 -0.569966\n", - " 0.12033249 -0.24960376 -0.20398642 -0.37427858 0.04139522 0.28986236\n", - " -0.31172943 0.7363574 -0.43040937 0.24302956 -0.2891899 -0.12707426\n", - " -0.26763597 -0.3471016 0.08912586 -0.20722611 0.1529707 0.39230242\n", - " -0.23503402 -0.00332095 -0.04347242 -0.00989339 0.08801552 -0.36916256\n", - " -0.13720557 0.40390077 -0.21936806 -0.10426865 -0.18858872 0.15547332\n", - " -0.3519439 0.00505178 0.1029634 -0.00991125 0.41537017 -0.10500967\n", - " 0.43521944 0.26955605 -0.23591378 0.14193945 0.08484828 0.57761383\n", - " -0.31014645 0.63834554 -0.15213463 -0.46310434 0.10502262 -0.03921723\n", - " 0.21358919 -0.17636251 0.14675795 0.15879233]\n", + "Embedding for apple: [-2.1927688e-01 2.9813698e-02 6.7616858e-02 3.6836052e-01\n", + " 2.9166859e-01 -4.3027815e-01 -4.3850473e-01 -5.5472869e-01\n", + " -2.4860071e-01 -2.8481758e-01 -8.5550338e-02 2.0373566e-01\n", + " -8.8941768e-02 -3.5824496e-01 -7.3820040e-02 1.9162497e-01\n", + " 1.9164029e-01 3.2222369e-01 1.7169371e-01 -1.8063694e-01\n", + " -2.5478544e-02 3.8527763e-01 -4.4661409e-01 -1.9077049e-01\n", + " 6.3831955e-01 -3.4981030e-01 3.6546609e-01 7.3591776e-02\n", + " -1.7809562e-01 -3.0694399e-02 -6.5486156e-04 2.8458415e-02\n", + " -1.4853548e-01 -1.1247496e-01 2.6613681e-02 -1.5886196e-01\n", + " 1.0738261e-01 5.2269661e-01 4.1452998e-01 -2.4978566e-01\n", + " -3.6866227e-01 4.5613028e-02 -2.5554851e-01 -2.9870963e-02\n", + " -3.4256181e-01 4.1204464e-01 3.3703518e-01 -5.3163689e-01\n", + " 2.7413066e-02 -3.2481736e-01 -2.1018679e-01 -3.5171476e-01\n", + " 5.6522321e-02 3.2140371e-01 -3.0404109e-01 7.3594677e-01\n", + " -4.7126335e-01 2.5894231e-01 -2.6430738e-01 -1.1617108e-01\n", + " -2.7015641e-01 -3.2107431e-01 8.0991395e-02 -1.8977067e-01\n", + " 1.6966967e-01 3.6855596e-01 -2.0167376e-01 -1.6917199e-02\n", + " -4.0029153e-02 8.3818562e-02 8.8887364e-02 -3.4052727e-01\n", + " -1.5159512e-01 4.2969501e-01 -1.8632193e-01 -4.8835874e-02\n", + " -1.9202119e-01 1.5949497e-01 -3.4046504e-01 4.6990579e-03\n", + " 9.2628546e-02 1.6060786e-02 3.8600260e-01 -8.4986687e-02\n", + " 4.4739038e-01 2.1059968e-01 -1.9877617e-01 1.8113001e-01\n", + " 9.4012588e-02 5.5849826e-01 -3.2842401e-01 6.3832772e-01\n", + " -1.1614193e-01 -4.4778910e-01 1.4173931e-01 -2.4079295e-02\n", + " 1.8156306e-01 -1.9836307e-01 1.4190227e-01 1.5471222e-01]\n", "\n", "First 30 vocabulary words: ['a', 'plane', 'is', 'taking', 'off', '.', 'man', 'playing', 'large', 'flute', 'spreading', 'cheese', 'on', 'pizza', 'three', 'men', 'are', 'the', 'some', 'fighting']\n" ] @@ -306,8 +310,8 @@ "print(\"\\nFirst 30 vocabulary words:\", list(fastText_model.wv.vocab)[:20])\n", "\n", "# 3. Save the word embeddings. We can save as binary format (to save space) or ASCII format\n", - "fastText_model.wv.save_word2vec_format(\"fastText_model\", binary=True) # binary format\n", - "fastText_model.wv.save_word2vec_format(\"fastText_model\", binary=False) # ASCII format" + "fastText_model.wv.save_word2vec_format(SAVE_FILES_PATH+\"fastText_model\", binary=True) # binary format\n", + "fastText_model.wv.save_word2vec_format(SAVE_FILES_PATH+\"fastText_model\", binary=False) # ASCII format" ] }, { @@ -359,7 +363,7 @@ "outputs": [], "source": [ "#save our corpus as tokens delimited by spaces with new line characters in between sentences\n", - "with open('sentences.txt', 'w', encoding='utf8') as file:\n", + "with open(BASE_DATA_PATH+'/clean/stsbenchmark/training-corpus-cleaned.txt', 'w', encoding='utf8') as file:\n", " for sent in sentences:\n", " file.write(\" \".join(sent) + \"\\n\")" ] @@ -375,7 +379,7 @@ "2. max-vocab: upper bound on the number of vocabulary words to keep\n", "3. verbose: 0, 1, or 2 (default)\n", "\n", - "Then provide the path to the text file we created in Step 0 (<\"sentences.txt\">) followed by a file path that we'll save the vocabulary to (\"glove/build/vocab.txt\")" + "Then provide the path to the text file we created in Step 0 followed by a file path that we'll save the vocabulary to " ] }, { @@ -397,7 +401,7 @@ } ], "source": [ - "!\"glove/build/vocab_count\" -min-count 5 -verbose 2 <\"sentences.txt\"> \"glove/build/vocab.txt\"" + "!\"glove/build/vocab_count\" -min-count 5 -verbose 2 <\"../../../data/clean/stsbenchmark/training-corpus-cleaned.txt\"> \"../../../data/trained_word_embeddings/vocab.txt\"" ] }, { @@ -414,7 +418,7 @@ "5. memory: soft limit for memory consumption, default 4\n", "6. max-product: limit the size of dense co-occurrence array by specifying the max product (integer) of the frequency counts of the two co-occurring words\n", "\n", - "Then provide the path to the text file we created in Step 0 (<\"sentences.txt\">) followed by a file path that we'll save the co-occurrences to (\"glove/build/cooccurrence.bin\")" + "Then provide the path to the text file we created in Step 0 followed by a file path that we'll save the co-occurrences to" ] }, { @@ -431,7 +435,7 @@ "context: symmetric\n", "max product: 13752509\n", "overflow length: 38028356\n", - "Reading vocab from file \"glove/build/vocab.txt\"...loaded 3166 words.\n", + "Reading vocab from file \"../../../data/trained_word_embeddings/vocab.txt\"...loaded 3166 words.\n", "Building lookup table...table contains 10023557 elements.\n", "Processing token: 0100000Processed 129989 tokens.\n", "Writing cooccurrences to disk.......2 files in total.\n", @@ -441,7 +445,7 @@ } ], "source": [ - "!\"glove/build/cooccur\" -memory 4 -vocab-file \"glove/build/vocab.txt\" -verbose 2 -window-size 15 <\"sentences.txt\"> \"glove/build/cooccurrence.bin\"" + "!\"glove/build/cooccur\" -memory 4 -vocab-file \"../../../data/trained_word_embeddings/vocab.txt\" -verbose 2 -window-size 15 <\"../../../data/clean/stsbenchmark/training-corpus-cleaned.txt\"> \"../../../data/trained_word_embeddings/cooccurrence.bin\"" ] }, { @@ -455,7 +459,7 @@ "2. memory: soft limit for memory consumption, default 4\n", "3. array-size: limit to the length of the buffer which stores chunks of data to shuffle before writing to disk\n", "\n", - "Then provide the path to the co-occurrence file we created in Step 2 (<\"glove/build/cooccurrence.bin\">) followed by a file path that we'll save the shuffled co-occurrences to (\"glove/build/cooccurrence.shuf.bin\")" + "Then provide the path to the co-occurrence file we created in Step 2 followed by a file path that we'll save the shuffled co-occurrences to" ] }, { @@ -477,7 +481,7 @@ } ], "source": [ - "!\"glove/build/shuffle\" -memory 4 -verbose 2 <\"glove/build/cooccurrence.bin\"> \"glove/build/cooccurrence.shuf.bin\"" + "!\"glove/build/shuffle\" -memory 4 -verbose 2 <\"../../../data/trained_word_embeddings/cooccurrence.bin\"> \"../../../data/trained_word_embeddings/cooccurrence.shuf.bin\"" ] }, { @@ -515,28 +519,28 @@ "vocab size: 3166\n", "x_max: 10.000000\n", "alpha: 0.750000\n", - "04/29/19 - 01:26.33PM, iter: 001, cost: 0.098453\n", - "04/29/19 - 01:26.33PM, iter: 002, cost: 0.084751\n", - "04/29/19 - 01:26.33PM, iter: 003, cost: 0.074604\n", - "04/29/19 - 01:26.33PM, iter: 004, cost: 0.071038\n", - "04/29/19 - 01:26.33PM, iter: 005, cost: 0.067709\n", - "04/29/19 - 01:26.33PM, iter: 006, cost: 0.064181\n", - "04/29/19 - 01:26.33PM, iter: 007, cost: 0.059996\n", - "04/29/19 - 01:26.33PM, iter: 008, cost: 0.055268\n", - "04/29/19 - 01:26.33PM, iter: 009, cost: 0.050708\n", - "04/29/19 - 01:26.33PM, iter: 010, cost: 0.046754\n", - "04/29/19 - 01:26.33PM, iter: 011, cost: 0.043402\n", - "04/29/19 - 01:26.33PM, iter: 012, cost: 0.040575\n", - "04/29/19 - 01:26.33PM, iter: 013, cost: 0.038056\n", - "04/29/19 - 01:26.33PM, iter: 014, cost: 0.035843\n", - "04/29/19 - 01:26.33PM, iter: 015, cost: 0.033807\n" + "04/30/19 - 10:33.02AM, iter: 001, cost: 0.098433\n", + "04/30/19 - 10:33.02AM, iter: 002, cost: 0.084675\n", + "04/30/19 - 10:33.02AM, iter: 003, cost: 0.074585\n", + "04/30/19 - 10:33.02AM, iter: 004, cost: 0.071048\n", + "04/30/19 - 10:33.02AM, iter: 005, cost: 0.067768\n", + "04/30/19 - 10:33.02AM, iter: 006, cost: 0.064212\n", + "04/30/19 - 10:33.02AM, iter: 007, cost: 0.060040\n", + "04/30/19 - 10:33.02AM, iter: 008, cost: 0.055310\n", + "04/30/19 - 10:33.02AM, iter: 009, cost: 0.050727\n", + "04/30/19 - 10:33.02AM, iter: 010, cost: 0.046803\n", + "04/30/19 - 10:33.02AM, iter: 011, cost: 0.043456\n", + "04/30/19 - 10:33.02AM, iter: 012, cost: 0.040570\n", + "04/30/19 - 10:33.02AM, iter: 013, cost: 0.038074\n", + "04/30/19 - 10:33.02AM, iter: 014, cost: 0.035818\n", + "04/30/19 - 10:33.02AM, iter: 015, cost: 0.033807\n" ] } ], "source": [ - "!\"glove/build/glove\" -save-file \"glove/build/GloVe_vectors\" -threads 8 -input-file \\\n", - "\"glove/build/cooccurrence.shuf.bin\" -x-max 10 -iter 15 -vector-size 50 -binary 2 \\\n", - "-vocab-file \"glove/build/vocab.txt\" -verbose 2" + "!\"glove/build/glove\" -save-file \"../../../data/trained_word_embeddings/GloVe_vectors\" -threads 8 -input-file \\\n", + "\"../../../data/trained_word_embeddings/cooccurrence.shuf.bin\" -x-max 10 -iter 15 -vector-size 50 -binary 2 \\\n", + "-vocab-file \"../../../data/trained_word_embeddings/vocab.txt\" -verbose 2" ] }, { @@ -561,7 +565,7 @@ "source": [ "#load in the saved word vectors\n", "glove_wv = {}\n", - "with open(\"glove/build/GloVe_vectors.txt\", encoding='utf-8') as f:\n", + "with open(\"../../../data/trained_word_embeddings/GloVe_vectors.txt\", encoding='utf-8') as f:\n", " for line in f:\n", " split_line = line.split(\" \")\n", " glove_wv[split_line[0]] = [float(i) for i in split_line[1:]]" @@ -576,7 +580,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Embedding for apple: [-0.015303, -0.0512, -0.011988, 0.429914, 0.246523, 0.009762, 0.153154, -0.178636, 0.061723, 0.108515, -0.166807, -0.033258, -0.046394, 0.081953, -0.209458, 0.194758, 0.179153, 0.23262, -0.118717, -0.053151, -0.018892, -0.037714, -0.067396, 0.057499, 0.179459, 0.004552, -0.203058, 0.243629, -0.294976, 0.123971, 0.368613, 0.190665, -0.16738, -0.0599, 0.119195, -0.030108, -0.254778, -0.007862, -0.036998, 0.060919, -0.210459, 0.293917, 0.045603, -0.01104, 0.075651, -0.120635, -0.133497, -0.372606, -0.152981, 0.009014]\n", + "Embedding for apple: [0.007199, -0.055337, -0.048813, 0.463647, 0.233898, -0.020051, 0.18876, -0.19439, 0.014477, 0.122465, -0.145506, -0.056616, -0.076315, 0.051205, -0.197457, 0.197818, 0.191692, 0.259758, -0.088431, -0.101713, -0.024687, -0.083431, -0.056415, 0.08024, 0.150831, 0.030778, -0.176252, 0.291561, -0.298596, 0.111546, 0.385694, 0.184508, -0.133928, 0.007924, 0.088849, 0.016869, -0.195535, 0.002015, -0.053591, 0.043867, -0.195157, 0.270429, -0.003891, -0.033436, 0.077898, -0.083324, -0.135095, -0.419319, -0.140611, 0.000322]\n", "\n", "First 30 vocabulary words: ['.', 'a', 'the', 'in', ',', 'is', 'to', 'of', 'and', 'on', 'man', '-', \"'s\", 'with', 'for', 'at', 'woman', 'are', 'that', 'two']\n" ]