CodeSearchNet/notebooks/ExploreData.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Exploration\n",
    "\n",
    "This notebook explores the pre-processed data, and shows some basic statistics that may be useful.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "pd.set_option('max_colwidth',300)\n",
    "from pprint import pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Preview The Dataset\n",
    "    \n",
    "Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data.  While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3.  \n",
    "\n",
    "The s3 links follow this pattern:\n",
    "\n",
    "> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip\n",
    "\n",
    "For example, the link for the `python` is:\n",
    "\n",
    "> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip\n",
    "\n",
    "First we download and decompress this dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2019-06-14 01:05:08--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip\n",
      "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.184.77\n",
      "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.184.77|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 218813834 (209M) [application/zip]\n",
      "Saving to: ‘python.zip’\n",
      "\n",
      "python.zip          100%[===================>] 208.68M  63.9MB/s    in 3.3s    \n",
      "\n",
      "2019-06-14 01:05:11 (63.9 MB/s) - ‘python.zip’ saved [218813834/218813834]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Archive:  python.zip\n",
      "   creating: python/\n",
      "   creating: python/final/\n",
      "   creating: python/final/jsonl/\n",
      "   creating: python/final/jsonl/valid/\n",
      "  inflating: python/final/jsonl/valid/python_valid_0.jsonl.gz  \n",
      "   creating: python/final/jsonl/test/\n",
      "  inflating: python/final/jsonl/test/python_test_0.jsonl.gz  \n",
      "   creating: python/final/jsonl/train/\n",
      "  inflating: python/final/jsonl/train/python_train_7.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_6.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_12.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_13.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_0.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_1.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_4.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_5.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_9.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_8.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_11.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_10.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_3.jsonl.gz  \n",
      "  inflating: python/final/jsonl/train/python_train_2.jsonl.gz  \n"
     ]
    }
   ],
   "source": [
    "!unzip python.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can inspect `python/final/jsonl/test/python_test_0.jsonl.gz` to see its contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# decompress this gzip file\n",
    "!gzip -d python/final/jsonl/test/python_test_0.jsonl.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in the file and display the first row.  The data is stored in [JSON Lines](http://jsonlines.org/) format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'{\"repo\": \"soimort/you-get\", \"path\": \"src/you_get/extractors/youtube.py\", \"func_name\": \"YouTube.get_vid_from_url\", \"original_string\": \"def get_vid_from_url(url):\\\\n        \\\\\"\\\\\"\\\\\"Extracts video ID from URL.\\\\n        \\\\\"\\\\\"\\\\\"\\\\n        return match1(url, r\\'youtu\\\\\\\\.be/([^?/]+)\\') or \\\\\\\\\\\\n          match1(url, r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\') or \\\\\\\\\\\\n          match1(url, r\\'youtube\\\\\\\\.com/v/([^/?]+)\\') or \\\\\\\\\\\\n          match1(url, r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\') or \\\\\\\\\\\\n          parse_query_param(url, \\'v\\') or \\\\\\\\\\\\n          parse_query_param(parse_query_param(url, \\'u\\'), \\'v\\')\", \"language\": \"python\", \"code\": \"def get_vid_from_url(url):\\\\n        \\\\\"\\\\\"\\\\\"Extracts video ID from URL.\\\\n        \\\\\"\\\\\"\\\\\"\\\\n        return match1(url, r\\'youtu\\\\\\\\.be/([^?/]+)\\') or \\\\\\\\\\\\n          match1(url, r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\') or \\\\\\\\\\\\n          match1(url, r\\'youtube\\\\\\\\.com/v/([^/?]+)\\') or \\\\\\\\\\\\n          match1(url, r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\') or \\\\\\\\\\\\n          parse_query_param(url, \\'v\\') or \\\\\\\\\\\\n          parse_query_param(parse_query_param(url, \\'u\\'), \\'v\\')\", \"code_tokens\": [\"def\", \"get_vid_from_url\", \"(\", \"url\", \")\", \":\", \"return\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtu\\\\\\\\.be/([^?/]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/v/([^/?]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\'\", \")\", \"or\", \"parse_query_param\", \"(\", \"url\", \",\", \"\\'v\\'\", \")\", \"or\", \"parse_query_param\", \"(\", \"parse_query_param\", \"(\", \"url\", \",\", \"\\'u\\'\", \")\", \",\", \"\\'v\\'\", \")\"], \"docstring\": \"Extracts video ID from URL.\", \"docstring_tokens\": [\"Extracts\", \"video\", \"ID\", \"from\", \"URL\", \".\"], \"sha\": \"b746ac01c9f39de94cac2d56f665285b0523b974\", \"url\": \"https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143\", \"partition\": \"test\"}\\n'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('python/final/jsonl/test/python_test_0.jsonl', 'r') as f:\n",
    "    sample_file = f.readlines()\n",
    "sample_file[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'code': 'def get_vid_from_url(url):\\n'\n",
      "         '        \"\"\"Extracts video ID from URL.\\n'\n",
      "         '        \"\"\"\\n'\n",
      "         \"        return match1(url, r'youtu\\\\.be/([^?/]+)') or \\\\\\n\"\n",
      "         \"          match1(url, r'youtube\\\\.com/embed/([^/?]+)') or \\\\\\n\"\n",
      "         \"          match1(url, r'youtube\\\\.com/v/([^/?]+)') or \\\\\\n\"\n",
      "         \"          match1(url, r'youtube\\\\.com/watch/([^/?]+)') or \\\\\\n\"\n",
      "         \"          parse_query_param(url, 'v') or \\\\\\n\"\n",
      "         \"          parse_query_param(parse_query_param(url, 'u'), 'v')\",\n",
      " 'code_tokens': ['def',\n",
      "                 'get_vid_from_url',\n",
      "                 '(',\n",
      "                 'url',\n",
      "                 ')',\n",
      "                 ':',\n",
      "                 'return',\n",
      "                 'match1',\n",
      "                 '(',\n",
      "                 'url',\n",
      "                 ',',\n",
      "                 \"r'youtu\\\\.be/([^?/]+)'\",\n",
      "                 ')',\n",
      "                 'or',\n",
      "                 'match1',\n",
      "                 '(',\n",
      "                 'url',\n",
      "                 ',',\n",
      "                 \"r'youtube\\\\.com/embed/([^/?]+)'\",\n",
      "                 ')',\n",
      "                 'or',\n",
      "                 'match1',\n",
      "                 '(',\n",
      "                 'url',\n",
      "                 ',',\n",
      "                 \"r'youtube\\\\.com/v/([^/?]+)'\",\n",
      "                 ')',\n",
      "                 'or',\n",
      "                 'match1',\n",
      "                 '(',\n",
      "                 'url',\n",
      "                 ',',\n",
      "                 \"r'youtube\\\\.com/watch/([^/?]+)'\",\n",
      "                 ')',\n",
      "                 'or',\n",
      "                 'parse_query_param',\n",
      "                 '(',\n",
      "                 'url',\n",
      "                 ',',\n",
      "                 \"'v'\",\n",
      "                 ')',\n",
      "                 'or',\n",
      "                 'parse_query_param',\n",
      "                 '(',\n",
      "                 'parse_query_param',\n",
      "                 '(',\n",
      "                 'url',\n",
      "                 ',',\n",
      "                 \"'u'\",\n",
      "                 ')',\n",
      "                 ',',\n",
      "                 \"'v'\",\n",
      "                 ')'],\n",
      " 'docstring': 'Extracts video ID from URL.',\n",
      " 'docstring_tokens': ['Extracts', 'video', 'ID', 'from', 'URL', '.'],\n",
      " 'func_name': 'YouTube.get_vid_from_url',\n",
      " 'language': 'python',\n",
      " 'original_string': 'def get_vid_from_url(url):\\n'\n",
      "                    '        \"\"\"Extracts video ID from URL.\\n'\n",
      "                    '        \"\"\"\\n'\n",
      "                    \"        return match1(url, r'youtu\\\\.be/([^?/]+)') or \\\\\\n\"\n",
      "                    \"          match1(url, r'youtube\\\\.com/embed/([^/?]+)') or \"\n",
      "                    '\\\\\\n'\n",
      "                    \"          match1(url, r'youtube\\\\.com/v/([^/?]+)') or \\\\\\n\"\n",
      "                    \"          match1(url, r'youtube\\\\.com/watch/([^/?]+)') or \"\n",
      "                    '\\\\\\n'\n",
      "                    \"          parse_query_param(url, 'v') or \\\\\\n\"\n",
      "                    \"          parse_query_param(parse_query_param(url, 'u'), \"\n",
      "                    \"'v')\",\n",
      " 'partition': 'test',\n",
      " 'path': 'src/you_get/extractors/youtube.py',\n",
      " 'repo': 'soimort/you-get',\n",
      " 'sha': 'b746ac01c9f39de94cac2d56f665285b0523b974',\n",
      " 'url': 'https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143'}\n"
     ]
    }
   ],
   "source": [
    "pprint(json.loads(sample_file[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Definitions of each of the above fields are located in the  in the README.md file in the root of this repository."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Exploring The Full Dataset\n",
    "\n",
    "You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.\n",
    "\n",
    "The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "python_files = sorted(Path('../resources/data/python/').glob('**/*.gz'))\n",
    "java_files = sorted(Path('../resources/data/java/').glob('**/*.gz'))\n",
    "go_files = sorted(Path('../resources/data/go/').glob('**/*.gz'))\n",
    "php_files = sorted(Path('../resources/data/php/').glob('**/*.gz'))\n",
    "javascript_files = sorted(Path('../resources/data/javascript/').glob('**/*.gz'))\n",
    "ruby_files = sorted(Path('../resources/data/ruby/').glob('**/*.gz'))\n",
    "all_files = python_files + go_files + java_files + php_files + javascript_files + ruby_files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of files: 78\n"
     ]
    }
   ],
   "source": [
    "print(f'Total number of files: {len(all_files):,}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns_long_list = ['repo', 'path', 'url', 'code', \n",
    "                     'code_tokens', 'docstring', 'docstring_tokens', \n",
    "                     'language', 'partition']\n",
    "\n",
    "columns_short_list = ['code_tokens', 'docstring_tokens', \n",
    "                      'language', 'partition']\n",
    "\n",
    "def jsonl_list_to_dataframe(file_list, columns=columns_long_list):\n",
    "    \"\"\"Load a list of jsonl.gz files into a pandas DataFrame.\"\"\"\n",
    "    return pd.concat([pd.read_json(f, \n",
    "                                   orient='records', \n",
    "                                   compression='gzip',\n",
    "                                   lines=True)[columns] \n",
    "                      for f in file_list], sort=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is what the python dataset looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "pydf = jsonl_list_to_dataframe(python_files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>repo</th>\n",
       "      <th>path</th>\n",
       "      <th>url</th>\n",
       "      <th>code</th>\n",
       "      <th>code_tokens</th>\n",
       "      <th>docstring</th>\n",
       "      <th>docstring_tokens</th>\n",
       "      <th>language</th>\n",
       "      <th>partition</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>soimort/you-get</td>\n",
       "      <td>src/you_get/extractors/youtube.py</td>\n",
       "      <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143</td>\n",
       "      <td>def get_vid_from_url(url):\\n        \"\"\"Extracts video ID from URL.\\n        \"\"\"\\n        return match1(url, r'youtu\\.be/([^?/]+)') or \\\\n          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\\n          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\\n          match1(url, r'youtube\\.com/watch/...</td>\n",
       "      <td>[def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '...</td>\n",
       "      <td>Extracts video ID from URL.</td>\n",
       "      <td>[Extracts, video, ID, from, URL, .]</td>\n",
       "      <td>python</td>\n",
       "      <td>test</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>soimort/you-get</td>\n",
       "      <td>src/you_get/extractors/miomio.py</td>\n",
       "      <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51</td>\n",
       "      <td>def sina_xml_to_url_list(xml_data):\\n    \"\"\"str-&gt;list\\n    Convert XML to URL List.\\n    From Biligrab.\\n    \"\"\"\\n    rawurl = []\\n    dom = parseString(xml_data)\\n    for node in dom.getElementsByTagName('durl'):\\n        url = node.getElementsByTagName('url')[0]\\n        rawurl.append(url.chil...</td>\n",
       "      <td>[def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r...</td>\n",
       "      <td>str-&gt;list\\n    Convert XML to URL List.\\n    From Biligrab.</td>\n",
       "      <td>[str, -, &gt;, list, Convert, XML, to, URL, List, ., From, Biligrab, .]</td>\n",
       "      <td>python</td>\n",
       "      <td>test</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>soimort/you-get</td>\n",
       "      <td>src/you_get/extractors/fc2video.py</td>\n",
       "      <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17</td>\n",
       "      <td>def makeMimi(upid):\\n    \"\"\"From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n    Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n    L110\"\"\"\\n    strSeed = \"gGddgPfeaf_gzyr\"\\n    prehash = upid + \"_\" + strSeed\\n    return md5(prehash.encode('utf-8')).hexdigest()</td>\n",
       "      <td>[def, makeMimi, (, upid, ), :, strSeed, =, \"gGddgPfeaf_gzyr\", prehash, =, upid, +, \"_\", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )]</td>\n",
       "      <td>From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n    Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n    L110</td>\n",
       "      <td>[From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110]</td>\n",
       "      <td>python</td>\n",
       "      <td>test</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              repo                                path  \\\n",
       "0  soimort/you-get   src/you_get/extractors/youtube.py   \n",
       "1  soimort/you-get    src/you_get/extractors/miomio.py   \n",
       "2  soimort/you-get  src/you_get/extractors/fc2video.py   \n",
       "\n",
       "                                                                                                                            url  \\\n",
       "0  https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143   \n",
       "1     https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51   \n",
       "2   https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                          code  \\\n",
       "0  def get_vid_from_url(url):\\n        \"\"\"Extracts video ID from URL.\\n        \"\"\"\\n        return match1(url, r'youtu\\.be/([^?/]+)') or \\\\n          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\\n          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\\n          match1(url, r'youtube\\.com/watch/...   \n",
       "1  def sina_xml_to_url_list(xml_data):\\n    \"\"\"str->list\\n    Convert XML to URL List.\\n    From Biligrab.\\n    \"\"\"\\n    rawurl = []\\n    dom = parseString(xml_data)\\n    for node in dom.getElementsByTagName('durl'):\\n        url = node.getElementsByTagName('url')[0]\\n        rawurl.append(url.chil...   \n",
       "2            def makeMimi(upid):\\n    \"\"\"From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n    Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n    L110\"\"\"\\n    strSeed = \"gGddgPfeaf_gzyr\"\\n    prehash = upid + \"_\" + strSeed\\n    return md5(prehash.encode('utf-8')).hexdigest()   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                   code_tokens  \\\n",
       "0  [def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '...   \n",
       "1  [def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r...   \n",
       "2                                                                                                                                [def, makeMimi, (, upid, ), :, strSeed, =, \"gGddgPfeaf_gzyr\", prehash, =, upid, +, \"_\", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )]   \n",
       "\n",
       "                                                                                                                                  docstring  \\\n",
       "0                                                                                                               Extracts video ID from URL.   \n",
       "1                                                                               str->list\\n    Convert XML to URL List.\\n    From Biligrab.   \n",
       "2  From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n    Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n    L110   \n",
       "\n",
       "                                                                                                                                                                                 docstring_tokens  \\\n",
       "0                                                                                                                                                             [Extracts, video, ID, from, URL, .]   \n",
       "1                                                                                                                            [str, -, >, list, Convert, XML, to, URL, List, ., From, Biligrab, .]   \n",
       "2  [From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110]   \n",
       "\n",
       "  language partition  \n",
       "0   python      test  \n",
       "1   python      test  \n",
       "2   python      test  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pydf.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.).  You can find more information regarding the definition of the above columns in the README of this repo. \n",
    "\n",
    "Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics.  **Warning:** This step takes ~ 20 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_df = jsonl_list_to_dataframe(all_files, columns_short_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Row Counts\n",
    "\n",
    "By Partition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "train    1880853\n",
       "test      100529\n",
       "valid      89154\n",
       "Name: partition, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_df.partition.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "php           578118\n",
       "java          496688\n",
       "python        457461\n",
       "go            346365\n",
       "javascript    138625\n",
       "ruby           53279\n",
       "Name: language, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_df.language.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By Partition & Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "partition  language  \n",
       "test       go             14291\n",
       "           java           26909\n",
       "           javascript      6483\n",
       "           php            28391\n",
       "           python         22176\n",
       "           ruby            2279\n",
       "train      go            317832\n",
       "           java          454451\n",
       "           javascript    123889\n",
       "           php           523712\n",
       "           python        412178\n",
       "           ruby           48791\n",
       "valid      go             14242\n",
       "           java           15328\n",
       "           javascript      8253\n",
       "           php            26015\n",
       "           python         23107\n",
       "           ruby            2209\n",
       "Name: code_tokens, dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_df.groupby(['partition', 'language'])['code_tokens'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Token Lengths By Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))\n",
    "all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Code Length Percentile By Language\n",
    "\n",
    "For example, the 80th percentile length for python tokens is 72"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>code_len</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">go</th>\n",
       "      <th>0.50</th>\n",
       "      <td>61.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>138.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>217.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>319.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">java</th>\n",
       "      <th>0.50</th>\n",
       "      <td>66.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>104.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>142.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>224.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>331.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">javascript</th>\n",
       "      <th>0.50</th>\n",
       "      <td>91.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>144.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>194.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>301.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>448.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">php</th>\n",
       "      <th>0.50</th>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>123.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>162.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>243.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>347.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">python</th>\n",
       "      <th>0.50</th>\n",
       "      <td>72.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>114.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>155.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>237.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>341.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">ruby</th>\n",
       "      <th>0.50</th>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>68.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>88.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>125.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>174.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 code_len\n",
       "language                 \n",
       "go         0.50      61.0\n",
       "           0.70     100.0\n",
       "           0.80     138.0\n",
       "           0.90     217.0\n",
       "           0.95     319.0\n",
       "java       0.50      66.0\n",
       "           0.70     104.0\n",
       "           0.80     142.0\n",
       "           0.90     224.0\n",
       "           0.95     331.0\n",
       "javascript 0.50      91.0\n",
       "           0.70     144.0\n",
       "           0.80     194.0\n",
       "           0.90     301.0\n",
       "           0.95     448.0\n",
       "php        0.50      81.0\n",
       "           0.70     123.0\n",
       "           0.80     162.0\n",
       "           0.90     243.0\n",
       "           0.95     347.0\n",
       "python     0.50      72.0\n",
       "           0.70     114.0\n",
       "           0.80     155.0\n",
       "           0.90     237.0\n",
       "           0.95     341.0\n",
       "ruby       0.50      48.0\n",
       "           0.70      68.6\n",
       "           0.80      88.0\n",
       "           0.90     125.0\n",
       "           0.95     174.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])\n",
    "display(pd.DataFrame(code_len_summary))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Query Length Percentile By Language\n",
    "\n",
    "For example, the 80th percentile length for python tokens is 19"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>query_len</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>language</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">go</th>\n",
       "      <th>0.50</th>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>19.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>28.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>49.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>92.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">java</th>\n",
       "      <th>0.50</th>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>39.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>61.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">javascript</th>\n",
       "      <th>0.50</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>47.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">php</th>\n",
       "      <th>0.50</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">python</th>\n",
       "      <th>0.50</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">ruby</th>\n",
       "      <th>0.50</th>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>36.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>49.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 query_len\n",
       "language                  \n",
       "go         0.50       12.0\n",
       "           0.70       19.0\n",
       "           0.80       28.0\n",
       "           0.90       49.0\n",
       "           0.95       92.0\n",
       "java       0.50       11.0\n",
       "           0.70       18.0\n",
       "           0.80       25.0\n",
       "           0.90       39.0\n",
       "           0.95       61.0\n",
       "javascript 0.50       10.0\n",
       "           0.70       15.0\n",
       "           0.80       21.0\n",
       "           0.90       33.0\n",
       "           0.95       47.0\n",
       "php        0.50        7.0\n",
       "           0.70       10.0\n",
       "           0.80       12.0\n",
       "           0.90       17.0\n",
       "           0.95       24.0\n",
       "python     0.50       10.0\n",
       "           0.70       15.0\n",
       "           0.80       20.0\n",
       "           0.90       33.0\n",
       "           0.95       48.0\n",
       "ruby       0.50       11.0\n",
       "           0.70       17.0\n",
       "           0.80       24.0\n",
       "           0.90       36.0\n",
       "           0.95       49.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])\n",
    "display(pd.DataFrame(query_len_summary))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Query Length All Languages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>query_len</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0.50</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.70</th>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.80</th>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.90</th>\n",
       "      <td>32.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0.95</th>\n",
       "      <td>50.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      query_len\n",
       "0.50       10.0\n",
       "0.70       15.0\n",
       "0.80       20.0\n",
       "0.90       32.0\n",
       "0.95       50.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])\n",
    "display(pd.DataFrame(query_len_summary))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}