CodeSearchNet/notebooks/ExploreData.ipynb

1193 строки
43 KiB
Plaintext
Исходник Ответственный История

Этот файл содержит неоднозначные символы Юникода!

Этот файл содержит неоднозначные символы Юникода, которые могут быть перепутаны с другими в текущей локали. Если это намеренно, можете спокойно проигнорировать это предупреждение. Используйте кнопку Экранировать, чтобы подсветить эти символы.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Exploration\n",
"\n",
"This notebook explores the pre-processed data, and shows some basic statistics that may be useful. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"import pandas as pd\n",
"from pathlib import Path\n",
"pd.set_option('max_colwidth',300)\n",
"from pprint import pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: Preview The Dataset\n",
" \n",
"Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data. While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3. \n",
"\n",
"The s3 links follow this pattern:\n",
"\n",
"> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip\n",
"\n",
"For example, the link for the `python` is:\n",
"\n",
"> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip\n",
"\n",
"First we download and decompress this dataset:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2019-06-14 01:05:08-- https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip\n",
"Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.184.77\n",
"Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.184.77|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 218813834 (209M) [application/zip]\n",
"Saving to: python.zip\n",
"\n",
"python.zip 100%[===================>] 208.68M 63.9MB/s in 3.3s \n",
"\n",
"2019-06-14 01:05:11 (63.9 MB/s) - python.zip saved [218813834/218813834]\n",
"\n"
]
}
],
"source": [
"!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: python.zip\n",
" creating: python/\n",
" creating: python/final/\n",
" creating: python/final/jsonl/\n",
" creating: python/final/jsonl/valid/\n",
" inflating: python/final/jsonl/valid/python_valid_0.jsonl.gz \n",
" creating: python/final/jsonl/test/\n",
" inflating: python/final/jsonl/test/python_test_0.jsonl.gz \n",
" creating: python/final/jsonl/train/\n",
" inflating: python/final/jsonl/train/python_train_7.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_6.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_12.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_13.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_0.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_1.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_4.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_5.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_9.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_8.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_11.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_10.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_3.jsonl.gz \n",
" inflating: python/final/jsonl/train/python_train_2.jsonl.gz \n"
]
}
],
"source": [
"!unzip python.zip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can inspect `python/final/jsonl/test/python_test_0.jsonl.gz` to see its contents:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# decompress this gzip file\n",
"!gzip -d python/final/jsonl/test/python_test_0.jsonl.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read in the file and display the first row. The data is stored in [JSON Lines](http://jsonlines.org/) format."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'{\"repo\": \"soimort/you-get\", \"path\": \"src/you_get/extractors/youtube.py\", \"func_name\": \"YouTube.get_vid_from_url\", \"original_string\": \"def get_vid_from_url(url):\\\\n \\\\\"\\\\\"\\\\\"Extracts video ID from URL.\\\\n \\\\\"\\\\\"\\\\\"\\\\n return match1(url, r\\'youtu\\\\\\\\.be/([^?/]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/v/([^/?]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\') or \\\\\\\\\\\\n parse_query_param(url, \\'v\\') or \\\\\\\\\\\\n parse_query_param(parse_query_param(url, \\'u\\'), \\'v\\')\", \"language\": \"python\", \"code\": \"def get_vid_from_url(url):\\\\n \\\\\"\\\\\"\\\\\"Extracts video ID from URL.\\\\n \\\\\"\\\\\"\\\\\"\\\\n return match1(url, r\\'youtu\\\\\\\\.be/([^?/]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/v/([^/?]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\') or \\\\\\\\\\\\n parse_query_param(url, \\'v\\') or \\\\\\\\\\\\n parse_query_param(parse_query_param(url, \\'u\\'), \\'v\\')\", \"code_tokens\": [\"def\", \"get_vid_from_url\", \"(\", \"url\", \")\", \":\", \"return\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtu\\\\\\\\.be/([^?/]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/v/([^/?]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\'\", \")\", \"or\", \"parse_query_param\", \"(\", \"url\", \",\", \"\\'v\\'\", \")\", \"or\", \"parse_query_param\", \"(\", \"parse_query_param\", \"(\", \"url\", \",\", \"\\'u\\'\", \")\", \",\", \"\\'v\\'\", \")\"], \"docstring\": \"Extracts video ID from URL.\", \"docstring_tokens\": [\"Extracts\", \"video\", \"ID\", \"from\", \"URL\", \".\"], \"sha\": \"b746ac01c9f39de94cac2d56f665285b0523b974\", \"url\": \"https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143\", \"partition\": \"test\"}\\n'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open('python/final/jsonl/test/python_test_0.jsonl', 'r') as f:\n",
" sample_file = f.readlines()\n",
"sample_file[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'code': 'def get_vid_from_url(url):\\n'\n",
" ' \"\"\"Extracts video ID from URL.\\n'\n",
" ' \"\"\"\\n'\n",
" \" return match1(url, r'youtu\\\\.be/([^?/]+)') or \\\\\\n\"\n",
" \" match1(url, r'youtube\\\\.com/embed/([^/?]+)') or \\\\\\n\"\n",
" \" match1(url, r'youtube\\\\.com/v/([^/?]+)') or \\\\\\n\"\n",
" \" match1(url, r'youtube\\\\.com/watch/([^/?]+)') or \\\\\\n\"\n",
" \" parse_query_param(url, 'v') or \\\\\\n\"\n",
" \" parse_query_param(parse_query_param(url, 'u'), 'v')\",\n",
" 'code_tokens': ['def',\n",
" 'get_vid_from_url',\n",
" '(',\n",
" 'url',\n",
" ')',\n",
" ':',\n",
" 'return',\n",
" 'match1',\n",
" '(',\n",
" 'url',\n",
" ',',\n",
" \"r'youtu\\\\.be/([^?/]+)'\",\n",
" ')',\n",
" 'or',\n",
" 'match1',\n",
" '(',\n",
" 'url',\n",
" ',',\n",
" \"r'youtube\\\\.com/embed/([^/?]+)'\",\n",
" ')',\n",
" 'or',\n",
" 'match1',\n",
" '(',\n",
" 'url',\n",
" ',',\n",
" \"r'youtube\\\\.com/v/([^/?]+)'\",\n",
" ')',\n",
" 'or',\n",
" 'match1',\n",
" '(',\n",
" 'url',\n",
" ',',\n",
" \"r'youtube\\\\.com/watch/([^/?]+)'\",\n",
" ')',\n",
" 'or',\n",
" 'parse_query_param',\n",
" '(',\n",
" 'url',\n",
" ',',\n",
" \"'v'\",\n",
" ')',\n",
" 'or',\n",
" 'parse_query_param',\n",
" '(',\n",
" 'parse_query_param',\n",
" '(',\n",
" 'url',\n",
" ',',\n",
" \"'u'\",\n",
" ')',\n",
" ',',\n",
" \"'v'\",\n",
" ')'],\n",
" 'docstring': 'Extracts video ID from URL.',\n",
" 'docstring_tokens': ['Extracts', 'video', 'ID', 'from', 'URL', '.'],\n",
" 'func_name': 'YouTube.get_vid_from_url',\n",
" 'language': 'python',\n",
" 'original_string': 'def get_vid_from_url(url):\\n'\n",
" ' \"\"\"Extracts video ID from URL.\\n'\n",
" ' \"\"\"\\n'\n",
" \" return match1(url, r'youtu\\\\.be/([^?/]+)') or \\\\\\n\"\n",
" \" match1(url, r'youtube\\\\.com/embed/([^/?]+)') or \"\n",
" '\\\\\\n'\n",
" \" match1(url, r'youtube\\\\.com/v/([^/?]+)') or \\\\\\n\"\n",
" \" match1(url, r'youtube\\\\.com/watch/([^/?]+)') or \"\n",
" '\\\\\\n'\n",
" \" parse_query_param(url, 'v') or \\\\\\n\"\n",
" \" parse_query_param(parse_query_param(url, 'u'), \"\n",
" \"'v')\",\n",
" 'partition': 'test',\n",
" 'path': 'src/you_get/extractors/youtube.py',\n",
" 'repo': 'soimort/you-get',\n",
" 'sha': 'b746ac01c9f39de94cac2d56f665285b0523b974',\n",
" 'url': 'https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143'}\n"
]
}
],
"source": [
"pprint(json.loads(sample_file[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Definitions of each of the above fields are located in the in the README.md file in the root of this repository."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Exploring The Full Dataset\n",
"\n",
"You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions. You can learn more about the directory structure and associated files by viewing `/resources/README.md`.\n",
"\n",
"The preprocessed data re stored in [json lines](http://jsonlines.org/) format. First, we can get a list of all these files for further inspection:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"python_files = sorted(Path('../resources/data/python/').glob('**/*.gz'))\n",
"java_files = sorted(Path('../resources/data/java/').glob('**/*.gz'))\n",
"go_files = sorted(Path('../resources/data/go/').glob('**/*.gz'))\n",
"php_files = sorted(Path('../resources/data/php/').glob('**/*.gz'))\n",
"javascript_files = sorted(Path('../resources/data/javascript/').glob('**/*.gz'))\n",
"ruby_files = sorted(Path('../resources/data/ruby/').glob('**/*.gz'))\n",
"all_files = python_files + go_files + java_files + php_files + javascript_files + ruby_files"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of files: 78\n"
]
}
],
"source": [
"print(f'Total number of files: {len(all_files):,}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"columns_long_list = ['repo', 'path', 'url', 'code', \n",
" 'code_tokens', 'docstring', 'docstring_tokens', \n",
" 'language', 'partition']\n",
"\n",
"columns_short_list = ['code_tokens', 'docstring_tokens', \n",
" 'language', 'partition']\n",
"\n",
"def jsonl_list_to_dataframe(file_list, columns=columns_long_list):\n",
" \"\"\"Load a list of jsonl.gz files into a pandas DataFrame.\"\"\"\n",
" return pd.concat([pd.read_json(f, \n",
" orient='records', \n",
" compression='gzip',\n",
" lines=True)[columns] \n",
" for f in file_list], sort=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is what the python dataset looks like:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"pydf = jsonl_list_to_dataframe(python_files)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>repo</th>\n",
" <th>path</th>\n",
" <th>url</th>\n",
" <th>code</th>\n",
" <th>code_tokens</th>\n",
" <th>docstring</th>\n",
" <th>docstring_tokens</th>\n",
" <th>language</th>\n",
" <th>partition</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>soimort/you-get</td>\n",
" <td>src/you_get/extractors/youtube.py</td>\n",
" <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143</td>\n",
" <td>def get_vid_from_url(url):\\n \"\"\"Extracts video ID from URL.\\n \"\"\"\\n return match1(url, r'youtu\\.be/([^?/]+)') or \\\\n match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\\n match1(url, r'youtube\\.com/v/([^/?]+)') or \\\\n match1(url, r'youtube\\.com/watch/...</td>\n",
" <td>[def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '...</td>\n",
" <td>Extracts video ID from URL.</td>\n",
" <td>[Extracts, video, ID, from, URL, .]</td>\n",
" <td>python</td>\n",
" <td>test</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>soimort/you-get</td>\n",
" <td>src/you_get/extractors/miomio.py</td>\n",
" <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51</td>\n",
" <td>def sina_xml_to_url_list(xml_data):\\n \"\"\"str-&gt;list\\n Convert XML to URL List.\\n From Biligrab.\\n \"\"\"\\n rawurl = []\\n dom = parseString(xml_data)\\n for node in dom.getElementsByTagName('durl'):\\n url = node.getElementsByTagName('url')[0]\\n rawurl.append(url.chil...</td>\n",
" <td>[def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r...</td>\n",
" <td>str-&gt;list\\n Convert XML to URL List.\\n From Biligrab.</td>\n",
" <td>[str, -, &gt;, list, Convert, XML, to, URL, List, ., From, Biligrab, .]</td>\n",
" <td>python</td>\n",
" <td>test</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>soimort/you-get</td>\n",
" <td>src/you_get/extractors/fc2video.py</td>\n",
" <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17</td>\n",
" <td>def makeMimi(upid):\\n \"\"\"From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n L110\"\"\"\\n strSeed = \"gGddgPfeaf_gzyr\"\\n prehash = upid + \"_\" + strSeed\\n return md5(prehash.encode('utf-8')).hexdigest()</td>\n",
" <td>[def, makeMimi, (, upid, ), :, strSeed, =, \"gGddgPfeaf_gzyr\", prehash, =, upid, +, \"_\", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )]</td>\n",
" <td>From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n L110</td>\n",
" <td>[From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110]</td>\n",
" <td>python</td>\n",
" <td>test</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" repo path \\\n",
"0 soimort/you-get src/you_get/extractors/youtube.py \n",
"1 soimort/you-get src/you_get/extractors/miomio.py \n",
"2 soimort/you-get src/you_get/extractors/fc2video.py \n",
"\n",
" url \\\n",
"0 https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143 \n",
"1 https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51 \n",
"2 https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17 \n",
"\n",
" code \\\n",
"0 def get_vid_from_url(url):\\n \"\"\"Extracts video ID from URL.\\n \"\"\"\\n return match1(url, r'youtu\\.be/([^?/]+)') or \\\\n match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\\n match1(url, r'youtube\\.com/v/([^/?]+)') or \\\\n match1(url, r'youtube\\.com/watch/... \n",
"1 def sina_xml_to_url_list(xml_data):\\n \"\"\"str->list\\n Convert XML to URL List.\\n From Biligrab.\\n \"\"\"\\n rawurl = []\\n dom = parseString(xml_data)\\n for node in dom.getElementsByTagName('durl'):\\n url = node.getElementsByTagName('url')[0]\\n rawurl.append(url.chil... \n",
"2 def makeMimi(upid):\\n \"\"\"From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n L110\"\"\"\\n strSeed = \"gGddgPfeaf_gzyr\"\\n prehash = upid + \"_\" + strSeed\\n return md5(prehash.encode('utf-8')).hexdigest() \n",
"\n",
" code_tokens \\\n",
"0 [def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '... \n",
"1 [def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r... \n",
"2 [def, makeMimi, (, upid, ), :, strSeed, =, \"gGddgPfeaf_gzyr\", prehash, =, upid, +, \"_\", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )] \n",
"\n",
" docstring \\\n",
"0 Extracts video ID from URL. \n",
"1 str->list\\n Convert XML to URL List.\\n From Biligrab. \n",
"2 From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n L110 \n",
"\n",
" docstring_tokens \\\n",
"0 [Extracts, video, ID, from, URL, .] \n",
"1 [str, -, >, list, Convert, XML, to, URL, List, ., From, Biligrab, .] \n",
"2 [From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110] \n",
"\n",
" language partition \n",
"0 python test \n",
"1 python test \n",
"2 python test "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pydf.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.). You can find more information regarding the definition of the above columns in the README of this repo. \n",
"\n",
"Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics. **Warning:** This step takes ~ 20 minutes."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"all_df = jsonl_list_to_dataframe(all_files, columns_short_list)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary Statistics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Row Counts\n",
"\n",
"By Partition"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"train 1880853\n",
"test 100529\n",
"valid 89154\n",
"Name: partition, dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_df.partition.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By Language"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"php 578118\n",
"java 496688\n",
"python 457461\n",
"go 346365\n",
"javascript 138625\n",
"ruby 53279\n",
"Name: language, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_df.language.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By Partition & Language"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"partition language \n",
"test go 14291\n",
" java 26909\n",
" javascript 6483\n",
" php 28391\n",
" python 22176\n",
" ruby 2279\n",
"train go 317832\n",
" java 454451\n",
" javascript 123889\n",
" php 523712\n",
" python 412178\n",
" ruby 48791\n",
"valid go 14242\n",
" java 15328\n",
" javascript 8253\n",
" php 26015\n",
" python 23107\n",
" ruby 2209\n",
"Name: code_tokens, dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_df.groupby(['partition', 'language'])['code_tokens'].count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Token Lengths By Language"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))\n",
"all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Code Length Percentile By Language\n",
"\n",
"For example, the 80th percentile length for python tokens is 72"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>code_len</th>\n",
" </tr>\n",
" <tr>\n",
" <th>language</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">go</th>\n",
" <th>0.50</th>\n",
" <td>61.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>138.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>217.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>319.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">java</th>\n",
" <th>0.50</th>\n",
" <td>66.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>104.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>142.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>224.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>331.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">javascript</th>\n",
" <th>0.50</th>\n",
" <td>91.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>144.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>194.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>301.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>448.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">php</th>\n",
" <th>0.50</th>\n",
" <td>81.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>123.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>162.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>243.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>347.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">python</th>\n",
" <th>0.50</th>\n",
" <td>72.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>114.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>155.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>237.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>341.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">ruby</th>\n",
" <th>0.50</th>\n",
" <td>48.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>68.6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>88.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>125.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>174.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" code_len\n",
"language \n",
"go 0.50 61.0\n",
" 0.70 100.0\n",
" 0.80 138.0\n",
" 0.90 217.0\n",
" 0.95 319.0\n",
"java 0.50 66.0\n",
" 0.70 104.0\n",
" 0.80 142.0\n",
" 0.90 224.0\n",
" 0.95 331.0\n",
"javascript 0.50 91.0\n",
" 0.70 144.0\n",
" 0.80 194.0\n",
" 0.90 301.0\n",
" 0.95 448.0\n",
"php 0.50 81.0\n",
" 0.70 123.0\n",
" 0.80 162.0\n",
" 0.90 243.0\n",
" 0.95 347.0\n",
"python 0.50 72.0\n",
" 0.70 114.0\n",
" 0.80 155.0\n",
" 0.90 237.0\n",
" 0.95 341.0\n",
"ruby 0.50 48.0\n",
" 0.70 68.6\n",
" 0.80 88.0\n",
" 0.90 125.0\n",
" 0.95 174.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])\n",
"display(pd.DataFrame(code_len_summary))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Query Length Percentile By Language\n",
"\n",
"For example, the 80th percentile length for python tokens is 19"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>query_len</th>\n",
" </tr>\n",
" <tr>\n",
" <th>language</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">go</th>\n",
" <th>0.50</th>\n",
" <td>12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>19.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>28.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>49.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>92.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">java</th>\n",
" <th>0.50</th>\n",
" <td>11.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>18.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>25.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>39.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>61.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">javascript</th>\n",
" <th>0.50</th>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>33.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>47.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">php</th>\n",
" <th>0.50</th>\n",
" <td>7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>17.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>24.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">python</th>\n",
" <th>0.50</th>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>33.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>48.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">ruby</th>\n",
" <th>0.50</th>\n",
" <td>11.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>17.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>24.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>36.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>49.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" query_len\n",
"language \n",
"go 0.50 12.0\n",
" 0.70 19.0\n",
" 0.80 28.0\n",
" 0.90 49.0\n",
" 0.95 92.0\n",
"java 0.50 11.0\n",
" 0.70 18.0\n",
" 0.80 25.0\n",
" 0.90 39.0\n",
" 0.95 61.0\n",
"javascript 0.50 10.0\n",
" 0.70 15.0\n",
" 0.80 21.0\n",
" 0.90 33.0\n",
" 0.95 47.0\n",
"php 0.50 7.0\n",
" 0.70 10.0\n",
" 0.80 12.0\n",
" 0.90 17.0\n",
" 0.95 24.0\n",
"python 0.50 10.0\n",
" 0.70 15.0\n",
" 0.80 20.0\n",
" 0.90 33.0\n",
" 0.95 48.0\n",
"ruby 0.50 11.0\n",
" 0.70 17.0\n",
" 0.80 24.0\n",
" 0.90 36.0\n",
" 0.95 49.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])\n",
"display(pd.DataFrame(query_len_summary))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Query Length All Languages"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>query_len</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0.50</th>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.70</th>\n",
" <td>15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.80</th>\n",
" <td>20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.90</th>\n",
" <td>32.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0.95</th>\n",
" <td>50.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" query_len\n",
"0.50 10.0\n",
"0.70 15.0\n",
"0.80 20.0\n",
"0.90 32.0\n",
"0.95 50.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])\n",
"display(pd.DataFrame(query_len_summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}