1193 строки
43 KiB
Plaintext
1193 строки
43 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Data Exploration\n",
|
||
"\n",
|
||
"This notebook explores the pre-processed data, and shows some basic statistics that may be useful. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import json\n",
|
||
"\n",
|
||
"import pandas as pd\n",
|
||
"from pathlib import Path\n",
|
||
"pd.set_option('max_colwidth',300)\n",
|
||
"from pprint import pprint"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Part 1: Preview The Dataset\n",
|
||
" \n",
|
||
"Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data. While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3. \n",
|
||
"\n",
|
||
"The s3 links follow this pattern:\n",
|
||
"\n",
|
||
"> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip\n",
|
||
"\n",
|
||
"For example, the link for the `python` is:\n",
|
||
"\n",
|
||
"> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip\n",
|
||
"\n",
|
||
"First we download and decompress this dataset:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--2019-06-14 01:05:08-- https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip\n",
|
||
"Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.184.77\n",
|
||
"Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.184.77|:443... connected.\n",
|
||
"HTTP request sent, awaiting response... 200 OK\n",
|
||
"Length: 218813834 (209M) [application/zip]\n",
|
||
"Saving to: ‘python.zip’\n",
|
||
"\n",
|
||
"python.zip 100%[===================>] 208.68M 63.9MB/s in 3.3s \n",
|
||
"\n",
|
||
"2019-06-14 01:05:11 (63.9 MB/s) - ‘python.zip’ saved [218813834/218813834]\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Archive: python.zip\n",
|
||
" creating: python/\n",
|
||
" creating: python/final/\n",
|
||
" creating: python/final/jsonl/\n",
|
||
" creating: python/final/jsonl/valid/\n",
|
||
" inflating: python/final/jsonl/valid/python_valid_0.jsonl.gz \n",
|
||
" creating: python/final/jsonl/test/\n",
|
||
" inflating: python/final/jsonl/test/python_test_0.jsonl.gz \n",
|
||
" creating: python/final/jsonl/train/\n",
|
||
" inflating: python/final/jsonl/train/python_train_7.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_6.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_12.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_13.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_0.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_1.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_4.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_5.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_9.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_8.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_11.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_10.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_3.jsonl.gz \n",
|
||
" inflating: python/final/jsonl/train/python_train_2.jsonl.gz \n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"!unzip python.zip"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Finally, we can inspect `python/final/jsonl/test/python_test_0.jsonl.gz` to see its contents:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# decompress this gzip file\n",
|
||
"!gzip -d python/final/jsonl/test/python_test_0.jsonl.gz"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Read in the file and display the first row. The data is stored in [JSON Lines](http://jsonlines.org/) format."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'{\"repo\": \"soimort/you-get\", \"path\": \"src/you_get/extractors/youtube.py\", \"func_name\": \"YouTube.get_vid_from_url\", \"original_string\": \"def get_vid_from_url(url):\\\\n \\\\\"\\\\\"\\\\\"Extracts video ID from URL.\\\\n \\\\\"\\\\\"\\\\\"\\\\n return match1(url, r\\'youtu\\\\\\\\.be/([^?/]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/v/([^/?]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\') or \\\\\\\\\\\\n parse_query_param(url, \\'v\\') or \\\\\\\\\\\\n parse_query_param(parse_query_param(url, \\'u\\'), \\'v\\')\", \"language\": \"python\", \"code\": \"def get_vid_from_url(url):\\\\n \\\\\"\\\\\"\\\\\"Extracts video ID from URL.\\\\n \\\\\"\\\\\"\\\\\"\\\\n return match1(url, r\\'youtu\\\\\\\\.be/([^?/]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/v/([^/?]+)\\') or \\\\\\\\\\\\n match1(url, r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\') or \\\\\\\\\\\\n parse_query_param(url, \\'v\\') or \\\\\\\\\\\\n parse_query_param(parse_query_param(url, \\'u\\'), \\'v\\')\", \"code_tokens\": [\"def\", \"get_vid_from_url\", \"(\", \"url\", \")\", \":\", \"return\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtu\\\\\\\\.be/([^?/]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/embed/([^/?]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/v/([^/?]+)\\'\", \")\", \"or\", \"match1\", \"(\", \"url\", \",\", \"r\\'youtube\\\\\\\\.com/watch/([^/?]+)\\'\", \")\", \"or\", \"parse_query_param\", \"(\", \"url\", \",\", \"\\'v\\'\", \")\", \"or\", \"parse_query_param\", \"(\", \"parse_query_param\", \"(\", \"url\", \",\", \"\\'u\\'\", \")\", \",\", \"\\'v\\'\", \")\"], \"docstring\": \"Extracts video ID from URL.\", \"docstring_tokens\": [\"Extracts\", \"video\", \"ID\", \"from\", \"URL\", \".\"], \"sha\": \"b746ac01c9f39de94cac2d56f665285b0523b974\", \"url\": \"https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143\", \"partition\": \"test\"}\\n'"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"with open('python/final/jsonl/test/python_test_0.jsonl', 'r') as f:\n",
|
||
" sample_file = f.readlines()\n",
|
||
"sample_file[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"{'code': 'def get_vid_from_url(url):\\n'\n",
|
||
" ' \"\"\"Extracts video ID from URL.\\n'\n",
|
||
" ' \"\"\"\\n'\n",
|
||
" \" return match1(url, r'youtu\\\\.be/([^?/]+)') or \\\\\\n\"\n",
|
||
" \" match1(url, r'youtube\\\\.com/embed/([^/?]+)') or \\\\\\n\"\n",
|
||
" \" match1(url, r'youtube\\\\.com/v/([^/?]+)') or \\\\\\n\"\n",
|
||
" \" match1(url, r'youtube\\\\.com/watch/([^/?]+)') or \\\\\\n\"\n",
|
||
" \" parse_query_param(url, 'v') or \\\\\\n\"\n",
|
||
" \" parse_query_param(parse_query_param(url, 'u'), 'v')\",\n",
|
||
" 'code_tokens': ['def',\n",
|
||
" 'get_vid_from_url',\n",
|
||
" '(',\n",
|
||
" 'url',\n",
|
||
" ')',\n",
|
||
" ':',\n",
|
||
" 'return',\n",
|
||
" 'match1',\n",
|
||
" '(',\n",
|
||
" 'url',\n",
|
||
" ',',\n",
|
||
" \"r'youtu\\\\.be/([^?/]+)'\",\n",
|
||
" ')',\n",
|
||
" 'or',\n",
|
||
" 'match1',\n",
|
||
" '(',\n",
|
||
" 'url',\n",
|
||
" ',',\n",
|
||
" \"r'youtube\\\\.com/embed/([^/?]+)'\",\n",
|
||
" ')',\n",
|
||
" 'or',\n",
|
||
" 'match1',\n",
|
||
" '(',\n",
|
||
" 'url',\n",
|
||
" ',',\n",
|
||
" \"r'youtube\\\\.com/v/([^/?]+)'\",\n",
|
||
" ')',\n",
|
||
" 'or',\n",
|
||
" 'match1',\n",
|
||
" '(',\n",
|
||
" 'url',\n",
|
||
" ',',\n",
|
||
" \"r'youtube\\\\.com/watch/([^/?]+)'\",\n",
|
||
" ')',\n",
|
||
" 'or',\n",
|
||
" 'parse_query_param',\n",
|
||
" '(',\n",
|
||
" 'url',\n",
|
||
" ',',\n",
|
||
" \"'v'\",\n",
|
||
" ')',\n",
|
||
" 'or',\n",
|
||
" 'parse_query_param',\n",
|
||
" '(',\n",
|
||
" 'parse_query_param',\n",
|
||
" '(',\n",
|
||
" 'url',\n",
|
||
" ',',\n",
|
||
" \"'u'\",\n",
|
||
" ')',\n",
|
||
" ',',\n",
|
||
" \"'v'\",\n",
|
||
" ')'],\n",
|
||
" 'docstring': 'Extracts video ID from URL.',\n",
|
||
" 'docstring_tokens': ['Extracts', 'video', 'ID', 'from', 'URL', '.'],\n",
|
||
" 'func_name': 'YouTube.get_vid_from_url',\n",
|
||
" 'language': 'python',\n",
|
||
" 'original_string': 'def get_vid_from_url(url):\\n'\n",
|
||
" ' \"\"\"Extracts video ID from URL.\\n'\n",
|
||
" ' \"\"\"\\n'\n",
|
||
" \" return match1(url, r'youtu\\\\.be/([^?/]+)') or \\\\\\n\"\n",
|
||
" \" match1(url, r'youtube\\\\.com/embed/([^/?]+)') or \"\n",
|
||
" '\\\\\\n'\n",
|
||
" \" match1(url, r'youtube\\\\.com/v/([^/?]+)') or \\\\\\n\"\n",
|
||
" \" match1(url, r'youtube\\\\.com/watch/([^/?]+)') or \"\n",
|
||
" '\\\\\\n'\n",
|
||
" \" parse_query_param(url, 'v') or \\\\\\n\"\n",
|
||
" \" parse_query_param(parse_query_param(url, 'u'), \"\n",
|
||
" \"'v')\",\n",
|
||
" 'partition': 'test',\n",
|
||
" 'path': 'src/you_get/extractors/youtube.py',\n",
|
||
" 'repo': 'soimort/you-get',\n",
|
||
" 'sha': 'b746ac01c9f39de94cac2d56f665285b0523b974',\n",
|
||
" 'url': 'https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143'}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"pprint(json.loads(sample_file[0]))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Definitions of each of the above fields are located in the in the README.md file in the root of this repository."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Part 2: Exploring The Full Dataset\n",
|
||
"\n",
|
||
"You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions. You can learn more about the directory structure and associated files by viewing `/resources/README.md`.\n",
|
||
"\n",
|
||
"The preprocessed data re stored in [json lines](http://jsonlines.org/) format. First, we can get a list of all these files for further inspection:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"scrolled": false
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"python_files = sorted(Path('../resources/data/python/').glob('**/*.gz'))\n",
|
||
"java_files = sorted(Path('../resources/data/java/').glob('**/*.gz'))\n",
|
||
"go_files = sorted(Path('../resources/data/go/').glob('**/*.gz'))\n",
|
||
"php_files = sorted(Path('../resources/data/php/').glob('**/*.gz'))\n",
|
||
"javascript_files = sorted(Path('../resources/data/javascript/').glob('**/*.gz'))\n",
|
||
"ruby_files = sorted(Path('../resources/data/ruby/').glob('**/*.gz'))\n",
|
||
"all_files = python_files + go_files + java_files + php_files + javascript_files + ruby_files"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Total number of files: 78\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(f'Total number of files: {len(all_files):,}')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"columns_long_list = ['repo', 'path', 'url', 'code', \n",
|
||
" 'code_tokens', 'docstring', 'docstring_tokens', \n",
|
||
" 'language', 'partition']\n",
|
||
"\n",
|
||
"columns_short_list = ['code_tokens', 'docstring_tokens', \n",
|
||
" 'language', 'partition']\n",
|
||
"\n",
|
||
"def jsonl_list_to_dataframe(file_list, columns=columns_long_list):\n",
|
||
" \"\"\"Load a list of jsonl.gz files into a pandas DataFrame.\"\"\"\n",
|
||
" return pd.concat([pd.read_json(f, \n",
|
||
" orient='records', \n",
|
||
" compression='gzip',\n",
|
||
" lines=True)[columns] \n",
|
||
" for f in file_list], sort=False)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"This is what the python dataset looks like:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"pydf = jsonl_list_to_dataframe(python_files)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>repo</th>\n",
|
||
" <th>path</th>\n",
|
||
" <th>url</th>\n",
|
||
" <th>code</th>\n",
|
||
" <th>code_tokens</th>\n",
|
||
" <th>docstring</th>\n",
|
||
" <th>docstring_tokens</th>\n",
|
||
" <th>language</th>\n",
|
||
" <th>partition</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>soimort/you-get</td>\n",
|
||
" <td>src/you_get/extractors/youtube.py</td>\n",
|
||
" <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143</td>\n",
|
||
" <td>def get_vid_from_url(url):\\n \"\"\"Extracts video ID from URL.\\n \"\"\"\\n return match1(url, r'youtu\\.be/([^?/]+)') or \\\\n match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\\n match1(url, r'youtube\\.com/v/([^/?]+)') or \\\\n match1(url, r'youtube\\.com/watch/...</td>\n",
|
||
" <td>[def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '...</td>\n",
|
||
" <td>Extracts video ID from URL.</td>\n",
|
||
" <td>[Extracts, video, ID, from, URL, .]</td>\n",
|
||
" <td>python</td>\n",
|
||
" <td>test</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>soimort/you-get</td>\n",
|
||
" <td>src/you_get/extractors/miomio.py</td>\n",
|
||
" <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51</td>\n",
|
||
" <td>def sina_xml_to_url_list(xml_data):\\n \"\"\"str->list\\n Convert XML to URL List.\\n From Biligrab.\\n \"\"\"\\n rawurl = []\\n dom = parseString(xml_data)\\n for node in dom.getElementsByTagName('durl'):\\n url = node.getElementsByTagName('url')[0]\\n rawurl.append(url.chil...</td>\n",
|
||
" <td>[def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r...</td>\n",
|
||
" <td>str->list\\n Convert XML to URL List.\\n From Biligrab.</td>\n",
|
||
" <td>[str, -, >, list, Convert, XML, to, URL, List, ., From, Biligrab, .]</td>\n",
|
||
" <td>python</td>\n",
|
||
" <td>test</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>soimort/you-get</td>\n",
|
||
" <td>src/you_get/extractors/fc2video.py</td>\n",
|
||
" <td>https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17</td>\n",
|
||
" <td>def makeMimi(upid):\\n \"\"\"From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n L110\"\"\"\\n strSeed = \"gGddgPfeaf_gzyr\"\\n prehash = upid + \"_\" + strSeed\\n return md5(prehash.encode('utf-8')).hexdigest()</td>\n",
|
||
" <td>[def, makeMimi, (, upid, ), :, strSeed, =, \"gGddgPfeaf_gzyr\", prehash, =, upid, +, \"_\", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )]</td>\n",
|
||
" <td>From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n L110</td>\n",
|
||
" <td>[From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110]</td>\n",
|
||
" <td>python</td>\n",
|
||
" <td>test</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" repo path \\\n",
|
||
"0 soimort/you-get src/you_get/extractors/youtube.py \n",
|
||
"1 soimort/you-get src/you_get/extractors/miomio.py \n",
|
||
"2 soimort/you-get src/you_get/extractors/fc2video.py \n",
|
||
"\n",
|
||
" url \\\n",
|
||
"0 https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143 \n",
|
||
"1 https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51 \n",
|
||
"2 https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17 \n",
|
||
"\n",
|
||
" code \\\n",
|
||
"0 def get_vid_from_url(url):\\n \"\"\"Extracts video ID from URL.\\n \"\"\"\\n return match1(url, r'youtu\\.be/([^?/]+)') or \\\\n match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\\n match1(url, r'youtube\\.com/v/([^/?]+)') or \\\\n match1(url, r'youtube\\.com/watch/... \n",
|
||
"1 def sina_xml_to_url_list(xml_data):\\n \"\"\"str->list\\n Convert XML to URL List.\\n From Biligrab.\\n \"\"\"\\n rawurl = []\\n dom = parseString(xml_data)\\n for node in dom.getElementsByTagName('durl'):\\n url = node.getElementsByTagName('url')[0]\\n rawurl.append(url.chil... \n",
|
||
"2 def makeMimi(upid):\\n \"\"\"From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n L110\"\"\"\\n strSeed = \"gGddgPfeaf_gzyr\"\\n prehash = upid + \"_\" + strSeed\\n return md5(prehash.encode('utf-8')).hexdigest() \n",
|
||
"\n",
|
||
" code_tokens \\\n",
|
||
"0 [def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '... \n",
|
||
"1 [def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r... \n",
|
||
"2 [def, makeMimi, (, upid, ), :, strSeed, =, \"gGddgPfeaf_gzyr\", prehash, =, upid, +, \"_\", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )] \n",
|
||
"\n",
|
||
" docstring \\\n",
|
||
"0 Extracts video ID from URL. \n",
|
||
"1 str->list\\n Convert XML to URL List.\\n From Biligrab. \n",
|
||
"2 From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\\n L110 \n",
|
||
"\n",
|
||
" docstring_tokens \\\n",
|
||
"0 [Extracts, video, ID, from, URL, .] \n",
|
||
"1 [str, -, >, list, Convert, XML, to, URL, List, ., From, Biligrab, .] \n",
|
||
"2 [From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110] \n",
|
||
"\n",
|
||
" language partition \n",
|
||
"0 python test \n",
|
||
"1 python test \n",
|
||
"2 python test "
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"pydf.head(3)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.). You can find more information regarding the definition of the above columns in the README of this repo. \n",
|
||
"\n",
|
||
"Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics. **Warning:** This step takes ~ 20 minutes."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"all_df = jsonl_list_to_dataframe(all_files, columns_short_list)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Summary Statistics"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Row Counts\n",
|
||
"\n",
|
||
"By Partition"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"scrolled": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"train 1880853\n",
|
||
"test 100529\n",
|
||
"valid 89154\n",
|
||
"Name: partition, dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"all_df.partition.value_counts()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By Language"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"php 578118\n",
|
||
"java 496688\n",
|
||
"python 457461\n",
|
||
"go 346365\n",
|
||
"javascript 138625\n",
|
||
"ruby 53279\n",
|
||
"Name: language, dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"all_df.language.value_counts()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By Partition & Language"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"partition language \n",
|
||
"test go 14291\n",
|
||
" java 26909\n",
|
||
" javascript 6483\n",
|
||
" php 28391\n",
|
||
" python 22176\n",
|
||
" ruby 2279\n",
|
||
"train go 317832\n",
|
||
" java 454451\n",
|
||
" javascript 123889\n",
|
||
" php 523712\n",
|
||
" python 412178\n",
|
||
" ruby 48791\n",
|
||
"valid go 14242\n",
|
||
" java 15328\n",
|
||
" javascript 8253\n",
|
||
" php 26015\n",
|
||
" python 23107\n",
|
||
" ruby 2209\n",
|
||
"Name: code_tokens, dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"all_df.groupby(['partition', 'language'])['code_tokens'].count()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Token Lengths By Language"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))\n",
|
||
"all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Code Length Percentile By Language\n",
|
||
"\n",
|
||
"For example, the 80th percentile length for python tokens is 72"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th>code_len</th>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>language</th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">go</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>61.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>100.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>138.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>217.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>319.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">java</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>66.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>104.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>142.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>224.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>331.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">javascript</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>91.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>144.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>194.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>301.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>448.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">php</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>81.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>123.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>162.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>243.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>347.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">python</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>72.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>114.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>155.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>237.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>341.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">ruby</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>48.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>68.6</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>88.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>125.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>174.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" code_len\n",
|
||
"language \n",
|
||
"go 0.50 61.0\n",
|
||
" 0.70 100.0\n",
|
||
" 0.80 138.0\n",
|
||
" 0.90 217.0\n",
|
||
" 0.95 319.0\n",
|
||
"java 0.50 66.0\n",
|
||
" 0.70 104.0\n",
|
||
" 0.80 142.0\n",
|
||
" 0.90 224.0\n",
|
||
" 0.95 331.0\n",
|
||
"javascript 0.50 91.0\n",
|
||
" 0.70 144.0\n",
|
||
" 0.80 194.0\n",
|
||
" 0.90 301.0\n",
|
||
" 0.95 448.0\n",
|
||
"php 0.50 81.0\n",
|
||
" 0.70 123.0\n",
|
||
" 0.80 162.0\n",
|
||
" 0.90 243.0\n",
|
||
" 0.95 347.0\n",
|
||
"python 0.50 72.0\n",
|
||
" 0.70 114.0\n",
|
||
" 0.80 155.0\n",
|
||
" 0.90 237.0\n",
|
||
" 0.95 341.0\n",
|
||
"ruby 0.50 48.0\n",
|
||
" 0.70 68.6\n",
|
||
" 0.80 88.0\n",
|
||
" 0.90 125.0\n",
|
||
" 0.95 174.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])\n",
|
||
"display(pd.DataFrame(code_len_summary))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Query Length Percentile By Language\n",
|
||
"\n",
|
||
"For example, the 80th percentile length for python tokens is 19"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th>query_len</th>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>language</th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">go</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>12.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>19.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>28.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>49.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>92.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">java</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>11.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>18.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>25.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>39.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>61.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">javascript</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>10.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>15.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>21.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>33.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>47.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">php</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>7.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>10.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>12.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>17.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>24.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">python</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>10.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>15.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>20.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>33.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>48.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th rowspan=\"5\" valign=\"top\">ruby</th>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>11.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>17.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>24.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>36.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>49.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" query_len\n",
|
||
"language \n",
|
||
"go 0.50 12.0\n",
|
||
" 0.70 19.0\n",
|
||
" 0.80 28.0\n",
|
||
" 0.90 49.0\n",
|
||
" 0.95 92.0\n",
|
||
"java 0.50 11.0\n",
|
||
" 0.70 18.0\n",
|
||
" 0.80 25.0\n",
|
||
" 0.90 39.0\n",
|
||
" 0.95 61.0\n",
|
||
"javascript 0.50 10.0\n",
|
||
" 0.70 15.0\n",
|
||
" 0.80 21.0\n",
|
||
" 0.90 33.0\n",
|
||
" 0.95 47.0\n",
|
||
"php 0.50 7.0\n",
|
||
" 0.70 10.0\n",
|
||
" 0.80 12.0\n",
|
||
" 0.90 17.0\n",
|
||
" 0.95 24.0\n",
|
||
"python 0.50 10.0\n",
|
||
" 0.70 15.0\n",
|
||
" 0.80 20.0\n",
|
||
" 0.90 33.0\n",
|
||
" 0.95 48.0\n",
|
||
"ruby 0.50 11.0\n",
|
||
" 0.70 17.0\n",
|
||
" 0.80 24.0\n",
|
||
" 0.90 36.0\n",
|
||
" 0.95 49.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])\n",
|
||
"display(pd.DataFrame(query_len_summary))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Query Length All Languages"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>query_len</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0.50</th>\n",
|
||
" <td>10.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.70</th>\n",
|
||
" <td>15.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.80</th>\n",
|
||
" <td>20.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.90</th>\n",
|
||
" <td>32.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0.95</th>\n",
|
||
" <td>50.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" query_len\n",
|
||
"0.50 10.0\n",
|
||
"0.70 15.0\n",
|
||
"0.80 20.0\n",
|
||
"0.90 32.0\n",
|
||
"0.95 50.0"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])\n",
|
||
"display(pd.DataFrame(query_len_summary))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.6.7"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|