OpenWPM/openwpm/storage/parquet_schema.py

247 строки
8.9 KiB
Python
Исходник Обычный вид История

Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
"""
Arrow schema for our ArrowProvider.py
IF YOU CHANGE THIS FILE ALSO CHANGE schema.sql and test_values.py
AND Schema-Documentation.md
"""
import pyarrow as pa
PQ_SCHEMAS = dict()
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
fields = [
pa.field("task_id", pa.int64(), nullable=False),
pa.field("manager_params", pa.string(), nullable=False),
pa.field("openwpm_version", pa.string(), nullable=False),
pa.field("browser_version", pa.string(), nullable=False),
pa.field("instance_id", pa.uint32(), nullable=False),
]
PQ_SCHEMAS["task"] = pa.schema(fields)
fields = [
pa.field("browser_id", pa.uint32(), nullable=False),
pa.field("task_id", pa.int64(), nullable=False),
pa.field("browser_params", pa.string(), nullable=False),
pa.field("instance_id", pa.uint32(), nullable=False),
]
PQ_SCHEMAS["crawl"] = pa.schema(fields)
# site_visits
fields = [
2020-09-11 16:14:09 +03:00
pa.field("visit_id", pa.int64(), nullable=False),
pa.field("browser_id", pa.uint32(), nullable=False),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("site_url", pa.string(), nullable=False),
pa.field("site_rank", pa.uint32()),
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["site_visits"] = pa.schema(fields)
# crawl_history
fields = [
2020-09-11 16:14:09 +03:00
pa.field("browser_id", pa.uint32(), nullable=False),
pa.field("visit_id", pa.int64(), nullable=False),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("command", pa.string()),
pa.field("arguments", pa.string()),
pa.field("retry_number", pa.int8()),
pa.field("command_status", pa.string()),
pa.field("error", pa.string()),
pa.field("traceback", pa.string()),
pa.field("duration", pa.int64()),
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["crawl_history"] = pa.schema(fields)
# http_requests
fields = [
2020-09-11 16:14:09 +03:00
pa.field("incognito", pa.int32()),
pa.field("browser_id", pa.uint32()),
pa.field("visit_id", pa.int64()),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("extension_session_uuid", pa.string()),
pa.field("event_ordinal", pa.int64()),
pa.field("window_id", pa.int64()),
pa.field("tab_id", pa.int64()),
pa.field("frame_id", pa.int64()),
pa.field("url", pa.string(), nullable=False),
pa.field("top_level_url", pa.string()),
pa.field("parent_frame_id", pa.int64()),
pa.field("frame_ancestors", pa.string()),
pa.field("method", pa.string(), nullable=False),
pa.field("referrer", pa.string(), nullable=False),
pa.field("headers", pa.string(), nullable=False),
pa.field("request_id", pa.int64(), nullable=False),
2020-09-11 16:14:09 +03:00
pa.field("is_XHR", pa.bool_()),
pa.field("is_third_party_channel", pa.bool_()),
pa.field("is_third_party_to_top_window", pa.bool_()),
pa.field("triggering_origin", pa.string()),
pa.field("loading_origin", pa.string()),
pa.field("loading_href", pa.string()),
pa.field("req_call_stack", pa.string()),
pa.field("resource_type", pa.string(), nullable=False),
pa.field("post_body", pa.string()),
pa.field("post_body_raw", pa.string()),
pa.field("time_stamp", pa.string(), nullable=False),
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["http_requests"] = pa.schema(fields)
# http_responses
fields = [
2020-09-11 16:14:09 +03:00
pa.field("incognito", pa.int32()),
pa.field("browser_id", pa.uint32()),
pa.field("visit_id", pa.int64()),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("extension_session_uuid", pa.string()),
pa.field("event_ordinal", pa.int64()),
pa.field("window_id", pa.int64()),
pa.field("tab_id", pa.int64()),
pa.field("frame_id", pa.int64()),
pa.field("url", pa.string(), nullable=False),
pa.field("method", pa.string(), nullable=False),
pa.field("response_status", pa.int64()),
pa.field("response_status_text", pa.string(), nullable=False),
pa.field("is_cached", pa.bool_(), nullable=False),
pa.field("headers", pa.string(), nullable=False),
pa.field("request_id", pa.int64(), nullable=False),
2020-09-11 16:14:09 +03:00
pa.field("location", pa.string(), nullable=False),
pa.field("time_stamp", pa.string(), nullable=False),
pa.field("content_hash", pa.string()),
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["http_responses"] = pa.schema(fields)
# http_redirects
fields = [
2020-09-11 16:14:09 +03:00
pa.field("incognito", pa.int32()),
pa.field("browser_id", pa.uint32()),
pa.field("visit_id", pa.int64()),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("old_request_url", pa.string()),
pa.field("old_request_id", pa.string()),
pa.field("new_request_url", pa.string()),
pa.field("new_request_id", pa.string()),
pa.field("extension_session_uuid", pa.string()),
pa.field("event_ordinal", pa.int64()),
pa.field("window_id", pa.int64()),
pa.field("tab_id", pa.int64()),
pa.field("frame_id", pa.int64()),
pa.field("response_status", pa.int64()),
pa.field("response_status_text", pa.string(), nullable=False),
pa.field("headers", pa.string()),
pa.field("time_stamp", pa.string(), nullable=False),
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["http_redirects"] = pa.schema(fields)
# javascript
fields = [
2020-09-11 16:14:09 +03:00
pa.field("incognito", pa.int32()),
pa.field("browser_id", pa.uint32()),
pa.field("visit_id", pa.int64()),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("extension_session_uuid", pa.string()),
pa.field("event_ordinal", pa.int64()),
pa.field("page_scoped_event_ordinal", pa.int64()),
pa.field("window_id", pa.int64()),
pa.field("tab_id", pa.int64()),
pa.field("frame_id", pa.int64()),
pa.field("script_url", pa.string()),
pa.field("script_line", pa.string()),
pa.field("script_col", pa.string()),
pa.field("func_name", pa.string()),
pa.field("script_loc_eval", pa.string()),
pa.field("document_url", pa.string()),
pa.field("top_level_url", pa.string()),
pa.field("call_stack", pa.string()),
pa.field("symbol", pa.string()),
pa.field("operation", pa.string()),
pa.field("value", pa.string()),
pa.field("arguments", pa.string()),
pa.field("time_stamp", pa.string(), nullable=False),
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["javascript"] = pa.schema(fields)
# javascript_cookies
fields = [
2020-09-11 16:14:09 +03:00
pa.field("browser_id", pa.uint32()),
pa.field("visit_id", pa.int64()),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("extension_session_uuid", pa.string()),
pa.field("event_ordinal", pa.int64()),
pa.field("record_type", pa.string()),
pa.field("change_cause", pa.string()),
pa.field("expiry", pa.string()),
pa.field("is_http_only", pa.bool_()),
pa.field("is_host_only", pa.bool_()),
pa.field("is_session", pa.bool_()),
pa.field("host", pa.string()),
pa.field("is_secure", pa.bool_()),
pa.field("name", pa.string()),
pa.field("path", pa.string()),
pa.field("value", pa.string()),
pa.field("same_site", pa.string()),
pa.field("first_party_domain", pa.string()),
pa.field("store_id", pa.string()),
pa.field("time_stamp", pa.string()),
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["javascript_cookies"] = pa.schema(fields)
2019-01-18 22:52:42 +03:00
# navigations
fields = [
2020-09-11 16:14:09 +03:00
pa.field("incognito", pa.int32()),
pa.field("browser_id", pa.uint32()),
pa.field("visit_id", pa.int64()),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("extension_session_uuid", pa.string()),
pa.field("process_id", pa.int64()),
pa.field("window_id", pa.int64()),
pa.field("tab_id", pa.int64()),
pa.field("tab_opener_tab_id", pa.int64()),
pa.field("frame_id", pa.int64()),
pa.field("parent_frame_id", pa.int64()),
pa.field("window_width", pa.int64()),
pa.field("window_height", pa.int64()),
pa.field("window_type", pa.string()),
pa.field("tab_width", pa.int64()),
pa.field("tab_height", pa.int64()),
pa.field("tab_cookie_store_id", pa.string()),
pa.field("uuid", pa.string()),
pa.field("url", pa.string()),
pa.field("transition_qualifiers", pa.string()),
pa.field("transition_type", pa.string()),
pa.field("before_navigate_event_ordinal", pa.int64()),
pa.field("before_navigate_time_stamp", pa.string()),
pa.field("committed_event_ordinal", pa.int64()),
pa.field("time_stamp", pa.string()),
2019-01-18 22:52:42 +03:00
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["navigations"] = pa.schema(fields)
2020-01-10 16:59:37 +03:00
# callstacks
2020-01-10 16:59:37 +03:00
fields = [
2020-09-11 16:14:09 +03:00
pa.field("visit_id", pa.int64(), nullable=False),
pa.field("request_id", pa.int64(), nullable=False),
pa.field("browser_id", pa.uint32(), nullable=False),
pa.field("instance_id", pa.uint32(), nullable=False),
pa.field("call_stack", pa.string()),
2020-01-10 16:59:37 +03:00
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["callstacks"] = pa.schema(fields)
# incomplete_visits
fields = [
2020-09-11 16:14:09 +03:00
pa.field("visit_id", pa.int64(), nullable=False),
pa.field("instance_id", pa.uint32(), nullable=False),
]
2020-09-11 16:14:09 +03:00
PQ_SCHEMAS["incomplete_visits"] = pa.schema(fields)
# dns_responses
fields = [
pa.field("request_id", pa.int64(), nullable=False),
2020-09-14 13:27:15 +03:00
pa.field("browser_id", pa.uint32(), nullable=False),
2020-09-14 12:17:40 +03:00
pa.field("visit_id", pa.int64(), nullable=False),
pa.field("hostname", pa.string()),
pa.field("addresses", pa.string()),
pa.field("canonical_name", pa.string()),
2020-09-14 16:27:54 +03:00
pa.field("is_TRR", pa.bool_()),
2020-09-14 12:17:40 +03:00
pa.field("time_stamp", pa.string(), nullable=False),
2020-09-14 16:27:54 +03:00
pa.field("instance_id", pa.uint32(), nullable=False),
]
2020-09-14 12:17:40 +03:00
PQ_SCHEMAS["dns_responses"] = pa.schema(fields)