21d4f85ad3 | ||
---|---|---|
push_to_bigquery | ||
vendor | ||
.gitignore | ||
LICENSE | ||
README.md | ||
config.json | ||
requirements.txt |
README.md
push-to-bigquery
Push JSON documents to BigQuery
Overview
Manages schema while pushing random documents to Google's BigQuery
Benefits
- Allow numerous independent processes to insert data into a "table", while avoid the per-table BQ insert limits
- Expands the schema to fit the JSON documents provided; this includes handling of change-in-datatype, allowing deeply nested JSON arrays, and multidimensional arrays.
Details
Installation
Clone from Github
Configuration
account_info
- The BigQuery Service Account Infodataset
- The BigQuery dataset to place tablestable
- The name of the BigQuery table to fillpartition
- BigQuery can partition a table based on aTIMESTAMP
fieldfield
- TheTIMESTAMP
field to determine the partitionsexpire
- Age of partition when it is removed, as determined by thefield
value
cluster
- array of field names to sort tableid
-field
- field used to determine document uniquenessversion
- version number to know which document takes precedence when removing duplicates: largest is chosen. Unix timestamps works well.
top_level_fields
- Map from full path name to top-level field name. BigQuery demands you include the partition and cluster fields if they are not already top-levelsharded
- iftrue
, then multiple tables (aka "shards") are allowed, but they must be merged before part of the primary tableread_only
- set tofalse
if you are planning to add recordsschema
- fields that must exist
Example config file
{
"account_info": {
"$ref": "file:///e:/moz-fx-dev-ekyle-treeherder-a838a7718652.json"
},
"dataset": "treeherder",
"table": "jobs",
"top_level_fields": {
"job.id": "_job_id",
"last_modified": "_last_modified",
"action.request_time": "_request_time"
},
"partition": {
"field": "action.request_time",
"expire": "2year"
},
"id":{
"field": "job.id",
"version": "last_modified"
},
"cluster": [
"job.id",
"last_modified"
],
"sharded": true
}
Usage
container = bigquery.Dataset(config)
index = container.get_or_create_table(config)
index.extend(documents)