DEPRECATED - Push JSON documents to BigQuery

Перейти к файлу

Kyle Lahnakoski 21d4f85ad3 fixes		2020-01-08 10:38:58 -05:00
push_to_bigquery	fixes	2020-01-08 10:38:58 -05:00
vendor	fixes	2020-01-08 10:38:58 -05:00
.gitignore	type-checking for merge	2019-12-26 13:13:16 -05:00
LICENSE	Initial commit	2019-12-17 08:13:55 -05:00
README.md	docs	2020-01-02 18:11:12 -05:00
config.json	fixes	2020-01-08 10:38:58 -05:00
requirements.txt	add vendor code, connect to bq	2019-12-17 10:04:58 -05:00

README.md

push-to-bigquery

Push JSON documents to BigQuery

Overview

Manages schema while pushing random documents to Google's BigQuery

Benefits

Allow numerous independent processes to insert data into a "table", while avoid the per-table BQ insert limits
Expands the schema to fit the JSON documents provided; this includes handling of change-in-datatype, allowing deeply nested JSON arrays, and multidimensional arrays.

Details

Installation

Clone from Github

Configuration

account_info - The BigQuery Service Account Info
dataset - The BigQuery dataset to place tables
table - The name of the BigQuery table to fill
partition - BigQuery can partition a table based on a TIMESTAMP field
- field - The TIMESTAMP field to determine the partitions
- expire - Age of partition when it is removed, as determined by the field value
cluster - array of field names to sort table
id -
- field - field used to determine document uniqueness
- version - version number to know which document takes precedence when removing duplicates: largest is chosen. Unix timestamps works well.
top_level_fields - Map from full path name to top-level field name. BigQuery demands you include the partition and cluster fields if they are not already top-level
sharded - if true, then multiple tables (aka "shards") are allowed, but they must be merged before part of the primary table
read_only - set to false if you are planning to add records
schema - fields that must exist

Example config file

{
    "account_info": {
        "$ref": "file:///e:/moz-fx-dev-ekyle-treeherder-a838a7718652.json"
    },
    "dataset": "treeherder",
    "table": "jobs",
    "top_level_fields": {
        "job.id": "_job_id",
        "last_modified": "_last_modified",
        "action.request_time": "_request_time"
    },
    "partition": {
        "field": "action.request_time",
        "expire": "2year"
    },
    "id":{
        "field": "job.id",
        "version": "last_modified"
    },
    "cluster": [
        "job.id",
        "last_modified"
    ],
    "sharded": true
}

Usage

    container = bigquery.Dataset(config)
    index = container.get_or_create_table(config)
    index.extend(documents)