DEPRECATED - Push JSON documents to BigQuery
Перейти к файлу
Kyle Lahnakoski 21d4f85ad3 fixes 2020-01-08 10:38:58 -05:00
push_to_bigquery fixes 2020-01-08 10:38:58 -05:00
vendor fixes 2020-01-08 10:38:58 -05:00
.gitignore type-checking for merge 2019-12-26 13:13:16 -05:00
LICENSE Initial commit 2019-12-17 08:13:55 -05:00
README.md docs 2020-01-02 18:11:12 -05:00
config.json fixes 2020-01-08 10:38:58 -05:00
requirements.txt add vendor code, connect to bq 2019-12-17 10:04:58 -05:00

README.md

push-to-bigquery

Push JSON documents to BigQuery

Overview

Manages schema while pushing random documents to Google's BigQuery

Benefits

  1. Allow numerous independent processes to insert data into a "table", while avoid the per-table BQ insert limits
  2. Expands the schema to fit the JSON documents provided; this includes handling of change-in-datatype, allowing deeply nested JSON arrays, and multidimensional arrays.

Details

Installation

Clone from Github

Configuration

  • account_info - The BigQuery Service Account Info
  • dataset - The BigQuery dataset to place tables
  • table - The name of the BigQuery table to fill
  • partition - BigQuery can partition a table based on a TIMESTAMP field
    • field - The TIMESTAMP field to determine the partitions
    • expire - Age of partition when it is removed, as determined by the field value
  • cluster - array of field names to sort table
  • id -
    • field - field used to determine document uniqueness
    • version - version number to know which document takes precedence when removing duplicates: largest is chosen. Unix timestamps works well.
  • top_level_fields - Map from full path name to top-level field name. BigQuery demands you include the partition and cluster fields if they are not already top-level
  • sharded - if true, then multiple tables (aka "shards") are allowed, but they must be merged before part of the primary table
  • read_only - set to false if you are planning to add records
  • schema - fields that must exist

Example config file

{
    "account_info": {
        "$ref": "file:///e:/moz-fx-dev-ekyle-treeherder-a838a7718652.json"
    },
    "dataset": "treeherder",
    "table": "jobs",
    "top_level_fields": {
        "job.id": "_job_id",
        "last_modified": "_last_modified",
        "action.request_time": "_request_time"
    },
    "partition": {
        "field": "action.request_time",
        "expire": "2year"
    },
    "id":{
        "field": "job.id",
        "version": "last_modified"
    },
    "cluster": [
        "job.id",
        "last_modified"
    ],
    "sharded": true
}

Usage

    container = bigquery.Dataset(config)
    index = container.get_or_create_table(config)
    index.extend(documents)