6b0b918cc6 | ||
---|---|---|
.circleci | ||
.github | ||
.vscode | ||
bigquery_etl | ||
dags | ||
docs | ||
script | ||
sql | ||
sql_generators | ||
stored_procedures | ||
tests | ||
.bigqueryrc | ||
.dockerignore | ||
.eslintrc.yml | ||
.flake8 | ||
.gitignore | ||
.isort.cfg | ||
.pre-commit-config.yaml | ||
.yamllint.yaml | ||
CODEOWNERS | ||
CODE_OF_CONDUCT.md | ||
CONTRIBUTING.md | ||
Dockerfile | ||
GRAVEYARD.md | ||
LICENSE | ||
README.md | ||
bqetl | ||
conftest.py | ||
dags.yaml | ||
java-requirements.in | ||
java-requirements.txt | ||
netlify.toml | ||
pom.xml | ||
pytest.ini | ||
requirements.in | ||
requirements.txt | ||
setup.py |
README.md
BigQuery ETL
This repository contains Mozilla Data Team's:
- Derived ETL jobs that do not require a custom container
- User-defined functions (UDFs)
- Airflow DAGs for scheduled bigquery-etl queries
- Tools for query & UDF deployment, management and scheduling
For more information, see https://mozilla.github.io/bigquery-etl/
Quick Start
Apple Silicon (M1) user suggestion
Enable Rosetta mode for your terminal BEFORE installing below tools using your terminal. It'll save you a lot of headaches. For tips on maintaining parallel stacks of python and homebrew running with and without Rosetta, see blog posts from Thinknum and Sixty North.
Pre-requisites
- Homebrew (not required, but useful for Mac) - Follow the instructions here to install homebrew on your Mac.
- Python 3.8+ - (see this guide for instructions if you're on a mac and haven't installed anything other than the default system Python).
- Java JDK 11+ - (required for some functionality, e.g. AdoptOpenJDK) with
$JAVA_HOME
set. - Maven - (needed for downloading jar dependencies). Available via your package manager in most Linux distributions and from homebrew on mac, or you can install yourself by downloading a binary and following maven's install instructions.
GCP CLI tools
- For Mozilla Employees or Contributors (not in Data Engineering) - Set up GCP command line tools, as described on docs.telemetry.mozilla.org. Note that some functionality (e.g. writing UDFs or backfilling queries) may not be allowed.
- For Data Engineering - In addition to setting up the command line tools, you will want to log in to
shared-prod
if making changes to production systems. Rungcloud auth login --update-adc --project=moz-fx-data-shared-prod
(if you have not run it previously).
Installing bqetl
- Clone the repository
git clone git@github.com:mozilla/bigquery-etl.git
cd bigquery-etl
- Install the
bqetl
command line tool
./bqetl bootstrap
- Install standard pre-commit hooks
venv/bin/pre-commit install
- Download java dependencies
mvn dependency:copy-dependencies
# specify `<(echo mozilla-bigquery-etl)` to retain bqetl from `./bqetl bootstrap`
venv/bin/pip-sync --pip-args=--no-deps requirements.txt java-requirements.txt <(echo mozilla-bigquery-etl)
Finally, if you are using Visual Studio Code, you may also wish to use our recommended defaults:
cp .vscode/settings.json.default .vscode/settings.json
And you should now be set up to start working in the repo! The easiest way to do this is for many tasks is to use bqetl
. You may also want to read up on common workflows.