Collection of dockerized ETL jobs managed by data engineering.
Перейти к файлу
Braun 5f46acbf55
docs(kevel-metadata): updating readme
2024-07-22 08:35:08 -05:00
.circleci build(circleci): fixing config 2024-07-19 15:44:58 -05:00
.github Add default reviewer for dependabot (#25) 2021-04-21 13:05:25 -04:00
docker_etl Port mau dashboard jobs from databricks (#2) 2020-12-10 14:14:51 -05:00
jobs docs(kevel-metadata): updating readme 2024-07-22 08:35:08 -05:00
script Setup initial tooling and templates (#1) 2020-12-03 17:33:58 -05:00
templates feat(): updated docker version and git image used by circle ci + updated requirements (#123) 2023-06-20 16:20:33 +02:00
tests feat(): updated docker version and git image used by circle ci + updated requirements (#123) 2023-06-20 16:20:33 +02:00
.dockerignore feat(): updated docker version and git image used by circle ci + updated requirements (#123) 2023-06-20 16:20:33 +02:00
.flake8 Setup initial tooling and templates (#1) 2020-12-03 17:33:58 -05:00
.gitignore few PR changes 2022-03-28 15:29:57 -05:00
Dockerfile feat(): updated docker version and git image used by circle ci + updated requirements (#123) 2023-06-20 16:20:33 +02:00
LICENSE Initial commit 2020-09-21 17:56:23 -04:00
README.md Refactor KPI Forecasting (#121) 2023-06-21 13:24:52 -07:00
pytest.ini Setup initial tooling and templates (#1) 2020-12-03 17:33:58 -05:00
requirements.in feat(): updated docker version and git image used by circle ci + updated requirements (#123) 2023-06-20 16:20:33 +02:00
requirements.txt feat(): updated docker version and git image used by circle ci + updated requirements (#123) 2023-06-20 16:20:33 +02:00
setup.py Setup initial tooling and templates (#1) 2020-12-03 17:33:58 -05:00

README.md

Docker ETL

This repo is a collection of dockerized ETL jobs to increase discoverability of the source code of scheduled ETL. There are also tools here that automate the common steps involved with creating and scheduling an ETL job. This includes defining a Docker image, setting up CI, and language boilerplate. The primary use of this repo is to create Dockerized jobs that are pushed to GCR so they can be scheduled via the Airflow GKE pod operator.

Project Structure

Jobs

Each job is located in its own directory in the jobs/ directory, e.g. the contents of a job named my-job would go into jobs/my-job

All job directories should have a Dockerfile, a ci_job.yaml, a ci_workflow.yaml, and a README.md in the root directory. ci_job.yaml and ci_workflow.yaml contain the yaml structure that will be placed in the - jobs: and - workflows: sections of the CircleCI config.yml respectively.

Templates

Templates for job creation and the CI config file are located in templates/.

The CI config template is in .circleci/config.template.yml. This is the file that should be modified instead of the circleci/config.yml.

Each job template is located in a directory in templates/ that is the name of the template, e.g. a python template is in templates/python/. Within the directory of a template is a directory named job/ that contains all the contents that will be copied when the template is used. Other files in the directory of a particular template are used for job creation, e.g. ci_job.template.yaml.

Example Directory Structure:

+--docker-etl/
|  +--jobs/
|     +--example-python-1/
|        +--ci_job.yaml
|        +--ci_workflow.yaml
|        +--Dockerfile
|        +--README.md
|        +--script
|  +--templates/
|     +--python/
|        +--job/
|           +--module/
|           +--tests/
|           +--Dockerfile
|           +--README.md
|           +--requirements.txt
|        +--ci_job.template.yaml
|        +--ci_workflow.template.yaml

Development

The tools in this repository are intended for python 3.8+.

To install dependencies:

pip install -r requirements.txt

This project uses pip-tools to pin dependencies. New dependencies go in requirements.in and pip-compile is used to generate requirements.txt:

pip install pip-tools
pip-compile --generate-hashes requirements.in

To run tests:

pytest --flake8 --black tests/

Adding a new job

To add a new job:

./script/create_job --job-name example-job --template python

job-name is the name of the directory that will be created in jobs/.

template is an optional argument that will populate the created directory with the contents of a template. If no template is given, a directory with only the required files is created.

Available Templates:

Template name Description
default Base directory with readme, Dockerfile, and CI config files
python Simple Python module with unit test and lint config

Modifying the CI config

This repo uses CircleCI which only allows a single global config file. In order to simplify adding and removing jobs to CI, the config file is generated using templates. This means the config.yml in .circleci/ should not be modified directly.

Generate .circleci/config.yml:

./script/update_ci_config

To make changes to the config that are not ETL job specific (e.g. add a command), changes should be made to templates/config.template.yml and the output config should be re-generated.

Each job has a ci_job.yaml and a ci_workflow.yaml which define the steps that will go into the jobs and workflow sections of the CircleCI config. Any changes to these files should be followed by updating the global config via script/update_ci_config. When a job is created, the CI files are created based on the ci_*.template.yaml files in the template directory.

Adding a template

To add a new template, create a new directory in templates/ with the name of the template. This directory must have a ci_job.template.yaml, a ci_workflow.template.yaml, and a job/ directory which contains all the files that will be copied to any job that uses this template.