Airflow configuration for Telemetry
Перейти к файлу
Mikaël Ducharme 4a0419fb93
feat: Upgrade to 2.6.3 (#1789)
2023-09-07 14:04:40 -04:00
.circleci feat: Update to Python 3.10 (#1737) 2023-06-22 09:32:19 -04:00
bin feat!: Use Dockerfile based on upstream [DSRE-1116] (#1638) 2023-02-22 12:52:41 -05:00
config chore: use PostgreSQL for local environment and update README.md (#1729) 2023-06-21 15:03:39 -04:00
dags Ctroy replace dataval (#1790) 2023-09-07 12:55:37 -05:00
dataproc_bootstrap [DSRE-900] Fix pip install error for taar 2022-07-12 14:49:48 -07:00
jobs Update orphaning to write to GCS instead of S3 (#1770) 2023-08-01 11:25:44 -04:00
plugins Add missing serialization for MultiWeekTimetable (#1782) 2023-08-15 14:58:17 -07:00
resources Add production step for wikipedia indexing (#1670) 2023-03-27 14:32:29 -05:00
tests Create custom timetable to run shredder every 4 weeks (#1771) 2023-08-14 15:55:39 -07:00
.add_credentials.md Make missing credentials more obvious 2020-04-20 09:44:55 -04:00
.dockerignore feat!: Use Dockerfile based on upstream [DSRE-1116] (#1638) 2023-02-22 12:52:41 -05:00
.gitignore fix(backfill): fix BackfillParams import issues (#1587) 2022-11-17 16:11:16 -05:00
CODE_OF_CONDUCT.md Add Mozilla Code of Conduct file (#457) 2019-03-29 15:22:04 -07:00
Dockerfile feat: Upgrade to 2.6.3 (#1789) 2023-09-07 14:04:40 -04:00
GRAVEYARD.md DSRE-1020 Remove Prio DAGs 2022-09-15 20:56:08 +02:00
LICENSE Add license 2016-10-19 14:09:58 +01:00
Makefile fix(ci): pin docker-compose version (#1704) 2023-05-16 12:05:38 -07:00
README.md feat: Update to Python 3.10 (#1737) 2023-06-22 09:32:19 -04:00
constraints.txt feat: Upgrade to 2.6.3 (#1789) 2023-09-07 14:04:40 -04:00
docker-compose.yml feat: Upgrade to 2.6.3 (#1789) 2023-09-07 14:04:40 -04:00
pyproject.toml Change time of Merino job to Tuesdays (#1663) 2023-03-10 10:18:03 -05:00
requirements.in feat: Upgrade to 2.6.3 (#1789) 2023-09-07 14:04:40 -04:00
requirements.txt feat: Upgrade to 2.6.3 (#1789) 2023-09-07 14:04:40 -04:00

README.md

Telemetry-Airflow

CircleCI Python 3.10 License: MPL 2.0 Code style: black

Apache Airflow is a platform to programmatically author, schedule and monitor workflows.

This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow".

Some links relevant to users and developers of WTMO:

  • The dags directory in this repository contains some custom DAG definitions
  • Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl
  • The Data SRE team maintains a WTMO Developer Guide (behind SSO)

Writing DAGs

See the Airflow's Best Practices guide to help you write DAGs.

⚠ Note: How to import DAGs and modules ⚠

Modules should be imported from the project directory, such as from dags.my_dag import load_data rather than from my_dag import load_data.

In Airflow, the dags, config, and plugins folders are automatically added to the PYTHONPATH to ensure they can be imported and accessed by Airflow's execution environment.

However, this default configuration can cause problems when running unit tests located in the tests directory. Since the PYTHONPATH includes the dags directory, but not the project directory itself, the unit tests will not be able to import code from the dags directory. This limitation restricts the ability to test the DAGs effectively within the project structure. It is also generally expected that imports should work from the project directory rather than from any of its subdirectories. For this reason, telemetry-airflow's Dockerfile adds the project directory to PYTHONPATH.

Prerequisites

This app is built and deployed with docker and docker-compose. Dependencies are managed with pip-tools pip-compile.

You'll also need to install PostgreSQL to build the database container.

Installing dependencies locally

⚠ Make sure you use the right Python version. Refer to Dockerfile for current supported Python Version ⚠

You can install the project dependencies locally to run tests with Pytest. We use the official Airflow constraints file to simplify Airflow dependency management. Install dependencies locally using the following command:

make pip-install-local

Updating Python dependencies

Add new Python dependencies into requirements.in and execute make pip-install-local

Build Container

Build Airflow image with

make build

Local Deployment

To deploy the Airflow container on the docker engine, with its required dependencies, run:

make build
make up

macOS

Assuming you're using Docker for Docker Desktop for macOS, start the docker service, click the docker icon in the menu bar, click on preferences and change the available memory to 4GB.

Testing

Adding dummy credentials

Tasks often require credentials to access external credentials. For example, one may choose to store API keys in an Airflow connection or variable. These variables are sure to exist in production but are often not mirrored locally for logistical reasons. Providing a dummy variable is the preferred way to keep the local development environment up to date.

Update the resources/dev_variables.env and resources/dev_connections.env with appropriate strings to prevent broken workflows.

Usage

You can now connect to your local Airflow web console at http://localhost:8080/.

All DAGs are paused by default for local instances and our staging instance of Airflow. In order to submit a DAG via the UI, you'll need to toggle the DAG from "Off" to "On". You'll likely want to toggle the DAG back to "Off" as soon as your desired task starts running.

Testing GKE Jobs (including BigQuery-etl changes)

See https://go.corp.mozilla.com/wtmodev for more details.

make build && make up
make gke

When done:
make clean-gke

From there, connect to Airflow and enable your job.

Testing Dataproc Jobs

Dataproc jobs run on a self-contained Dataproc cluster, created by Airflow.

To test these, jobs, you'll need a sandbox account and corresponding service account. For information on creating that, see "Testing GKE Jobs". Your service account will need Dataproc and GCS permissions (and BigQuery, if you're connecting to it). Note: Dataproc requires "Dataproc/Dataproc Worker" as well as Compute Admin permissions. You'll need to ensure that the Dataproc API is enabled in your sandbox project.

Ensure that your dataproc job has a configurable project to write to. Set the project in the DAG entry to be configured based on development environment; see the ltv.py job for an example of that.

From there, run the following:

make build && make up
./bin/add_gcp_creds $GOOGLE_APPLICATION_CREDENTIALS google_cloud_airflow_dataproc

You can then connect to Airflow locally. Enable your DAG and see that it runs correctly.

Production Setup

This repository was structured to be deployed using the offical Airflow Helm Chart.. See the Production Guide for best practices.

Debugging

Some useful docker tricks for development and debugging:

make clean

# Remove any leftover docker volumes:
docker volume rm $(docker volume ls -qf dangling=true)

# Purge docker volumes (helps with postgres container failing to start)
# Careful as this will purge all local volumes not used by at least one container.
docker volume prune