1469 строки
53 KiB
ReStructuredText
1469 строки
53 KiB
ReStructuredText
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
|
|
.. _concepts:
|
|
|
|
Concepts
|
|
########
|
|
|
|
The Airflow platform is a tool for describing, executing, and monitoring
|
|
workflows.
|
|
|
|
Core Ideas
|
|
''''''''''
|
|
|
|
DAGs
|
|
====
|
|
|
|
In Airflow, a ``DAG`` -- or a Directed Acyclic Graph -- is a collection of all
|
|
the tasks you want to run, organized in a way that reflects their relationships
|
|
and dependencies.
|
|
|
|
A DAG is defined in a Python script, which represents the DAGs structure (tasks
|
|
and their dependencies) as code.
|
|
|
|
For example, a simple DAG could consist of three tasks: A, B, and C. It could
|
|
say that A has to run successfully before B can run, but C can run anytime. It
|
|
could say that task A times out after 5 minutes, and B can be restarted up to 5
|
|
times in case it fails. It might also say that the workflow will run every night
|
|
at 10pm, but shouldn't start until a certain date.
|
|
|
|
In this way, a DAG describes *how* you want to carry out your workflow; but
|
|
notice that we haven't said anything about *what* we actually want to do! A, B,
|
|
and C could be anything. Maybe A prepares data for B to analyze while C sends an
|
|
email. Or perhaps A monitors your location so B can open your garage door while
|
|
C turns on your house lights. The important thing is that the DAG isn't
|
|
concerned with what its constituent tasks do; its job is to make sure that
|
|
whatever they do happens at the right time, or in the right order, or with the
|
|
right handling of any unexpected issues.
|
|
|
|
DAGs are defined in standard Python files that are placed in Airflow's
|
|
``DAG_FOLDER``. Airflow will execute the code in each file to dynamically build
|
|
the ``DAG`` objects. You can have as many DAGs as you want, each describing an
|
|
arbitrary number of tasks. In general, each one should correspond to a single
|
|
logical workflow.
|
|
|
|
.. note:: When searching for DAGs, Airflow only considers python files
|
|
that contain the strings "airflow" and "DAG" by default. To consider
|
|
all python files instead, disable the ``DAG_DISCOVERY_SAFE_MODE``
|
|
configuration flag.
|
|
|
|
Scope
|
|
-----
|
|
|
|
Airflow will load any ``DAG`` object it can import from a DAGfile. Critically,
|
|
that means the DAG must appear in ``globals()``. Consider the following two
|
|
DAGs. Only ``dag_1`` will be loaded; the other one only appears in a local
|
|
scope.
|
|
|
|
.. code-block:: python
|
|
|
|
dag_1 = DAG('this_dag_will_be_discovered')
|
|
|
|
def my_function():
|
|
dag_2 = DAG('but_this_dag_will_not')
|
|
|
|
my_function()
|
|
|
|
Sometimes this can be put to good use. For example, a common pattern with
|
|
``SubDagOperator`` is to define the subdag inside a function so that Airflow
|
|
doesn't try to load it as a standalone DAG.
|
|
|
|
.. _default-args:
|
|
|
|
Default Arguments
|
|
-----------------
|
|
|
|
If a dictionary of ``default_args`` is passed to a DAG, it will apply them to
|
|
any of its operators. This makes it easy to apply a common parameter to many operators without having to type it many times.
|
|
|
|
.. code-block:: python
|
|
|
|
default_args = {
|
|
'start_date': datetime(2016, 1, 1),
|
|
'owner': 'airflow'
|
|
}
|
|
|
|
dag = DAG('my_dag', default_args=default_args)
|
|
op = DummyOperator(task_id='dummy', dag=dag)
|
|
print(op.owner) # Airflow
|
|
|
|
Context Manager
|
|
---------------
|
|
|
|
*Added in Airflow 1.8*
|
|
|
|
DAGs can be used as context managers to automatically assign new operators to that DAG.
|
|
|
|
.. code-block:: python
|
|
|
|
with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag:
|
|
op = DummyOperator('op')
|
|
|
|
op.dag is dag # True
|
|
|
|
.. _concepts:functional_dags:
|
|
|
|
Functional DAGs
|
|
---------------
|
|
|
|
DAGs can be defined using functional abstractions. Outputs and inputs are sent between tasks using
|
|
:ref:`XCom values <concepts:xcom>`. In addition, you can wrap functions as tasks using the
|
|
:ref:`task decorator <concepts:task_decorator>`. Airflow will also automatically add dependencies between
|
|
tasks to ensure that XCom messages are available when operators are executed.
|
|
|
|
Example DAG with functional abstraction
|
|
|
|
.. code-block:: python
|
|
|
|
with DAG(
|
|
'send_server_ip', default_args=default_args, schedule_interval=None
|
|
) as dag:
|
|
|
|
# Using default connection as it's set to httpbin.org by default
|
|
get_ip = SimpleHttpOperator(
|
|
task_id='get_ip', endpoint='get', method='GET', xcom_push=True
|
|
)
|
|
|
|
@dag.task(multiple_outputs=True)
|
|
def prepare_email(raw_json: str) -> Dict[str, str]:
|
|
external_ip = json.loads(raw_json)['origin']
|
|
return {
|
|
'subject':f'Server connected from {external_ip}',
|
|
'body': f'Seems like today your server executing Airflow is connected from the external IP {external_ip}<br>'
|
|
}
|
|
|
|
email_info = prepare_email(get_ip.output)
|
|
|
|
send_email = EmailOperator(
|
|
task_id='send_email',
|
|
to='example@example.com',
|
|
subject=email_info['subject'],
|
|
html_content=email_info['body']
|
|
)
|
|
|
|
.. _concepts:dagruns:
|
|
|
|
DAG Runs
|
|
========
|
|
|
|
A DAG run is an instantiation of a DAG, containing task instances that run for a specific ``execution_date``.
|
|
|
|
A DAG run is usually created by the Airflow scheduler, but can also be created by an external trigger.
|
|
Multiple DAG runs may be running at once for a particular DAG, each of them having a different ``execution_date``.
|
|
For example, we might currently have two DAG runs that are in progress for 2016-01-01 and 2016-01-02 respectively.
|
|
|
|
.. _concepts:execution_date:
|
|
|
|
execution_date
|
|
--------------
|
|
|
|
The ``execution_date`` is the *logical* date and time which the DAG Run, and its task instances, are running for.
|
|
|
|
This allows task instances to process data for the desired *logical* date & time.
|
|
While a task_instance or DAG run might have a *actual* start date of now,
|
|
their *logical* date might be 3 months ago because we are busy reloading something.
|
|
|
|
In the prior example the ``execution_date`` was 2016-01-01 for the first DAG Run and 2016-01-02 for the second.
|
|
|
|
A DAG run and all task instances created within it are instanced with the same ``execution_date``, so
|
|
that logically you can think of a DAG run as simulating the DAG running all of its tasks at some
|
|
previous date & time specified by the ``execution_date``.
|
|
|
|
.. _concepts:tasks:
|
|
|
|
Tasks
|
|
=====
|
|
|
|
A Task defines a unit of work within a DAG; it is represented as a node in the DAG graph, and it is written in Python.
|
|
|
|
Each task is an implementation of an Operator, for example a ``PythonOperator`` to execute some Python code,
|
|
or a ``BashOperator`` to run a Bash command.
|
|
|
|
The task implements an operator by defining specific values for that operator,
|
|
such as a Python callable in the case of ``PythonOperator`` or a Bash command in the case of ``BashOperator``.
|
|
|
|
Relations between Tasks
|
|
-----------------------
|
|
|
|
Consider the following DAG with two tasks.
|
|
Each task is a node in our DAG, and there is a dependency from task_1 to task_2:
|
|
|
|
.. code-block:: python
|
|
|
|
with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag:
|
|
task_1 = DummyOperator('task_1')
|
|
task_2 = DummyOperator('task_2')
|
|
task_1 >> task_2 # Define dependencies
|
|
|
|
We can say that task_1 is *upstream* of task_2, and conversely task_2 is *downstream* of task_1.
|
|
When a DAG Run is created, task_1 will start running and task_2 waits for task_1 to complete successfully before it may start.
|
|
|
|
.. _concepts:task_decorator:
|
|
|
|
Python task decorator
|
|
---------------------
|
|
|
|
Airflow ``task`` decorator converts any Python function to an Airflow operator.
|
|
The decorated function can be called once to set the arguments and key arguments for operator execution.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
with DAG('my_dag', start_date=datetime(2020, 5, 15)) as dag:
|
|
@dag.task
|
|
def hello_world():
|
|
print('hello world!')
|
|
|
|
|
|
# Also...
|
|
from airflow.decorators import task
|
|
|
|
|
|
@task
|
|
def hello_name(name: str):
|
|
print(f'hello {name}!')
|
|
|
|
|
|
hello_name('Airflow users')
|
|
|
|
Task decorator captures returned values and sends them to the :ref:`XCom backend <concepts:xcom>`. By default, returned
|
|
value is saved as a single XCom value. You can set ``multiple_outputs`` key argument to ``True`` to unroll dictionaries,
|
|
lists or tuples into seprate XCom values. This can be used with regular operators to create
|
|
:ref:`functional DAGs <concepts:functional_dags>`.
|
|
|
|
Calling a decorated function returns an ``XComArg`` instance. You can use it to set templated fields on downstream
|
|
operators.
|
|
|
|
You can call a decorated function more than once in a DAG. The decorated function will automatically generate
|
|
a unique ``task_id`` for each generated operator.
|
|
|
|
.. code-block:: python
|
|
|
|
with DAG('my_dag', start_date=datetime(2020, 5, 15)) as dag:
|
|
|
|
@dag.task
|
|
def update_user(user_id: int):
|
|
...
|
|
|
|
# Avoid generating this list dynamically to keep DAG topology stable between DAG runs
|
|
for user_id in user_ids:
|
|
update_user(user_id)
|
|
|
|
# This will generate an operator for each user_id
|
|
|
|
Task ids are generated by appending a number at the end of the original task id. For the above example, the DAG will have
|
|
the following task ids: ``[update_user, update_user__1, update_user__2, ... update_user__n]``.
|
|
|
|
Task Instances
|
|
==============
|
|
|
|
A task instance represents a specific run of a task and is characterized as the
|
|
combination of a DAG, a task, and a point in time (``execution_date``). Task instances
|
|
also have an indicative state, which could be "running", "success", "failed", "skipped", "up
|
|
for retry", etc.
|
|
|
|
Tasks are defined in DAGs, and both are written in Python code to define what you want to do.
|
|
Task Instances belong to DAG Runs, have an associated ``execution_date``, and are instantiated, runnable entities.
|
|
|
|
Relations between Task Instances
|
|
--------------------------------
|
|
|
|
Again consider the following tasks, defined for some DAG:
|
|
|
|
.. code-block:: python
|
|
|
|
with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag:
|
|
task_1 = DummyOperator('task_1')
|
|
task_2 = DummyOperator('task_2')
|
|
task_1 >> task_2 # Define dependencies
|
|
|
|
When we enable this DAG, the scheduler creates several DAG Runs - one with ``execution_date`` of 2016-01-01,
|
|
one with ``execution_date`` of 2016-01-02, and so on up to the current date.
|
|
|
|
Each DAG Run will contain a task_1 Task Instance and a task_2 Task instance. Both Task Instances will
|
|
have ``execution_date`` equal to the DAG Run's ``execution_date``, and each task_2 will be *downstream* of
|
|
(depends on) its task_1.
|
|
|
|
We can also say that task_1 for 2016-01-01 is the *previous* task instance of the task_1 for 2016-01-02.
|
|
Or that the DAG Run for 2016-01-01 is the *previous* DAG Run to the DAG Run of 2016-01-02.
|
|
Here, *previous* refers to the logical past/prior ``execution_date``, that runs independently of other runs,
|
|
and *upstream* refers to a dependency within the same run and having the same ``execution_date``.
|
|
|
|
.. note::
|
|
The Airflow documentation sometimes refers to *previous* instead of *upstream* in places, and vice-versa.
|
|
If you find any occurrences of this, please help us improve by contributing some corrections!
|
|
|
|
Task Lifecycle
|
|
==============
|
|
|
|
A task goes through various stages from start to completion. In the Airflow UI
|
|
(graph and tree views), these stages are displayed by a color representing each
|
|
stage:
|
|
|
|
.. image:: img/task_stages.png
|
|
|
|
The complete lifecycle of the task looks like this:
|
|
|
|
.. image:: img/task_lifecycle_diagram.png
|
|
|
|
The happy flow consists of the following stages:
|
|
|
|
1. No status (scheduler created empty task instance)
|
|
2. Scheduled (scheduler determined task instance needs to run)
|
|
3. Queued (scheduler sent task to executor to run on the queue)
|
|
4. Running (worker picked up a task and is now running it)
|
|
5. Success (task completed)8
|
|
|
|
There is also visual difference between scheduled and manually triggered
|
|
DAGs/tasks:
|
|
|
|
.. image:: img/task_manual_vs_scheduled.png
|
|
|
|
The DAGs/tasks with a black border are scheduled runs, whereas the non-bordered
|
|
DAGs/tasks are manually triggered, i.e. by ``airflow dags trigger``.
|
|
|
|
.. _concepts:operators:
|
|
|
|
Operators
|
|
=========
|
|
|
|
While DAGs describe *how* to run a workflow, ``Operators`` determine what
|
|
actually gets done by a task.
|
|
|
|
An operator describes a single task in a workflow. Operators are usually (but
|
|
not always) atomic, meaning they can stand on their own and don't need to share
|
|
resources with any other operators. The DAG will make sure that operators run in
|
|
the correct order; other than those dependencies, operators generally
|
|
run independently. In fact, they may run on two completely different machines.
|
|
|
|
This is a subtle but very important point: in general, if two operators need to
|
|
share information, like a filename or small amount of data, you should consider
|
|
combining them into a single operator. If it absolutely can't be avoided,
|
|
Airflow does have a feature for operator cross-communication called XCom that is
|
|
described in the section :ref:`XComs <concepts:xcom>`
|
|
|
|
Airflow provides operators for many common tasks, including:
|
|
|
|
- :class:`~airflow.operators.bash.BashOperator` - executes a bash command
|
|
- :class:`~airflow.operators.python.PythonOperator` - calls an arbitrary Python function
|
|
- :class:`~airflow.providers.email.operators.email.EmailOperator` - sends an email
|
|
- :class:`~airflow.providers.http.operators.http.SimpleHttpOperator` - sends an HTTP request
|
|
- :class:`~airflow.providers.mysql.operators.mysql.MySqlOperator`,
|
|
:class:`~airflow.providers.sqlite.operators.sqlite.SqliteOperator`,
|
|
:class:`~airflow.providers.postgres.operators.postgres.PostgresOperator`,
|
|
:class:`~airflow.providers.microsoft.mssql.operators.mssql.MsSqlOperator`,
|
|
:class:`~airflow.providers.oracle.operators.oracle.OracleOperator`,
|
|
:class:`~airflow.providers.jdbc.operators.jdbc.JdbcOperator`, etc. - executes a SQL command
|
|
- ``Sensor`` - an Operator that waits (polls) for a certain time, file, database row, S3 key, etc...
|
|
|
|
In addition to these basic building blocks, there are many more specific
|
|
operators: :class:`~airflow.providers.docker.operators.docker.DockerOperator`,
|
|
:class:`~airflow.providers.apache.hive.operators.hive.HiveOperator`, :class:`~airflow.providers.amazon.aws.operators.s3_file_transform.S3FileTransformOperator`,
|
|
:class:`~airflow.providers.mysql.transfers.presto_to_mysql.PrestoToMySqlOperator`,
|
|
:class:`~airflow.providers.slack.operators.slack.SlackAPIOperator`... you get the idea!
|
|
|
|
Operators are only loaded by Airflow if they are assigned to a DAG.
|
|
|
|
.. seealso::
|
|
- :ref:`List Airflow operators <pythonapi:operators>`
|
|
- :doc:`How-to guides for some Airflow operators<howto/operator/index>`.
|
|
|
|
DAG Assignment
|
|
--------------
|
|
|
|
*Added in Airflow 1.8*
|
|
|
|
Operators do not have to be assigned to DAGs immediately (previously ``dag`` was
|
|
a required argument). However, once an operator is assigned to a DAG, it can not
|
|
be transferred or unassigned. DAG assignment can be done explicitly when the
|
|
operator is created, through deferred assignment, or even inferred from other
|
|
operators.
|
|
|
|
.. code-block:: python
|
|
|
|
dag = DAG('my_dag', start_date=datetime(2016, 1, 1))
|
|
|
|
# sets the DAG explicitly
|
|
explicit_op = DummyOperator(task_id='op1', dag=dag)
|
|
|
|
# deferred DAG assignment
|
|
deferred_op = DummyOperator(task_id='op2')
|
|
deferred_op.dag = dag
|
|
|
|
# inferred DAG assignment (linked operators must be in the same DAG)
|
|
inferred_op = DummyOperator(task_id='op3')
|
|
inferred_op.set_upstream(deferred_op)
|
|
|
|
|
|
Bitshift Composition
|
|
--------------------
|
|
|
|
*Added in Airflow 1.8*
|
|
|
|
We recommend you setting operator relationships with bitshift operators rather than ``set_upstream()``
|
|
and ``set_downstream()``.
|
|
|
|
Traditionally, operator relationships are set with the ``set_upstream()`` and
|
|
``set_downstream()`` methods. In Airflow 1.8, this can be done with the Python
|
|
bitshift operators ``>>`` and ``<<``. The following four statements are all
|
|
functionally equivalent:
|
|
|
|
.. code-block:: python
|
|
|
|
op1 >> op2
|
|
op1.set_downstream(op2)
|
|
|
|
op2 << op1
|
|
op2.set_upstream(op1)
|
|
|
|
When using the bitshift to compose operators, the relationship is set in the
|
|
direction that the bitshift operator points. For example, ``op1 >> op2`` means
|
|
that ``op1`` runs first and ``op2`` runs second. Multiple operators can be
|
|
composed -- keep in mind the chain is executed left-to-right and the rightmost
|
|
object is always returned. For example:
|
|
|
|
.. code-block:: python
|
|
|
|
op1 >> op2 >> op3 << op4
|
|
|
|
is equivalent to:
|
|
|
|
.. code-block:: python
|
|
|
|
op1.set_downstream(op2)
|
|
op2.set_downstream(op3)
|
|
op3.set_upstream(op4)
|
|
|
|
We can put this all together to build a simple pipeline:
|
|
|
|
.. code-block:: python
|
|
|
|
with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag:
|
|
(
|
|
DummyOperator(task_id='dummy_1')
|
|
>> BashOperator(
|
|
task_id='bash_1',
|
|
bash_command='echo "HELLO!"')
|
|
>> PythonOperator(
|
|
task_id='python_1',
|
|
python_callable=lambda: print("GOODBYE!"))
|
|
)
|
|
|
|
Bitshift can also be used with lists. For example:
|
|
|
|
.. code-block:: python
|
|
|
|
op1 >> [op2, op3] >> op4
|
|
|
|
is equivalent to:
|
|
|
|
.. code-block:: python
|
|
|
|
op1 >> op2 >> op4
|
|
op1 >> op3 >> op4
|
|
|
|
and equivalent to:
|
|
|
|
.. code-block:: python
|
|
|
|
op1.set_downstream([op2, op3])
|
|
op4.set_upstream([op2, op3])
|
|
|
|
|
|
Relationship Builders
|
|
---------------------
|
|
|
|
*Moved in Airflow 2.0*
|
|
|
|
In Airflow 2.0 those two methods moved from ``airflow.utils.helpers`` to ``airflow.models.baseoperator``.
|
|
|
|
``chain`` and ``cross_downstream`` function provide easier ways to set relationships
|
|
between operators in specific situation.
|
|
|
|
When setting a relationship between two lists,
|
|
if we want all operators in one list to be upstream to all operators in the other,
|
|
we cannot use a single bitshift composition. Instead we have to split one of the lists:
|
|
|
|
.. code-block:: python
|
|
|
|
[op1, op2, op3] >> op4
|
|
[op1, op2, op3] >> op5
|
|
[op1, op2, op3] >> op6
|
|
|
|
``cross_downstream`` could handle list relationships easier.
|
|
|
|
.. code-block:: python
|
|
|
|
cross_downstream([op1, op2, op3], [op4, op5, op6])
|
|
|
|
When setting single direction relationships to many operators, we could
|
|
concat them with bitshift composition.
|
|
|
|
.. code-block:: python
|
|
|
|
op1 >> op2 >> op3 >> op4 >> op5
|
|
|
|
This can be accomplished using ``chain``
|
|
|
|
.. code-block:: python
|
|
|
|
chain(op1, op2, op3, op4, op5)
|
|
|
|
even without operator's name
|
|
|
|
.. code-block:: python
|
|
|
|
chain([DummyOperator(task_id='op' + i, dag=dag) for i in range(1, 6)])
|
|
|
|
``chain`` can handle a list of operators
|
|
|
|
.. code-block:: python
|
|
|
|
chain(op1, [op2, op3], op4)
|
|
|
|
is equivalent to:
|
|
|
|
.. code-block:: python
|
|
|
|
op1 >> [op2, op3] >> op4
|
|
|
|
When ``chain`` sets relationships between two lists of operators, they must have the same size.
|
|
|
|
.. code-block:: python
|
|
|
|
chain(op1, [op2, op3], [op4, op5], op6)
|
|
|
|
is equivalent to:
|
|
|
|
.. code-block:: python
|
|
|
|
op1 >> [op2, op3]
|
|
op2 >> op4
|
|
op3 >> op5
|
|
[op4, op5] >> op6
|
|
|
|
|
|
Workflows
|
|
=========
|
|
|
|
You're now familiar with the core building blocks of Airflow.
|
|
Some of the concepts may sound very similar, but the vocabulary can
|
|
be conceptualized like this:
|
|
|
|
- DAG: The work (tasks), and the order in which
|
|
work should take place (dependencies), written in Python.
|
|
- DAG Run: An instance of a DAG for a particular logical date and time.
|
|
- Operator: A class that acts as a template for carrying out some work.
|
|
- Task: Defines work by implementing an operator, written in Python.
|
|
- Task Instance: An instance of a task - that has been assigned to a DAG and has a
|
|
state associated with a specific DAG run (i.e for a specific execution_date).
|
|
- execution_date: The logical date and time for a DAG Run and its Task Instances.
|
|
|
|
By combining ``DAGs`` and ``Operators`` to create ``TaskInstances``, you can
|
|
build complex workflows.
|
|
|
|
Additional Functionality
|
|
''''''''''''''''''''''''
|
|
|
|
In addition to the core Airflow objects, there are a number of more complex
|
|
features that enable behaviors like limiting simultaneous access to resources,
|
|
cross-communication, conditional execution, and more.
|
|
|
|
Hooks
|
|
=====
|
|
|
|
Hooks are interfaces to external platforms and databases like Hive, S3,
|
|
MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when
|
|
possible, and act as a building block for operators. They also use
|
|
the ``airflow.models.connection.Connection`` model to retrieve hostnames
|
|
and authentication information. Hooks keep authentication code and
|
|
information out of pipelines, centralized in the metadata database.
|
|
|
|
Hooks are also very useful on their own to use in Python scripts,
|
|
Airflow airflow.operators.PythonOperator, and in interactive environments
|
|
like iPython or Jupyter Notebook.
|
|
|
|
.. seealso::
|
|
:ref:`List Airflow hooks <pythonapi:hooks>`
|
|
|
|
Pools
|
|
=====
|
|
|
|
Some systems can get overwhelmed when too many processes hit them at the same
|
|
time. Airflow pools can be used to **limit the execution parallelism** on
|
|
arbitrary sets of tasks. The list of pools is managed in the UI
|
|
(``Menu -> Admin -> Pools``) by giving the pools a name and assigning
|
|
it a number of worker slots. Tasks can then be associated with
|
|
one of the existing pools by using the ``pool`` parameter when
|
|
creating tasks (i.e., instantiating operators).
|
|
|
|
.. code-block:: python
|
|
|
|
aggregate_db_message_job = BashOperator(
|
|
task_id='aggregate_db_message_job',
|
|
execution_timeout=timedelta(hours=3),
|
|
pool='ep_data_pipeline_db_msg_agg',
|
|
bash_command=aggregate_db_message_job_cmd,
|
|
dag=dag)
|
|
aggregate_db_message_job.set_upstream(wait_for_empty_queue)
|
|
|
|
The ``pool`` parameter can
|
|
be used in conjunction with ``priority_weight`` to define priorities
|
|
in the queue, and which tasks get executed first as slots open up in the
|
|
pool. The default ``priority_weight`` is ``1``, and can be bumped to any
|
|
number. When sorting the queue to evaluate which task should be executed
|
|
next, we use the ``priority_weight``, summed up with all of the
|
|
``priority_weight`` values from tasks downstream from this task. You can
|
|
use this to bump a specific important task and the whole path to that task
|
|
gets prioritized accordingly.
|
|
|
|
Tasks will be scheduled as usual while the slots fill up. Once capacity is
|
|
reached, runnable tasks get queued and their state will show as such in the
|
|
UI. As slots free up, queued tasks start running based on the
|
|
``priority_weight`` (of the task and its descendants).
|
|
|
|
Pools are not thread-safe , in case of more than one scheduler in localExecutor Mode
|
|
you can't ensure the non-scheduling of task even if the pool is full.
|
|
|
|
Note that if tasks are not given a pool, they are assigned to a default
|
|
pool ``default_pool``. ``default_pool`` is initialized with 128 slots and
|
|
can changed through the UI or CLI (though it cannot be removed).
|
|
|
|
To combine Pools with SubDAGs see the `SubDAGs`_ section.
|
|
|
|
.. _concepts-connections:
|
|
|
|
Connections
|
|
===========
|
|
|
|
The information needed to connect to external systems is stored in the Airflow metastore database and can be
|
|
managed in the UI (``Menu -> Admin -> Connections``). A ``conn_id`` is defined there, and hostname / login /
|
|
password / schema information attached to it. Airflow pipelines retrieve centrally-managed connections
|
|
information by specifying the relevant ``conn_id``.
|
|
|
|
You may add more than one connection with the same ``conn_id``. When there is more than one connection
|
|
with the same ``conn_id``, the :py:meth:`~airflow.hooks.base_hook.BaseHook.get_connection` method on
|
|
:py:class:`~airflow.hooks.base_hook.BaseHook` will choose one connection randomly. This can be be used to
|
|
provide basic load balancing and fault tolerance, when used in conjunction with retries.
|
|
|
|
Airflow also provides a mechanism to store connections outside the database, e.g. in :ref:`environment variables <environment_variables_secrets_backend>`.
|
|
Additional sources may be enabled, e.g. :ref:`AWS SSM Parameter Store <ssm_parameter_store_secrets>`, or you may
|
|
:ref:`roll your own secrets backend <roll_your_own_secrets_backend>`.
|
|
|
|
Many hooks have a default ``conn_id``, where operators using that hook do not
|
|
need to supply an explicit connection ID. For example, the default
|
|
``conn_id`` for the :class:`~airflow.providers.postgres.hooks.postgres.PostgresHook` is
|
|
``postgres_default``.
|
|
|
|
See :doc:`howto/connection/index` for details on creating and managing connections.
|
|
|
|
Queues
|
|
======
|
|
|
|
When using the CeleryExecutor, the Celery queues that tasks are sent to
|
|
can be specified. ``queue`` is an attribute of BaseOperator, so any
|
|
task can be assigned to any queue. The default queue for the environment
|
|
is defined in the ``airflow.cfg``'s ``celery -> default_queue``. This defines
|
|
the queue that tasks get assigned to when not specified, as well as which
|
|
queue Airflow workers listen to when started.
|
|
|
|
Workers can listen to one or multiple queues of tasks. When a worker is
|
|
started (using the command ``airflow celery worker``), a set of comma-delimited
|
|
queue names can be specified (e.g. ``airflow celery worker -q spark``). This worker
|
|
will then only pick up tasks wired to the specified queue(s).
|
|
|
|
This can be useful if you need specialized workers, either from a
|
|
resource perspective (for say very lightweight tasks where one worker
|
|
could take thousands of tasks without a problem), or from an environment
|
|
perspective (you want a worker running from within the Spark cluster
|
|
itself because it needs a very specific environment and security rights).
|
|
|
|
.. _concepts:xcom:
|
|
|
|
XComs
|
|
=====
|
|
|
|
XComs let tasks exchange messages, allowing more nuanced forms of control and
|
|
shared state. The name is an abbreviation of "cross-communication". XComs are
|
|
principally defined by a key, value, and timestamp, but also track attributes
|
|
like the task/DAG that created the XCom and when it should become visible. Any
|
|
object that can be pickled can be used as an XCom value, so users should make
|
|
sure to use objects of appropriate size.
|
|
|
|
XComs can be "pushed" (sent) or "pulled" (received). When a task pushes an
|
|
XCom, it makes it generally available to other tasks. Tasks can push XComs at
|
|
any time by calling the ``xcom_push()`` method. In addition, if a task returns
|
|
a value (either from its Operator's ``execute()`` method, or from a
|
|
PythonOperator's ``python_callable`` function), then an XCom containing that
|
|
value is automatically pushed.
|
|
|
|
Tasks call ``xcom_pull()`` to retrieve XComs, optionally applying filters
|
|
based on criteria like ``key``, source ``task_ids``, and source ``dag_id``. By
|
|
default, ``xcom_pull()`` filters for the keys that are automatically given to
|
|
XComs when they are pushed by being returned from execute functions (as
|
|
opposed to XComs that are pushed manually).
|
|
|
|
If ``xcom_pull`` is passed a single string for ``task_ids``, then the most
|
|
recent XCom value from that task is returned; if a list of ``task_ids`` is
|
|
passed, then a corresponding list of XCom values is returned.
|
|
|
|
.. code-block:: python
|
|
|
|
# inside a PythonOperator called 'pushing_task'
|
|
def push_function():
|
|
return value
|
|
|
|
# inside another PythonOperator
|
|
def pull_function(task_instance):
|
|
value = task_instance.xcom_pull(task_ids='pushing_task')
|
|
|
|
When specifying arguments that are part of the context, they will be
|
|
automatically passed to the function.
|
|
|
|
It is also possible to pull XCom directly in a template, here's an example
|
|
of what this may look like:
|
|
|
|
.. code-block:: jinja
|
|
|
|
SELECT * FROM {{ task_instance.xcom_pull(task_ids='foo', key='table_name') }}
|
|
|
|
Note that XComs are similar to `Variables`_, but are specifically designed
|
|
for inter-task communication rather than global settings.
|
|
|
|
Custom XCom backend
|
|
-------------------
|
|
|
|
It is possible to change ``XCom`` behaviour os serialization and deserialization of tasks' result.
|
|
To do this one have to change ``xcom_backend`` parameter in Airflow config. Provided value should point
|
|
to a class that is subclass of :class:`~airflow.models.xcom.BaseXCom`. To alter the serialaization /
|
|
deserialization mechanism the custom class should override ``serialize_value`` and ``deserialize_value``
|
|
methods.
|
|
|
|
.. _concepts:variables:
|
|
|
|
Variables
|
|
=========
|
|
|
|
Variables are a generic way to store and retrieve arbitrary content or
|
|
settings as a simple key value store within Airflow. Variables can be
|
|
listed, created, updated and deleted from the UI (``Admin -> Variables``),
|
|
code or CLI. In addition, json settings files can be bulk uploaded through
|
|
the UI. While your pipeline code definition and most of your constants
|
|
and variables should be defined in code and stored in source control,
|
|
it can be useful to have some variables or configuration items
|
|
accessible and modifiable through the UI.
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
from airflow.models import Variable
|
|
foo = Variable.get("foo")
|
|
bar = Variable.get("bar", deserialize_json=True)
|
|
baz = Variable.get("baz", default_var=None)
|
|
|
|
The second call assumes ``json`` content and will be deserialized into
|
|
``bar``. Note that ``Variable`` is a sqlalchemy model and can be used
|
|
as such. The third call uses the ``default_var`` parameter with the value
|
|
``None``, which either returns an existing value or ``None`` if the variable
|
|
isn't defined. The get function will throw a ``KeyError`` if the variable
|
|
doesn't exist and no default is provided.
|
|
|
|
You can use a variable from a jinja template with the syntax :
|
|
|
|
.. code-block:: bash
|
|
|
|
echo {{ var.value.<variable_name> }}
|
|
|
|
or if you need to deserialize a json object from the variable :
|
|
|
|
.. code-block:: bash
|
|
|
|
echo {{ var.json.<variable_name> }}
|
|
|
|
Storing Variables in Environment Variables
|
|
------------------------------------------
|
|
|
|
.. versionadded:: 1.10.10
|
|
|
|
Airflow Variables can also be created and managed using Environment Variables. The environment variable
|
|
naming convention is :envvar:`AIRFLOW_VAR_{VARIABLE_NAME}`, all uppercase.
|
|
So if your variable key is ``FOO`` then the variable name should be ``AIRFLOW_VAR_FOO``.
|
|
|
|
For example,
|
|
|
|
.. code-block:: bash
|
|
|
|
export AIRFLOW_VAR_FOO=BAR
|
|
|
|
# To use JSON, store them as JSON strings
|
|
export AIRFLOW_VAR_FOO_BAZ='{"hello":"world"}'
|
|
|
|
You can use them in your DAGs as:
|
|
|
|
.. code-block:: python
|
|
|
|
from airflow.models import Variable
|
|
foo = Variable.get("foo")
|
|
foo_json = Variable.get("foo_baz", deserialize_json=True)
|
|
|
|
.. note::
|
|
|
|
Single underscores surround ``VAR``. This is in contrast with the way ``airflow.cfg``
|
|
parameters are stored, where double underscores surround the config section name.
|
|
Variables set using Environment Variables would not appear in the Airflow UI but you will
|
|
be able to use it in your DAG file.
|
|
|
|
Branching
|
|
=========
|
|
|
|
Sometimes you need a workflow to branch, or only go down a certain path
|
|
based on an arbitrary condition which is typically related to something
|
|
that happened in an upstream task. One way to do this is by using the
|
|
``BranchPythonOperator``.
|
|
|
|
The ``BranchPythonOperator`` is much like the PythonOperator except that it
|
|
expects a ``python_callable`` that returns a task_id (or list of task_ids). The
|
|
task_id returned is followed, and all of the other paths are skipped.
|
|
The task_id returned by the Python function has to reference a task
|
|
directly downstream from the BranchPythonOperator task.
|
|
|
|
Note that when a path is a downstream task of the returned task (list), it will
|
|
not be skipped:
|
|
|
|
.. image:: img/branch_note.png
|
|
|
|
Paths of the branching task are ``branch_a``, ``join`` and ``branch_b``. Since
|
|
``join`` is a downstream task of ``branch_a``, it will be excluded from the skipped
|
|
tasks when ``branch_a`` is returned by the Python callable.
|
|
|
|
The ``BranchPythonOperator`` can also be used with XComs allowing branching
|
|
context to dynamically decide what branch to follow based on upstream tasks.
|
|
For example:
|
|
|
|
.. code-block:: python
|
|
|
|
def branch_func(ti):
|
|
xcom_value = int(ti.xcom_pull(task_ids='start_task'))
|
|
if xcom_value >= 5:
|
|
return 'continue_task'
|
|
else:
|
|
return 'stop_task'
|
|
|
|
start_op = BashOperator(
|
|
task_id='start_task',
|
|
bash_command="echo 5",
|
|
xcom_push=True,
|
|
dag=dag)
|
|
|
|
branch_op = BranchPythonOperator(
|
|
task_id='branch_task',
|
|
python_callable=branch_func,
|
|
dag=dag)
|
|
|
|
continue_op = DummyOperator(task_id='continue_task', dag=dag)
|
|
stop_op = DummyOperator(task_id='stop_task', dag=dag)
|
|
|
|
start_op >> branch_op >> [continue_op, stop_op]
|
|
|
|
If you wish to implement your own operators with branching functionality, you
|
|
can inherit from :class:`~airflow.operators.branch_operator.BaseBranchOperator`,
|
|
which behaves similarly to ``BranchPythonOperator`` but expects you to provide
|
|
an implementation of the method ``choose_branch``. As with the callable for
|
|
``BranchPythonOperator``, this method should return the ID of a downstream task,
|
|
or a list of task IDs, which will be run, and all others will be skipped.
|
|
|
|
.. code-block:: python
|
|
|
|
class MyBranchOperator(BaseBranchOperator):
|
|
def choose_branch(self, context):
|
|
"""
|
|
Run an extra branch on the first day of the month
|
|
"""
|
|
if context['execution_date'].day == 1:
|
|
return ['daily_task_id', 'monthly_task_id']
|
|
else:
|
|
return 'daily_task_id'
|
|
|
|
|
|
SubDAGs
|
|
=======
|
|
|
|
SubDAGs are perfect for repeating patterns. Defining a function that returns a
|
|
DAG object is a nice design pattern when using Airflow.
|
|
|
|
Airbnb uses the *stage-check-exchange* pattern when loading data. Data is staged
|
|
in a temporary table, after which data quality checks are performed against
|
|
that table. Once the checks all pass the partition is moved into the production
|
|
table.
|
|
|
|
As another example, consider the following DAG:
|
|
|
|
.. image:: img/subdag_before.png
|
|
|
|
We can combine all of the parallel ``task-*`` operators into a single SubDAG,
|
|
so that the resulting DAG resembles the following:
|
|
|
|
.. image:: img/subdag_after.png
|
|
|
|
Note that SubDAG operators should contain a factory method that returns a DAG
|
|
object. This will prevent the SubDAG from being treated like a separate DAG in
|
|
the main UI. For example:
|
|
|
|
.. exampleinclude:: ../airflow/example_dags/subdags/subdag.py
|
|
:language: python
|
|
:start-after: [START subdag]
|
|
:end-before: [END subdag]
|
|
|
|
This SubDAG can then be referenced in your main DAG file:
|
|
|
|
.. exampleinclude:: ../airflow/example_dags/example_subdag_operator.py
|
|
:language: python
|
|
:start-after: [START example_subdag_operator]
|
|
:end-before: [END example_subdag_operator]
|
|
|
|
You can zoom into a SubDagOperator from the graph view of the main DAG to show
|
|
the tasks contained within the SubDAG:
|
|
|
|
.. image:: img/subdag_zoom.png
|
|
|
|
Some other tips when using SubDAGs:
|
|
|
|
- by convention, a SubDAG's ``dag_id`` should be prefixed by its parent and
|
|
a dot. As in ``parent.child``
|
|
- share arguments between the main DAG and the SubDAG by passing arguments to
|
|
the SubDAG operator (as demonstrated above)
|
|
- SubDAGs must have a schedule and be enabled. If the SubDAG's schedule is
|
|
set to ``None`` or ``@once``, the SubDAG will succeed without having done
|
|
anything
|
|
- clearing a SubDagOperator also clears the state of the tasks within
|
|
- marking success on a SubDagOperator does not affect the state of the tasks
|
|
within
|
|
- refrain from using ``depends_on_past=True`` in tasks within the SubDAG as
|
|
this can be confusing
|
|
- it is possible to specify an executor for the SubDAG. It is common to use
|
|
the SequentialExecutor if you want to run the SubDAG in-process and
|
|
effectively limit its parallelism to one. Using LocalExecutor can be
|
|
problematic as it may over-subscribe your worker, running multiple tasks in
|
|
a single slot
|
|
|
|
See ``airflow/example_dags`` for a demonstration.
|
|
|
|
Note that airflow pool is not honored by SubDagOperator. Hence resources could be
|
|
consumed by SubdagOperators.
|
|
|
|
SLAs
|
|
====
|
|
|
|
Service Level Agreements, or time by which a task or DAG should have
|
|
succeeded, can be set at a task level as a ``timedelta``. If
|
|
one or many instances have not succeeded by that time, an alert email is sent
|
|
detailing the list of tasks that missed their SLA. The event is also recorded
|
|
in the database and made available in the web UI under ``Browse->SLA Misses``
|
|
where events can be analyzed and documented.
|
|
|
|
SLAs can be configured for scheduled tasks by using the ``sla`` parameter.
|
|
In addition to sending alerts to the addresses specified in a task's ``email`` parameter,
|
|
the ``sla_miss_callback`` specifies an additional ``Callable``
|
|
object to be invoked when the SLA is not met.
|
|
|
|
If you don't want to check SLAs, you can disable globally (all the DAGs) by
|
|
setting ``check_slas=False`` under ``[core]`` section in ``airflow.cfg`` file:
|
|
|
|
.. code-block:: ini
|
|
|
|
[core]
|
|
check_slas = False
|
|
|
|
.. note::
|
|
For information on the email configuration, see :doc:`howto/email-config`
|
|
|
|
.. _concepts/trigger_rule:
|
|
|
|
Trigger Rules
|
|
=============
|
|
|
|
Though the normal workflow behavior is to trigger tasks when all their
|
|
directly upstream tasks have succeeded, Airflow allows for more complex
|
|
dependency settings.
|
|
|
|
All operators have a ``trigger_rule`` argument which defines the rule by which
|
|
the generated task get triggered. The default value for ``trigger_rule`` is
|
|
``all_success`` and can be defined as "trigger this task when all directly
|
|
upstream tasks have succeeded". All other rules described here are based
|
|
on direct parent tasks and are values that can be passed to any operator
|
|
while creating tasks:
|
|
|
|
* ``all_success``: (default) all parents have succeeded
|
|
* ``all_failed``: all parents are in a ``failed`` or ``upstream_failed`` state
|
|
* ``all_done``: all parents are done with their execution
|
|
* ``one_failed``: fires as soon as at least one parent has failed, it does not wait for all parents to be done
|
|
* ``one_success``: fires as soon as at least one parent succeeds, it does not wait for all parents to be done
|
|
* ``none_failed``: all parents have not failed (``failed`` or ``upstream_failed``) i.e. all parents have succeeded or been skipped
|
|
* ``none_failed_or_skipped``: all parents have not failed (``failed`` or ``upstream_failed``) and at least one parent has succeeded.
|
|
* ``none_skipped``: no parent is in a ``skipped`` state, i.e. all parents are in a ``success``, ``failed``, or ``upstream_failed`` state
|
|
* ``dummy``: dependencies are just for show, trigger at will
|
|
|
|
Note that these can be used in conjunction with ``depends_on_past`` (boolean)
|
|
that, when set to ``True``, keeps a task from getting triggered if the
|
|
previous schedule for the task hasn't succeeded.
|
|
|
|
One must be aware of the interaction between trigger rules and skipped tasks
|
|
in schedule level. Skipped tasks will cascade through trigger rules
|
|
``all_success`` and ``all_failed`` but not ``all_done``, ``one_failed``, ``one_success``,
|
|
``none_failed``, ``none_failed_or_skipped``, ``none_skipped`` and ``dummy``.
|
|
|
|
For example, consider the following DAG:
|
|
|
|
.. code-block:: python
|
|
|
|
#dags/branch_without_trigger.py
|
|
import datetime as dt
|
|
|
|
from airflow.models import DAG
|
|
from airflow.operators.dummy_operator import DummyOperator
|
|
from airflow.operators.python import BranchPythonOperator
|
|
|
|
dag = DAG(
|
|
dag_id='branch_without_trigger',
|
|
schedule_interval='@once',
|
|
start_date=dt.datetime(2019, 2, 28)
|
|
)
|
|
|
|
run_this_first = DummyOperator(task_id='run_this_first', dag=dag)
|
|
branching = BranchPythonOperator(
|
|
task_id='branching', dag=dag,
|
|
python_callable=lambda: 'branch_a'
|
|
)
|
|
|
|
branch_a = DummyOperator(task_id='branch_a', dag=dag)
|
|
follow_branch_a = DummyOperator(task_id='follow_branch_a', dag=dag)
|
|
|
|
branch_false = DummyOperator(task_id='branch_false', dag=dag)
|
|
|
|
join = DummyOperator(task_id='join', dag=dag)
|
|
|
|
run_this_first >> branching
|
|
branching >> branch_a >> follow_branch_a >> join
|
|
branching >> branch_false >> join
|
|
|
|
In the case of this DAG, ``join`` is downstream of ``follow_branch_a``
|
|
and ``branch_false``. The ``join`` task will show up as skipped
|
|
because its ``trigger_rule`` is set to ``all_success`` by default and
|
|
skipped tasks will cascade through ``all_success``.
|
|
|
|
.. image:: img/branch_without_trigger.png
|
|
|
|
By setting ``trigger_rule`` to ``none_failed_or_skipped`` in ``join`` task,
|
|
|
|
.. code-block:: python
|
|
|
|
#dags/branch_with_trigger.py
|
|
...
|
|
join = DummyOperator(task_id='join', dag=dag, trigger_rule='none_failed_or_skipped')
|
|
...
|
|
|
|
The ``join`` task will be triggered as soon as
|
|
``branch_false`` has been skipped (a valid completion state) and
|
|
``follow_branch_a`` has succeeded. Because skipped tasks **will not**
|
|
cascade through ``none_failed_or_skipped``.
|
|
|
|
.. image:: img/branch_with_trigger.png
|
|
|
|
Latest Run Only
|
|
===============
|
|
|
|
Standard workflow behavior involves running a series of tasks for a
|
|
particular date/time range. Some workflows, however, perform tasks that
|
|
are independent of run time but need to be run on a schedule, much like a
|
|
standard cron job. In these cases, backfills or running jobs missed during
|
|
a pause just wastes CPU cycles.
|
|
|
|
For situations like this, you can use the ``LatestOnlyOperator`` to skip
|
|
tasks that are not being run during the most recent scheduled run for a
|
|
DAG. The ``LatestOnlyOperator`` skips all direct downstream tasks, if the time
|
|
right now is not between its ``execution_time`` and the next scheduled
|
|
``execution_time`` or the DagRun has been externally triggered.
|
|
|
|
For example, consider the following DAG:
|
|
|
|
.. exampleinclude:: ../airflow/example_dags/example_latest_only_with_trigger.py
|
|
:language: python
|
|
:start-after: [START example]
|
|
:end-before: [END example]
|
|
|
|
In the case of this DAG, the task ``task1`` is directly downstream of
|
|
``latest_only`` and will be skipped for all runs except the latest.
|
|
``task2`` is entirely independent of ``latest_only`` and will run in all
|
|
scheduled periods. ``task3`` is downstream of ``task1`` and ``task2`` and
|
|
because of the default ``trigger_rule`` being ``all_success`` will receive
|
|
a cascaded skip from ``task1``. ``task4`` is downstream of ``task1`` and
|
|
``task2``, but it will not be skipped, since its ``trigger_rule`` is set to
|
|
``all_done``.
|
|
|
|
.. image:: img/latest_only_with_trigger.png
|
|
|
|
|
|
Zombies & Undeads
|
|
=================
|
|
|
|
Task instances die all the time, usually as part of their normal life cycle,
|
|
but sometimes unexpectedly.
|
|
|
|
Zombie tasks are characterized by the absence
|
|
of an heartbeat (emitted by the job periodically) and a ``running`` status
|
|
in the database. They can occur when a worker node can't reach the database,
|
|
when Airflow processes are killed externally, or when a node gets rebooted
|
|
for instance. Zombie killing is performed periodically by the scheduler's
|
|
process.
|
|
|
|
Undead processes are characterized by the existence of a process and a matching
|
|
heartbeat, but Airflow isn't aware of this task as ``running`` in the database.
|
|
This mismatch typically occurs as the state of the database is altered,
|
|
most likely by deleting rows in the "Task Instances" view in the UI.
|
|
Tasks are instructed to verify their state as part of the heartbeat routine,
|
|
and terminate themselves upon figuring out that they are in this "undead"
|
|
state.
|
|
|
|
|
|
Cluster Policy
|
|
==============
|
|
|
|
In case you want to apply cluster-wide mutations to the Airflow tasks,
|
|
you can either mutate the task right after the DAG is loaded or
|
|
mutate the task instance before task execution.
|
|
|
|
Mutate tasks after DAG loaded
|
|
-----------------------------
|
|
|
|
To mutate the task right after the DAG is parsed, you can define
|
|
a ``policy`` function in ``airflow_local_settings.py`` that mutates the
|
|
task based on other task or DAG attributes (through ``task.dag``).
|
|
It receives a single argument as a reference to the task object and you can alter
|
|
its attributes.
|
|
|
|
For example, this function could apply a specific queue property when
|
|
using a specific operator, or enforce a task timeout policy, making sure
|
|
that no tasks run for more than 48 hours. Here's an example of what this
|
|
may look like inside your ``airflow_local_settings.py``:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
def policy(task):
|
|
if task.__class__.__name__ == 'HivePartitionSensor':
|
|
task.queue = "sensor_queue"
|
|
if task.timeout > timedelta(hours=48):
|
|
task.timeout = timedelta(hours=48)
|
|
|
|
|
|
Please note, cluster policy will have precedence over task
|
|
attributes defined in DAG meaning if ``task.sla`` is defined
|
|
in dag and also mutated via cluster policy then later will have precedence.
|
|
|
|
|
|
Mutate task instances before task execution
|
|
-------------------------------------------
|
|
|
|
To mutate the task instance before the task execution, you can define a
|
|
``task_instance_mutation_hook`` function in ``airflow_local_settings.py``
|
|
that mutates the task instance.
|
|
|
|
For example, this function re-routes the task to execute in a different
|
|
queue during retries:
|
|
|
|
.. code-block:: python
|
|
|
|
def task_instance_mutation_hook(ti):
|
|
if ti.try_number >= 1:
|
|
ti.queue = 'retry_queue'
|
|
|
|
|
|
Where to put ``airflow_local_settings.py``?
|
|
-------------------------------------------
|
|
|
|
Add a ``airflow_local_settings.py`` file to your ``$PYTHONPATH``
|
|
or to ``$AIRFLOW_HOME/config`` folder.
|
|
|
|
|
|
Documentation & Notes
|
|
=====================
|
|
|
|
It's possible to add documentation or notes to your DAGs & task objects that
|
|
become visible in the web interface ("Graph View" & "Tree View" for DAGs, "Task Details" for
|
|
tasks). There are a set of special task attributes that get rendered as rich
|
|
content if defined:
|
|
|
|
========== ================
|
|
attribute rendered to
|
|
========== ================
|
|
doc monospace
|
|
doc_json json
|
|
doc_yaml yaml
|
|
doc_md markdown
|
|
doc_rst reStructuredText
|
|
========== ================
|
|
|
|
Please note that for DAGs, doc_md is the only attribute interpreted.
|
|
|
|
This is especially useful if your tasks are built dynamically from
|
|
configuration files, it allows you to expose the configuration that led
|
|
to the related tasks in Airflow.
|
|
|
|
.. code-block:: python
|
|
|
|
"""
|
|
### My great DAG
|
|
"""
|
|
|
|
dag = DAG('my_dag', default_args=default_args)
|
|
dag.doc_md = __doc__
|
|
|
|
t = BashOperator("foo", dag=dag)
|
|
t.doc_md = """\
|
|
#Title"
|
|
Here's a [url](www.airbnb.com)
|
|
"""
|
|
|
|
This content will get rendered as markdown respectively in the "Graph View" and
|
|
"Task Details" pages.
|
|
|
|
.. _jinja-templating:
|
|
|
|
Jinja Templating
|
|
================
|
|
|
|
Airflow leverages the power of
|
|
`Jinja Templating <http://jinja.pocoo.org/docs/dev/>`_ and this can be a
|
|
powerful tool to use in combination with macros (see the :doc:`macros-ref` section).
|
|
|
|
For example, say you want to pass the execution date as an environment variable
|
|
to a Bash script using the ``BashOperator``.
|
|
|
|
.. code-block:: python
|
|
|
|
# The execution date as YYYY-MM-DD
|
|
date = "{{ ds }}"
|
|
t = BashOperator(
|
|
task_id='test_env',
|
|
bash_command='/tmp/test.sh ',
|
|
dag=dag,
|
|
env={'EXECUTION_DATE': date})
|
|
|
|
Here, ``{{ ds }}`` is a macro, and because the ``env`` parameter of the
|
|
``BashOperator`` is templated with Jinja, the execution date will be available
|
|
as an environment variable named ``EXECUTION_DATE`` in your Bash script.
|
|
|
|
You can use Jinja templating with every parameter that is marked as "templated"
|
|
in the documentation. Template substitution occurs just before the pre_execute
|
|
function of your operator is called.
|
|
|
|
You can also use Jinja templating with nested fields, as long as these nested fields
|
|
are marked as templated in the structure they belong to: fields registered in
|
|
``template_fields`` property will be submitted to template substitution, like the
|
|
``path`` field in the example below:
|
|
|
|
.. code-block:: python
|
|
|
|
class MyDataReader:
|
|
template_fields = ['path']
|
|
|
|
def __init__(self, my_path):
|
|
self.path = my_path
|
|
|
|
# [additional code here...]
|
|
|
|
t = PythonOperator(
|
|
task_id='transform_data',
|
|
python_callable=transform_data
|
|
op_args=[
|
|
MyDataReader('/tmp/{{ ds }}/my_file')
|
|
],
|
|
dag=dag)
|
|
|
|
.. note:: ``template_fields`` property can equally be a class variable or an
|
|
instance variable.
|
|
|
|
Deep nested fields can also be substituted, as long as all intermediate fields are
|
|
marked as template fields:
|
|
|
|
.. code-block:: python
|
|
|
|
class MyDataTransformer:
|
|
template_fields = ['reader']
|
|
|
|
def __init__(self, my_reader):
|
|
self.reader = my_reader
|
|
|
|
# [additional code here...]
|
|
|
|
class MyDataReader:
|
|
template_fields = ['path']
|
|
|
|
def __init__(self, my_path):
|
|
self.path = my_path
|
|
|
|
# [additional code here...]
|
|
|
|
t = PythonOperator(
|
|
task_id='transform_data',
|
|
python_callable=transform_data
|
|
op_args=[
|
|
MyDataTransformer(MyDataReader('/tmp/{{ ds }}/my_file'))
|
|
],
|
|
dag=dag)
|
|
|
|
You can pass custom options to the Jinja ``Environment`` when creating your DAG.
|
|
One common usage is to avoid Jinja from dropping a trailing newline from a
|
|
template string:
|
|
|
|
.. code-block:: python
|
|
|
|
my_dag = DAG(dag_id='my-dag',
|
|
jinja_environment_kwargs={
|
|
'keep_trailing_newline': True,
|
|
# some other jinja2 Environment options here
|
|
})
|
|
|
|
See `Jinja documentation <https://jinja.palletsprojects.com/en/master/api/#jinja2.Environment>`_
|
|
to find all available options.
|
|
|
|
.. _exceptions:
|
|
|
|
Exceptions
|
|
==========
|
|
|
|
Airflow defines a number of exceptions; most of these are used internally, but a few
|
|
are relevant to authors of custom operators or python callables called from ``PythonOperator``
|
|
tasks. Normally any exception raised from an ``execute`` method or python callable will either
|
|
cause a task instance to fail if it is not configured to retry or has reached its limit on
|
|
retry attempts, or to be marked as "up for retry". A few exceptions can be used when different
|
|
behavior is desired:
|
|
|
|
* ``AirflowSkipException`` can be raised to set the state of the current task instance to "skipped"
|
|
* ``AirflowFailException`` can be raised to set the state of the current task to "failed" regardless
|
|
of whether there are any retry attempts remaining.
|
|
|
|
This example illustrates some possibilities
|
|
|
|
.. code-block:: python
|
|
|
|
from airflow.exceptions import AirflowFailException, AirflowSkipException
|
|
|
|
def fetch_data():
|
|
try:
|
|
data = get_some_data(get_api_key())
|
|
if not data:
|
|
# Set state to skipped and do not retry
|
|
# Downstream task behavior will be determined by trigger rules
|
|
raise AirflowSkipException("No data available.")
|
|
except Unauthorized:
|
|
# If we retry, our api key will still be bad, so don't waste time retrying!
|
|
# Set state to failed and move on
|
|
raise AirflowFailException("Our api key is bad!")
|
|
except TransientError:
|
|
print("Looks like there was a blip.")
|
|
# Raise the exception and let the task retry unless max attempts were reached
|
|
raise
|
|
handle(data)
|
|
|
|
task = PythonOperator(task_id="fetch_data", python_callable=fetch_data, retries=10)
|
|
|
|
.. seealso::
|
|
- :ref:`List of Airflow exceptions <pythonapi:exceptions>`
|
|
|
|
|
|
Packaged DAGs
|
|
'''''''''''''
|
|
While often you will specify DAGs in a single ``.py`` file it might sometimes
|
|
be required to combine a DAG and its dependencies. For example, you might want
|
|
to combine several DAGs together to version them together or you might want
|
|
to manage them together or you might need an extra module that is not available
|
|
by default on the system you are running Airflow on. To allow this you can create
|
|
a zip file that contains the DAG(s) in the root of the zip file and have the extra
|
|
modules unpacked in directories.
|
|
|
|
For instance you can create a zip file that looks like this:
|
|
|
|
.. code-block:: bash
|
|
|
|
my_dag1.py
|
|
my_dag2.py
|
|
package1/__init__.py
|
|
package1/functions.py
|
|
|
|
Airflow will scan the zip file and try to load ``my_dag1.py`` and ``my_dag2.py``.
|
|
It will not go into subdirectories as these are considered to be potential
|
|
packages.
|
|
|
|
In case you would like to add module dependencies to your DAG you basically would
|
|
do the same, but then it is more suitable to use a virtualenv and pip.
|
|
|
|
.. code-block:: bash
|
|
|
|
virtualenv zip_dag
|
|
source zip_dag/bin/activate
|
|
|
|
mkdir zip_dag_contents
|
|
cd zip_dag_contents
|
|
|
|
pip install --install-option="--install-lib=$PWD" my_useful_package
|
|
cp ~/my_dag.py .
|
|
|
|
zip -r zip_dag.zip *
|
|
|
|
.. note:: the zip file will be inserted at the beginning of module search list
|
|
(sys.path) and as such it will be available to any other code that resides
|
|
within the same interpreter.
|
|
|
|
.. note:: packaged dags cannot be used with pickling turned on.
|
|
|
|
.. note:: packaged dags cannot contain dynamic libraries (eg. libz.so) these need
|
|
to be available on the system if a module needs those. In other words only
|
|
pure python modules can be packaged.
|
|
|
|
|
|
.airflowignore
|
|
''''''''''''''
|
|
|
|
A ``.airflowignore`` file specifies the directories or files in ``DAG_FOLDER``
|
|
or ``PLUGINS_FOLDER`` that Airflow should intentionally ignore.
|
|
Each line in ``.airflowignore`` specifies a regular expression pattern,
|
|
and directories or files whose names (not DAG id) match any of the patterns
|
|
would be ignored (under the hood,``re.findall()`` is used to match the pattern).
|
|
Overall it works like a ``.gitignore`` file.
|
|
Use the ``#`` character to indicate a comment; all characters
|
|
on a line following a ``#`` will be ignored.
|
|
|
|
``.airflowignore`` file should be put in your ``DAG_FOLDER``.
|
|
For example, you can prepare a ``.airflowignore`` file with contents
|
|
|
|
.. code-block::
|
|
|
|
project_a
|
|
tenant_[\d]
|
|
|
|
|
|
Then files like ``project_a_dag_1.py``, ``TESTING_project_a.py``, ``tenant_1.py``,
|
|
``project_a/dag_1.py``, and ``tenant_1/dag_1.py`` in your ``DAG_FOLDER`` would be ignored
|
|
(If a directory's name matches any of the patterns, this directory and all its subfolders
|
|
would not be scanned by Airflow at all. This improves efficiency of DAG finding).
|
|
|
|
The scope of a ``.airflowignore`` file is the directory it is in plus all its subfolders.
|
|
You can also prepare ``.airflowignore`` file for a subfolder in ``DAG_FOLDER`` and it
|
|
would only be applicable for that subfolder.
|