Various documentation spelling and grammar edits
This commit is contained in:
Родитель
1da85da22e
Коммит
219803924a
|
@ -1,8 +1,8 @@
|
|||
Command Line Interface
|
||||
======================
|
||||
|
||||
Airflow has a very rich command line interface allowing to perform
|
||||
many types of operation on a DAG, starting services and supporting
|
||||
Airflow has a very rich command line interface that allows for
|
||||
many types of operation on a DAG, starting services, and supporting
|
||||
development and testing.
|
||||
|
||||
.. argparse::
|
||||
|
|
|
@ -3,18 +3,18 @@ Code / API
|
|||
|
||||
Operators
|
||||
---------
|
||||
Operators allows to generate a certain type of task that become a node in
|
||||
Operators allow for generation of certain types of tasks that become nodes in
|
||||
the DAG when instantiated. All operators derive from BaseOperator and
|
||||
inherit a whole lot of attributes and method that way. Refer to the
|
||||
BaseOperator documentation for more details.
|
||||
inherit many attributes and methods that way. Refer to the BaseOperator
|
||||
documentation for more details.
|
||||
|
||||
There are 3 main types of operators:
|
||||
|
||||
- Operators that performs an **action**, or tells another system to
|
||||
- Operators that performs an **action**, or tell another system to
|
||||
perform an action
|
||||
- **Transfer** operators move data from a system to another
|
||||
- **Sensors** are a certain type of operators that will keep running until a
|
||||
certain criteria is met. Things like a specific file landing in HDFS or
|
||||
- **Transfer** operators move data from one system to another
|
||||
- **Sensors** are a certain type of operator that will keep running until a
|
||||
certain criterion is met. Examples include a specific file landing in HDFS or
|
||||
S3, a partition appearing in Hive, or a specific time of the day. Sensors
|
||||
are derived from ``BaseSensorOperator`` and run a poke
|
||||
method at a specified ``poke_interval`` until it returns ``True``.
|
||||
|
@ -75,10 +75,10 @@ Variable Description
|
|||
formatted ``{dag_id}_{task_id}_{ds}``
|
||||
================================= ====================================
|
||||
|
||||
Note that you can access the objects attributes and methods with simple
|
||||
Note that you can access the object's attributes and methods with simple
|
||||
dot notation. Here are some examples of what is possible:
|
||||
``{{ task.owner }}``, ``{{ task.task_id }}``, ``{{ ti.hostname }}``, ...
|
||||
Refer to the models documentation for more information on the objects
|
||||
Refer to the models documentation for more information on the objects'
|
||||
attributes and methods.
|
||||
|
||||
Macros
|
||||
|
@ -98,7 +98,7 @@ These macros live under the ``macros`` namespace in your templates.
|
|||
Models
|
||||
------
|
||||
|
||||
Models are built on top of th SQLAlchemy ORM Base class, instance are
|
||||
Models are built on top of th SQLAlchemy ORM Base class, and instances are
|
||||
persisted in the database.
|
||||
|
||||
|
||||
|
|
|
@ -7,20 +7,20 @@ Operators
|
|||
Operators allow for generating a certain type of task on the graph. There
|
||||
are 3 main type of operators:
|
||||
|
||||
- **Sensor:** Waits for events to happen, it could be a file appearing
|
||||
in HDFS, the existence of a Hive partition or for an arbitrary MySQL
|
||||
query to return a row.
|
||||
- **Remote Execution:** Trigger an operation in a remote system, this
|
||||
could be a HQL statement in Hive, a Pig script, a map reduce job, a
|
||||
- **Sensor:** Waits for events to happen. This could be a file appearing
|
||||
in HDFS, the existence of a Hive partition, or waiting for an arbitrary
|
||||
MySQL query to return a row.
|
||||
- **Remote Execution:** Triggers an operation in a remote system. This
|
||||
could be an HQL statement in Hive, a Pig script, a map reduce job, a
|
||||
stored procedure in Oracle or a Bash script to run.
|
||||
- **Data transfers:** Move data from a system to another. Push data
|
||||
- **Data transfers:** Move data from one system to another. Push data
|
||||
from Hive to MySQL, from a local file to HDFS, from Postgres to
|
||||
Oracle, or anything of that nature.
|
||||
|
||||
Tasks
|
||||
'''''
|
||||
|
||||
A task represent the instantiation of an operator and becomes a node in
|
||||
A task represents the instantiation of an operator and becomes a node in
|
||||
the directed acyclic graph (DAG). The instantiation defines specific
|
||||
values when calling the abstract operator. A task could be waiting for a
|
||||
specific partition in Hive, or triggering a specific DML statement in
|
||||
|
@ -38,44 +38,44 @@ task instance will have a status of either "started", "retrying",
|
|||
Hooks
|
||||
'''''
|
||||
|
||||
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL,
|
||||
Postgres, HDFS, Pig and Cascading. Hooks implement a common interface when
|
||||
possible, and act as a building block for operators. They also use
|
||||
Hooks are interfaces to external platforms and databases like Hive, S3,
|
||||
MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when
|
||||
possible, and act as a building block for operators. They also use
|
||||
the ``airflow.connection.Connection`` model to retrieve hostnames
|
||||
and authentication information. Hooks keeps authentication code and
|
||||
and authentication information. Hooks keep authentication code and
|
||||
information out of pipelines, centralized in the metadata database.
|
||||
|
||||
Hooks are also very useful on their own to use in Python scripts,
|
||||
Airflow airflow.operators.PythonOperator, and in interactive environment
|
||||
Hooks are also very useful on their own to use in Python scripts,
|
||||
Airflow airflow.operators.PythonOperator, and in interactive environments
|
||||
like iPython or Jupyter Notebook.
|
||||
|
||||
Pools
|
||||
'''''
|
||||
|
||||
Some systems can get overwhelmed when too many processes hit them at the same
|
||||
time. Airflow pools can be used to **limit the execution parallelism** on
|
||||
arbitrary sets of tasks. The list of pools is managed in the UI
|
||||
(``Menu -> Admin -> Pools``) by giving the pools a name and assigning
|
||||
it a number of worker slots. Tasks can then be associated with
|
||||
one of the existing pools by using the ``pool`` parameter when
|
||||
creating tasks (instantiating operators).
|
||||
time. Airflow pools can be used to **limit the execution parallelism** on
|
||||
arbitrary sets of tasks. The list of pools is managed in the UI
|
||||
(``Menu -> Admin -> Pools``) by giving the pools a name and assigning
|
||||
it a number of worker slots. Tasks can then be associated with
|
||||
one of the existing pools by using the ``pool`` parameter when
|
||||
creating tasks (i.e., instantiating operators).
|
||||
|
||||
The ``pool`` parameter can
|
||||
be used in conjunction with ``priority_weight`` to define priorities
|
||||
in the queue, and which tasks get executed first as slots open up in the
|
||||
pool. The default ``priority_weight`` is of ``1``, and can be bumped to any
|
||||
number. When sorting the queue to evaluate which task should be executed
|
||||
next, we use the ``priority_weight``, summed up with of all
|
||||
the tasks ``priority_weight`` downstream from this task. This way you can
|
||||
bumped a specific important task and the whole path to that task gets
|
||||
prioritized accordingly.
|
||||
pool. The default ``priority_weight`` is ``1``, and can be bumped to any
|
||||
number. When sorting the queue to evaluate which task should be executed
|
||||
next, we use the ``priority_weight``, summed up with all of the
|
||||
``priority_weight`` values from tasks downstream from this task. You can
|
||||
use this to bump a specific important task and the whole path to that task
|
||||
gets prioritized accordingly.
|
||||
|
||||
Tasks will be scheduled as usual while the slots fill up. Once capacity is
|
||||
reached, runnable tasks get queued and there state will show as such in the
|
||||
UI. As slots free up, queued up tasks start running based on the
|
||||
reached, runnable tasks get queued and their state will show as such in the
|
||||
UI. As slots free up, queued tasks start running based on the
|
||||
``priority_weight`` (of the task and its descendants).
|
||||
|
||||
Note that by default tasks aren't assigned to any pool and their
|
||||
Note that by default tasks aren't assigned to any pool and their
|
||||
execution parallelism is only limited to the executor's setting.
|
||||
|
||||
Connections
|
||||
|
@ -83,45 +83,45 @@ Connections
|
|||
|
||||
The connection information to external systems is stored in the Airflow
|
||||
metadata database and managed in the UI (``Menu -> Admin -> Connections``).
|
||||
A ``conn_id`` is defined there and hostname / login / password / schema
|
||||
information attached to it. Then Airflow pipelines can simply refer
|
||||
to the centrally managed ``conn_id`` without having to hard code any
|
||||
of this information anywhere.
|
||||
A ``conn_id`` is defined there and hostname / login / password / schema
|
||||
information attached to it. Airflow pipelines can simply refer to the
|
||||
centrally managed ``conn_id`` without having to hard code any of this
|
||||
information anywhere.
|
||||
|
||||
Many connections with the same ``conn_id`` can be defined and when that
|
||||
is the case, and when the **hooks** uses the ``get_connection`` method
|
||||
Many connections with the same ``conn_id`` can be defined and when that
|
||||
is the case, and when the **hooks** uses the ``get_connection`` method
|
||||
from ``BaseHook``, Airflow will choose one connection randomly, allowing
|
||||
for some basic load balancing and some fault tolerance when used in
|
||||
conjunction with retries.
|
||||
for some basic load balancing and fault tolerance when used in conjunction
|
||||
with retries.
|
||||
|
||||
Queues
|
||||
''''''
|
||||
|
||||
When using the CeleryExecutor, the celery queues that task are sent to
|
||||
When using the CeleryExecutor, the celery queues that tasks are sent to
|
||||
can be specified. ``queue`` is an attribute of BaseOperator, so any
|
||||
task can be assigned to any queue. The default queue for the environment
|
||||
is defined in the ``airflow.cfg``'s ``celery -> default_queue``. This defines
|
||||
the queue that task get assigned to when not specified, as well as which
|
||||
the queue that tasks get assigned to when not specified, as well as which
|
||||
queue Airflow workers listen to when started.
|
||||
|
||||
Workers can listen to one or multiple queues of tasks. When a worker is
|
||||
started (using the command ``airflow worker``), a set of comma delimited
|
||||
queue names can be specified (``airflow worker -q spark``). This worker
|
||||
started (using the command ``airflow worker``), a set of comma delimited
|
||||
queue names can be specified (e.g. ``airflow worker -q spark``). This worker
|
||||
will then only pick up tasks wired to the specified queue(s).
|
||||
|
||||
This can be useful if you need specialized workers, either from a
|
||||
resource perspective (for say very lightweight tasks where one worker
|
||||
could take thousands of task without a problem), or from an environment
|
||||
perspective (you want a worker running from within the Spark cluster
|
||||
This can be useful if you need specialized workers, either from a
|
||||
resource perspective (for say very lightweight tasks where one worker
|
||||
could take thousands of tasks without a problem), or from an environment
|
||||
perspective (you want a worker running from within the Spark cluster
|
||||
itself because it needs a very specific environment and security rights).
|
||||
|
||||
|
||||
Variables
|
||||
'''''''''
|
||||
|
||||
Variables are a generic way to store and retrieve arbitrary content or
|
||||
settings as a simple key value store within Airflow. Variable can be
|
||||
listed, created, updated and deleted from the UI ``Admin -> Variables``
|
||||
Variables are a generic way to store and retrieve arbitrary content or
|
||||
settings as a simple key value store within Airflow. Variables can be
|
||||
listed, created, updated and deleted from the UI (``Admin -> Variables``)
|
||||
or from code. While your pipeline code definition and most of your constants
|
||||
and variables should be defined in code and stored in source control,
|
||||
it can be useful to have some variables or configuration items
|
||||
|
@ -135,5 +135,5 @@ accessible and modifiable through the UI.
|
|||
bar = Variable.get("bar", deser_json=True)
|
||||
|
||||
The second call assumes ``json`` content and will be deserialized into
|
||||
``bar``. Note that ``Variable`` is a sqlalchemy model and can be used
|
||||
``bar``. Note that ``Variable`` is a sqlalchemy model and can be used
|
||||
as such.
|
||||
|
|
|
@ -6,7 +6,7 @@ Airflow's Documentation
|
|||
|
||||
Airflow is a platform to programmatically author, schedule and monitor data pipelines.
|
||||
|
||||
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed.
|
||||
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
|
||||
|
||||
------------
|
||||
|
||||
|
@ -17,10 +17,10 @@ Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The
|
|||
Principles
|
||||
----------
|
||||
|
||||
- **Dynamic**: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiate pipelines dynamically.
|
||||
- **Dynamic**: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
|
||||
- **Extensible**: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
|
||||
- **Elegant**: Airflow pipelines are lean and explicit. Parameterizing your scripts is built in the core of Airflow using powerful **Jinja** templating engine.
|
||||
- **Scalable**: Airflow has a modular architecture and uses a message queue to talk to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
|
||||
- **Elegant**: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful **Jinja** templating engine.
|
||||
- **Scalable**: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
|
||||
|
||||
Content
|
||||
-------
|
||||
|
@ -34,6 +34,6 @@ Content
|
|||
concepts
|
||||
profiling
|
||||
cli
|
||||
scheduler
|
||||
scheduler
|
||||
plugins
|
||||
code
|
||||
|
|
|
@ -10,23 +10,23 @@ python3 (as of 2015-06).
|
|||
Extra Packages
|
||||
''''''''''''''
|
||||
The ``airflow`` PyPI basic package only installs what's needed to get started.
|
||||
Subpackages can be installed depending on what will be useful in your
|
||||
Subpackages can be installed depending on what will be useful in your
|
||||
environment. For instance, if you don't need connectivity with Postgres,
|
||||
you won't have to go through the trouble of install the ``postgres-devel`` yum
|
||||
package, or whatever equivalent on the distribution you are using.
|
||||
you won't have to go through the trouble of installing the ``postgres-devel``
|
||||
yum package, or whatever equivalent applies on the distribution you are using.
|
||||
|
||||
Behind the scene, we do conditional imports on operators that require
|
||||
Behind the scenes, we do conditional imports on operators that require
|
||||
these extra dependencies.
|
||||
|
||||
Here's the list of the subpackages and that they enable:
|
||||
Here's the list of the subpackages and what they enable:
|
||||
|
||||
+-------------+------------------------------------+---------------------------------------+
|
||||
| subpackage | install command | enables |
|
||||
+=============+====================================+=======================================+
|
||||
| mysql | ``pip install airflow[mysql]`` | MySQL operators and hook, support as |
|
||||
| mysql | ``pip install airflow[mysql]`` | MySQL operators and hook, support as |
|
||||
| | | an Airflow backend |
|
||||
+-------------+------------------------------------+---------------------------------------+
|
||||
| postgres | ``pip install airflow[postgres]`` | Postgres operators and hook, support |
|
||||
| postgres | ``pip install airflow[postgres]`` | Postgres operators and hook, support |
|
||||
| | | as an Airflow backend |
|
||||
+-------------+------------------------------------+---------------------------------------+
|
||||
| samba | ``pip install airflow[samba]`` | ``Hive2SambaOperator`` |
|
||||
|
@ -39,7 +39,7 @@ Here's the list of the subpackages and that they enable:
|
|||
|
||||
Setting up a Backend
|
||||
''''''''''''''''''''
|
||||
If you want to take a real test drive of Airflow, you should consider
|
||||
If you want to take a real test drive of Airflow, you should consider
|
||||
setting up a real database backend and switching to the LocalExecutor.
|
||||
|
||||
As Airflow was built to interact with its metadata using the great SqlAlchemy
|
||||
|
@ -48,7 +48,7 @@ SqlAlchemy backend. We recommend using **MySQL** or **Postgres**.
|
|||
|
||||
Once you've setup your database to host Airflow, you'll need to alter the
|
||||
SqlAlchemy connection string located in your configuration file
|
||||
``$AIRFLOW_HOME/airflow.cfg``. You should then also change the "executor"
|
||||
``$AIRFLOW_HOME/airflow.cfg``. You should then also change the "executor"
|
||||
setting to use "LocalExecutor", an executor that can parallelize task
|
||||
instances locally.
|
||||
|
||||
|
@ -59,10 +59,10 @@ instances locally.
|
|||
|
||||
Connections
|
||||
'''''''''''
|
||||
Airflow needs to know how to connect to your environment. Information
|
||||
such as hostname, port, login and password to other systems and services is
|
||||
handled ``Admin->Connection`` section of the UI. The pipeline code you will
|
||||
author will reference the 'conn_id' of the Connection objects.
|
||||
Airflow needs to know how to connect to your environment. Information
|
||||
such as hostname, port, login and passwords to other systems and services is
|
||||
handled in the ``Admin->Connection`` section of the UI. The pipeline code you
|
||||
will author will reference the 'conn_id' of the Connection objects.
|
||||
|
||||
.. image:: img/connections.png
|
||||
|
||||
|
@ -71,23 +71,23 @@ Scaling Out
|
|||
'''''''''''
|
||||
CeleryExecutor is the way you can scale out the number of workers. For this
|
||||
to work, you need to setup a Celery backend (**RabbitMQ**, **Redis**, ...) and
|
||||
change your ``airflow.cfg`` to point the executor parameter to
|
||||
change your ``airflow.cfg`` to point the executor parameter to
|
||||
CeleryExecutor and provide the related Celery settings.
|
||||
|
||||
For more information about setting up a Celery broker, refer to the
|
||||
exhaustive `Celery documentation on the topic <http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html>`_.
|
||||
|
||||
To kick off a worker, you need to setup Airflow and quick off the worker
|
||||
To kick off a worker, you need to setup Airflow and kick off the worker
|
||||
subcommand
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
airflow worker
|
||||
|
||||
Your worker should start picking up tasks as soon as they get fired up in
|
||||
Your worker should start picking up tasks as soon as they get fired in
|
||||
its direction.
|
||||
|
||||
Note that you can also run "Celery Flower" a web UI build on top of Celery
|
||||
Note that you can also run "Celery Flower", a web UI built on top of Celery,
|
||||
to monitor your workers.
|
||||
|
||||
|
||||
|
@ -96,11 +96,11 @@ Web Authentication
|
|||
|
||||
By default, all gates are opened. An easy way to restrict access
|
||||
to the web application is to do it at the network level, or by using
|
||||
ssh tunnels.
|
||||
SSH tunnels.
|
||||
|
||||
However, it is possible to switch on
|
||||
However, it is possible to switch on
|
||||
authentication and define exactly how your users should login
|
||||
into your Airflow environment. Airflow uses ``flask_login`` and
|
||||
to your Airflow environment. Airflow uses ``flask_login`` and
|
||||
exposes a set of hooks in the ``airflow.default_login`` module. You can
|
||||
alter the content of this module by overriding it as a ``airflow_login``
|
||||
module. To do this, you would typically copy/paste ``airflow.default_login``
|
||||
|
|
|
@ -2,25 +2,25 @@ Plugins
|
|||
=======
|
||||
|
||||
Airflow has a simple plugin manager built-in that can integrate external
|
||||
features at its core by simply dropping files in your
|
||||
features to its core by simply dropping files in your
|
||||
``$AIRFLOW_HOME/plugins`` folder.
|
||||
|
||||
The python modules in the ``plugins`` folder get imported,
|
||||
and **hooks**, **operators**, **macros**, **executors** and web **views**
|
||||
The python modules in the ``plugins`` folder get imported,
|
||||
and **hooks**, **operators**, **macros**, **executors** and web **views**
|
||||
get integrated to Airflow's main collections and become available for use.
|
||||
|
||||
What for?
|
||||
---------
|
||||
|
||||
Airflow offers a generic toolbox for working with data. Different
|
||||
Airflow offers a generic toolbox for working with data. Different
|
||||
organizations have different stacks and different needs. Using Airflow
|
||||
plugins can be a way for companies to customize their Airflow installation
|
||||
to reflect their ecosystem.
|
||||
|
||||
Plugins can be used as an easy way to write, share and activate new sets of
|
||||
Plugins can be used as an easy way to write, share and activate new sets of
|
||||
features.
|
||||
|
||||
There's also a need for a set of more complex application to interact with
|
||||
There's also a need for a set of more complex applications to interact with
|
||||
different flavors of data and metadata.
|
||||
|
||||
Examples:
|
||||
|
@ -29,18 +29,18 @@ Examples:
|
|||
* An anomaly detection framework, allowing people to collect metrics, set thresholds and alerts
|
||||
* An auditing tool, helping understand who accesses what
|
||||
* A config-driven SLA monitoring tool, allowing you to set monitored tables and at what time
|
||||
they should land, alert people and exposes visualization of outages
|
||||
they should land, alert people, and expose visualizations of outages
|
||||
* ...
|
||||
|
||||
Why build on top Airflow?
|
||||
-------------------------
|
||||
Why build on top of Airflow?
|
||||
----------------------------
|
||||
|
||||
Airflow has many components that can be reused when building an application:
|
||||
|
||||
* A web server you can use to render your views
|
||||
* A metadata database to store your models
|
||||
* Access to your database, and knowledge of how to connect to them
|
||||
* An array of workers that can allow your application to push workload to
|
||||
* Access to your databases, and knowledge of how to connect to them
|
||||
* An array of workers that your application can push workload to
|
||||
* Airflow is deployed, you can just piggy back on it's deployment logistics
|
||||
* Basic charting capabilities, underlying libraries and abstractions
|
||||
|
||||
|
@ -48,10 +48,10 @@ Airflow has many components that can be reused when building an application:
|
|||
Interface
|
||||
---------
|
||||
|
||||
To create a plugin you will need to derive the
|
||||
To create a plugin you will need to derive the
|
||||
``airflow.plugins_manager.AirflowPlugin`` class and reference the objects
|
||||
you want to plug into Airflow. Here's what the class you need to derive
|
||||
look like:
|
||||
looks like:
|
||||
|
||||
|
||||
.. code:: python
|
||||
|
@ -67,7 +67,7 @@ look like:
|
|||
executors = []
|
||||
# A list of references to inject into the macros namespace
|
||||
macros = []
|
||||
# A list of objects created from a class derived
|
||||
# A list of objects created from a class derived
|
||||
# from flask_admin.BaseView
|
||||
admin_views = []
|
||||
# A list of Blueprint object created from flask.Blueprint
|
||||
|
@ -80,10 +80,10 @@ Example
|
|||
-------
|
||||
|
||||
The code bellow defines a plugin that injects a set of dummy object
|
||||
definitions in Airflow.
|
||||
definitions in Airflow.
|
||||
|
||||
.. code:: python
|
||||
|
||||
|
||||
# This is the class you derive to create a plugin
|
||||
from airflow.plugins_manager import AirflowPlugin
|
||||
|
||||
|
|
|
@ -1,27 +1,27 @@
|
|||
Data Profiling
|
||||
==============
|
||||
|
||||
Part of being a productive with data is about having the right weapons to
|
||||
profile the data you are working with. Airflow provides a simple query
|
||||
interface to write sql and get results quickly, and a charting application
|
||||
Part of being productive with data is having the right weapons to
|
||||
profile the data you are working with. Airflow provides a simple query
|
||||
interface to write SQL and get results quickly, and a charting application
|
||||
letting you visualize data.
|
||||
|
||||
Adhoc Queries
|
||||
-------------
|
||||
The adhoc query UI allows for simple SQL interaction with the database
|
||||
The adhoc query UI allows for simple SQL interactions with the database
|
||||
connections registered in Airflow.
|
||||
|
||||
.. image:: img/adhoc.png
|
||||
|
||||
Charts
|
||||
-------------
|
||||
A simple UI built on top of flask-admin and highcharts allows to build
|
||||
data visualizations and charts easily. Fill in a form with a label, sql,
|
||||
chart type, pick a source database from your environment's connecton,
|
||||
select a few other options, and save it for later use.
|
||||
------
|
||||
A simple UI built on top of flask-admin and highcharts allows building
|
||||
data visualizations and charts easily. Fill in a form with a label, SQL,
|
||||
chart type, pick a source database from your environment's connectons,
|
||||
select a few other options, and save it for later use.
|
||||
|
||||
You can even use the same templating and macros available when writing
|
||||
airflow pipelines, parameterizing your queries and modifying parameters
|
||||
You can even use the same templating and macros available when writing
|
||||
airflow pipelines, parameterizing your queries and modifying parameters
|
||||
directly in the URL.
|
||||
|
||||
These charts are basic, but they're easy to create, modify and share.
|
||||
|
|
|
@ -1,15 +1,15 @@
|
|||
The Scheduler
|
||||
=============
|
||||
|
||||
The Airflow scheduler monitors all tasks and all dags and schedules the
|
||||
task instances whose dependencies have been met. Behinds the scene,
|
||||
it monitors a folder for all dag objects it may contain,
|
||||
The Airflow scheduler monitors all tasks and all DAGs and schedules the
|
||||
task instances whose dependencies have been met. Behind the scenes,
|
||||
it monitors a folder for all DAG objects it may contain,
|
||||
and periodically inspects all tasks to see whether it can schedule the
|
||||
next run.
|
||||
|
||||
The scheduler starts an instance of the executor specified in the your
|
||||
``airflow.cfg``, if it happens to be the LocalExecutor, tasks will be
|
||||
executed as subprocesses, in the case of CeleryExecutor, tasks are
|
||||
``airflow.cfg``. If it happens to be the LocalExecutor, tasks will be
|
||||
executed as subprocesses; in the case of CeleryExecutor, tasks are
|
||||
executed remotely.
|
||||
|
||||
To start a scheduler, simply run the command:
|
||||
|
@ -18,25 +18,24 @@ To start a scheduler, simply run the command:
|
|||
|
||||
airflow scheduler
|
||||
|
||||
Note that:
|
||||
Note that:
|
||||
|
||||
* It **won't parallelize** multiple instances of the same tasks, it always wait for the previous schedule to be done to move forward
|
||||
* It will **not fill in gaps**, it only moves forward in time from the latest task instance on that task
|
||||
* If a task instance failed and the task is set to ``depends_on_past=True``, it won't move forward from that point until the error state is cleared and runs successfully, or is marked as successful
|
||||
* If no task history exist for a task, it will attempt to run it on the task's ``start_date``
|
||||
* It **won't parallelize** multiple instances of the same tasks; it always wait for the previous schedule to be done before moving forward
|
||||
* It will **not fill in gaps**; it only moves forward in time from the latest task instance on that task
|
||||
* If a task instance failed and the task is set to ``depends_on_past=True``, it won't move forward from that point until the error state is cleared and the task runs successfully, or is marked as successful
|
||||
* If no task history exists for a task, it will attempt to run it on the task's ``start_date``
|
||||
|
||||
Understanding this, you should be able to comprehend what is keeping your
|
||||
tasks from running or moving forward. To allow the scheduler to move forward, you may want to clear the state
|
||||
of some task instances, or mark them as successful.
|
||||
Understanding this, you should be able to comprehend what is keeping your
|
||||
tasks from running or moving forward. To allow the scheduler to move forward, you may want to clear the state of some task instances, or mark them as successful.
|
||||
|
||||
Here are some of the ways you can **unblock tasks**:
|
||||
|
||||
* From the UI, you can **clear** (as in delete the status of) individual task instances from the tasks instance dialog, while defining whether you want to includes the past/future and the upstream/downstream dependencies. Note that a confirmation window comes next and allows you to see the set you are about to clear.
|
||||
* The CLI ``airflow clear -h`` has lots of options when it comes to clearing task instances states, including specifying date ranges, targeting task_ids by specifying a regular expression, flags for including upstream and downstream relatives, and targeting task instances in specific states (``failed``, or ``success``)
|
||||
* Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, or when the fix has been applied outside of Airflow for instance.
|
||||
* The ``airflow backfill`` CLI subcommand has a flag to ``--mark_success`` and allows to select subsections of the dag as well as specifying date ranges.
|
||||
* From the UI, you can **clear** (as in delete the status of) individual task instances from the task instances dialog, while defining whether you want to includes the past/future and the upstream/downstream dependencies. Note that a confirmation window comes next and allows you to see the set you are about to clear.
|
||||
* The CLI command ``airflow clear -h`` has lots of options when it comes to clearing task instance states, including specifying date ranges, targeting task_ids by specifying a regular expression, flags for including upstream and downstream relatives, and targeting task instances in specific states (``failed``, or ``success``)
|
||||
* Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, or for instance when the fix has been applied outside of Airflow.
|
||||
* The ``airflow backfill`` CLI subcommand has a flag to ``--mark_success`` and allows selecting subsections of the DAG as well as specifying date ranges.
|
||||
|
||||
The Airflow scheduler is designed to run as a persistent service in an
|
||||
Airflow production environment. To kick it off, all you need to do is
|
||||
execute ``airflow scheduler``. It will use the configuration specified in the
|
||||
Airflow production environment. To kick it off, all you need to do is
|
||||
execute ``airflow scheduler``. It will use the configuration specified in
|
||||
``airflow.cfg``.
|
||||
|
|
|
@ -20,12 +20,12 @@ python3 (as of 2015-06).
|
|||
# start the web server, default port is 8080
|
||||
airflow webserver -p 8080
|
||||
|
||||
Upon running these commands, airflow will create the ``$AIRFLOW_HOME`` folder
|
||||
and lay a "airflow.cfg" files with defaults that get you going fast. You can
|
||||
Upon running these commands, Airflow will create the ``$AIRFLOW_HOME`` folder
|
||||
and lay an "airflow.cfg" file with defaults that get you going fast. You can
|
||||
inspect the file either in ``$AIRFLOW_HOME/airflow.cfg``, or through the UI in
|
||||
the ``Admin->Configuration`` menu.
|
||||
|
||||
Out of the box, airflow uses a sqlite database, which you should outgrow
|
||||
Out of the box, Airflow uses a sqlite database, which you should outgrow
|
||||
fairly quickly since no parallelization is possible using this database
|
||||
backend. It works in conjunction with the ``SequentialExecutor`` which will
|
||||
only run task instances sequentially. While this is very limiting, it allows
|
||||
|
|
|
@ -2,13 +2,13 @@
|
|||
Tutorial
|
||||
================
|
||||
|
||||
This tutorial walks you through some of the fundamental Airflow concepts,
|
||||
objects and their usage while writing your first pipeline.
|
||||
This tutorial walks you through some of the fundamental Airflow concepts,
|
||||
objects, and their usage while writing your first pipeline.
|
||||
|
||||
Example Pipeline definition
|
||||
---------------------------
|
||||
|
||||
Here is an example of a basic pipeline definition. Do not worry if this looks
|
||||
Here is an example of a basic pipeline definition. Do not worry if this looks
|
||||
complicated, a line by line explanation follows below.
|
||||
|
||||
.. code:: python
|
||||
|
@ -72,22 +72,22 @@ complicated, a line by line explanation follows below.
|
|||
Importing Modules
|
||||
-----------------
|
||||
|
||||
An Airflow pipeline is just a common Python script that happens to define
|
||||
an Airflow DAG object. Let's start by importing the libraries we will need.
|
||||
An Airflow pipeline is just a Python script that happens to define an
|
||||
Airflow DAG object. Let's start by importing the libraries we will need.
|
||||
|
||||
.. code:: python
|
||||
|
||||
# The DAG object, we'll need this to instantiate a DAG
|
||||
# The DAG object; we'll need this to instantiate a DAG
|
||||
from airflow import DAG
|
||||
|
||||
# Operators, we need this to operate!
|
||||
# Operators; we need this to operate!
|
||||
from airflow.operators import BashOperator, MySqlOperator
|
||||
|
||||
Default Arguments
|
||||
-----------------
|
||||
We're about to create a DAG and some tasks, and we have the choice to
|
||||
explicitly pass a set of arguments to each task's constructor
|
||||
(which would become redundant), or (better!) we can define a dictionary
|
||||
We're about to create a DAG and some tasks, and we have the choice to
|
||||
explicitly pass a set of arguments to each task's constructor
|
||||
(which would become redundant), or (better!) we can define a dictionary
|
||||
of default parameters that we can use when creating tasks.
|
||||
|
||||
.. code:: python
|
||||
|
@ -107,16 +107,16 @@ For more information about the BaseOperator's parameters and what they do,
|
|||
refer to the :py:class:``airflow.models.BaseOperator`` documentation.
|
||||
|
||||
Also, note that you could easily define different sets of arguments that
|
||||
would serve different purposes. An example of that would be to have
|
||||
would serve different purposes. An example of that would be to have
|
||||
different settings between a production and development environment.
|
||||
|
||||
|
||||
Instantiate a DAG
|
||||
-----------------
|
||||
|
||||
We'll need a DAG object to nest our tasks into. Here we pass a string
|
||||
We'll need a DAG object to nest our tasks into. Here we pass a string
|
||||
that defines the dag_id, which serves as a unique identifier for your DAG.
|
||||
We also pass the default argument dictionary that we just define.
|
||||
We also pass the default argument dictionary that we just defined.
|
||||
|
||||
.. code:: python
|
||||
|
||||
|
@ -124,8 +124,8 @@ We also pass the default argument dictionary that we just define.
|
|||
|
||||
Tasks
|
||||
-----
|
||||
Tasks are generated when instantiating objects from operators. An object
|
||||
instatiated from an operator is called a constructor. The first argument
|
||||
Tasks are generated when instantiating operator objects. An object
|
||||
instantiated from an operator is called a constructor. The first argument
|
||||
``task_id`` acts as a unique identifier for the task.
|
||||
|
||||
.. code:: python
|
||||
|
@ -142,10 +142,10 @@ instatiated from an operator is called a constructor. The first argument
|
|||
dag=dag)
|
||||
|
||||
Notice how we pass a mix of operator specific arguments (``bash_command``) and
|
||||
an argument common to all operators (``email_on_failure``) inherited
|
||||
an argument common to all operators (``email_on_failure``) inherited
|
||||
from BaseOperator to the operator's constructor. This is simpler than
|
||||
passing every argument for every constructor call. Also, notice that in
|
||||
the second task we override ``email_on_failure`` parameter with ``False``.
|
||||
passing every argument for every constructor call. Also, notice that in
|
||||
the second task we override the ``email_on_failure`` parameter with ``False``.
|
||||
|
||||
The precedence rules for a task are as follows:
|
||||
|
||||
|
@ -158,17 +158,17 @@ otherwise Airflow will raise an exception.
|
|||
|
||||
Templating with Jinja
|
||||
---------------------
|
||||
Airflow leverages the power of
|
||||
Airflow leverages the power of
|
||||
`Jinja Templating <http://jinja.pocoo.org/docs/dev/>`_ and provides
|
||||
the pipeline author
|
||||
with a set of built-in parameters and macros. Airflow also provides
|
||||
hooks for the pipeline author to define their own parameters, macros and
|
||||
templates.
|
||||
|
||||
This tutorial barely scratches the surfaces of what you can do
|
||||
with templating in Airflow, but the goal of this section is to let you know
|
||||
this feature exists, get you familiar with double
|
||||
curly brackets, and point to the most common template variable: ``{{ ds }}``.
|
||||
This tutorial barely scratches the surface of what you can do with
|
||||
templating in Airflow, but the goal of this section is to let you know
|
||||
this feature exists, get you familiar with double curly brackets, and
|
||||
point to the most common template variable: ``{{ ds }}``.
|
||||
|
||||
.. code:: python
|
||||
|
||||
|
@ -196,31 +196,31 @@ parameters and/or objects to your templates. Please take the time
|
|||
to understand how the parameter ``my_param`` makes it through to the template.
|
||||
|
||||
Files can also be passed to the ``bash_command`` argument, like
|
||||
``bash_command='templated_command.sh'`` where the file location is relative to
|
||||
``bash_command='templated_command.sh'``, where the file location is relative to
|
||||
the directory containing the pipeline file (``tutorial.py`` in this case). This
|
||||
may be desirable for many reasons, like separating your script's logic and
|
||||
pipeline code, allowing for proper code highlighting in files composed in
|
||||
different languages, and general flexibility in structuring pipelines. It is
|
||||
also possible to define your ``template_searchpath`` pointing to any folder
|
||||
also possible to define your ``template_searchpath`` as pointing to any folder
|
||||
locations in the DAG constructor call.
|
||||
|
||||
Setting up Dependencies
|
||||
-----------------------
|
||||
We have two simple tasks that do not depend on each other, here's a few ways
|
||||
We have two simple tasks that do not depend on each other. Here's a few ways
|
||||
you can define dependencies between them:
|
||||
|
||||
.. code:: python
|
||||
|
||||
t2.set_upstream(t1)
|
||||
|
||||
# This means that t2 will depend on t1
|
||||
# This means that t2 will depend on t1
|
||||
# running successfully to run
|
||||
# It is equivalent to
|
||||
# t1.set_downstream(t2)
|
||||
|
||||
t3.set_upstream(t1)
|
||||
|
||||
# all of this is equivalent to
|
||||
# all of this is equivalent to
|
||||
# dag.set_dependencies('print_date', 'sleep')
|
||||
# dag.set_dependencies('print_date', 'templated')
|
||||
|
||||
|
@ -230,7 +230,7 @@ than once.
|
|||
|
||||
Recap
|
||||
-----
|
||||
Alright, so we have a pretty basic DAG. At this point your code should look
|
||||
Alright, so we have a pretty basic DAG. At this point your code should look
|
||||
something like this:
|
||||
|
||||
.. code:: python
|
||||
|
@ -340,24 +340,24 @@ gets rendered and executed by running this command:
|
|||
# testing templated
|
||||
airflow test tutorial templated 2015-01-01
|
||||
|
||||
This should result in displaying a verbose log of events and ultimately
|
||||
This should result in displaying a verbose log of events and ultimately
|
||||
running your bash command and printing the result.
|
||||
|
||||
Note that the ``airflow test`` command runs task instances locally, output
|
||||
their log to stdout (on screen), don't bother with dependencies, and
|
||||
don't communicate their state (running, success, failed, ...) to the
|
||||
database. It simply allows to test a single a task instance.
|
||||
Note that the ``airflow test`` command runs task instances locally, outputs
|
||||
their log to stdout (on screen), doesn't bother with dependencies, and
|
||||
doesn't communicate state (running, success, failed, ...) to the database.
|
||||
It simply allows testing a single task instance.
|
||||
|
||||
Backfill
|
||||
''''''''
|
||||
Everything looks like it's running fine so let's run a backfill.
|
||||
``backfill`` will respect your dependencies, log into files and talk to the
|
||||
database to record status. If you do have a webserver up, you'll be able to
|
||||
track the progress. ``airflow webserver`` will start a web server if you
|
||||
are interested in tracking the progress visually as you backfill progresses.
|
||||
``backfill`` will respect your dependencies, emit logs into files and talk to
|
||||
the database to record status. If you do have a webserver up, you'll be able
|
||||
to track the progress. ``airflow webserver`` will start a web server if you
|
||||
are interested in tracking the progress visually as your backfill progresses.
|
||||
|
||||
Note that if you use ``depend_on_past=True``, individual task instances
|
||||
depends the success of the preceding task instance, except for the
|
||||
Note that if you use ``depends_on_past=True``, individual task instances
|
||||
will depend on the success of the preceding task instance, except for the
|
||||
start_date specified itself, for which this dependency is disregarded.
|
||||
|
||||
.. code-block:: bash
|
||||
|
@ -373,7 +373,7 @@ What's Next?
|
|||
-------------
|
||||
That's it, you've written, tested and backfilled your very first Airflow
|
||||
pipeline. Merging your code into a code repository that has a master scheduler
|
||||
running on top of should get it to get triggered and run everyday.
|
||||
running against it should get it to get triggered and run every day.
|
||||
|
||||
Here's a few things you might want to do next:
|
||||
|
||||
|
|
16
docs/ui.rst
16
docs/ui.rst
|
@ -8,7 +8,7 @@ can find in the Airflow UI.
|
|||
DAGs View
|
||||
.........
|
||||
List of the DAGs in your environment, and a set of shortcuts to useful pages.
|
||||
You can see exactly how many tasks succeeded, failed and are currently
|
||||
You can see exactly how many tasks succeeded, failed, or are currently
|
||||
running at a glance.
|
||||
|
||||
------------
|
||||
|
@ -20,8 +20,8 @@ running at a glance.
|
|||
|
||||
Tree View
|
||||
.........
|
||||
A tree representation of the DAG that spans across time. If a pipeline is
|
||||
late, you can quickly see where the different steps are at and identify
|
||||
A tree representation of the DAG that spans across time. If a pipeline is
|
||||
late, you can quickly see where the different steps are and identify
|
||||
the blocking ones.
|
||||
|
||||
------------
|
||||
|
@ -32,8 +32,8 @@ the blocking ones.
|
|||
|
||||
Graph View
|
||||
..........
|
||||
The graph is perhaps the most comprehensive. Visualize your DAG's dependencies
|
||||
and their current status for a specific run.
|
||||
The graph view is perhaps the most comprehensive. Visualize your DAG's
|
||||
dependencies and their current status for a specific run.
|
||||
|
||||
------------
|
||||
|
||||
|
@ -43,7 +43,7 @@ and their current status for a specific run.
|
|||
|
||||
Gantt Chart
|
||||
...........
|
||||
The Gantt chart lets you analyse task duration and overlap, you can quickly
|
||||
The Gantt chart lets you analyse task duration and overlap. You can quickly
|
||||
identify bottlenecks and where the bulk of the time is spent for specific
|
||||
DAG runs.
|
||||
|
||||
|
@ -55,9 +55,9 @@ DAG runs.
|
|||
|
||||
Task Duration
|
||||
.............
|
||||
The duration of your different tasks over the past N runs. This view lets
|
||||
The duration of your different tasks over the past N runs. This view lets
|
||||
you find outliers and quickly understand where the time is spent in your
|
||||
DAG, over many runs.
|
||||
DAG over many runs.
|
||||
|
||||
|
||||
------------
|
||||
|
|
Загрузка…
Ссылка в новой задаче