Various documentation spelling and grammar edits

This commit is contained in:
Marty Woodlee 2015-06-28 17:57:13 -05:00
Родитель 1da85da22e
Коммит 219803924a
11 изменённых файлов: 182 добавлений и 183 удалений

Просмотреть файл

@ -1,8 +1,8 @@
Command Line Interface
======================
Airflow has a very rich command line interface allowing to perform
many types of operation on a DAG, starting services and supporting
Airflow has a very rich command line interface that allows for
many types of operation on a DAG, starting services, and supporting
development and testing.
.. argparse::

Просмотреть файл

@ -3,18 +3,18 @@ Code / API
Operators
---------
Operators allows to generate a certain type of task that become a node in
Operators allow for generation of certain types of tasks that become nodes in
the DAG when instantiated. All operators derive from BaseOperator and
inherit a whole lot of attributes and method that way. Refer to the
BaseOperator documentation for more details.
inherit many attributes and methods that way. Refer to the BaseOperator
documentation for more details.
There are 3 main types of operators:
- Operators that performs an **action**, or tells another system to
- Operators that performs an **action**, or tell another system to
perform an action
- **Transfer** operators move data from a system to another
- **Sensors** are a certain type of operators that will keep running until a
certain criteria is met. Things like a specific file landing in HDFS or
- **Transfer** operators move data from one system to another
- **Sensors** are a certain type of operator that will keep running until a
certain criterion is met. Examples include a specific file landing in HDFS or
S3, a partition appearing in Hive, or a specific time of the day. Sensors
are derived from ``BaseSensorOperator`` and run a poke
method at a specified ``poke_interval`` until it returns ``True``.
@ -75,10 +75,10 @@ Variable Description
formatted ``{dag_id}_{task_id}_{ds}``
================================= ====================================
Note that you can access the objects attributes and methods with simple
Note that you can access the object's attributes and methods with simple
dot notation. Here are some examples of what is possible:
``{{ task.owner }}``, ``{{ task.task_id }}``, ``{{ ti.hostname }}``, ...
Refer to the models documentation for more information on the objects
Refer to the models documentation for more information on the objects'
attributes and methods.
Macros
@ -98,7 +98,7 @@ These macros live under the ``macros`` namespace in your templates.
Models
------
Models are built on top of th SQLAlchemy ORM Base class, instance are
Models are built on top of th SQLAlchemy ORM Base class, and instances are
persisted in the database.

Просмотреть файл

@ -7,20 +7,20 @@ Operators
Operators allow for generating a certain type of task on the graph. There
are 3 main type of operators:
- **Sensor:** Waits for events to happen, it could be a file appearing
in HDFS, the existence of a Hive partition or for an arbitrary MySQL
query to return a row.
- **Remote Execution:** Trigger an operation in a remote system, this
could be a HQL statement in Hive, a Pig script, a map reduce job, a
- **Sensor:** Waits for events to happen. This could be a file appearing
in HDFS, the existence of a Hive partition, or waiting for an arbitrary
MySQL query to return a row.
- **Remote Execution:** Triggers an operation in a remote system. This
could be an HQL statement in Hive, a Pig script, a map reduce job, a
stored procedure in Oracle or a Bash script to run.
- **Data transfers:** Move data from a system to another. Push data
- **Data transfers:** Move data from one system to another. Push data
from Hive to MySQL, from a local file to HDFS, from Postgres to
Oracle, or anything of that nature.
Tasks
'''''
A task represent the instantiation of an operator and becomes a node in
A task represents the instantiation of an operator and becomes a node in
the directed acyclic graph (DAG). The instantiation defines specific
values when calling the abstract operator. A task could be waiting for a
specific partition in Hive, or triggering a specific DML statement in
@ -38,44 +38,44 @@ task instance will have a status of either "started", "retrying",
Hooks
'''''
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL,
Postgres, HDFS, Pig and Cascading. Hooks implement a common interface when
possible, and act as a building block for operators. They also use
Hooks are interfaces to external platforms and databases like Hive, S3,
MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when
possible, and act as a building block for operators. They also use
the ``airflow.connection.Connection`` model to retrieve hostnames
and authentication information. Hooks keeps authentication code and
and authentication information. Hooks keep authentication code and
information out of pipelines, centralized in the metadata database.
Hooks are also very useful on their own to use in Python scripts,
Airflow airflow.operators.PythonOperator, and in interactive environment
Hooks are also very useful on their own to use in Python scripts,
Airflow airflow.operators.PythonOperator, and in interactive environments
like iPython or Jupyter Notebook.
Pools
'''''
Some systems can get overwhelmed when too many processes hit them at the same
time. Airflow pools can be used to **limit the execution parallelism** on
arbitrary sets of tasks. The list of pools is managed in the UI
(``Menu -> Admin -> Pools``) by giving the pools a name and assigning
it a number of worker slots. Tasks can then be associated with
one of the existing pools by using the ``pool`` parameter when
creating tasks (instantiating operators).
time. Airflow pools can be used to **limit the execution parallelism** on
arbitrary sets of tasks. The list of pools is managed in the UI
(``Menu -> Admin -> Pools``) by giving the pools a name and assigning
it a number of worker slots. Tasks can then be associated with
one of the existing pools by using the ``pool`` parameter when
creating tasks (i.e., instantiating operators).
The ``pool`` parameter can
be used in conjunction with ``priority_weight`` to define priorities
in the queue, and which tasks get executed first as slots open up in the
pool. The default ``priority_weight`` is of ``1``, and can be bumped to any
number. When sorting the queue to evaluate which task should be executed
next, we use the ``priority_weight``, summed up with of all
the tasks ``priority_weight`` downstream from this task. This way you can
bumped a specific important task and the whole path to that task gets
prioritized accordingly.
pool. The default ``priority_weight`` is ``1``, and can be bumped to any
number. When sorting the queue to evaluate which task should be executed
next, we use the ``priority_weight``, summed up with all of the
``priority_weight`` values from tasks downstream from this task. You can
use this to bump a specific important task and the whole path to that task
gets prioritized accordingly.
Tasks will be scheduled as usual while the slots fill up. Once capacity is
reached, runnable tasks get queued and there state will show as such in the
UI. As slots free up, queued up tasks start running based on the
reached, runnable tasks get queued and their state will show as such in the
UI. As slots free up, queued tasks start running based on the
``priority_weight`` (of the task and its descendants).
Note that by default tasks aren't assigned to any pool and their
Note that by default tasks aren't assigned to any pool and their
execution parallelism is only limited to the executor's setting.
Connections
@ -83,45 +83,45 @@ Connections
The connection information to external systems is stored in the Airflow
metadata database and managed in the UI (``Menu -> Admin -> Connections``).
A ``conn_id`` is defined there and hostname / login / password / schema
information attached to it. Then Airflow pipelines can simply refer
to the centrally managed ``conn_id`` without having to hard code any
of this information anywhere.
A ``conn_id`` is defined there and hostname / login / password / schema
information attached to it. Airflow pipelines can simply refer to the
centrally managed ``conn_id`` without having to hard code any of this
information anywhere.
Many connections with the same ``conn_id`` can be defined and when that
is the case, and when the **hooks** uses the ``get_connection`` method
Many connections with the same ``conn_id`` can be defined and when that
is the case, and when the **hooks** uses the ``get_connection`` method
from ``BaseHook``, Airflow will choose one connection randomly, allowing
for some basic load balancing and some fault tolerance when used in
conjunction with retries.
for some basic load balancing and fault tolerance when used in conjunction
with retries.
Queues
''''''
When using the CeleryExecutor, the celery queues that task are sent to
When using the CeleryExecutor, the celery queues that tasks are sent to
can be specified. ``queue`` is an attribute of BaseOperator, so any
task can be assigned to any queue. The default queue for the environment
is defined in the ``airflow.cfg``'s ``celery -> default_queue``. This defines
the queue that task get assigned to when not specified, as well as which
the queue that tasks get assigned to when not specified, as well as which
queue Airflow workers listen to when started.
Workers can listen to one or multiple queues of tasks. When a worker is
started (using the command ``airflow worker``), a set of comma delimited
queue names can be specified (``airflow worker -q spark``). This worker
started (using the command ``airflow worker``), a set of comma delimited
queue names can be specified (e.g. ``airflow worker -q spark``). This worker
will then only pick up tasks wired to the specified queue(s).
This can be useful if you need specialized workers, either from a
resource perspective (for say very lightweight tasks where one worker
could take thousands of task without a problem), or from an environment
perspective (you want a worker running from within the Spark cluster
This can be useful if you need specialized workers, either from a
resource perspective (for say very lightweight tasks where one worker
could take thousands of tasks without a problem), or from an environment
perspective (you want a worker running from within the Spark cluster
itself because it needs a very specific environment and security rights).
Variables
'''''''''
Variables are a generic way to store and retrieve arbitrary content or
settings as a simple key value store within Airflow. Variable can be
listed, created, updated and deleted from the UI ``Admin -> Variables``
Variables are a generic way to store and retrieve arbitrary content or
settings as a simple key value store within Airflow. Variables can be
listed, created, updated and deleted from the UI (``Admin -> Variables``)
or from code. While your pipeline code definition and most of your constants
and variables should be defined in code and stored in source control,
it can be useful to have some variables or configuration items
@ -135,5 +135,5 @@ accessible and modifiable through the UI.
bar = Variable.get("bar", deser_json=True)
The second call assumes ``json`` content and will be deserialized into
``bar``. Note that ``Variable`` is a sqlalchemy model and can be used
``bar``. Note that ``Variable`` is a sqlalchemy model and can be used
as such.

Просмотреть файл

@ -6,7 +6,7 @@ Airflow's Documentation
Airflow is a platform to programmatically author, schedule and monitor data pipelines.
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed.
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
------------
@ -17,10 +17,10 @@ Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The
Principles
----------
- **Dynamic**: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiate pipelines dynamically.
- **Dynamic**: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
- **Extensible**: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
- **Elegant**: Airflow pipelines are lean and explicit. Parameterizing your scripts is built in the core of Airflow using powerful **Jinja** templating engine.
- **Scalable**: Airflow has a modular architecture and uses a message queue to talk to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
- **Elegant**: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful **Jinja** templating engine.
- **Scalable**: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
Content
-------
@ -34,6 +34,6 @@ Content
concepts
profiling
cli
scheduler
scheduler
plugins
code

Просмотреть файл

@ -10,23 +10,23 @@ python3 (as of 2015-06).
Extra Packages
''''''''''''''
The ``airflow`` PyPI basic package only installs what's needed to get started.
Subpackages can be installed depending on what will be useful in your
Subpackages can be installed depending on what will be useful in your
environment. For instance, if you don't need connectivity with Postgres,
you won't have to go through the trouble of install the ``postgres-devel`` yum
package, or whatever equivalent on the distribution you are using.
you won't have to go through the trouble of installing the ``postgres-devel``
yum package, or whatever equivalent applies on the distribution you are using.
Behind the scene, we do conditional imports on operators that require
Behind the scenes, we do conditional imports on operators that require
these extra dependencies.
Here's the list of the subpackages and that they enable:
Here's the list of the subpackages and what they enable:
+-------------+------------------------------------+---------------------------------------+
| subpackage | install command | enables |
+=============+====================================+=======================================+
| mysql | ``pip install airflow[mysql]`` | MySQL operators and hook, support as |
| mysql | ``pip install airflow[mysql]`` | MySQL operators and hook, support as |
| | | an Airflow backend |
+-------------+------------------------------------+---------------------------------------+
| postgres | ``pip install airflow[postgres]`` | Postgres operators and hook, support |
| postgres | ``pip install airflow[postgres]`` | Postgres operators and hook, support |
| | | as an Airflow backend |
+-------------+------------------------------------+---------------------------------------+
| samba | ``pip install airflow[samba]`` | ``Hive2SambaOperator`` |
@ -39,7 +39,7 @@ Here's the list of the subpackages and that they enable:
Setting up a Backend
''''''''''''''''''''
If you want to take a real test drive of Airflow, you should consider
If you want to take a real test drive of Airflow, you should consider
setting up a real database backend and switching to the LocalExecutor.
As Airflow was built to interact with its metadata using the great SqlAlchemy
@ -48,7 +48,7 @@ SqlAlchemy backend. We recommend using **MySQL** or **Postgres**.
Once you've setup your database to host Airflow, you'll need to alter the
SqlAlchemy connection string located in your configuration file
``$AIRFLOW_HOME/airflow.cfg``. You should then also change the "executor"
``$AIRFLOW_HOME/airflow.cfg``. You should then also change the "executor"
setting to use "LocalExecutor", an executor that can parallelize task
instances locally.
@ -59,10 +59,10 @@ instances locally.
Connections
'''''''''''
Airflow needs to know how to connect to your environment. Information
such as hostname, port, login and password to other systems and services is
handled ``Admin->Connection`` section of the UI. The pipeline code you will
author will reference the 'conn_id' of the Connection objects.
Airflow needs to know how to connect to your environment. Information
such as hostname, port, login and passwords to other systems and services is
handled in the ``Admin->Connection`` section of the UI. The pipeline code you
will author will reference the 'conn_id' of the Connection objects.
.. image:: img/connections.png
@ -71,23 +71,23 @@ Scaling Out
'''''''''''
CeleryExecutor is the way you can scale out the number of workers. For this
to work, you need to setup a Celery backend (**RabbitMQ**, **Redis**, ...) and
change your ``airflow.cfg`` to point the executor parameter to
change your ``airflow.cfg`` to point the executor parameter to
CeleryExecutor and provide the related Celery settings.
For more information about setting up a Celery broker, refer to the
exhaustive `Celery documentation on the topic <http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html>`_.
To kick off a worker, you need to setup Airflow and quick off the worker
To kick off a worker, you need to setup Airflow and kick off the worker
subcommand
.. code-block:: bash
airflow worker
Your worker should start picking up tasks as soon as they get fired up in
Your worker should start picking up tasks as soon as they get fired in
its direction.
Note that you can also run "Celery Flower" a web UI build on top of Celery
Note that you can also run "Celery Flower", a web UI built on top of Celery,
to monitor your workers.
@ -96,11 +96,11 @@ Web Authentication
By default, all gates are opened. An easy way to restrict access
to the web application is to do it at the network level, or by using
ssh tunnels.
SSH tunnels.
However, it is possible to switch on
However, it is possible to switch on
authentication and define exactly how your users should login
into your Airflow environment. Airflow uses ``flask_login`` and
to your Airflow environment. Airflow uses ``flask_login`` and
exposes a set of hooks in the ``airflow.default_login`` module. You can
alter the content of this module by overriding it as a ``airflow_login``
module. To do this, you would typically copy/paste ``airflow.default_login``

Просмотреть файл

@ -2,25 +2,25 @@ Plugins
=======
Airflow has a simple plugin manager built-in that can integrate external
features at its core by simply dropping files in your
features to its core by simply dropping files in your
``$AIRFLOW_HOME/plugins`` folder.
The python modules in the ``plugins`` folder get imported,
and **hooks**, **operators**, **macros**, **executors** and web **views**
The python modules in the ``plugins`` folder get imported,
and **hooks**, **operators**, **macros**, **executors** and web **views**
get integrated to Airflow's main collections and become available for use.
What for?
---------
Airflow offers a generic toolbox for working with data. Different
Airflow offers a generic toolbox for working with data. Different
organizations have different stacks and different needs. Using Airflow
plugins can be a way for companies to customize their Airflow installation
to reflect their ecosystem.
Plugins can be used as an easy way to write, share and activate new sets of
Plugins can be used as an easy way to write, share and activate new sets of
features.
There's also a need for a set of more complex application to interact with
There's also a need for a set of more complex applications to interact with
different flavors of data and metadata.
Examples:
@ -29,18 +29,18 @@ Examples:
* An anomaly detection framework, allowing people to collect metrics, set thresholds and alerts
* An auditing tool, helping understand who accesses what
* A config-driven SLA monitoring tool, allowing you to set monitored tables and at what time
they should land, alert people and exposes visualization of outages
they should land, alert people, and expose visualizations of outages
* ...
Why build on top Airflow?
-------------------------
Why build on top of Airflow?
----------------------------
Airflow has many components that can be reused when building an application:
* A web server you can use to render your views
* A metadata database to store your models
* Access to your database, and knowledge of how to connect to them
* An array of workers that can allow your application to push workload to
* Access to your databases, and knowledge of how to connect to them
* An array of workers that your application can push workload to
* Airflow is deployed, you can just piggy back on it's deployment logistics
* Basic charting capabilities, underlying libraries and abstractions
@ -48,10 +48,10 @@ Airflow has many components that can be reused when building an application:
Interface
---------
To create a plugin you will need to derive the
To create a plugin you will need to derive the
``airflow.plugins_manager.AirflowPlugin`` class and reference the objects
you want to plug into Airflow. Here's what the class you need to derive
look like:
looks like:
.. code:: python
@ -67,7 +67,7 @@ look like:
executors = []
# A list of references to inject into the macros namespace
macros = []
# A list of objects created from a class derived
# A list of objects created from a class derived
# from flask_admin.BaseView
admin_views = []
# A list of Blueprint object created from flask.Blueprint
@ -80,10 +80,10 @@ Example
-------
The code bellow defines a plugin that injects a set of dummy object
definitions in Airflow.
definitions in Airflow.
.. code:: python
# This is the class you derive to create a plugin
from airflow.plugins_manager import AirflowPlugin

Просмотреть файл

@ -1,27 +1,27 @@
Data Profiling
==============
Part of being a productive with data is about having the right weapons to
profile the data you are working with. Airflow provides a simple query
interface to write sql and get results quickly, and a charting application
Part of being productive with data is having the right weapons to
profile the data you are working with. Airflow provides a simple query
interface to write SQL and get results quickly, and a charting application
letting you visualize data.
Adhoc Queries
-------------
The adhoc query UI allows for simple SQL interaction with the database
The adhoc query UI allows for simple SQL interactions with the database
connections registered in Airflow.
.. image:: img/adhoc.png
Charts
-------------
A simple UI built on top of flask-admin and highcharts allows to build
data visualizations and charts easily. Fill in a form with a label, sql,
chart type, pick a source database from your environment's connecton,
select a few other options, and save it for later use.
------
A simple UI built on top of flask-admin and highcharts allows building
data visualizations and charts easily. Fill in a form with a label, SQL,
chart type, pick a source database from your environment's connectons,
select a few other options, and save it for later use.
You can even use the same templating and macros available when writing
airflow pipelines, parameterizing your queries and modifying parameters
You can even use the same templating and macros available when writing
airflow pipelines, parameterizing your queries and modifying parameters
directly in the URL.
These charts are basic, but they're easy to create, modify and share.

Просмотреть файл

@ -1,15 +1,15 @@
The Scheduler
=============
The Airflow scheduler monitors all tasks and all dags and schedules the
task instances whose dependencies have been met. Behinds the scene,
it monitors a folder for all dag objects it may contain,
The Airflow scheduler monitors all tasks and all DAGs and schedules the
task instances whose dependencies have been met. Behind the scenes,
it monitors a folder for all DAG objects it may contain,
and periodically inspects all tasks to see whether it can schedule the
next run.
The scheduler starts an instance of the executor specified in the your
``airflow.cfg``, if it happens to be the LocalExecutor, tasks will be
executed as subprocesses, in the case of CeleryExecutor, tasks are
``airflow.cfg``. If it happens to be the LocalExecutor, tasks will be
executed as subprocesses; in the case of CeleryExecutor, tasks are
executed remotely.
To start a scheduler, simply run the command:
@ -18,25 +18,24 @@ To start a scheduler, simply run the command:
airflow scheduler
Note that:
Note that:
* It **won't parallelize** multiple instances of the same tasks, it always wait for the previous schedule to be done to move forward
* It will **not fill in gaps**, it only moves forward in time from the latest task instance on that task
* If a task instance failed and the task is set to ``depends_on_past=True``, it won't move forward from that point until the error state is cleared and runs successfully, or is marked as successful
* If no task history exist for a task, it will attempt to run it on the task's ``start_date``
* It **won't parallelize** multiple instances of the same tasks; it always wait for the previous schedule to be done before moving forward
* It will **not fill in gaps**; it only moves forward in time from the latest task instance on that task
* If a task instance failed and the task is set to ``depends_on_past=True``, it won't move forward from that point until the error state is cleared and the task runs successfully, or is marked as successful
* If no task history exists for a task, it will attempt to run it on the task's ``start_date``
Understanding this, you should be able to comprehend what is keeping your
tasks from running or moving forward. To allow the scheduler to move forward, you may want to clear the state
of some task instances, or mark them as successful.
Understanding this, you should be able to comprehend what is keeping your
tasks from running or moving forward. To allow the scheduler to move forward, you may want to clear the state of some task instances, or mark them as successful.
Here are some of the ways you can **unblock tasks**:
* From the UI, you can **clear** (as in delete the status of) individual task instances from the tasks instance dialog, while defining whether you want to includes the past/future and the upstream/downstream dependencies. Note that a confirmation window comes next and allows you to see the set you are about to clear.
* The CLI ``airflow clear -h`` has lots of options when it comes to clearing task instances states, including specifying date ranges, targeting task_ids by specifying a regular expression, flags for including upstream and downstream relatives, and targeting task instances in specific states (``failed``, or ``success``)
* Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, or when the fix has been applied outside of Airflow for instance.
* The ``airflow backfill`` CLI subcommand has a flag to ``--mark_success`` and allows to select subsections of the dag as well as specifying date ranges.
* From the UI, you can **clear** (as in delete the status of) individual task instances from the task instances dialog, while defining whether you want to includes the past/future and the upstream/downstream dependencies. Note that a confirmation window comes next and allows you to see the set you are about to clear.
* The CLI command ``airflow clear -h`` has lots of options when it comes to clearing task instance states, including specifying date ranges, targeting task_ids by specifying a regular expression, flags for including upstream and downstream relatives, and targeting task instances in specific states (``failed``, or ``success``)
* Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, or for instance when the fix has been applied outside of Airflow.
* The ``airflow backfill`` CLI subcommand has a flag to ``--mark_success`` and allows selecting subsections of the DAG as well as specifying date ranges.
The Airflow scheduler is designed to run as a persistent service in an
Airflow production environment. To kick it off, all you need to do is
execute ``airflow scheduler``. It will use the configuration specified in the
Airflow production environment. To kick it off, all you need to do is
execute ``airflow scheduler``. It will use the configuration specified in
``airflow.cfg``.

Просмотреть файл

@ -20,12 +20,12 @@ python3 (as of 2015-06).
# start the web server, default port is 8080
airflow webserver -p 8080
Upon running these commands, airflow will create the ``$AIRFLOW_HOME`` folder
and lay a "airflow.cfg" files with defaults that get you going fast. You can
Upon running these commands, Airflow will create the ``$AIRFLOW_HOME`` folder
and lay an "airflow.cfg" file with defaults that get you going fast. You can
inspect the file either in ``$AIRFLOW_HOME/airflow.cfg``, or through the UI in
the ``Admin->Configuration`` menu.
Out of the box, airflow uses a sqlite database, which you should outgrow
Out of the box, Airflow uses a sqlite database, which you should outgrow
fairly quickly since no parallelization is possible using this database
backend. It works in conjunction with the ``SequentialExecutor`` which will
only run task instances sequentially. While this is very limiting, it allows

Просмотреть файл

@ -2,13 +2,13 @@
Tutorial
================
This tutorial walks you through some of the fundamental Airflow concepts,
objects and their usage while writing your first pipeline.
This tutorial walks you through some of the fundamental Airflow concepts,
objects, and their usage while writing your first pipeline.
Example Pipeline definition
---------------------------
Here is an example of a basic pipeline definition. Do not worry if this looks
Here is an example of a basic pipeline definition. Do not worry if this looks
complicated, a line by line explanation follows below.
.. code:: python
@ -72,22 +72,22 @@ complicated, a line by line explanation follows below.
Importing Modules
-----------------
An Airflow pipeline is just a common Python script that happens to define
an Airflow DAG object. Let's start by importing the libraries we will need.
An Airflow pipeline is just a Python script that happens to define an
Airflow DAG object. Let's start by importing the libraries we will need.
.. code:: python
# The DAG object, we'll need this to instantiate a DAG
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators, we need this to operate!
# Operators; we need this to operate!
from airflow.operators import BashOperator, MySqlOperator
Default Arguments
-----------------
We're about to create a DAG and some tasks, and we have the choice to
explicitly pass a set of arguments to each task's constructor
(which would become redundant), or (better!) we can define a dictionary
We're about to create a DAG and some tasks, and we have the choice to
explicitly pass a set of arguments to each task's constructor
(which would become redundant), or (better!) we can define a dictionary
of default parameters that we can use when creating tasks.
.. code:: python
@ -107,16 +107,16 @@ For more information about the BaseOperator's parameters and what they do,
refer to the :py:class:``airflow.models.BaseOperator`` documentation.
Also, note that you could easily define different sets of arguments that
would serve different purposes. An example of that would be to have
would serve different purposes. An example of that would be to have
different settings between a production and development environment.
Instantiate a DAG
-----------------
We'll need a DAG object to nest our tasks into. Here we pass a string
We'll need a DAG object to nest our tasks into. Here we pass a string
that defines the dag_id, which serves as a unique identifier for your DAG.
We also pass the default argument dictionary that we just define.
We also pass the default argument dictionary that we just defined.
.. code:: python
@ -124,8 +124,8 @@ We also pass the default argument dictionary that we just define.
Tasks
-----
Tasks are generated when instantiating objects from operators. An object
instatiated from an operator is called a constructor. The first argument
Tasks are generated when instantiating operator objects. An object
instantiated from an operator is called a constructor. The first argument
``task_id`` acts as a unique identifier for the task.
.. code:: python
@ -142,10 +142,10 @@ instatiated from an operator is called a constructor. The first argument
dag=dag)
Notice how we pass a mix of operator specific arguments (``bash_command``) and
an argument common to all operators (``email_on_failure``) inherited
an argument common to all operators (``email_on_failure``) inherited
from BaseOperator to the operator's constructor. This is simpler than
passing every argument for every constructor call. Also, notice that in
the second task we override ``email_on_failure`` parameter with ``False``.
passing every argument for every constructor call. Also, notice that in
the second task we override the ``email_on_failure`` parameter with ``False``.
The precedence rules for a task are as follows:
@ -158,17 +158,17 @@ otherwise Airflow will raise an exception.
Templating with Jinja
---------------------
Airflow leverages the power of
Airflow leverages the power of
`Jinja Templating <http://jinja.pocoo.org/docs/dev/>`_ and provides
the pipeline author
with a set of built-in parameters and macros. Airflow also provides
hooks for the pipeline author to define their own parameters, macros and
templates.
This tutorial barely scratches the surfaces of what you can do
with templating in Airflow, but the goal of this section is to let you know
this feature exists, get you familiar with double
curly brackets, and point to the most common template variable: ``{{ ds }}``.
This tutorial barely scratches the surface of what you can do with
templating in Airflow, but the goal of this section is to let you know
this feature exists, get you familiar with double curly brackets, and
point to the most common template variable: ``{{ ds }}``.
.. code:: python
@ -196,31 +196,31 @@ parameters and/or objects to your templates. Please take the time
to understand how the parameter ``my_param`` makes it through to the template.
Files can also be passed to the ``bash_command`` argument, like
``bash_command='templated_command.sh'`` where the file location is relative to
``bash_command='templated_command.sh'``, where the file location is relative to
the directory containing the pipeline file (``tutorial.py`` in this case). This
may be desirable for many reasons, like separating your script's logic and
pipeline code, allowing for proper code highlighting in files composed in
different languages, and general flexibility in structuring pipelines. It is
also possible to define your ``template_searchpath`` pointing to any folder
also possible to define your ``template_searchpath`` as pointing to any folder
locations in the DAG constructor call.
Setting up Dependencies
-----------------------
We have two simple tasks that do not depend on each other, here's a few ways
We have two simple tasks that do not depend on each other. Here's a few ways
you can define dependencies between them:
.. code:: python
t2.set_upstream(t1)
# This means that t2 will depend on t1
# This means that t2 will depend on t1
# running successfully to run
# It is equivalent to
# t1.set_downstream(t2)
t3.set_upstream(t1)
# all of this is equivalent to
# all of this is equivalent to
# dag.set_dependencies('print_date', 'sleep')
# dag.set_dependencies('print_date', 'templated')
@ -230,7 +230,7 @@ than once.
Recap
-----
Alright, so we have a pretty basic DAG. At this point your code should look
Alright, so we have a pretty basic DAG. At this point your code should look
something like this:
.. code:: python
@ -340,24 +340,24 @@ gets rendered and executed by running this command:
# testing templated
airflow test tutorial templated 2015-01-01
This should result in displaying a verbose log of events and ultimately
This should result in displaying a verbose log of events and ultimately
running your bash command and printing the result.
Note that the ``airflow test`` command runs task instances locally, output
their log to stdout (on screen), don't bother with dependencies, and
don't communicate their state (running, success, failed, ...) to the
database. It simply allows to test a single a task instance.
Note that the ``airflow test`` command runs task instances locally, outputs
their log to stdout (on screen), doesn't bother with dependencies, and
doesn't communicate state (running, success, failed, ...) to the database.
It simply allows testing a single task instance.
Backfill
''''''''
Everything looks like it's running fine so let's run a backfill.
``backfill`` will respect your dependencies, log into files and talk to the
database to record status. If you do have a webserver up, you'll be able to
track the progress. ``airflow webserver`` will start a web server if you
are interested in tracking the progress visually as you backfill progresses.
``backfill`` will respect your dependencies, emit logs into files and talk to
the database to record status. If you do have a webserver up, you'll be able
to track the progress. ``airflow webserver`` will start a web server if you
are interested in tracking the progress visually as your backfill progresses.
Note that if you use ``depend_on_past=True``, individual task instances
depends the success of the preceding task instance, except for the
Note that if you use ``depends_on_past=True``, individual task instances
will depend on the success of the preceding task instance, except for the
start_date specified itself, for which this dependency is disregarded.
.. code-block:: bash
@ -373,7 +373,7 @@ What's Next?
-------------
That's it, you've written, tested and backfilled your very first Airflow
pipeline. Merging your code into a code repository that has a master scheduler
running on top of should get it to get triggered and run everyday.
running against it should get it to get triggered and run every day.
Here's a few things you might want to do next:

Просмотреть файл

@ -8,7 +8,7 @@ can find in the Airflow UI.
DAGs View
.........
List of the DAGs in your environment, and a set of shortcuts to useful pages.
You can see exactly how many tasks succeeded, failed and are currently
You can see exactly how many tasks succeeded, failed, or are currently
running at a glance.
------------
@ -20,8 +20,8 @@ running at a glance.
Tree View
.........
A tree representation of the DAG that spans across time. If a pipeline is
late, you can quickly see where the different steps are at and identify
A tree representation of the DAG that spans across time. If a pipeline is
late, you can quickly see where the different steps are and identify
the blocking ones.
------------
@ -32,8 +32,8 @@ the blocking ones.
Graph View
..........
The graph is perhaps the most comprehensive. Visualize your DAG's dependencies
and their current status for a specific run.
The graph view is perhaps the most comprehensive. Visualize your DAG's
dependencies and their current status for a specific run.
------------
@ -43,7 +43,7 @@ and their current status for a specific run.
Gantt Chart
...........
The Gantt chart lets you analyse task duration and overlap, you can quickly
The Gantt chart lets you analyse task duration and overlap. You can quickly
identify bottlenecks and where the bulk of the time is spent for specific
DAG runs.
@ -55,9 +55,9 @@ DAG runs.
Task Duration
.............
The duration of your different tasks over the past N runs. This view lets
The duration of your different tasks over the past N runs. This view lets
you find outliers and quickly understand where the time is spent in your
DAG, over many runs.
DAG over many runs.
------------