Adding a doc entry for Connections

This commit is contained in:
Maxime Beauchemin 2015-06-02 19:47:46 -04:00
Родитель a6ed0fa017
Коммит 3b75027df1
3 изменённых файлов: 33 добавлений и 7 удалений

Просмотреть файл

@ -1,7 +1,7 @@
TODO
-----
#### UI
* Run button / backfill wizard
* Backfill form
* Add templating to adhoc queries
* Charts: better error handling
@ -16,7 +16,6 @@ TODO
#### Backend
* Add a run_only_latest flag to BaseOperator, runs only most recent task instance where deps are met
* Pickle all the THINGS!
* Add priority_weight(Int) to BaseOperator, +@property subtree_priority
* Distributed scheduler
* Add decorator to timeout imports on master process [lib](https://github.com/pnpnpn/timeout-decorator)
* Raise errors when setting dependencies on task in foreign DAGs

Просмотреть файл

@ -4,7 +4,7 @@ Concepts
Operators
'''''''''
Operators allows to generate a certain type of task on the graph. There
Operators allow for generating a certain type of task on the graph. There
are 3 main type of operators:
- **Sensor:** Waits for events to happen, it could be a file appearing
@ -58,11 +58,38 @@ arbitrary sets of tasks. The list of pools is managed in the UI
(``Menu -> Admin -> Pools``) by giving the pools a name and assigning
it a number of worker slots. Tasks can then be associated with
one of the existing pools by using the ``pool`` parameter when
creating tasks (aka instantiating operators).
creating tasks (instantiating operators).
The ``pool`` parameter can
be used in conjunction with ``priority_weight`` to define priorities
in the queue, and which tasks get executed first as slots open up in the
pool. The default ``priority_weight`` is of ``1``, and can be bumped to any
number. When sorting the queue to evaluate which task should be executed
next, we use the ``priority_weight``, summed up with of all
the tasks ``priority_weight`` downstream from this task. This way you can
bumped a specific important task and the whole path to that task gets
prioritized accordingly.
Tasks will be scheduled as usual while the slots fill up. Once capacity is
reached, runnable tasks get queued and there state will show as such in the
UI. As slots free up, queued up tasks start running.
UI. As slots free up, queued up tasks start running based on the
``priority_weight`` (of the task and its descendants).
Note that by default tasks aren't assigned to any pool and their
execution parallelism is only limited to the executor's setting.
Connections
'''''''''''
The connection information to external systems is stored in the Airflow
metadata database and managed in the UI (``Menu -> Admin -> Connections``).
A ``conn_id`` is defined there and hostname / login / password / schema
information attached to it. Then Airflow pipelines can simply refer
to the centrally managed ``conn_id`` without having to hard code any
of this information anywhere.
Many connections with the same ``conn_id`` can be defined and when that
is the case, and when the **hooks** uses the ``get_connection`` method
from ``BaseHook``, Airflow will choose one connection randomly, allowing
for some basic load balancing and some fault tolerance when used in
conjunction with retries.

Просмотреть файл

@ -1,7 +1,7 @@
Data Profiling
==============
Part of being a productive data ninja is about having the right weapons to
Part of being a productive with data is about having the right weapons to
profile the data you are working with. Airflow provides a simple query
interface to write sql and get results quickly, and a charting application
letting you visualize data.
@ -24,7 +24,7 @@ You can even use the same templating and macros availlable when writting
airflow pipelines, parameterizing your queries and modifying parameters
direclty in the URL.
These charts ain't Tableau, but they're easy to create, modify and share.
These charts are basic, but they're easy to create, modify and share.
Chart Screenshot
................