113 строки
5.3 KiB
ReStructuredText
113 строки
5.3 KiB
ReStructuredText
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
|
|
|
|
|
|
|
|
DAG Serialization
|
|
=================
|
|
|
|
In order to make Airflow Webserver stateless, Airflow >=1.10.7 supports
|
|
DAG Serialization and DB Persistence.
|
|
|
|
.. image:: img/dag_serialization.png
|
|
|
|
Without DAG Serialization & persistence in DB, the Webserver and the Scheduler both
|
|
need access to the DAG files. Both the scheduler and webserver parse the DAG files.
|
|
|
|
With **DAG Serialization** we aim to decouple the webserver from DAG parsing
|
|
which would make the Webserver very light-weight.
|
|
|
|
As shown in the image above, when using the this feature,
|
|
the Scheduler parses the DAG files, serializes them in JSON format and saves them in the Metadata DB
|
|
as :class:`airflow.models.serialized_dag.SerializedDagModel` model.
|
|
|
|
The Webserver now instead of having to parse the DAG file again, reads the
|
|
serialized DAGs in JSON, de-serializes them and create the DagBag and uses it
|
|
to show in the UI.
|
|
|
|
One of the key features that is implemented as the part of DAG Serialization is that
|
|
instead of loading an entire DagBag when the WebServer starts we only load each DAG on demand from the
|
|
Serialized Dag table. This helps reduce Webserver startup time and memory. The reduction is notable
|
|
when you have large number of DAGs.
|
|
|
|
You can enable the source code to be stored in the database to make it completely independent from DAG files.
|
|
This is not necessary if your files are embedded in an Docker image or you can otherwise provide
|
|
them to the webserver. The data is stored in the :class:`airflow.models.dagcode.DagCode` model.
|
|
|
|
The last element is rendering template fields. When serialization is enabled, templates are not rendered
|
|
to requests, but a copy of the field contents is saved before the task is executed on worker.
|
|
The data is stored in the :class:`airflow.models.renderedtifields.RenderedTaskInstanceFields` model.
|
|
To limit the excessive growth of the database, only the most recent entries are kept and older entries
|
|
are purged.
|
|
|
|
Enable Dag Serialization
|
|
------------------------
|
|
|
|
Add the following settings in ``airflow.cfg``:
|
|
|
|
.. code-block:: ini
|
|
|
|
[core]
|
|
store_serialized_dags = True
|
|
store_dag_code = True
|
|
|
|
# You can also update the following default configurations based on your needs
|
|
min_serialized_dag_update_interval = 30
|
|
min_serialized_dag_fetch_interval = 10
|
|
max_num_rendered_ti_fields_per_task = 30
|
|
|
|
* ``store_serialized_dags``: This option decides whether to serialise DAGs and persist them in DB.
|
|
If set to True, Webserver reads from DB instead of parsing DAG files
|
|
* ``store_dag_code``: This option decides whether to persist DAG files code in DB.
|
|
If set to True, Webserver reads file contents from DB instead of trying to access files in a DAG folder.
|
|
* ``min_serialized_dag_update_interval``: This flag sets the minimum interval (in seconds) after which
|
|
the serialized DAG in DB should be updated. This helps in reducing database write rate.
|
|
* ``min_serialized_dag_fetch_interval``: This option controls how often a SerializedDAG will be re-fetched
|
|
from the DB when it's already loaded in the DagBag in the Webserver. Setting this higher will reduce
|
|
load on the DB, but at the expense of displaying a possibly stale cached version of the DAG.
|
|
* ``max_num_rendered_ti_fields_per_task``: This option controls maximum number of Rendered Task Instance
|
|
Fields (Template Fields) per task to store in the Database.
|
|
|
|
If you are updating Airflow from <1.10.7, please do not forget to run ``airflow db upgrade``.
|
|
|
|
|
|
Limitations
|
|
-----------
|
|
|
|
* When using user-defined filters and macros, the Rendered View in the Webserver might show incorrect results
|
|
for TIs that have not yet executed as it might be using external modules that Webserver wont have access to.
|
|
Use ``airflow tasks render`` cli command in such situation to debug or test rendering of you template_fields.
|
|
Once the tasks execution starts the Rendered Template Fields will be stored in the DB in a separate table and
|
|
after which the correct values would be showed in the Webserver (Rendered View tab).
|
|
|
|
.. note::
|
|
You need Airflow >= 1.10.10 for completely stateless Webserver.
|
|
Airflow 1.10.7 to 1.10.9 needed access to Dag files in some cases.
|
|
More Information: https://airflow.apache.org/docs/1.10.9/dag-serialization.html#limitations
|
|
|
|
Using a different JSON Library
|
|
------------------------------
|
|
|
|
To use a different JSON library instead of the standard ``json`` library like ``ujson``, you need to
|
|
define a ``json`` variable in local Airflow settings (``airflow_local_settings.py``) file as follows:
|
|
|
|
.. code-block:: python
|
|
|
|
import ujson
|
|
json = ujson
|